Streaming Responses

Stream responses in real-time for better user experience.

Streaming allows you to receive responses progressively as they are generated, rather than waiting for the complete response. This creates a more responsive user experience, especially for longer outputs.

Faster perceived response time - Users see output immediately
Better UX for long responses - Progress is visible
Lower memory usage - Process chunks as they arrive

Basic Streaming

Enable Streaming

Set stream: true in your request

javascript

import ModelPilot from 'modelpilot';

const client = new ModelPilot({
  apiKey: process.env.MODELPILOT_API_KEY,
  routerId: process.env.MODELPILOT_ROUTER_ID,
});

async function streamResponse() {
  const stream = await client.chat.completions.create({
    messages: [
      { role: 'user', content: 'Write a short story about a robot' }
    ],
    stream: true,
  });

  for await (const chunk of stream) {
    const content = chunk.choices[0]?.delta?.content || '';
    process.stdout.write(content);
  }
  
  console.log('\n\nStream complete!');
}

streamResponse();

Stream Chunk Format

Chunk Structure

Each chunk follows this format

json

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion.chunk",
  "created": 1677858242,
  "model": "gpt-5",
  "choices": [
    {
      "index": 0,
      "delta": {
        "content": "Hello"  // Incremental content
      },
      "finish_reason": null
    }
  ]
}

// Last chunk includes finish_reason
{
  "choices": [
    {
      "index": 0,
      "delta": {},
      "finish_reason": "stop"
    }
  ]
}

Token Usage Tracking

Get Token Counts in Streaming

Request usage data with stream_options (OpenAI-compatible)

When streaming, you can request token usage data by setting stream_options: { include_usage: true }. This will include a final chunk with usage information before the stream completes.

javascript

const stream = await client.chat.completions.create({
  messages: [{ role: 'user', content: 'Hello!' }],
  stream: true,
  stream_options: {
    include_usage: true  // Request token usage
  },
});

let totalContent = '';
for await (const chunk of stream) {
  const content = chunk.choices[0]?.delta?.content || '';
  totalContent += content;
  
  // Check for usage data in final chunk
  if (chunk.usage) {
    console.log('Token usage:', {
      prompt_tokens: chunk.usage.prompt_tokens,
      completion_tokens: chunk.usage.completion_tokens,
      total_tokens: chunk.usage.total_tokens,
    });
  }
}

Usage Chunk Format

The final chunk contains usage data with empty choices

json

// Regular chunks (content)
{
  "id": "chatcmpl-abc123",
  "choices": [
    {
      "index": 0,
      "delta": { "content": "Hello" },
      "finish_reason": null
    }
  ]
}

// Final usage chunk (appears after finish_reason: "stop")
{
  "id": "chatcmpl-abc123",
  "choices": [],  // Empty choices array
  "usage": {
    "prompt_tokens": 10,
    "completion_tokens": 150,
    "total_tokens": 160
  }
}

Note: The usage chunk has an empty choices array. Check for chunk.usage to detect it.

React Integration

React Component

Display streaming responses in React

javascript

import ModelPilot from 'modelpilot';

const client = new ModelPilot({
  apiKey: process.env.MODELPILOT_API_KEY,
  routerId: process.env.MODELPILOT_ROUTER_ID,
});

export default function ChatComponent() {

  async function handleSubmit(userMessage) {
    setIsLoading(true);
    setResponse('');

    try {
      const stream = await client.chat.completions.create({
        messages: [{ role: 'user', content: userMessage }],
        stream: true,
      });

      for await (const chunk of stream) {
        const content = chunk.choices[0]?.delta?.content || '';
        setResponse(prev => prev + content);
      }
    } catch (error) {
      console.error('Streaming error:', error);
    } finally {
      setIsLoading(false);
    }
  }

  return (
    <div>
      <div className="response">
        {response}
        {isLoading && <span className="cursor">|</span>}
      </div>
    </div>
  );
}

Server-Sent Events (SSE)

Direct API Usage

If not using the SDK, handle SSE manually

javascript

const response = await fetch('https://modelpilot.co/api/router/{routerId}/chat/completions', {
  method: 'POST',
  headers: {
    'Content-Type': 'application/json',
    'Authorization': `Bearer ${API_KEY}`,
  },
  body: JSON.stringify({
    messages: [{ role: 'user', content: 'Hello!' }],
    stream: true,
  }),
});

const reader = response.body.getReader();
const decoder = new TextDecoder();

while (true) {
  const { done, value } = await reader.read();
  if (done) break;
  
  const chunk = decoder.decode(value);
  const lines = chunk.split('\n').filter(line => line.trim() !== '');
  
  for (const line of lines) {
    if (line.startsWith('data: ')) {
      const data = line.slice(6);
      if (data === '[DONE]') continue;
      
      const parsed = JSON.parse(data);
      const content = parsed.choices[0]?.delta?.content || '';
      console.log(content);
    }
  }
}

Error Handling

Handling Stream Errors

javascript

async function streamWithErrorHandling() {
  try {
    const stream = await client.chat.completions.create({
      messages: [{ role: 'user', content: 'Hello!' }],
      stream: true,
    });

    for await (const chunk of stream) {
      try {
        const content = chunk.choices[0]?.delta?.content || '';
        process.stdout.write(content);
      } catch (chunkError) {
        console.error('Error processing chunk:', chunkError);
        // Continue processing other chunks
      }
    }
  } catch (error) {
    if (error.status === 429) {
      console.error('Rate limit exceeded');
    } else if (error.status === 503) {
      console.error('Service temporarily unavailable');
    } else {
      console.error('Streaming error:', error.message);
    }
  }
}

Best Practices

When to Use Streaming

✓ Long-form content generation
✓ Interactive chat applications
✓ Real-time code generation
✓ Stories, articles, or creative writing

When NOT to Use Streaming

✗ JSON/structured output parsing
✗ Batch processing
✗ Function calling responses
✗ Short responses (overhead not worth it)

Performance Tips

• Buffer chunks for UI updates (e.g., every 50ms) to avoid excessive re-renders
• Implement timeout handling for long-running streams
• Handle connection drops gracefully with retry logic
• Use abort controllers to cancel streams when needed
• Monitor memory usage when accumulating large responses

Cancelling Streams

Using AbortController

javascript

const controller = new AbortController();

async function cancellableStream() {
  try {
    const stream = await client.chat.completions.create({
      messages: [{ role: 'user', content: 'Long response...' }],
      stream: true,
    }, {
      signal: controller.signal,
    });

    for await (const chunk of stream) {
      const content = chunk.choices[0]?.delta?.content || '';
      console.log(content);
    }
  } catch (error) {
    if (error.name === 'AbortError') {
      console.log('Stream cancelled by user');
    }
  }
}

// Cancel the stream
controller.abort();

Next Steps

Function Calling & Tools

Add function calling capabilities

Cost Optimization

Optimize your AI costs