← All Articles
Tutorial6 min read

Build a Real-Time AI Chat in Next.js with Streaming Responses (Step by Step)

Youness Haji

Youness Haji

January 5, 2025

Every AI chat tutorial shows you the same thing: send a message, wait for the full response, display it. That's not how ChatGPT works. That's not how users expect AI to work. They expect streaming — tokens appearing in real time, word by word.

Here's how to build a production-ready AI chat with streaming in Next.js, including the parts tutorials usually skip: error handling, rate limiting, and a polished UI.

What Streaming Means and Why It Matters

Without streaming, your AI chat flow looks like this:

  1. User sends message
  2. Spinner for 3-5 seconds
  3. Full response appears at once

With streaming:

  1. User sends message
  2. First words appear in 100ms
  3. Response types out in real time

The perceived latency drops from seconds to milliseconds. Users feel like the AI is "thinking" alongside them rather than disappearing into a black box.

Setting Up the API Route

We'll use Server-Sent Events (SSE) for streaming. Create an API route:

typescript
// app/api/chat/stream/route.ts
import Groq from 'groq-sdk';

const groq = new Groq({ apiKey: process.env.GROQ_API_KEY });

export async function POST(request: Request) {
  const { message } = await request.json();

  const stream = await groq.chat.completions.create({
    model: 'llama3-70b-8192',
    messages: [
      { role: 'system', content: 'You are a helpful assistant.' },
      { role: 'user', content: message },
    ],
    stream: true,
  });

  const encoder = new TextEncoder();
  const readable = new ReadableStream({
    async start(controller) {
      try {
        for await (const chunk of stream) {
          const text = chunk.choices[0]?.delta?.content || '';
          if (text) {
            controller.enqueue(
              encoder.encode(`data: ${JSON.stringify({ text })}\n\n`)
            );
          }
        }
        controller.enqueue(encoder.encode('data: [DONE]\n\n'));
        controller.close();
      } catch (error) {
        controller.enqueue(
          encoder.encode(`data: ${JSON.stringify({ error: 'Stream failed' })}\n\n`)
        );
        controller.close();
      }
    },
  });

  return new Response(readable, {
    headers: {
      'Content-Type': 'text/event-stream',
      'Cache-Control': 'no-cache',
      Connection: 'keep-alive',
    },
  });
}

Key points:

  • stream: true tells Groq to return chunks instead of a complete response
  • We use ReadableStream to pipe chunks to the client as SSE events
  • The [DONE] sentinel tells the client the stream is complete
  • Error handling inside the stream prevents hanging connections

Building the Chat UI Component

The client needs to consume the SSE stream and render tokens as they arrive:

typescript
// hooks/useStreamingChat.ts
import { useState, useCallback } from 'react';

export function useStreamingChat() {
  const [messages, setMessages] = useState<Array<{
    role: 'user' | 'assistant';
    content: string;
  }>>([]);
  const [isStreaming, setIsStreaming] = useState(false);

  const sendMessage = useCallback(async (content: string) => {
    setMessages((prev) => [...prev, { role: 'user', content }]);
    setMessages((prev) => [...prev, { role: 'assistant', content: '' }]);
    setIsStreaming(true);

    try {
      const response = await fetch('/api/chat/stream', {
        method: 'POST',
        headers: { 'Content-Type': 'application/json' },
        body: JSON.stringify({ message: content }),
      });

      const reader = response.body?.getReader();
      const decoder = new TextDecoder();

      if (!reader) throw new Error('No reader available');

      while (true) {
        const { done, value } = await reader.read();
        if (done) break;

        const chunk = decoder.decode(value);
        const lines = chunk.split('\n').filter((l) => l.startsWith('data: '));

        for (const line of lines) {
          const data = line.replace('data: ', '');
          if (data === '[DONE]') break;

          try {
            const parsed = JSON.parse(data);
            if (parsed.text) {
              setMessages((prev) => {
                const updated = [...prev];
                const last = updated[updated.length - 1];
                updated[updated.length - 1] = {
                  ...last,
                  content: last.content + parsed.text,
                };
                return updated;
              });
            }
          } catch {
            // Skip malformed chunks
          }
        }
      }
    } catch (error) {
      setMessages((prev) => {
        const updated = [...prev];
        updated[updated.length - 1] = {
          role: 'assistant',
          content: 'Sorry, something went wrong. Please try again.',
        };
        return updated;
      });
    } finally {
      setIsStreaming(false);
    }
  }, []);

  return { messages, sendMessage, isStreaming };
}

Handling Errors Gracefully

Production AI chat needs to handle three failure modes:

1. Network Errors

The fetch can fail entirely. Wrap it in try/catch and show a user-friendly message.

2. Rate Limiting

Groq and OpenAI both rate limit. Check for 429 responses and show a "please wait" message:

typescript
if (response.status === 429) {
  setMessages((prev) => {
    const updated = [...prev];
    updated[updated.length - 1] = {
      role: 'assistant',
      content: 'I\'m receiving too many requests right now. Please try again in a few seconds.',
    };
    return updated;
  });
  return;
}

3. Partial Stream Failures

The stream can start successfully but fail midway. The error handler inside ReadableStream.start() catches this and sends an error event so the client knows to stop waiting.

Rate Limiting on Your End

Don't rely solely on the AI provider's rate limits. Implement your own:

typescript
// Simple in-memory rate limiter
const rateLimit = new Map<string, number[]>();
const WINDOW_MS = 60_000;
const MAX_REQUESTS = 10;

function isRateLimited(ip: string): boolean {
  const now = Date.now();
  const timestamps = rateLimit.get(ip) || [];
  const recent = timestamps.filter((t) => now - t < WINDOW_MS);
  rateLimit.set(ip, recent);

  if (recent.length >= MAX_REQUESTS) return true;
  recent.push(now);
  return false;
}

The Typewriter Effect

For polish, add a CSS animation that makes the streaming text feel like typing:

css
@keyframes blink {
  0%, 50% { opacity: 1; }
  51%, 100% { opacity: 0; }
}

.streaming-cursor::after {
  content: '▋';
  animation: blink 1s infinite;
  color: #00F5FF;
}

Add the streaming-cursor class to the last message while isStreaming is true. Remove it when the stream completes.

Deploying to Vercel

Two things to configure:

  1. Environment variables: Add GROQ_API_KEY in Vercel dashboard
  2. Runtime: SSE requires the Node.js runtime, not Edge. Add to your route:
typescript
export const runtime = 'nodejs';

Edge runtime doesn't support streaming with all providers. Stick with Node.js for reliability.

The Complete Flow

  1. User types a message and hits send
  2. Message appears in the chat UI immediately
  3. Empty assistant message placeholder appears with blinking cursor
  4. Fetch POST to /api/chat/stream
  5. API creates Groq streaming completion
  6. SSE chunks flow back to the client
  7. Each chunk appends to the assistant message in real time
  8. [DONE] event removes the cursor and enables the input
  9. Error at any point shows a friendly message

Conclusion

Streaming transforms AI chat from a "request-response" interaction into a conversation. The implementation is more complex than a simple fetch, but the UX improvement is dramatic.

The code above is production-tested — it's the foundation of the AI chat on my portfolio. Every response you see there streams in real time using this exact pattern.


Building a custom AI chat for your product? I've done it multiple times and can help you ship faster. Let's talk.