Back to Blog
Engineering

1-Second Latency: How We Built a Real-Time AI Teleprompter That Grew Clientele by 40%

The Problem

At Oculus / DSHI in Switzerland, we had an unusual challenge: build an AI agent that acts as a real-time teleprompter for customer communications. Not a chatbot. Not a pre-written script. A system that listens to live conversations and feeds contextually relevant suggestions to the representative — with a latency budget of one second.

One second. From the moment a customer finishes a sentence to the moment our agent surfaces a suggested response on the rep's screen. Any slower, and the conversation moves on. The suggestion becomes irrelevant.

I led a team of 3 engineers to build this from scratch. Here's the architecture, the mistakes, and the results.

Why 1-Second Matters

Human conversation has a natural rhythm. Research shows that response gaps longer than 700ms feel unnatural to the listener. Our reps needed time to read the suggestion, internalize it, and respond naturally — so the AI's work had to be done in well under a second to give them that buffer.

We broke the latency budget down:

  • Audio capture and transcription: ~200ms
  • Context assembly and prompt construction: ~50ms
  • LLM inference (streaming first tokens): ~500ms
  • WebSocket delivery to client: ~50ms
  • Render on screen: ~50ms
  • Total target: < 850ms (leaving 150ms buffer)

Every component had to be ruthlessly optimized. There was no room for batch processing, polling, or traditional request-response patterns.

The Architecture

We settled on a WebSocket-first streaming architecture. The pipeline looks like this:

architecture.txt
Browser (Audio Stream)
    ↓ WebSocket
Transcription Service (Whisper)
    ↓ Internal Queue
Context Assembler (Profile + History + Transcript)
    ↓
LLM Inference (Streaming)
    ↓ Token-by-token
WebSocket Hub
    ↓ WebSocket
Browser (Real-time Render)

The critical design decisions:

1. WebSocket over HTTP. HTTP polling and SSE weren't viable. We needed full-duplex communication — the client streams audio up while the server streams suggestions down, simultaneously. WebSockets were the only option that allowed this bidirectional flow without the overhead of repeated connections.

2. Streaming LLM inference. We didn't wait for the full LLM response. We streamed tokens as they were generated, sending partial suggestions to the client in real-time. The first useful fragment typically arrived within 300ms — fast enough that reps saw text appearing almost immediately.

3. Serverful infrastructure. We ran persistent Node.js processes on dedicated instances. Serverless was a non-starter — cold starts alone would have consumed our entire latency budget, and WebSocket connections require long-lived processes that serverless platforms actively work against.

The WebSocket Layer

The key architectural insight was separating the connection layer from the inference layer. The WebSocket server handled connection lifecycle, heartbeats, and reconnection. The inference service handled the AI logic. They communicated through an internal message queue.

ws-server.ts
import { WebSocketServer } from "ws";

const wss = new WebSocketServer({ port: 8080 });

const HEARTBEAT_INTERVAL = 30_000;
const CLIENT_TIMEOUT = 45_000;

wss.on("connection", (ws, req) => {
  const clientId = authenticateFromQuery(req);
  if (!clientId) return ws.close(4001, "Unauthorized");

  ws.isAlive = true;
  ws.clientId = clientId;

  ws.on("pong", () => { ws.isAlive = true; });

  ws.on("message", (data) => {
    const msg = JSON.parse(data);
    if (msg.type === "transcript_chunk") {
      inferenceQueue.publish({
        clientId,
        transcript: msg.text,
        timestamp: Date.now(),
      });
    }
  });
});

// Dead connection cleanup
setInterval(() => {
  wss.clients.forEach((ws) => {
    if (!ws.isAlive) return ws.terminate();
    ws.isAlive = false;
    ws.ping();
  });
}, HEARTBEAT_INTERVAL);

This separation meant we could scale the WebSocket layer (connection-heavy, CPU-light) independently from the inference layer (connection-light, GPU-heavy). In practice, a single WebSocket server handled 200+ concurrent connections while inference ran on separate GPU-backed instances.

Streaming LLM Integration

The stream: true flag was the single most important technical decision in the project. Without it, we'd wait 2-3 seconds for the full response. With streaming, the first tokens arrived in ~300ms.

stream-inference.ts
async function streamSuggestion(
  context: ConversationContext,
  clientWs: WebSocket
) {
  const stream = await openai.chat.completions.create({
    model: "gpt-4",
    stream: true,
    messages: [
      { role: "system", content: buildSystemPrompt(context.profile) },
      ...context.recentTurns,
      { role: "user", content: context.latestTranscript },
    ],
    max_tokens: 150,
    temperature: 0.7,
  });

  let buffer = "";
  for await (const chunk of stream) {
    const delta = chunk.choices[0]?.delta?.content;
    if (!delta) continue;

    buffer += delta;
    clientWs.send(JSON.stringify({
      type: "suggestion_delta",
      text: delta,
      fullText: buffer,
    }));
  }

  clientWs.send(JSON.stringify({
    type: "suggestion_complete",
    text: buffer,
  }));
}

We also implemented speculative rendering on the client side — displaying tokens with a subtle typing animation as they arrived. This made the perceived latency feel even shorter than the actual latency, because the rep saw activity immediately rather than staring at a blank space.

Context Window Management

The teleprompter needed conversation context to generate relevant suggestions. But context windows fill up fast in a live conversation. We implemented a sliding window approach:

  • Last 90 seconds of transcription (~500 tokens) — the immediate conversational context
  • Compressed conversation summary (~200 tokens) — generated asynchronously between turns
  • Customer profile and product knowledge (~300 tokens) — static context per session

The compression happened asynchronously. While the rep was speaking, a background process summarized the conversation so far using a smaller, faster model. This kept the context window lean for the next inference call — typically under 1,200 tokens total — which was critical for maintaining speed.

Backpressure and Flow Control

In a live conversation, transcription chunks arrive faster than the LLM can process them. Without flow control, you get a queue that grows unboundedly, and latency degrades from 1 second to 5 seconds to 30 seconds. We hit this exact problem in week two.

We implemented three-level backpressure:

Level 1 — Client-side debouncing. We didn't send every transcription fragment. We waited for natural sentence boundaries detected by punctuation and pause duration before triggering inference. This alone cut unnecessary inference calls by 60%.

Level 2 — Server-side message coalescing. If a new transcription arrived while the previous inference was still running, we cancelled the in-flight request and started a new one with the combined context. No point generating a suggestion for a sentence that's already been superseded.

Level 3 — Circuit breaker. If the inference queue exceeded 3 pending requests, we stopped accepting new ones and showed a brief "thinking" indicator to the rep. This prevented cascading latency and kept the system honest about its capacity.

throttler.ts
class InferenceThrottler {
  private pending = new Map<string, AbortController>();

  async process(clientId: string, context: ConversationContext) {
    // Cancel any in-flight inference for this client
    const existing = this.pending.get(clientId);
    if (existing) existing.abort();

    // Circuit breaker: reject if queue is saturated
    if (this.pending.size >= 3) {
      return { status: "throttled" };
    }

    const controller = new AbortController();
    this.pending.set(clientId, controller);

    try {
      const result = await streamSuggestion(context, controller.signal);
      return { status: "ok", result };
    } finally {
      this.pending.delete(clientId);
    }
  }
}

The Results

After 3 months of iteration — and more than a few late nights debugging WebSocket edge cases — the numbers told the story:

  • 40% increase in clientele — reps equipped with the teleprompter closed more deals and handled conversations with greater confidence
  • 53% reduction in hiring overhead — new reps ramped up dramatically faster with AI-assisted conversations
  • 12% boost in lead conversions — suggestions were contextually accurate enough to measurably improve pitch quality
  • Average latency: 780ms — comfortably under our 850ms target
  • P99 latency: 1.2 seconds — acceptable for edge cases involving longer context windows

Lessons Learned

Start with the latency budget, not the features. We designed the entire architecture around the 1-second constraint. Every feature proposal was evaluated against it. "Does it fit in the budget?" became our team's default question, and it saved us from scope creep that would have killed the product.

Streaming changes everything. The difference between waiting 2.5 seconds for a complete response and seeing the first words appear after 300ms is not just a UX improvement — it fundamentally changes what's possible. Applications that feel impossible with batch inference become natural with streaming.

WebSocket infrastructure is hard. Connection management, reconnection logic, heartbeats, authentication, backpressure — it's a lot of undifferentiated heavy lifting. It works, but it demands respect. Don't underestimate the engineering effort.

The AI is 30% of the work. The remaining 70% was infrastructure, latency optimization, error handling, and making the system reliable enough that reps could trust it mid-conversation. Anyone can build a demo. Shipping it to production where real people depend on it — that's engineering.

The hardest problems in real-time AI aren't the models. They're the pipes.