Skip to main content

Centrifugo for AI apps

Most AI apps start streaming the same way: the backend calls the model and pipes tokens straight to the browser over SSE (or a raw WebSocket). It's the approach in every tutorial, and for a single user watching a single generation it works well — if that's your case, stream directly and skip Centrifugo.

The trouble is what comes after the demo. Direct streaming breaks down on exactly the things a production AI app needs — and each of those is a primitive Centrifugo already provides, rather than custom catch-up code you write and maintain.

Where direct streaming breaks

In production you need…With direct streamingWith Centrifugo
Resume after a reload or a dropped connectionthe in-flight response is gone — nothing to replayclients reattach and replay missed tokens via history & recovery
The same generation in more than one place — the user's other tabs, a human operator, an audit loga single stream can't fan outpublish once to a channel; every subscriber receives it
Survive proxies and networks that drop SSEno fallback pathWebSocket, SSE, HTTP-streaming, WebTransport with automatic fallback in the SDKs
Scale past one backend nodea token produced on node A can't reach a client connected to node Bhorizontal scaling through a Redis or NATS broker
Sane cost at token volumehosted pub/sub bills per message or connection-minuteself-hosted — no per-message billing
Keep data and keys on your networkprompts and provider API keys traverse a third-party SaaSruns on your own infrastructure

Every row is the same shape: the toy version doesn't need it, the real product does.

How it fits

Centrifugo sits between your inference backend and your clients. Your backend stays in control of the model call and publishes tokens (or larger chunks) to a channel through the server API; Centrifugo delivers them to every connected subscriber over whatever transport each client uses. It's language- and framework-agnostic — it doesn't care whether your inference pipeline runs Python, Go, or anything else.

For high-throughput streams, the binary Protobuf protocol keeps per-token overhead low. Online presence tracks who is attached to a session. And when an assistant message is persisted, the PostgreSQL stream broker can publish the realtime event inside the same database transaction as the row write — so the stored message and the delivered event never diverge.

Examples

When you don't need Centrifugo

For a single user watching a single generation — no reconnect recovery, no observers, one backend node — streaming straight from the model provider to the browser over SSE is simpler, and you should use that. Centrifugo earns its place once you need resumable streams, fan-out to multiple viewers, transport flexibility, or scale beyond one node.