Centrifugo for AI apps

Most AI apps start streaming the same way: the backend calls the model and pipes tokens straight to the browser over SSE (or a raw WebSocket). It's the approach in every tutorial, and for a single user watching a single generation it works well — if that's your case, stream directly and skip Centrifugo.

The trouble is what comes after the demo. Direct streaming breaks down on exactly the things a production AI app needs — and each of those is a primitive Centrifugo already provides, rather than custom catch-up code you write and maintain.

Where direct streaming breaks

In production you need…	With direct streaming	With Centrifugo
Resume after a reload or a dropped connection	the in-flight response is gone — nothing to replay	clients reattach and replay missed tokens via history & recovery
The same generation in more than one place — the user's other tabs, a human operator, an audit log	a single stream can't fan out	publish once to a channel; every subscriber receives it
Survive proxies and networks that drop SSE	no fallback path	WebSocket, SSE, HTTP-streaming, WebTransport with automatic fallback in the SDKs
Scale past one backend node	a token produced on node A can't reach a client connected to node B	horizontal scaling through a Redis or NATS broker
Sane cost at token volume	hosted pub/sub bills per message or connection-minute	self-hosted — no per-message billing
Keep data and keys on your network	prompts and provider API keys traverse a third-party SaaS	runs on your own infrastructure

Every row is the same shape: the toy version doesn't need it, the real product does.

How it fits

Centrifugo sits between your inference backend and your clients. Your backend stays in control of the model call and publishes tokens (or larger chunks) to a channel through the server API; Centrifugo delivers them to every connected subscriber over whatever transport each client uses. It's language- and framework-agnostic — it doesn't care whether your inference pipeline runs Python, Go, or anything else.

For high-throughput streams, the binary Protobuf protocol keeps per-token overhead low. Online presence tracks who is attached to a session. And when an assistant message is persisted, the PostgreSQL stream broker can publish the realtime event inside the same database transaction as the row write — so the stored message and the delivered event never diverge.

Examples

Streaming AI responses with Centrifugo — end-to-end tutorial streaming GPT responses with FastAPI and temporary channels.
Scaling AI token streams with Centrifugo — reconnect recovery, multi-tab synchronization, transport fallbacks, and horizontal scaling with Redis. Source on GitHub.
Transactional publishing with the PostgreSQL stream broker — committing the stored message and the published event in one transaction.

When you don't need Centrifugo

For a single user watching a single generation — no reconnect recovery, no observers, one backend node — streaming straight from the model provider to the browser over SSE is simpler, and you should use that. Centrifugo earns its place once you need resumable streams, fan-out to multiple viewers, transport flexibility, or scale beyond one node.

Where direct streaming breaks​

How it fits​

Examples​

When you don't need Centrifugo​

Where direct streaming breaks

How it fits

Examples

When you don't need Centrifugo