Improving Centrifugo Redis Engine throughput and allocation efficiency with Rueidis Go library

December 20, 2022 · 31 min read

Alexander Emelin

Author of Centrifugo

Centrifugo_Redis_Engine_Improvements

The main objective of Centrifugo is to manage persistent client connections established over various real-time transports (including WebSocket, HTTP-Streaming, SSE, WebTransport, etc – see here) and offer an API for publishing data towards established connections. Clients subscribe to channels, hence Centrifugo implements PUB/SUB mechanics to transmit published data to all online channel subscribers.

Centrifugo employs Redis as its primary scalability option – so that it's possible to distribute client connections amongst numerous Centrifugo nodes without worrying about channel subscribers connected to separate nodes. Redis is incredibly mature, simple, and fast in-memory storage. Due to various built-in data structures and PUB/SUB support Redis is a perfect fit to be both Centrifugo Broker and PresenceManager (we will describe what's this shortly).

In Centrifugo v4.1.0 we introduced an updated implementation of our Redis Engine (Engine in Centrifugo == Broker + PresenceManager) which provides sufficient performance improvements to our users. This post discusses the factors that prompted us to update Redis Engine implementation and provides some insight into the results we managed to achieve. We'll examine a few well-known Go libraries for Redis communication and contrast them against Centrifugo tasks.

Broker and PresenceManager

Before we get started, let's define what Centrifugo's Broker and PresenceManager terms mean.

Broker is an interface responsible for maintaining subscriptions from different Centrifugo nodes (initiated by client connections). That helps to scale client connections over many Centrifugo instances and not worry about the same channel subscribers being connected to different nodes – since all Centrifugo nodes connected with PUB/SUB. Messages published to one node are delivered to a channel subscriber connected to another node.

Another major part of Broker is keeping an expiring publication history for channels (streams). So that Centrifugo may provide a fast cache for messages missed by clients upon going offline for a short period and compensate at most once delivery of Redis PUB/SUB using Publication incremental offsets. Centrifugo uses STREAM and HASH data structures in Redis to store channel history and stream meta information.

In general Centrifugo architecture may be perfectly illustrated by this picture (Gophers are Centrifugo nodes all connected to Broker, and sockets are WebSockets):

gopher-broker

PresenceManager is an interface responsible for managing online presence information - list of currently active channel subscribers. While the connection is alive we periodically update presence entries for channels connection subscribed to (for channels where presence is enabled). Presence data should expire if not updated by a client connection for some time. Centrifugo uses two Redis data structures for managing presence in channels - HASH and ZSET.

Redigo

For a long time, the gomodule/redigo package served as the foundation for the Redis Engine implementation in Centrifugo. Huge props go to Mr Gary Burd for creating it.

Redigo offers a connection Pool to Redis. A simple usage of it involves getting the connection from the pool, issuing request to Redis over that connection, and then putting the connection back to the pool after receiving the result from Redis.

Let's write a simple benchmark which demonstrates simple usage of Redigo and measures SET operation performance:

func BenchmarkRedigo(b *testing.B) {
	pool := redigo.Pool{
		MaxIdle:   128,
		MaxActive: 128,
		Wait:      true,
		Dial: func() (redigo.Conn, error) {
			return redigo.Dial("tcp", ":6379")
		},
	}
	defer pool.Close()

	b.ResetTimer()
	b.SetParallelism(128)
	b.ReportAllocs()
	b.RunParallel(func(pb *testing.PB) {
		for pb.Next() {
			c := pool.Get()
			_, err := c.Do("SET", "redigo", "test")
			if err != nil {
				b.Fatal(err)
			}
			c.Close()
		}
	})
}

Let's run it:

BenchmarkRedigo-8        228804        4648 ns/op        62 B/op         2 allocs/op

Seems pretty fast, but we can improve it further.

Redigo with pipelining

To increase a throughput in Centrifugo, instead of using Redigo's Pool for each operation, we acquired a dedicated connection from the Pool and utilized Redis pipelining to send multiple commands where possible.

Redis pipelining improves performance by executing multiple commands using a single client-server-client round trip. Instead of executing many commands one by one, you can queue the commands in a pipeline and then execute the queued commands as if it is a single command. Redis processes commands in order and sends individual response for each command. Given a single CPU nature of Redis, reducing the number of active connections when using pipelining has a positive impact on throughput – therefore pipelining is beneficial from this angle as well.

Redis pipeline

You can quickly estimate the benefits of pipelining by running Redis locally and running redis-benchmark which comes with Redis distribution over it:

> redis-benchmark -n 100000 set key value

Summary:
  throughput summary: 84674.01 requests per second

And with pipelining:

> redis-benchmark -n 100000 -P 64 set key value

Summary:
  throughput summary: 666880.00 requests per second

In Centrifugo we are using smart batching technique for collecting pipeline (also described in one of the previous posts in this blog).

To demonstrate benefits from using pipelining let's look at the following benchmark:

const (
	maxCommandsInPipeline = 512
	numPipelineWorkers    = 1
)

type command struct {
	errCh chan error
}

type sender struct {
	cmdCh chan command
	pool  redigo.Pool
}

func newSender(pool redigo.Pool) *sender {
	p := &sender{
		cmdCh: make(chan command),
		pool:  pool,
	}
	go func() {
		for {
			for i := 0; i < numPipelineWorkers; i++ {
				p.runPipelineRoutine()
			}
		}
	}()
	return p
}

func (s *sender) send() error {
	errCh := make(chan error, 1)
	cmd := command{
		errCh: errCh,
	}
	// Submit command to be executed by runPipelineRoutine.
	s.cmdCh <- cmd
	return <-errCh
}

func (s *sender) runPipelineRoutine() {
	conn := p.pool.Get()
	defer conn.Close()
	for {
		select {
		case cmd := <-s.cmdCh:
			commands := []command{cmd}
			conn.Send("set", "redigo", "test")
		loop:
			// Collect batch of commands to send to Redis in one RTT.
			for i := 0; i < maxCommandsInPipeline; i++ {
				select {
				case cmd := <-s.cmdCh:
					commands = append(commands, cmd)
					conn.Send("set", "redigo", "test")
				default:
					break loop
				}
			}
			// Flush all collected commands to the network.
			err := conn.Flush()
			if err != nil {
				for i := 0; i < len(commands); i++ {
					commands[i].errCh <- err
				}
				continue
			}
			// Read responses to commands, they come in order.
			for i := 0; i < len(commands); i++ {
				_, err := conn.Receive()
				commands[i].errCh <- err
			}
		}
	}
}

func BenchmarkRedigoPipelininig(b *testing.B) {
	pool := redigo.Pool{
		Wait: true,
		Dial: func() (redigo.Conn, error) {
			return redigo.Dial("tcp", ":6379")
		},
	}
	defer pool.Close()

	sender := newSender(pool)

	b.ResetTimer()
	b.SetParallelism(128)
	b.ReportAllocs()
	b.RunParallel(func(pb *testing.PB) {
		for pb.Next() {
			err := sender.send()
			if err != nil {
				b.Fatal(err)
			}
		}
	})
}

This is a strategy that we employed in Centrifugo for a long time. As you can see code with automatic pipelining gets more complex, and in real life it's even more complicated to support different types of commands, channel send timeouts, and server shutdowns.

What about the performance of this approach?

BenchmarkRedigo-8               228804      4648 ns/op       62 B/op     2 allocs/op
BenchmarkRedigoPipelininig-8   1840758     604.7 ns/op      176 B/op     4 allocs/op

Operation latency reduced from 4648 ns/op to 604.7 ns/op – not bad right?

It's worth mentioning that upon increased RTT between application and Redis the approach with pipelining will provide worse throughput. But it still can be better than in pool-based approach. Let's say we have latency 5ms between app and Redis. This means that with pool size of 128 you will be able to issue up to 128 * (1000 / 5) = 25600 requests per second over 128 connections. With the pipelining approach above the theoretical limit is 512 * (1000 / 5) = 102400 requests per second over a single connection (though in case of using code for pipelining shown above we need to have larger parallelism, say 512 instead of 128). And it can scale further if you increase numPipelineWorkers to work over several connections in paralell. Though increasing numPipelineWorkers has negative effect on CPU – we will discuss this later in this post.

Redigo is an awesome battle-tested library that served us great for a long time.

Motivation to migrate

There are three modes in which Centrifugo can work with Redis these days:

Connecting to a standalone single Redis instance
Connecting to Redis in master-replica configuration, where Redis Sentinel controls the failover process
Connecting to Redis Cluster

All modes additionally can be used with client-side consistent sharding. So it's possible to scale Redis even without a Redis Cluster setup.

Unfortunately, with pure Redigo library, it's only possible to implement [ 1 ] – i.e. connecting to a single standalone Redis instance.

To support the scheme with Sentinel you whether need to have a proxy between the application and Redis which proxies the connection to Redis master. For example, with Haproxy it's possible in this way:

listen redis
    server redis-01 127.0.0.1:6380 check port 6380 check inter 2s weight 1 inter 2s downinter 5s rise 10 fall 2 on-marked-down shutdown-sessions on-marked-up shutdown-backup-sessions
    server redis-02 127.0.0.1:6381 check port 6381 check inter 2s weight 1 inter 2s downinter 5s rise 10 fall 2 backup
    bind *:6379
    mode tcp
    option tcpka
    option tcplog
    option tcp-check
    tcp-check send PING\r\n
    tcp-check expect string +PONG
    tcp-check send info\ replication\r\n
    tcp-check expect string role:master
    tcp-check send QUIT\r\n
    tcp-check expect string +OK
    balance roundrobin

Or, you need to additionally import FZambia/sentinel library - which provides a communication layer with Redis Sentinel on top of Redigo's connection Pool.

For communicating with Redis Cluster one more library may be used – mna/redisc which is also a layer on top of redigo basic functionality.

Combining redigo + FZambia/sentinel + mna/redisc we managed to implement all three connection modes. This worked, though resulted in rather tricky Redis setup. Also, it was difficult to re-use existing pipelining code we had for a standalone Redis with Redis Cluster. As a result, Centrifugo only used pipelining in a standalone or Sentinel Redis cases. When using Redis Cluster, however, Centrifugo merely used the connection pool to issue requests thus not benefiting from request pipelining. Due to this we had some code duplication to send the same requests in various Redis configurations.

Another thing is that Redigo uses interface{} for command construction. To send command to Redis Redigo has Do method which accepts name of the command and variadic interface{} arguments to construct command arguments:

Do(commandName string, args ...interface{}) (reply interface{}, err error)

While this works well and you can issue any command to Redis, you need to be very accurate when constructing a command. This also adds some allocation overhead. As we know more memory allocations lead to the increased CPU utilization because the allocation process itself requires more processing power and the GC is under more strain.

At some point we felt that eliminating additional dependencies (even though I am the author of one of them) and reducing allocations in Redis communication layer is a nice step forward for Centrifugo. So we started looking around for redigo alternatives.

To summarize, here is what we wanted from Redis library:

Possibility to work with all three Redis setup options we support: standalone, master-replica(s) with Sentinel, Redis Cluster, so we can depend on one library instead of three
Less memory allocations (and more type-safety API is a plus)
Support working with RESP2-only Redis servers as we need that for backwards compatibility. And some vendors like Redis Enterprise still support RESP2 protocol only
The library should be actively maintained

Go-redis/redis

The most obvious alternative to Redigo is go-redis/redis package. It's popular, regularly gets updates, used by a huge amount of Go projects (Grafana, Thanos, etc.). And maintained by Vladimir Mihailenco who created several more awesome Go libraries, like msgpack for example. I personally successfully used go-redis/redis in several other projects I worked on.

To avoid setup boilerplate for various Redis installation variations go-redis/redis has UniversalClient. From docs:

UniversalClient is a wrapper client which, based on the provided options, represents either a ClusterClient, a FailoverClient, or a single-node Client. This can be useful for testing cluster-specific applications locally or having different clients in different environments.

In terms of implementation go-redis/redis also has internal pool of connections to Redis, similar to redigo. It's also possible to use Client.Pipeline method to allocate a Pipeliner interface and use it for pipelining. So UniversalClient reduces setup boilerplate for different Redis installation types and number of dependencies we had, and it provide very similar way to pipeline requests so we could easily re-implement things we had with Redigo.

Go-redis also provides more type-safety when constructing commands compared to Redigo, almost every command in Redis is implemented as a separate method of Client, for example Publish defined as:

func (c Client) Publish(ctx context.Context, channel string, message interface{}) *IntCmd

You can see though that we still have interface{} here for message argument type. I suppose this was implemented in such way for convenience – to pass both string or []byte. But it still produces some extra allocations.

Without pipelining the simplest program with go-redis/redis may look like this:

func BenchmarkGoredis(b *testing.B) {
	client := redis.NewUniversalClient(&redis.UniversalOptions{
		Addrs:    []string{":6379"},
		PoolSize: 128,
	})
	defer client.Close()

	b.ResetTimer()
	b.SetParallelism(128)
	b.ReportAllocs()
	b.RunParallel(func(pb *testing.PB) {
		for pb.Next() {
			resp := client.Set(context.Background(), "goredis", "test", 0)
			if resp.Err() != nil {
				b.Fatal(resp.Err())
			}
		}
	})
}

Let's run it:

BenchmarkRedigo-8        228804        4648 ns/op        62 B/op         2 allocs/op
BenchmarkGoredis-8       268444        4561 ns/op       244 B/op         8 allocs/op

Result is pretty comparable to Redigo, though Go-redis allocates more (btw most of allocations come from the connection liveness check upon getting from the pool which can not be turned off).

It's interesting – if we dive deeper into what is it we can discover that this is the only way in Go to check connection was closed without reading data from it. The approach was originally introduced by go-sql-driver/mysql, it's not cross-platform, and related issue may be found in Go issue tracker.

But as I said in Centrifugo we already used pipelining over the dedicated connection for all operations so we avoid frequently getting connections from the pool. And early experiments proved that go-redis may provide some performance benefits for our use case.

At some point @j178 sent a pull request to Centrifuge library with Broker and PresenceManager implementations based on go-redis/redis. The amount of code to cover all the various Redis setups was reduced, we got only one dependency instead of three 🔥

But what about performance? Here we will show results for several operations which are typical for Centrifugo:

Publish a message to a channel without saving it to the history - this is just a Redis PUBLISH command going through Redis PUB/SUB system (RedisPublish)
Publish message to a channel with saving it to history - this involves executing the LUA script on Redis side where we add a publication to STREAM data structure, update meta information HASH, and finally PUBLISH to PUB/SUB (RedisPublish_History)
Subscribe to a channel - that's a SUBSCRIBE Redis command, this is important to have it fast as Centrifugo should be able to re-subscribe to all the channels in the system upon mass client reconnect scenario (RedisSubscribe)
Recovering missed publication state from channel STREAM, this is again may be called lots of times when all clients reconnect at once (RedisRecover).
Updating connection presence information - many connections may periodically update their channel online presence information in Redis (RedisAddPresence)

Here are the benchmark results we got when comparing redigo (v1.8.9) implementation (old) and go-redis/redis (v9.0.0-rc.2) implementation (new) with Redis v6.2.7 on Mac with M1 processor and benchmark paralellism 128:

❯ benchstat redigo_p128.txt goredis_p128.txt
name                      old time/op    new time/op    delta
RedisPublish-8            1.45µs ±10%    1.88µs ± 4%  +29.32%  (p=0.000 n=10+10)
RedisPublish_History-8    12.5µs ± 6%     9.7µs ± 3%  -22.77%  (p=0.000 n=10+10)
RedisSubscribe-8          1.47µs ±24%    1.47µs ±10%     ~     (p=0.469 n=10+10)
RedisRecover-8            18.4µs ± 2%     6.3µs ± 0%  -65.78%  (p=0.000 n=10+8)
RedisAddPresence-8        3.72µs ± 1%    3.40µs ± 1%   -8.74%  (p=0.000 n=10+10)

name                      old alloc/op   new alloc/op   delta
RedisPublish-8              483B ± 0%      499B ± 0%   +3.37%  (p=0.000 n=9+10)
RedisPublish_History-8    1.30kB ± 0%    1.08kB ± 0%  -16.67%  (p=0.000 n=10+10)
RedisSubscribe-8            892B ± 2%      662B ± 6%  -25.83%  (p=0.000 n=10+10)
RedisRecover-8            1.25kB ± 1%    1.00kB ± 0%  -19.91%  (p=0.000 n=10+10)
RedisAddPresence-8          907B ± 0%      827B ± 0%   -8.82%  (p=0.002 n=7+8)

name                      old allocs/op  new allocs/op  delta
RedisPublish-8              10.0 ± 0%       9.0 ± 0%  -10.00%  (p=0.000 n=10+10)
RedisPublish_History-8      29.0 ± 0%      25.0 ± 0%  -13.79%  (p=0.000 n=10+10)
RedisSubscribe-8            22.0 ± 0%      14.0 ± 0%  -36.36%  (p=0.000 n=8+7)
RedisRecover-8              29.0 ± 0%      23.0 ± 0%  -20.69%  (p=0.000 n=10+10)
RedisAddPresence-8          18.0 ± 0%      17.0 ± 0%   -5.56%  (p=0.000 n=10+10)

danger

Please note that this benchmark is not a pure performance comparison of two Go libraries for Redis – it's a performance comparison of Centrifugo Engine methods upon switching to a new library.

Or visualized in Grafana:

note

Centrifugo benchmarks results shown in the post use parallelism 128. If someone interested to check numbers for paralellism 1 or 16 – check out this comment on Github.

We observe a noticeable reduction in allocations in these benchmarks and in most benchmarks (presented here and other not listed in this post) we observed a reduced latency.

Overall, results convinced us that the migration from redigo to go-redis/redis may provide Centrifugo with everything we aimed for – all the goals for a redigo alternative outlined above were successfully fullfilled.

One good thing go-redis/redis allowed us to do is to use Redis pipelining also in a Redis Cluster case. It's possible due to the fact that go-redis/redis re-maps pipeline objects internally based on keys to execute pipeline on the correct node of Redis Cluster. Actually, we could do the same based on redigo + mna/redisc, but here we got it for free.

BTW, there is a page with comparison between redigo and go-redis/redis in go-redis/redis docs which outlines some things I mentioned here and some others.

But we have not migrated to go-redis/redis in the end. And the reason is another library – rueidis.

Rueidis

While results were good with go-redis/redis we also made an attempt to implement Redis Engine on top of rueian/rueidis library written by @rueian. According to docs, rueidis is:

A fast Golang Redis client that supports Client Side Caching, Auto Pipelining, Generics OM, RedisJSON, RedisBloom, RediSearch, RedisAI, RedisGears, etc.

The readme of rueidis contains benchmark results where it hugely outperforms go-redis/redis in terms of operation latency/throughput in both single Redis and Redis Custer setups:

rueidis works with standalone Redis, Sentinel Redis and Redis Cluster out of the box. Just like UniversalClient of go-redis/redis. So it also allowed us to reduce code boilerplate to work with all these setups.

Again, let's try to write a simple program like we had for Redigo and Go-redis above:

func BenchmarkRueidis(b *testing.B) {
	client, err := rueidis.NewClient(rueidis.ClientOption{
		InitAddress: []string{":6379"},
	})
	if err != nil {
		b.Fatal(err)
	}

	b.ResetTimer()
	b.SetParallelism(128)
	b.ReportAllocs()
	b.RunParallel(func(pb *testing.PB) {
		for pb.Next() {
			cmd := client.B().Set().Key("rueidis").Value("test").Build()
			res := client.Do(context.Background(), cmd)
			if res.Error() != nil {
				b.Fatal(res.Error())
			}
		}
	})
}

And run it:

BenchmarkRedigo-8        228804        4648 ns/op        62 B/op         2 allocs/op
BenchmarkGoredis-8       268444        4561 ns/op       244 B/op         8 allocs/op
BenchmarkRueidis-8      2908591        418.5 ns/op        4 B/op         1 allocs/op

rueidis library comes with automatic implicit pipelining, so you can send each request in isolated way while rueidis makes sure the request becomes part of the pipeline sent to Redis – thus utilizing the connection between an application and Redis most efficiently with maximized throughput. The idea of implicit pipelining with Redis is not new and Go ecosystem already had joomcode/redispipe library which implemented it (though it comes with some limitations which made it unsuitable for Centrifugo use case).

So applications that use a pool-based approach for communication with Redis may observe dramatic improvements in latency and throughput when switching to the Rueidis library.

For Centrifugo we didn't expect such a huge speed-up as shown in the above graphs since we already used pipelining in Redis Engine. But rueidis implements some ideas which allow it to be efficient. Insights about these ideas are provided by Rueidis author in a "Writing a High-Performance Golang Client Library" series of posts on Medium:

I did some prototypes with rueidis which were super-promising in terms of performance. There were some issues found during that early prototyping (mostly with PUB/SUB) – but all of them were quickly resolved by Rueian.

Until v0.0.80 release rueidis did not support RESP2 though, so we could not replace our Redis Engine implementation with it. But as soon as it got RESP2 support we opened a pull request with alternative implementation.

Since auto-pipelining is used in rueidis by default we were able to remove some of our own pipelining management code – so the Engine implementation is more concise now. One more thing to mention is a simpler PUB/SUB code we were able to write with rueidis. One example is that in redigo case we had to periodically PING PUB/SUB connection to maintain it alive, rueidis does this automatically.

Regarding performance, here are the benchmark results we got when comparing redigo (v1.8.9) implementation (old) and rueidis (v0.0.90) implementation (new):

❯ benchstat redigo_p128.txt rueidis_p128.txt
name                      old time/op    new time/op    delta
RedisPublish-8            1.45µs ±10%    0.56µs ± 1%  -61.53%  (p=0.000 n=10+9)
RedisPublish_History-8    12.5µs ± 6%     9.7µs ± 1%  -22.43%  (p=0.000 n=10+9)
RedisSubscribe-8          1.47µs ±24%    1.45µs ± 1%     ~     (p=0.484 n=10+9)
RedisRecover-8            18.4µs ± 2%     6.2µs ± 1%  -66.08%  (p=0.000 n=10+10)
RedisAddPresence-8        3.72µs ± 1%    3.60µs ± 1%   -3.34%  (p=0.000 n=10+10)

name                      old alloc/op   new alloc/op   delta
RedisPublish-8              483B ± 0%       91B ± 0%  -81.16%  (p=0.000 n=9+10)
RedisPublish_History-8    1.30kB ± 0%    0.39kB ± 0%  -70.08%  (p=0.000 n=10+8)
RedisSubscribe-8            892B ± 2%      360B ± 0%  -59.66%  (p=0.000 n=10+10)
RedisRecover-8            1.25kB ± 1%    0.36kB ± 1%  -71.52%  (p=0.000 n=10+10)
RedisAddPresence-8          907B ± 0%      151B ± 1%  -83.34%  (p=0.000 n=7+9)

name                      old allocs/op  new allocs/op  delta
RedisPublish-8              10.0 ± 0%       2.0 ± 0%  -80.00%  (p=0.000 n=10+10)
RedisPublish_History-8      29.0 ± 0%      10.0 ± 0%  -65.52%  (p=0.000 n=10+10)
RedisSubscribe-8            22.0 ± 0%       6.0 ± 0%  -72.73%  (p=0.002 n=8+10)
RedisRecover-8              29.0 ± 0%       7.0 ± 0%  -75.86%  (p=0.000 n=10+10)
RedisAddPresence-8          18.0 ± 0%       3.0 ± 0%  -83.33%  (p=0.000 n=10+10)

Or visualized in Grafana:

2.5x times more publication throughput than we had before! Instead of 700k publications/sec, we went towards 1.7 million publications/sec due to drastically decreased publish operation latency (1.45µs -> 0.59µs). This means that our previous Engine implementation under-utilized Redis, and Rueidis just pushes us towards Redis limits. The latency of most other operations is also reduced.

The allocation effectiveness of the implementation based on "rueidis" is best. As you can see rueidis helped us to generate sufficiently fewer memory allocations for all our Redis operations. Allocation improvements directly affect Centrifugo node CPU usage. Though we will talk about CPU more later below.

For Redis Cluster case we also got benchmark results similar to the standalone Redis results above.

I might add that I enjoyed building commands with rueidis. All Redis commands may be constructed using a builder approach. Rueidis comes with builders generated for all Redis commands. As an illustration, this is a process of building a PUBLISH Redis command:

This drastically reduces a chance to make a stupid mistake while constructing a command. Instead of always opening Redis docs to see a command syntax it's now possible to just start typing - and quickly come to the complete command to send.

Switching to Rueidis: reducing CPU usage

After making all these benchmarks and implementing Engine in Rueidis I decided to check whether Centrifugo consumes less CPU with it. I expected a notable CPU reduction as Rueidis Engine implementation allocates much less than Redigo-based. Turned out it's not that simple.

I ran Centrifugo with some artificial load and noticed that CPU consumption of the new implementation is actually... worse than we had with Redigo-based engine under equal conditions!😩 But why?

As I mentioned above Redis pipelining is a technique when several commands may be combined into one batch to send over the network. In case of automatic pipelining the size of generated batches start playing a crucial role in application and Redis CPU usage – since smaller command batches result into more read/write system calls to the kernel on both application and Redis server sides. That's why projects like Twemproxy which sit between app and Redis have sich a good effect on Redis CPU usage among other things.

As we have seen above, Rueidis provides a better throughput and latency, but it's more agressive in terms of flushing data to the network. So in its default configuration we get smaller batches under th equal conditions than we had before in our own pipelining implementation based on Redigo (shown in the beginning of this post).

Luckily, there is an option in Rueidis called MaxFlushDelay which allows to slow down write loop a bit to give Rueidis a chance to collect more commands to send in one batch. When this option is used Rueidis will make a pause after each network flush not bigger than selected value of MaxFlushDelay (please note, that this is a delay after flushing collected pipeline commands, not an additional delay for each request). Using some reasonable value it's possible to drastically reduce both application and Redis CPU utilization.

To demonstrate this I created a repo: https://github.com/FZambia/pipelines.

This repo contains three benchmarks where we use automatic pipelining: based on redigo, based on go-redis/redis and rueidis. In these benchmarks we produce concurrent requests, but instead of pushing the system towards the limits we are limiting number of requests sent to Redis, so we put all libraries in equal conditions.

To rate limit requests we are using uber-go/ratelimit library. For example, to allow rate no more than 100k commands per second we can do sth like this:

rl := ratelimit.New(100, ratelimit.Per(time.Millisecond))
for {
	rl.Take()
	...
}

We limit requests per second we could actually just write ratelimit.New(100000) – but we aim to get a more smooth distribution of requests over time - so using millisecond resolution.

Let's run all the benchmarks in the default configuration:

Average CPU usage during the test (a bit rough but enough for demonstration):

	Redigo	Go-redis/redis	Rueidis
Application CPU, %	95	99	116
Redis CPU, %	36	35	42

OK, Rueidis-based implementation is the worst here despite of allocating less than others. So let's try to change this by setting MaxFlushDelay to sth like 100 microseconds:

Now CPU usage is:

	Redigo	Go-redis/redis	Rueidis
Application CPU, %	95	99	59
Redis CPU, %	36	35	12

So we can achieve great CPU usage reduction. CPU went from 116% to 59% for the application side, and from 42% to only 12% for Redis! We are sacrificing latency though. Given the fact the CPU utilization reduction is very notable the trade-off is pretty fair.

caution

It's definitely possible to improve CPU usage in Redigo and Go-redis/redis cases too – using similar technique. But the goal here was to improve Rueidis-based engine implementation to make it comparable or better than our Redigo-based implementation in terms of CPU utilization.

As you can see we were able to achieve better CPU results just by using 100 microseconds delay after each network flush. In real life, where we are not running Redis on localhost and have some network latency in between application and Redis, this delay should be insignificant at all. Indeed, adding MaxFlushDelay can even improve (!) the latency you have. You may wonder what happened with benchmarks we showed above after we added MaxFlushDelay option. In Centrifugo we chose default value 100 microseconds, and here are results on localhost (old without delay, new with delay):

> benchstat rueidis_p128.txt rueidis_delay_p128.txt
name                      old time/op    new time/op    delta
RedisPublish-8             559ns ± 1%     468ns ± 0%  -16.35%  (p=0.000 n=9+8)
RedisPublish_History-8    9.72µs ± 1%    9.67µs ± 1%   -0.52%  (p=0.007 n=9+8)
RedisSubscribe-8          1.45µs ± 1%    1.27µs ± 1%  -12.49%  (p=0.000 n=9+10)
RedisRecover-8            6.25µs ± 1%    5.85µs ± 0%   -6.32%  (p=0.000 n=10+10)
RedisAddPresence-8        3.60µs ± 1%    3.33µs ± 1%   -7.52%  (p=0.000 n=10+10)

(rest is not important here...)

It's even better for this set of benchmarks. Though while it's better for these benchmarks the numbers may differ for other under different conditions. For example, in the benchmarks we run we use concurrency 128, if we reduce concurrency we will notice reduced throughput – as batches Rueidis collects become smaller. Smaller batches + some delay to collect = less requests per second to send.

The problem is that the value to pause Rueidis write loop is a very use case specific, it's pretty hard to provide a reasonable default for it. Depending on request rate/size, network latency etc. you may choose a larger or smaller delay. In v4.1.0 we start with hardcoded 100 microsecond MaxFlushDelay which seems sufficient for most use cases and showed good results in the benchmarks - though possibly we will have to make it tunable later.

To check that Centrifugo benchmarks also utilize less CPU I added rate limiter (50k rps per second) to benchmarks and compared version without MaxFlushDelay and with 100 microsecond MaxFlushDelay:

50k req per second	Without delay	With 100mks delay
BenchmarkPublish	Centrifugo - 75%, Redis - 24%	Centrifugo - 44%, Redis - 9%
BenchmarkPublish_History	Centrifugo - 80% , Redis - 67%	Centrifugo - 55%, Redis - 50%
BenchmarkSubscribe	Centrifugo - 80%, Redis - 30%	Centrifugo - 45% , Redis - 14%
BenchmarkRecover	Centrifugo - 84%, Redis - 51%	Centrifugo - 51%, Redis - 36%
BenchmarkPresence	Centrifugo - 114%, Redis - 69%	Centrifugo - 90%, Redis - 60%

note

In this test I replaced BenchmarkAddPresence with BenchmarkPresence (get information about all online subscribers in channel) to also make sure we have CPU reduction when using read-intensive method, i.e. when Redis response is reasonably large.

We observe a notable CPU usage improvement here.

Hope you understand now why increasing numPipelineWorkers value in the pipelining code showed before results into increased CPU usage on app and Redis sides – due to smaller batch sizes and more read/write system calls as the consequence.

note

BTW, would it be a nice thing if Go benchmarking suite could show a CPU usage of the process in addition to time and alloc stats? 🤔

Adding latency

The last thing to check is how new implementation works upon increased RTT between application and Redis. To add artificial latency on localhost on Linux one can use tc tool as shown here by Daniel Stenberg. But I am on MacOS so the simplest way I found was using Shopify/toxiproxy. Sth like running a server:

toxyproxy-server

And then in another terminal I used toxiproxy-cli to create toxic Redis proxy with additional latency on port 26379:

toxiproxy-cli create -l localhost:26379 -u localhost:6379 toxic_redis
toxiproxy-cli toxic add -t latency -a latency=5 toxic_redis

The benchmark results are (old is Redigo-based, new is Rueidis-based):

> benchstat redigo_latency_p128.txt rueidis_delay_latency_p128.txt
name                      old time/op    new time/op    delta
RedisPublish-8            31.5µs ± 1%     5.6µs ± 3%   -82.26%  (p=0.000 n=9+10)
RedisPublish_History-8    62.8µs ± 3%    10.6µs ± 4%   -83.05%  (p=0.000 n=10+10)
RedisSubscribe-8          1.52µs ± 5%    6.05µs ± 8%  +298.70%  (p=0.000 n=8+10)
RedisRecover-8            48.3µs ± 3%     7.3µs ± 4%   -84.80%  (p=0.000 n=10+10)
RedisAddPresence-8        52.3µs ± 4%     5.8µs ± 2%   -88.94%  (p=0.000 n=10+10)

(rest is not important here...)

We see that new Engine implementation behaves much better for most cases. But what happened to Subscribe operation? It did not change at all in Redigo case – the same performance as if there is no additional latency involved!

Turned out that when we call Subscribe in Redigo case, Redigo only flushes data to the network without waiting synchronously for subscribe result.

It makes sense in general and we can listen to subscribe notifications asynchronously, but in Centrifugo we relied on the returned error thinking that it includes succesful subscription result from Redis - meaning that we already subscribed to a channel at that point. And this could theoretically lead to some rare bugs in Centrifugo.

Rueidis library waits for subscribe response. So here the behavior of rueidis while differs from redigo in terms of throughput under increased latency just fits Centrifugo better in terms of behavior. So we go with it.

Conclusion

Migrating from Redigo to Rueidis library was not just a task of rewriting code, we had to carefully test various aspects of Redis Engine behaviour – latency, throughput, CPU utilization of application, and even CPU utilization of Redis itself under the equal application load conditions.

I think that we will find more projects in Go ecosystem using rueidis library shortly. Not just because of its allocation efficiency and out-of-the-box throughput, but also due to a convenient type-safe command API.

For most Centrifugo users this migration means more efficient CPU usage as new implementation allocates less memory (less work to allocate and less strain on GC) and we tried to find a reasonable batch size to reduce the number of system calls for common operations. While latency and throughput of single Centrifugo node should be better as we make concurrent Redis calls from many goroutines.

Hopefully readers will learn some tips from this post which can help to achieve effective communication with Redis from Go or another programming language.

A few key takeaways:

Redis pipelining may increase throughput and reduce latency, it can also reduce CPU utilization of Redis
Don't blindly trust Go benchmark numbers but also think about CPU effect of changes you made (sometimes of the external system also)
Reduce the number of system calls to decrease CPU utilization
Everything is a trade-off – latency or resource usage? Your own WebSocket server or Centrifugo?
Don't rely on someone's else benchmarks, including those published here. Measure for your own use case. Take into account your load profile, paralellism, network latency, data size, etc.

P.S. One thing worth mentioning and which may be helpful for someone is that during our comparison experiments we discovered that Redis 7 has a major latency increase compared to Redis 6 when executing Lua scripts. So if you have performance sensitive code with Lua scripts take a look at this Redis issue. With the help of Redis developers some things already improved in unstable Redis branch, hopefully that issue will be closed at the time you read this post.

Broker and PresenceManager​

Redigo​

Redigo with pipelining​

Motivation to migrate​

Go-redis/redis​

Rueidis​

Switching to Rueidis: reducing CPU usage​

Adding latency​

Conclusion​

Broker and PresenceManager

Redigo

Redigo with pipelining

Motivation to migrate

Go-redis/redis

Rueidis

Switching to Rueidis: reducing CPU usage

Adding latency

Conclusion