Why did Redis fall over when we rewrote this protocol to use fewer commands?

In our Python application, we have a service-client communication protocol that we wrote to allow service clients to make requests to service servers running in different processes or on different machines altogether within our server infrastructure. This protocol uses Redis as a backend, and so far it works great (faster than using HTTP, for example).

For context, our Redis keys are always < 64 bytes and when I use the term "message" I am referring to msgpacked binary strings that are always < 100KB but are typically around 3KB. In the case of the specific service we were testing with, client request messages are almost exactly 500 bytes every time, while server response messages are almost exactly 2KB every time. We are using Redis 3.0.6.

The initial version of our protocol works like this (Redis command names styled like THIS):

Generate a unique request ID to use as a key
Call LUA script that:
- LLEN the request list with key "server-requests" to fail if the list has too many messages in it
- SET the request ID key with the value of the request message
- EXPIRE the request ID key to 60 seconds
- RPUSH the request ID key onto the request list with key "server-requests"
- EXPIRE the request list with key "server-requests" to 60 seconds
BLPOP from the request list with key "server-requests" to get the next request ID
GET the request ID key to get the request message
DELETE the request ID key
Generate a unique response ID to use as a key
Call LUA script that:
- LLEN the response list with key "client-id-123456789-responses" to fail if the list has too many messages in it
- SET the response ID key with the value of the response message
- EXPIRE the response ID key to 60 seconds
- RPUSH the response ID key onto the response list with key "client-id-123456789-responses"
- EXPIRE the response list with key "client-id-123456789-responses" to 60 seconds
BLPOP from the response list with key "client-id-123456789-responses" to get the response ID
GET the response ID key to get the response message
DELETE the response ID key

This worked, and under fluctuating load of 800 to 1,600 requests per second, the average time was ~2ms for each of the following: 1) post a request to Redis, 2) get a request from Redis, 3) post a response to Redis, 4) get a response from Redis.

We didn't love this performance, so we decided to improve about the protocol with a second version:

Call LUA script that:
- LLEN the request list with key "server-requests" to fail if the list has too many messages in it
- RPUSH the request message onto the request list with key "server-requests"
- EXPIRE the request list with key "server-requests" to 60 seconds
BLPOP from the request list with key "server-requests" to get the next request message
Call LUA script that:
- LLEN the response list with key "client-id-123456789-responses" to fail if the list has too many messages in it
- RPUSH the the response message onto the response list with key "client-id-123456789-responses"
- EXPIRE the response list with key "client-id-123456789-responses" to 60 seconds
BLPOP from the response list with key "client-id-123456789-responses" to get the response message

Our thinking was that, with far fewer Redis operations, this should be much faster and allow us to scale better. Early indications were that we were right. At 800 requests/second, that ~2ms had become ~1ms. However, at 1,600 requests/second, Redis fell over and barfed all over itself. CPU usage spiked to nearly 100%, requests were being rejected, and the response time increased from ~1ms to a range of 100ms to 8 SECONDS. But one thing we never saw was an error from that LUA script saying that the lists had too many items in them (LLEN above).

I'm at a total loss. None of this makes sense. Intuitively, it should be better. The testing seems to maybe suggest that storing 0.5-2KB strings in lists performs very terribly over a certain load (unless I'm missing some other variable here), but the documentation does not back that up. As linked lists, pushing to and popping off of lists should be a constant time operation regardless of the size of the values, and we aren't even anywhere NEAR the maximum size of 512MB per item.

Has anyone ever seen behavior like this, or have suggestions about what the problem could be?

Solution

The full discussion can be reviewed in the above-linked Redis Google group discussion, but when this came back across my desk this morning, it seemed pertinent to summarize the conclusions we came to, which I'll break up into what we know and what we can only theorize.

What we know

We were using sentinel.master_for from the RedisPy library incorrectly. Every time you use this, a brand-new connection to the Redis master is created. This extremely rapid creation of connections was substantially increasing our resource usage. We made a change to call sentinel.master_for a single time and cache that connection until a failure was detected, and then call it once more to get a connection to the new master, and so on, and things improved dramatically: redis_total_connections_received plummeted (this is new connections, and it was in the thousands), CPU usage decreased, memory usage decreased, disk consumption decreased, network traffic decreased, total Redis keys went from ~6 million to ~150,000, and "time to send a request to Redis" went from 2ms to 0.2ms. Importantly, the problems with the second version of our protocol vanished.

What we can only theorize

The best explanation we can come up with for why this problem revealed itself only with the second version of our protocol is that the second version was, indeed, faster, and because of this sentinel.master_for was being called even-more rapidly because the applications were handling more "requests" per second, thus resulting in a higher rate of new Redis connection creation. This theory can also explain the cyclic nature of the problem, because when Redis bogged down with new connections, it would slow our applications down, thus slowing down the rate we were calling sentinel.master_for, thus slowing down the rate of new Redis connection creation, thus allowing Redis to return to more normal behavior, and so on.

This theory was hard to prove with the data available, so everyone on the team accepted this as the most-likely explanation and didn't feel it necessary to expend more time and money trying to definitively prove this for a problem that was already solved.