An odd coincidence

On March 20th, ChatGPT users reported seeing conversations that were not their own. Just a few weeks earlier I had solved a bug where the default Rails caching library (Dalli) would return incorrect values. I thought, "this ChatGPT incident sounds a lot like what could happen with that caching bug via returning incorrect cached HTML." But OpenAI doesn't use Rails so I shrugged the thought off as recency bias.

My jaw dropped when I saw the postmortem — it was exactly the same bug concept, just in a different library! A reminder that hard things often transcend particular languages and libraries. And boy, is this a hard bug. It sits at the intersection of caching, shared resource management, and state corruption — infamously tricky problem spaces.

The bug mechanism

The Dalli caching client connects to a memcached server with a socket (which is essentially just a buffer, I'll use the terms interchangeably). The client issues a get message to the server which says "give me the value for this key". In response, the server writes some data into the socket buffer. The client then reads bytes off of the socket until it has processed one complete response, making the assumption that what it's reading is the response to its own get message. If the socket was empty at the beginning of the operation, that assumption works and the client returns the correct cache value.

But what if the buffer was not empty when the client issued the get command? Then the client processes the data that was on the socket as if it were the requested value, and — since no extra validation steps occur — returns the wrong value for the given key. Worse, because it would only read one value's worth of data out of the buffer (it knows when to stop reading based on a metadata header), the actual response to its request would remain on the socket for the next request to return incorrectly, and so on.

The socket has thus entered an indefinite "off by one" corrupt state. Meaning the nth get operation will return the value for the n-1th operation's key, leaving its own value in the buffer for the n+1th to read. Oh no! That corrupted state in a different library explains ChatGPT's incident.

If your cache system starts returning the wrong values that's very scary from a security perspective, because data or HTML cached for one user might be shown to another user. In the case I saw firsthand this didn't happen because the incorrect cache values all type mismatched, causing the process to error repeatedly until it was killed. That was lucky, but a Rails app conceivably could see the exact kind of incident OpenAI saw. It would just depend on what type of data was on the corrupted socket and how the application uses the cache.

How it happens

Why would there be an unread value in the buffer? All that's required is for the client to send a get command to the server and then fail to actually read the response — while also leaving the socket around for a future request to use.

One way this might happen in Dalli is if there's an issue in get_multi code. This could occur if the client requests multiple values, reads some subset of those values off of the socket, and then returns before the server finishes writing all of the values to the socket. Another way this might happen is if something - say, an error or a timeout - interrupts the client code execution between issuing a get message and reading the response.

Ideally the client would discard/reset the socket in case of an error or timeout, and in most scenarios that's exactly what would happen. But all it takes is one code path where the socket can get corrupted and stick around. Unfortunately, there is at least one such a code path, explaining the incorrect caching behavior I observed.

The specifics are less important than the question of whether the client should rely completely on the assumption that the socket will be empty when it begins a get operation.

The fix

As a security engineer I love when I can solve a bug deep in an abstraction, making it impossible for a developer using the abstraction to cause the bug again — even if they make a mistake. We should encourage developers to avoid known dangers, sure, but it's best for an abstraction itself to prevent unacceptable outcomes automatically. Or, within the abstraction, we should try to manage our state correctly, but it's best if we can preclude the worst impacts even if there's a flaw and some unexpected state emerges. It is far more robust to make a failure case impossible at the root instead of relying on all the higher level code doing the right thing every time.

To that end, I pursued a simple foundational fix. Memcached has a getk command, which differs from get by returning the key alongside the value. The downside is a small performance cost — returning the key means processing more bytes. The upside is that the client gets both the key and the value back from memcached, and can check that the key matches what it asked for.

That's the core of my implementation - if the keys don't match, raise an error. Simple. Now even if the buffer contains a leftover value for a different key, the client won't return it. Instead it will reset and retry.

Analyzing the fix

What of the cost vs benefit? In the context of a multi-tenant Rails application, it is generally not an acceptable risk to potentially expose one tenant's data to another, even in a low probability event. A small increase in overhead to completely rule out a data exposure bug is a small price to pay.

In this case, the overhead is really small. When I rolled out this implementation in an application serving millions of requests a day, there was zero measurable impact on performance. Even if the overhead was noticeable, it would still be worthwhile to pay the cost for most threat models.

It is conceivable that an application using the caching client might make a different cost/benefit calculation and decide that performance is paramount and key:value integrity failures are acceptable. That's why one approach I proposed was making it a configuration option in the upstream PR. Unfortunately the maintainer did not agree that the client should solve this problem at all, despite multiple other reports of it happening in production systems.

The maintainer had some valid points — application layer timeouts in Ruby are dangerous, and if you're using them, extreme caution is warranted. Consider killing the process entirely after a timeout to prevent state corruption bugs like this one.

Nonetheless, I firmly believe that given the severity of this issue it is worthwhile for the client itself to prevent the worst impact from happening. I am also not convinced that timeouts are the only way for this bug to happen. See the get_multi issue or consider the possibility of a future code change introducing a connection handling bug.

Alternatives

Another way to solve the problem would be to keep the socket locked until the response is read. That way, a socket with an unread response would not be re-used by a subsequent operation. I'm not sure why Dalli doesn't already work this way, but currently it releases the lock after issuing the get command and re-acquires it before reading the response. I opened a second PR proposing to keep the lock on the socket for the entire get sequence, which was also rejected.

The safe_get implementation still has an advantage in that it works regardless of whether the socket is properly handled or even if memcached sends extraneous responses. That approach is publicly available and production tested. Please let me know if you have any questions or feedback about it!

Solving a critical bug in the default Rails caching library