In this article, I will share my approach of tracking down and fixing high memory usage in Node.js.
Contents
Context
Recently I got a ticket with the title "fix memory leak issue in library x". The description included a Datadog dashboard showing a dozen services suffering from high memory usage and eventually crashing with OOM (out of memory) errors, and they all had the x library in common.
I was introduced quite recently to the codebase (<2 weeks) which was what made the task challenging and also worth sharing.
I started working with two pieces of information:
- There is a library that is used by all the services that is causing high memory usage and it involves redis (redis was included in the name of the library).
- The list of the services that were affected.
Below is the dashboard that was linked to the ticket:
Services were running on Kubernetes and it was obvious that services were accumulating memory over time until they reached the memory limit, crash (reclaim memory) and start again.
Approach
In this section, I will share how I approached the task at hand,identifying the culprit of the high memory usage and later fixing it.
Understand the code
As I was fairly new to the codebase, I first wanted to understand the code, what the library in question did and how it was supposed to be used, hoping with this process it would be easier to identify the problem. Unfortunately, there was no proper documentation but from reading the code and searching how services were utilising the library I was able to understand the gist of it. It was a library wrapping around redis streams and exposing convenient interfaces for event production and consumption. After spending a day and a half reading the code, I was not able to grasp all the details and how the data flowed due to code structure and complexity (a lot of class inheritance and rxjs which I am not familiar with).
So I decided to put a pause on reading and try to spot the problem while observing the code in action and collect telemetry data.
Replicate the issue in isolation
As there was no profiling data available (e.g. continuous profiling) which would help me investigate further I decided to replicate the issue locally and try to capture memory profiles.
I found a couple of ways for capturing memory profiles in Node.js:
With no leads on where to look, I decided to run what I thought was the most "data intensive" part of the library, the redis streams producer and consumer. I built two simple services that would produce and consume data from a redis stream and I proceeded with capturing memory profiles and comparing the results over time. Unfortunately, after a couple of hours of producing load to the services and comparing the profiles I was not able to spot any difference in the memory consumption in any of the two services, everything was looking normal. The library was exposing a bunch of different interfaces and ways of interacting with the redis streams. It became clear to me it would be more complicated than what I expected to replicate the issue, especially with my limited domain-specific knowledge of the actual services.
So the question was, how can I find the right moment and conditions to capture the memory leak?
Capture profiles from staging services
As mentioned earlier the easiest and most convenient way to capture memory profiles would be having continuous profiling on the actual services that were being affected, an option I did not have. I started to investigate how to at least leverage our staging services (they were facing the same high memory consumption) that would allow me to capture the data that I needed with no additional effort.
I started searching for a way to connect Chrome DevTools to one of the running pods and capture heap snapshots over time. I knew the memory leak was happening in staging, so if I could capture that data I was hoping I would be able to spot at least some of the hotspots. To my surprise, there is a way to do just that.
The process for doing this
- Enable the Node.js debugger on the pod by sending a
SIGUSR1
signal to the node process on your pod.
kubectl exec -it <nodejs-pod-name> -- kill -SIGUSR1 <node-process-id>
More at about Node.js signals at Signal Events
If successful you should see a log from your service:
Debugger listening on ws://127.0.0.1:9229/....
For help, see: https://nodejs.org/en/docs/inspector
- Expose the port that the debugger is listening on, locally by running
kubectl port-forward <nodejs-pod-name> 9229
- Connect Chrome Devtools to the debugger you enabled on previous steps. Visit
chrome://inspect/
and there you should see in the list of targets your Node.js process:
if not, then make sure your target discovery settings are properly set up
Now you can start capturing snapshots overtime (the period depends on the time required for the memory leak to happen) and compare them. Chrome DevTools provides a very convenient way to do this.
You can find more information about memory snapshots and Chrome Dev Tools at Record heap snapshot
When creating a snapshot, all other work in your main thread is stopped. Depending on the heap contents it could even take more than a minute. The snapshot is built in memory, so it can double the heap size, resulting in filling up entire memory and then crashing the app.
If you're going to take a heap snapshot in production, make sure the process you're taking it from can crash without impacting your application's availability.
So back to my case, selecting two snapshots for comparing and sorting by delta I got what you can see below.
We can see the biggest positive delta was happening on the string
constructor which meant the service had created a lot of strings between the two snapshots but they were still in use. Now the question was where those were created and who was referencing them. Good thing that the snapshots captured contained this information as well called Retainers
.
While digging into the snapshots and the never shrinking list of strings I noticed a pattern of strings that were resembling an id. Clicking on them I could see the chain objects that were referencing them - aka Retainers
. It was an array called sentEvents
from a class name that I could recognise from the library code. Tadaaa we have our culprit, an only growing list of ids that by this point I assumed were are never released. I captured a bunch of snapshots overtime and this was the single place that kept reappearing as a hotspot with a big positive delta.
Verify the fix
With this information, instead of trying to understand the code to its entirety, I needed to focus on the purpose of the array, when it got populated and when cleared. There was a single place where the code was pushing
items to the array and another one where the code was popping
them out which narrowed the scope of the fix.
It's safe to assume that the array was not emptied when it should. Skipping the details of the code, what was basically happening is this:
- The library was exposing interfaces for either consuming, producing events or producing and consuming events.
- When it was both consuming and producing events it needed to track the events that the process itself produced in order to skip them and not re-consume them. The
sentEvents
was populated when producing and cleared when while trying to consume it would skip the messages.
Can you see where this is going? ? When services were using the library only for producing events the sentEvents
would still get populated with all the events but there was no code path (consumer) for clearing it.
I patched the code to only track events on producer,consumer mode and deployed to staging. Even with the staging load it was clear that the patch helped with reducing the high memory usage and didn't introduce any regressions.
Results
When the patch was deployed on production, the memory usage was drastically reduced and reliability of the service was improved (no more OOMs).
A nice side effect was the 50% reduction in the number of pods needed to handle the same traffic.
Conclusion
This was a great learning opportunity for me regarding tracking memory issues in Node.js and further familiarising myself with the tools available.
I thought it best not to dwell on the details of each tool as that would deserve a separate post, but I hope this is a good starting point for anyone that's interested in learning more about this topic or facing similar issues.