Tracking down high memory usage in Node.js

Georgios Kampitakis - Dec 14 - - Dev Community

In this article, I will share my approach of tracking down and fixing high memory usage in Node.js.

Contents

Context

Recently I got a ticket with the title "fix memory leak issue in library x". The description included a Datadog dashboard showing a dozen services suffering from high memory usage and eventually crashing with OOM (out of memory) errors, and they all had the x library in common.

I was introduced quite recently to the codebase (<2 weeks) which was what made the task challenging and also worth sharing.

I started working with two pieces of information:

  • There is a library that is used by all the services that is causing high memory usage and it involves redis (redis was included in the name of the library).
  • The list of the services that were affected.

Below is the dashboard that was linked to the ticket:

Initial state dashboard

Services were running on Kubernetes and it was obvious that services were accumulating memory over time until they reached the memory limit, crash (reclaim memory) and start again.

Approach

In this section, I will share how I approached the task at hand,identifying the culprit of the high memory usage and later fixing it.

Understand the code

As I was fairly new to the codebase, I first wanted to understand the code, what the library in question did and how it was supposed to be used, hoping with this process it would be easier to identify the problem. Unfortunately, there was no proper documentation but from reading the code and searching how services were utilising the library I was able to understand the gist of it. It was a library wrapping around redis streams and exposing convenient interfaces for event production and consumption. After spending a day and a half reading the code, I was not able to grasp all the details and how the data flowed due to code structure and complexity (a lot of class inheritance and rxjs which I am not familiar with).

So I decided to put a pause on reading and try to spot the problem while observing the code in action and collect telemetry data.

Replicate the issue in isolation

As there was no profiling data available (e.g. continuous profiling) which would help me investigate further I decided to replicate the issue locally and try to capture memory profiles.

I found a couple of ways for capturing memory profiles in Node.js:

With no leads on where to look, I decided to run what I thought was the most "data intensive" part of the library, the redis streams producer and consumer. I built two simple services that would produce and consume data from a redis stream and I proceeded with capturing memory profiles and comparing the results over time. Unfortunately, after a couple of hours of producing load to the services and comparing the profiles I was not able to spot any difference in the memory consumption in any of the two services, everything was looking normal. The library was exposing a bunch of different interfaces and ways of interacting with the redis streams. It became clear to me it would be more complicated than what I expected to replicate the issue, especially with my limited domain-specific knowledge of the actual services.

So the question was, how can I find the right moment and conditions to capture the memory leak?

Capture profiles from staging services

As mentioned earlier the easiest and most convenient way to capture memory profiles would be having continuous profiling on the actual services that were being affected, an option I did not have. I started to investigate how to at least leverage our staging services (they were facing the same high memory consumption) that would allow me to capture the data that I needed with no additional effort.

I started searching for a way to connect Chrome DevTools to one of the running pods and capture heap snapshots over time. I knew the memory leak was happening in staging, so if I could capture that data I was hoping I would be able to spot at least some of the hotspots. To my surprise, there is a way to do just that.

The process for doing this

  • Enable the Node.js debugger on the pod by sending a SIGUSR1 signal to the node process on your pod.
kubectl exec -it <nodejs-pod-name> -- kill -SIGUSR1 <node-process-id>
Enter fullscreen mode Exit fullscreen mode

More at about Node.js signals at Signal Events

If successful you should see a log from your service:

Debugger listening on ws://127.0.0.1:9229/....
For help, see: https://nodejs.org/en/docs/inspector
Enter fullscreen mode Exit fullscreen mode
  • Expose the port that the debugger is listening on, locally by running
kubectl port-forward <nodejs-pod-name> 9229
Enter fullscreen mode Exit fullscreen mode
  • Connect Chrome Devtools to the debugger you enabled on previous steps. Visit chrome://inspect/ and there you should see in the list of targets your Node.js process:

Chrome Devtool

if not, then make sure your target discovery settings are properly set up

Discovery settings

Now you can start capturing snapshots overtime (the period depends on the time required for the memory leak to happen) and compare them. Chrome DevTools provides a very convenient way to do this.

You can find more information about memory snapshots and Chrome Dev Tools at Record heap snapshot

When creating a snapshot, all other work in your main thread is stopped. Depending on the heap contents it could even take more than a minute. The snapshot is built in memory, so it can double the heap size, resulting in filling up entire memory and then crashing the app.

If you're going to take a heap snapshot in production, make sure the process you're taking it from can crash without impacting your application's availability.

From Node.js docs

So back to my case, selecting two snapshots for comparing and sorting by delta I got what you can see below.

Delta Comparison

We can see the biggest positive delta was happening on the string constructor which meant the service had created a lot of strings between the two snapshots but they were still in use. Now the question was where those were created and who was referencing them. Good thing that the snapshots captured contained this information as well called Retainers.

Retainers

While digging into the snapshots and the never shrinking list of strings I noticed a pattern of strings that were resembling an id. Clicking on them I could see the chain objects that were referencing them - aka Retainers. It was an array called sentEvents from a class name that I could recognise from the library code. Tadaaa we have our culprit, an only growing list of ids that by this point I assumed were are never released. I captured a bunch of snapshots overtime and this was the single place that kept reappearing as a hotspot with a big positive delta.

Verify the fix

With this information, instead of trying to understand the code to its entirety, I needed to focus on the purpose of the array, when it got populated and when cleared. There was a single place where the code was pushing items to the array and another one where the code was popping them out which narrowed the scope of the fix.

It's safe to assume that the array was not emptied when it should. Skipping the details of the code, what was basically happening is this:

  • The library was exposing interfaces for either consuming, producing events or producing and consuming events.
  • When it was both consuming and producing events it needed to track the events that the process itself produced in order to skip them and not re-consume them. The sentEvents was populated when producing and cleared when while trying to consume it would skip the messages.

Can you see where this is going? ? When services were using the library only for producing events the sentEvents would still get populated with all the events but there was no code path (consumer) for clearing it.

I patched the code to only track events on producer,consumer mode and deployed to staging. Even with the staging load it was clear that the patch helped with reducing the high memory usage and didn't introduce any regressions.

Results

When the patch was deployed on production, the memory usage was drastically reduced and reliability of the service was improved (no more OOMs).

Memory usage per pod and OOMs

A nice side effect was the 50% reduction in the number of pods needed to handle the same traffic.

Pod reduction

Conclusion

This was a great learning opportunity for me regarding tracking memory issues in Node.js and further familiarising myself with the tools available.

I thought it best not to dwell on the details of each tool as that would deserve a separate post, but I hope this is a good starting point for anyone that's interested in learning more about this topic or facing similar issues.

. . . . . . . . . . . . . . .