Avoiding Beginner Mistakes Hampering You to Scale Backend⚡️

Riken Shah - Jun 14 - - Dev Community

This blog covers how I unlocked performance that allowed me to scale my backend from 50K requests → 1M requests (~16K reqs/min) on minimal resources (2 GB RAM 1v CPU and minimal network bandwidth 50-100 Mbps).

meme

It will take you on a journey with my past self. It might be a long ride, so tighten your seatbelt and enjoy the ride! 🎢

It assumes that you are familiar with the backend and writing APIs. It's also a plus if you know a bit about Go. If you don't, that's fine too. You'd still be able to follow along as I've provided resources to help you understand each topic. (If you don't know GO, here's a quick intro)

tl;dr,

First, we build an observability pipeline that helps us monitor all the aspects of our backend. Then, we start stress testing our backend until breakpoint testing (when everything eventually breaks).

Connection Polling to avoid hitting max connection threshold

Enforcing resource constraints to avoid resource hogging from non-critical services

Adding Indexes

Disabling Implicit Transaction

Increasing the max file descriptor limit for Linux

Throttling Goroutines

Future Plans

Intro w/ backend 🤝

Let me give a brief intro to the backend,

  • It's a monolithic, RESTful API written in Golang.
  • Written in GIN framework and uses GORM as ORM.
  • Uses Aurora Postgres as our sole primary database hosted on AWS RDS.
  • Backend is dockerized and we run it in t2.small instance on AWS. It has 2GB of RAM, 50-100mb/s of network bandwidth, 1 vCPU.
  • The backend provides authentication, CRUD operation, push notifications, and live updates.
  • For live updates, we open a very lightweight web socket connection that notifies the device that entities have been updated.

Our app is mostly read-heavy, with descent write activity, if I had to give it a ratio it'd be 65% read / 35% write.

I can write a separate blog on why we choose - monolithic architecture, golang, or postgress but to give you the tl;dr at MsquareLabs we believe in "Keeping it simple, and architecting code that allows us to move at a ridiculously fast pace."

Data Data Data 🙊

Before doing any mock load generation, first I built observability into our backend. Which includes traces, metrics, profiling, and logs. This makes it so easy to find the issues and pinpoint exactly what is causing the pain. When you have such a strong monitoring hold of your backend, it's also easier to track production issues faster.

Before we move ahead let me just give tl;dr for metrics, profiling, logs, and traces:

  • Logs: We all know what logs are, it's just loads of textual messages we create when an event occurs.

image.png

  • Traces: This is structured logs high on visibility, that help us to encapsulate an event with correct order and timing.

image.png

  • Metrics: All the numeric churned data like CPU usage, active requests, and active goroutines.

image.png

  • Profiling: Gives us real-time metrics for our code and their impact on the machine, that help us understand what's going on. (WIP, will talk in detail in next blog)

To learn more about how I built observability into the backend you can study the next blog (WIP), I moved this section to another blog because I wanted to avoid the reader becoming overwhelmed and focus on only one thing - OPTIMIZATION)

This is how visualization of traces, logs, and metrics looks like,

Screenshot 2024-05-30 at 4.53.29 PM.png

So now we have a strong monitoring pipeline + a decent dashboard to start with 🚀

Mocking Power User x 100,000 🤺

Now the real fun begins, we start mocking the user who is in love with the app.

"Only when you put your love (backend) to extreme pressure, you find it's true essence ✨" - someone great lol, idk

Grafana also provides a load testing tool so without overthinking much I decided to go with it, as it requires minimal setup of just a few lines of code, there you have got a mocking service ready.

Instead of touching all the API routes I focused on the most crucial routes that were responsible for 90% of our traffic.

image.png

Quick tl;dr about k6, it's a testing tool based on javascript and golang, where you can quickly define the behavior you want to mock, it takes care of load testing it. Whatever you define in the main function is called an iteration, k6 spins up multiple virtual user units (VUs) which processes this iteration until the given period or iteration count is reached.

Each iteration constitutes 4 requests, Creating Task → Updating Task → Fetching the Task → Delete Task

iLoveIMG Download (1).jpg

Starting slow, let's see how it goes for ~10K request → 100 VUs with 30 iters → 3000 iters x 4reqs → 12K request

image.png

That was a breeze, no sign of memory leaks, CPU overloading, or any kind of bottleneck, Hurray!

Here's the k6 summary, 13MB of data sent, received 89MB, averaging over 52 req/s, with an average latency of 278ms not bad considering all this running on a single machine.



checks.........................: 100.00%  12001      0    
     data_received..................: 89 MB   193 kB/s
     data_sent......................: 13 MB   27 kB/s
     http_req_blocked...............: avg=6.38ms  min=0s       med=6µs      max=1.54s    p(90)=11µs   p(95)=14µs  
     http_req_connecting............: avg=2.99ms  min=0s       med=0s       max=536.44ms p(90)=0s     p(95)=0s    
    http_req_duration..............: avg=1.74s   min=201.48ms med=278.15ms max=16.76s   p(90)=9.05s  p(95)=13.76s
       { expected_response:true }...: avg=1.74s   min=201.48ms med=278.15ms max=16.76s   p(90)=9.05s  p(95)=13.76s
    http_req_failed................: 0.00%    0          24001
     http_req_receiving.............: avg=11.21ms min=10µs     med=94µs     max=2.18s    p(90)=294µs  p(95)=2.04ms
     http_req_sending...............: avg=43.3µs  min=3µs      med=32µs     max=13.16ms  p(90)=67µs   p(95)=78µs  
     http_req_tls_handshaking.......: avg=3.32ms  min=0s       med=0s       max=678.69ms p(90)=0s     p(95)=0s    
     http_req_waiting...............: avg=1.73s   min=201.36ms med=278.04ms max=15.74s   p(90)=8.99s  p(95)=13.7s 
     http_reqs......................: 24001   52.095672/s
     iteration_duration.............: avg=14.48s  min=1.77s    med=16.04s   max=21.39s   p(90)=17.31s p(95)=18.88s
     iterations.....................: 3000    6.511688/s
     vus............................: 1       min=0       max=100
     vus_max........................: 100     min=100     max=100

running (07m40.7s), 000/100 VUs, 3000 complete and 0 interrupted iterations
_10k_v_hits  [======================================] 100 VUs  07m38.9s/20m0s  3000/3000 iters, 30 per VU


Enter fullscreen mode Exit fullscreen mode

Let's ramp up 12K → 100K requests, sent 66MB sent, 462MB received, saw a peak CPU usage to 60% and memory usage to 50% took 40mins to run (averaging 2500 req/min)

image.png

Everything looked fine until I saw something weird in our logs, "::gorm: Too many connections::", quickly checking the RDS metrics it was confirmed that the open connection had reached, 410, the limit for max open connection. It is set by Aurora Postgres itself based on the available memory of the instance.

Here's how you can check,

select * from pg_settings where name='max_connections'; ⇒ 410

Postgres spawns a process for each connection, which is extremely costly considering it opens a new connection as a new request comes and the previous query is still being executed. So postgress enforces a limit on how many concurrent connections can be open. Once the limit is reached, it blocks any further attempt to connect to DB to avoid crashing the instance (which can cause data loss)

Optimization 1: Connection Pooling ⚡️

Connection Pooling is a technique to manage the database connection, it reuses the open connection and ensures it doesn't cross the threshold value, if the client is asking for a connection and the max connection limit is crossed, it waits until a connection gets freed or rejects the request.

There are two options here either do client-side pooling or use a separate service like pgBouncer (acts as a proxy). pgBouncer is indeed a better option when we are at scale and we have distributed architecture connecting to the same DB. So for the sake of simplicity and our core values, we choose to move ahead with the client-side pooling.

Luckily, the ORM we are using GORM supports connection pooling, but under the hood uses database/SQL (golang standard package) to handle it.

There are pretty straightforward methods to handle this,



configSQLDriver, err := db.DB()
        if err != nil {
            log.Fatal(err)
        }
        configSQLDriver.SetMaxIdleConns(300)
        configSQLDriver.SetMaxOpenConns(380) // kept a few connections as buffer for developers
        configSQLDriver.SetConnMaxIdleTime(30 * time.Minute)
        configSQLDriver.SetConnMaxLifetime(time.Hour)


Enter fullscreen mode Exit fullscreen mode
  • SetMaxIdleConns → maximum idle connection to keep in memory so we can reuse it (helps reduce latency and cost to open a connection)
  • SetConnMaxIdleTime → maximum amount of time we should keep the idle connection in memory.
  • SetMaxOpenConns → maximum open connection to the DB, as we are running two environments on the same RDS instance
  • SetConnMaxLifetime → maximum amount of time any connection stays open

Now going one step further 500K requests, (4000 req/min) and 20mins in server crashed 💥, finally let's investigate 🔎

image.png

Quickly looking through metrics and bam! CPU and Memory usage spiked. Alloy (Open Telemetry Collector) was hogging all the CPU and memory rather than our API container.

image.png

Optimization 2: Unblocking Resources from Alloy (Open Telemetry Collector)

We are running three containers inside our small t2 instance,

  • API Dev
  • API Staging
  • Alloy

As we dump huge loads to our DEV server it starts to generate logs + traces at the same rate hence exponentially increasing CPU usage and network egress.

So it's important to ensure alloy container never crosses the resource limits, and hampers the critical services.

As the alloy is running inside a docker container it was easier to enforce this constraint,



resources:
        limits:
            cpus: '0.5'
            memory: 400M


Enter fullscreen mode Exit fullscreen mode

Also, this time logs weren't empty there were multiple context canceled errors - the reason being request timed out, and the connection was abruptly closed.

image.png

And then I checked latency it was crazy 😲 after a certain period average latency was 30 - 40 seconds. Thanks to traces now I can exactly pinpoint what was causing such huge delays.

image.png

Our query in GET operation was extremely slow, let's run EXPLAIN ANALYZE on the query,

Screenshot 2024-06-11 at 9.55.10 PM.png

LEFT JOIN took 14.6 seconds while LIMIT took another 14.6 seconds, how can we optimize this - INDEXING

Optimization 3: Adding Indexes 🏎️

Adding indexes to fields that are often used in where or ordering clauses can improve the query performance five-fold. After adding the indexes for the LEFT JOIN table and ORDER fields the same query took 50ms. It's crazy, from 14.6seconds ⇒ 50ms 🤯

(But beware of adding indexes blindly, it can cause slow death to CREATE/UPDATE/DELETE)

It also frees up connection faster and helps improve the overall capacity of handling huge concurrent loads.

Optimization 4: Ensure while testing there is no blocking TRANSACTION 🤡

Technically not an optimization but a fix, you should keep this in mind. Your code doesn't try to update/delete the same entity concurrently when you are stress testing.

While going over the code I found a bug that caused an UPDATE to the user entity on every request and as each UPDATE call executed inside a transaction, which creates a LOCK, almost all the UPDATE calls were blocked by previous update calls.

This alone fix, increased the throughput to 2x.

Optimization 5: Skipping implicit TRANSACTION of GORM 🎭

image.png

By default GORM, executes each query inside a transaction which can slow down the performance, as we have an extremely strong transaction mechanism, the chances of missing a transaction in a critical area is almost not possible (unless they are an intern 🤣).

We have a middleware to create a transaction before hitting a model layer, and a centralized function to ensure the commit/rollback of that transaction in our controller layer.

By disabling this we can get performance gains of at least ~30%.

"The reason we were stuck at 4-5K requests a minute was this and I thought it was my laptop network bandwidth." - dumb me

All this optimization led to a 5x throughput gain 💪, now my laptop alone can generate traffic of 12K-18K requests a minute.

Screenshot 2024-06-12 at 7.20.27 PM.png

Million Hits 🐉

Finally, a million hits with 10k-13K requests a minute, it took ~2 hours, it should have been done earlier but as alloy restarts (due to resource restriction) all the metrics get lost with it.

image.png

To my surprise the maximum CPU utilization during that span was 60% and memory usage was just 150MB.

It's crazy how Golang is so performant and handles the load so beautifully. It has minimal memory footprint. Just in love with golang 💖

Each query took 200-400ms to complete, The next step is to uncover why it takes that much, my guess is connection pooling and IO blocking slowing down the query.

The average latency came down to ~2 seconds, but there is still a lot of room for improvements.

Implicit Optimization 🕊️

Optimization 6: Increasing the max file descriptor limit 🔋

As we are running our backend inside a Linux OS, each network connection we open creates a file descriptor, by default Linux limits this to 1024 per process which hinders it from reaching peak performance.

As we are opening multiple web-socket connections if there is a lot of concurrent traffic we'd hit this limit easily.

Docker compose provides a nice abstraction over it,



ulimits:
core:
soft: -1
hard: -1

Enter fullscreen mode Exit fullscreen mode




Optimization 7: Avoid overloading goroutines 🤹

As a go developer, we often take a goroutine for granted, and just mindlessly run many non-critical tasks inside a goroutine we add go before a function and then forget about its execution, but at the extreme condition it can become a bottleneck.

To ensure it never becomes a bottleneck for me, for the services that often run in goroutine, I use an in-memory queue with n-worker to execute a task.

image.png

Next Steps 🏃‍♀️

Improvements: Moving from t2 to t3 or t3a

t2 is the older generation of AWS general-purpose machines, while t3 and t3a, t4g are newer generation. They are burstable instances they provide far better network bandwidth and better performance for prolonged CPU usage than t2

Understanding burstable instances,

AWS introduced burstable instance types mainly for workloads that don't require 100% CPU for most of the time. So these instances operate on baseline performance (20% - 30%). They maintain a credit system whenever your instances don't require CPU it accumulates credits. When the CPU spike occurs it uses that credits. This reduces your cost and wastage of computing for AWS.

t3a would be a nice family to stick with because their cost/efficiency ratio is much better among burstable instance families.

Here's a nice blog comparing t2 and t3.

Improvements: Query

There are many improvements that we can make to query/schema to improve speed, some of them are:

  • Batching inserts in insert heavy table.
  • Avoiding LEFT JOIN by denormalization
  • Caching layer
  • Shading & Partitioning but this comes a lot latter.

Improvements: Profiling

The next step to unlock performance is to enable profiling and figuring out what is exactly going on in runtime.

Improvements: Breakpoint Testing

To discover the limitations and capacity of my server, breakpoint testing is the next step forward.

End Note 👋

If you have read till the end, you are cracked, congrats 🍻

This is my first blog, please let me know if something is unclear, or if you want a more in-depth dive into the topic. In my next blog, I'll be taking a deep dive into profiling so stay tuned.

You can follow me on X, to stay updated :)

.