You might heard about the 1brc
(one billion row challenge), well… Brazil being Brazil, we just had the second instance of the Backend Cockfight (“Rinha de Backend”), where backend languages, frameworks, and other technologies are pitted against each other in a challenge where the real prize… are the gambiarras you make along the way.
The first instance of the cockfight
I only knew about this after it finished, but here’s the repo: https://github.com/zanfranceschi/rinha-de-backend-2023-q3
If you don’t know any Portuguese, then turn on the auto-translate from your browser, it’s good enough!
The first instance was focused on handling the most load under a stress test under CPU and memory restrictions, a lot of Brazilian influencers later reviewed the challenge and submissions. Thanks to it I’ve learned about the challenge and got some insights and learning.
A new challenge appears
Knowing it was a thing, a new instance of the challenge was launched and I managed to play around and submit some entries.
In this instance, the challenge was about concurrency. This time you had to handle lots of credit and debit operations while keeping the balance without inconsistencies.
Lessons from the last challenge
One of the lessons from the last challenge is that even under high stress, the language doesn’t matter much. It might matter when you have millions of requests per second, but considering the restraints, the bottleneck will probably be the database anyway and it can only handle so much.
Not only that, although it says it's a backend fight, the devil is on the connecting details, so knowing (or learning) about DevOps will take you farther than just knowing your backend language really well and not knowing how to properly connect all the dots.
My choices
Next.js for the API
For some reason, I was curious if they added Rust or something like that in the backend part, after all, “in Rust we rewrite” and all that.
Or at least what kind of backend framework they use in the backend part. But as far as I could see in the exported version, it’s custom-built, and even on a standalone build, it’s just as good as other backend frameworks.
Surprisingly it uses Webpack which everyone probably feels is “slow”. But there’s also SWC (Rust mentioned again) being used, so it probably is responsible for the speed.
better-sqlite3 for the DB (plus Drizzle)
Drizzle was something I’ve been wanting to play with for some time, so why not now?
Surprisingly nice to play with. I found it interesting that in part, you’ll be using the DB of your choice with its quirks, but some things might be used even when changing DBs, however, it’s probably not trivial to change DBs… but being honest… you probably wouldn’t anyway.
And maybe it’s common SQLite or something that better-sqlite3 brings, but being able to use queries without async operations is borderline magic.
Obviously, those sync operations only work when you have the DB attached in the same space as the code, but some services show that SQLite can be distributed.
This challenge, and the last one, makes it clear that the DB is the bottleneck. This also shows that even a cheap VPS could handle thousands of requests per minute, more than enough for most applications.
Even factoring async calls, but now in a distributed SQLite service… then you don’t really need much to get more than enough power to handle whatever idea you have.
nginx for the load balancer
This one was a “safe choice”, I was already playing around with some new toys, so no need to add more to the chaos mix.
What I’ve learned
I wanted to play with high stress, so I tuned the test cases to make sure I would have failing requests.
While the API had limited resources, Gatling, which is being used for the load test, didn’t. It was also not being run in some isolated space… I was doing a lot of things plus the API and Gatling.
This was the scenario, and so are the things I found:
nginx was hogging all the memory
Most likely my fault, probably some memory leak, but I had to give a lot more memory to nginx than for any other service running.
Since the services weren’t using it all anyway… I gave whatever I could to it.
One simple trick to ensure consistency
I’ve learned this somewhere, not sure where, but here's the trick to handle concurrency.
In this challenge, we had credits and debits, while you could have all the credits pass without any restriction, debits could cause inconsistency.
But I was also saving the new current balance with each new insert and this could also mean that two operations could use the same “last operation” and end up “branching” the actual balance.
The trick is to know which “last operation” was used and add that as a constraint.
If two operations get the same last operation at the same time, the first one will get a lock, save that and release, and when the second one tries to save the same last operation it will get a constraint error and, at least, you won’t have inconsistent balances.
I took advantage of having sync operations and in this case, I would just retry the whole operation again. For the first version, I’m pretty sure I had some processes running an infinite loop because of that, so I then added an exit condition. Even then, even with multiple concurrent operations going on, this let me handle multiple operations without losing any (or maybe most, depending on the luck of the order and if the user still had plenty of balance left).
Corepack: the node “package manager” manager
Something interesting I found while playing around with Dockerfiles is the corepack
.
It acts as a “package manager” manager and has been available since node v16.9.0 and v14.19.0.
It says it’s still experimental, but so far as I could see and use it, it does what it’s supposed to do.
Since I was using pnpm
, I would usually start by running the container with npm -g i pnpm
but corepack handles that and more: you can even pin a version of the package manager to be used in the project. Oh, how I wanted to know this before… who never has had different people using different npm
versions and each wanting to create a lock file in a different version?
To be even better, it would only need some way to call the install and other scripts with a single interface (think something like corepack install
to install the dependencies (this one is used to install the package manager) and corepack run dev
to use the chosen manager to run the dev
script). You could use one of the package manager of your choice to start the project, but in most cases, especially when dropping into a new project where you want to just run something.
The database is the bottleneck and I’m still baffled
The minimum requirement, total CPU and memory constraints aside, was to have 2 replicas of the API. But no matter how much I threw at them, I couldn’t make the services use or need all the CPU or memory assigned to each.
How to check it? docker stats
, a nice utility to check the use of each container running.
Since they weren’t using it, I’ve tried to add more replicas… and even then they weren’t using it all.
After having 5 replicas, I felt it wasn’t working properly, so I stopped at 5, but from the tests I’ve made, I think 4 was a sweet spot.
With that, I had memory and CPU to spare for each service, they were all using the same SQLite db file and losing a lot of requests each.
Was the DB the bottleneck?
I’m thinking I/O (it’s still a file after all), and just realizing I’ve never tried running only one instance or no limits… well…
Replicas, limits, requests, and latency
One replica seems to have lower latency but four can handle more RPS (requests per second). This is with or without the CPU and memory constraints.
Gatling, or maybe the tests I'm running have a cap, and while with no constraints it can handle more RPS, it’s not by much. But it starts having no failing requests and overall latency all goes down.
Burn baby burn. - “I paid for the whole PC, I'm gonna use the whole PC”.
What could go wrong with making numbers go up? Almost 200k requests over 4 minutes.
I tried using 10 replicas with no constraints… it almost froze my computer with it pointing to everything being used (cores, CPU%, memory, swap…). And it failed around 20% of the requests.
Then I tried only 1 replica in the same situation… but it could only handle about half of the requests. And it barely used any of the available resources. But that is probably because JavaScript can use only one core.
Conclusion
I’m sure I could keep playing with the numbers, but it’s already evident that everything else equal, depending on the available resources and expected RPS, there’s an optimal number of replicas to make the most of it.
I consider myself a frontend (or a more frontend fullstack programmer), so a full backend or a DevOps might know a good way of finding that optimal number.
And also what's going on with the DB… I still feel it’s the bottleneck.
Right now I don’t know what’s happening there, I would be happy to get some pointers on what is going on. But if/when I find something… I’ll let you know.
Links
First Rinha de Backend: https://github.com/zanfranceschi/rinha-de-backend-2023-q3
Last Rinha: https://github.com/zanfranceschi/rinha-de-backend-2024-q1
My repo for the last Rinha (so you can play with it): https://github.com/Noriller/app-rinha-de-backend-2024-q1