Reliable, Scalable and Maintainable Applications 1

Vignesh Muthukumaran - Jan 16 '23 - - Dev Community

As I started to go through the book Designing Data-Intensive Applications, I thought I would consolidate my notes to provide a quick reference for my future self. In this article, I will consolidate my notes for the first chapter Reliable, Scalable, and Maintainable Applications.

An awesome quote at the start of the chapter is

The Internet was done so well that most people think of it as a natural resource like the Pacific Ocean, rather than something that was man-made. When was the last time a technology with a scale like that was so error-free? - Alan Kay, in an interview with Dr. Dobb’s Journal (2012)

The quote emphasizes the beauty of building Reliable, Scalable and Maintainable applications and what we should strive for as architects and developers.

Broadly applications can be classified into 2 types,

  • Compute intensive - Requiring a lot of computing resources
  • Data-intensive - Huge amount of data, complex data, and pace at which these data change is high

Most of today's applications are data-intensive. Typical Data Intensive application has the following components.

  • Store data (DBs)
  • Remember the results of expensive operations(Caches)
  • Allow users to search (Search Indexes)
  • Send messages to handle asynchronously (Stream Processing)
  • Process huge chunks of data at once (Batch Processing)

Major concerns when designing a software system are

Reliablilty - System should work correctly even in face of adversity(hardware/software faults, human error, etc)
Scalability - As system grows(in volume, traffic, etc), there should be ways to deal with that growth
Maintainability - Should be easy to maintain the system(make changes to fix bugs, handle more usecases, etc)

Reliability

For software, typical reliability expectations are,

  • Performs the function user expects
  • Tolerate user mistakes
  • Performance is good enough for the use cases, under expected load and data volume
  • Prevent unauthorized access and abuse

So, the expectation is for the system to work correctly when things go wrong(faults). So, we will be building fault-tolerant (resilient) systems.

Fault vs Failure - A fault is one/more components deviating from expected specifications, whereas a failure is when the system is not able to service the user request.

Counterintuitively we can increase faults, to know how well the system can handle the faults(Eg: Netflix Chaos Monkey). Also, in some cases, we need to stop the faults, like in cases of data breaches where there is no way to fix the fault or tolerate them if the data is breached.

Types of faults are,

  • Hardware faults
  • Software faults
  • Human errors

Hardware Faults

If any hardware components were prone to failure, the natural thing to do would be redundancy. In this approach, we may set up the drives in RAID config, with dual power supplies and hot-swappable CPUs. Though not completely foolproof these approaches can keep machines running uninterrupted for years.

Nowadays, with AWS EC2-like VMs running most of the newer applications, where these systems favor elasticity to single VM reliability, applications need to be able to handle entire machine losses as well.

We go to multi-machine setups which have few operational advantages as well over the traditional systems like rolling upgrades, zero downtime, quick scaling to load, etc.

Software errors

These are errors due to the bugs in the software. There is no quick solution to these errors. Things like thorough testing, process isolation, allowing processes to crash and restart, and monitoring help here. If a system is to provide some guarantee, it can self-check itself constantly and raise an alert in case of discrepancy.

Human errors

Humans are the most unreliable part of any software system. Some ways to keep these errors in check are

  • Limit chances to introduce these human errors
  • Decouple places where people make most mistakes (Ex: Test env and prod env)
  • Testing thoroughly
  • Allow quick recovery (Ex: Rollback config if any failures)
  • Detailed monitoring

Even in noncritical applications, the system needs to be reliable. There are cases where cost vs reliability sacrifice arises, but we should be careful of the choices.

Will continue in the next article as I don't want the article to be too long.

. . . . . . . . . . . . . . . . . . .