Probably you have heard about observability, but.. do you really understand it? In this post, I’ll explain it in an easy way. If this is the first time you read something about observability: I hope you learn a lot. Else, if you know a lot about observability, I hope you learn something new.
What is observability?
A system is observable when you can infer its internal state just from its outputs. That’s it. Sounds easy, right? Not exactly. The systems are more and more complex each day and deciding what outputs provide the tools to use to manage that data and being able to convert those data into information can be a big challenge.
A 101 example
Let’s imagine you created a calculator command line application. At some point, you try to multiply 3 by 3 and the result is 9 but, if you multiply 3 by 4, the result is 150. How would you debug that issue? Probably you will:
1) Check the output of the program for different inputs
2) Read the source code of the tool to detect what can go wrong
3) Use a debugging tool (or add a lot of temporary print statements, right?
And, after some time, you’ll find the issue, fix the problem, rebuild your application and, hopefully, run the tests to verify nothing was broken. This is so slow. And our example is a small application running in a single node: imagine when your codebase has thousands of lines of code, is running on a multi-cloud environment with hundreds of nodes and interacting with other applications. Problems can be difficult to find, aren't they? Systems are complex with complex failures.
Why is observability needed?
The dependency of business on technology has increased over the years. And the architecture of those software systems too. And, as a consequence, the architecture of the systems where this software runs.
From monolith to microservices
There are a lot of articles about how a lot of companies have evolved the software architecture of their systems from a single application to a lot of them. "Splitting" your application into microservices has some advantages, like better scalability, faster time to market, greater agility or better fault tolerance.
But it also has some disadvantages, like the need for more collaboration between teams or poorer performance (because the information needs to be sent through the network). And you need to maintain more systems. So, when something needs investigation (like an error or a performance degradation), instead of reviewing a single element, we need to inspect our microservices to detect which one is causing the issue and where.
From one machine to containers
Another big change was in the infrastructure. Before, the applications were deployed on a single machine. The dependencies as databases were installed in the same machine. If the system needed more resources, the solution was to buy new hardware and migrate everything to the new machine. This is called vertical scaling. But this solution has limitations in the number of resources you can have in a machine and the price of hardware with a lot of resources is expensive. And you have to throw away your old hardware! What is the solution? Horizontal scaling.
We can simplify and say that horizontal scaling is using small interconnected machines to work as a single unit. For example, you can have 10 machines running a web application and another one running a load balancer software to distribute the load between the different nodes.
But using physical machines has a big inconvenience: when you don’t need all the computing power your system has, you are wasting resources. Virtual machines to rescue! Using virtual machines, you can benefit more from your resources. If you need more computing power, you create copies of the VM running your application. If you need less, you can scale down your system easily. And… What do you do with the spare resources? You rented them to other companies with the same problems (oh wait! We just invented Cloud Computing).
This new scenario has other benefits: since you can create and destroy instances of your applications easily, if one of them starts to fail, you can remove it and spawn a new one. Another one: if you are running microservices, you can scale up or down one of them without affecting the others.
Using a cloud as infrastructure for your virtual machines avoids the need for maintaining physical hardware. And, depending on the cloud you are using, you can deploy your application in different regions, reducing the response delay of your customers.
When we use physical resources, we could run one instance of our application machine. Now, we can run multiple instances per machine. But our infrastructure can be more efficient: for each VM, there is an installation of the full operating system, libraries and other stuff that is not really important for our application. How about a way to isolate the instances of our applications but share those resources? That’s the idea behind containers.
With containers, we don’t need a hypervisor. Instead, we run a container engine in the host operating system and share OS resources across the different containers. As a consequence, each container includes the application and its dependencies. This leads to a lower resource usage per application instance. So, we can run more instances of our application in the same machine.
The less resources we use, the more money we save.
Why is this related to observability?
With one single instance of our application, observability was important to reduce the time to identify possible issues in our system. Imagine how important it is now, since we have hundreds of containers running instances of our microservices in different clouds with thousands of users accessing them. Systems are complex. And complexity increases the probabilities of creating a catastrophe.
There is something you need to take into account: disaster can not be avoided. There will always be:
1) Human error
2) Network issues: latency, bandwidth, reliability…
3) Outages
4) Untested code paths
5) Corner cases
We need to be ready for issues to avoid them or, at least, recover from them as fast as possible. Each minute our system is not working, is money our company is losing.
The free pillars of observability
If your application is observable, you should be able to detect where the problem originated by checking the outputs. What outputs?
1) Traces: contextual data about how a request or operation went through a system
2) Metrics: quantitative information about the system
3) Logs: specific messages emitted by a process or service
As I said, once your system is generating that data, you need to convert it into information. Usually, send everything to a central place where you can correlate everything. For instance: if you start seeing crashes in your applications, you can query your system to detect what queries are producing those crashes. After that, using tracing and logging, you can go directly to the source of the problem. Observability is about the integration of multiple sources of telemetry data together that can help you to better understand how your software is operating. To get context to solve a problem before it happens or as soon as possible after the failure is produced.
Something that is pretty clear is that the concept of monitoring is complementary to observability. Monitoring tells you when something is wrong and observability why. Monitoring provides information about how the system is performing about connectivity, downtime or bottlenecks and observability what piece is failing and the reasons. Monitoring is part of observability.
Traces
A trace is a visualization of the events in your system showing the calling relationship between the events. This visualization includes information like the timing data for each event. The individual events are called "spans".
Using them, you can get what functions were called in your code or when one microservice connected a database to retrieve some information. And something really important: you can get how much time is spent in the different operations. Distributed tracing helps to do profiling of your system in production.
Metrics
They consist of a single numeric value tracked over time. We have used them in monitoring systems for years: CPU usage, memory, disk performance… But all this information gives us an incomplete view of what is happening in our system.
Observable systems include the concept of "application performance monitoring". This new concept proposes to track application level metrics like average load times, error rates, requests per second, time to perform operations against third party services… providing a real picture of what the user experiences.
Each metric tracks only one variable, which makes them more efficient to send and store.
Logs
Logs are text strings written to the terminal or to a file. You can generate these logs directly by printing information on the screen. But there are libraries that can help you to structurate that data to print information depending on the importance level (it is not the same as a debug message than an error message, for instance).
Los aggregation services allow you to send these messages to a central place to store them. Later, you can retrieve that data and extract information about what is happening in your system.
Closing
I hope you found this article useful. I'll create more posts about more observability topics soon!