The Journey to Multi-Region Infrastructure: Understanding Availability and Business Needs

Victor Martinez - Aug 29 - - Dev Community

When I decided to write about implementing a multi-region strategy, I quickly realized that this topic is far too complex to cover in a single blog post. Therefore, I've started a series of posts explaining the entire process for companies to achieve a successful infrastructure project with the highest possible availability in the cloud.

Image description

To begin this extensive process, we need to start from the business perspective and understand why we need greater availability. It is crucial to recognize that multi-region implementation is not a business or technological goal. Talking about multi-region means increasing the complexity of the platform, its operation, and consequently, the cost of the platform to solve a higher need: availability.

Let's shift the conversation to availability. Before diving into multi-region implementation, I recommend following this flow:
Measuring availability involves various metrics that give us a 360-degree view of the organization's state. However, these metrics can be complex to measure and even more challenging to understand within an organization. An organization should focus on mastering two key metrics: Mean Time to Recovery (MTTR) and Service Level Agreement (SLA).

When we talk about the availability percentage, we're referring to the SLA offered to our customers in the contract and the availability percentage we measure on our platform. These two values are intimately connected. To define SLAs within the company's offering, we must first separate the organization's domains or products. It's almost impossible to talk about all systems in a single value. Depending on the company's complexity, they can be divided geographically or by product (system).
For example, if we have product A with 99% availability and product B with 96%, and we offer our customers an average of 97.5%, they will expect a 2.5% impact. This becomes a problem we'll have to justify when an incident brings us down to 96%. We can't simply respond to a customer with something like, "It's the average of all services." Additionally, clarifying availability in domains or products is not new and can be observed in both value propositions and status pages of major cloud providers:

AWS Lambda SLA
Microsoft Online Services SLAs

Distributing SLAs is normal practice for companies with clearly defined products, but it can be challenging in some cases when the company has yet to establish a product portfolio.

Mean Time to Recovery (MTTR) is another crucial point. It establishes our offer to restore service in extreme cases, such as natural disasters or significant physical or cloud infrastructure failures. In these scenarios, we're talking about major failures like a data center without power, internet outrage, failure of the entire information storage layer due to cloud provider issues, latency exceeding our response time between provider services, etc.

This time can be understood as the time to restore operational capabilities in the worst-case scenarios, from recovering information to the time for domain services to propagate the new public IP address.
Once these two points are established, we can discuss high availability schemes and potential architectural challenges in each scenario. But I'll leave that for the next blog post on architectural changes.

In conclusion, before jumping into multi-region implementation, it's crucial to understand your business needs, define clear SLAs for each product or domain, and establish realistic MTTR goals. These foundational steps will guide your journey towards a more resilient and available infrastructure.

Stay tuned for the next post in this series, where we'll discuss the architectural changes necessary to implement a multi-region strategy.

. . . .