With request-level isolation, multiple teams can experiment on a single shared cluster.
Originally posted on The New Stack by Nočnica Mellifera.
What’s the best way to know that your code will work? A fascinating trend, as I speak to capable platform engineers and operations architects, is that no one can seem to agree quite where, or how, testing is supposed to happen.
What is the first time you get an indication that your code isn’t working correctly with other services? Should staging fail often as developers test big changes, or should staging always get working code commits? Should huge efforts go into contract testing, with complex mocks to simulate things like latency spikes, or should we just put up a canary test on production and see how things go? These are all questions that don’t have unanimous answers on enterprise platform engineering teams.
Let’s look at the problems with attempting local replication of a complex microservice environment. While smaller teams can absolutely provide every engineer with a copy of their production cluster that runs on their laptop, this approach no longer scales well at all, and time spent on local replication is better spent creating a staging environment that can be shared by the whole team and safely used for testing from the first day of development.
The Two-Dozen Microservice Dilemma
All the questions of the best development platform begins with scale. For our case, imagine a team with more than 50 engineers and more than 25 microservices. Certain things are true of a team around this size that make it an inflection point from moving a process that would be familiar to a monolithic application to a more distributed, shared, high-velocity development team.
What’s true about a team of 50 engineers and 25 microservices? Let’s list a few observations:
- The team(s) is too large to stay synced and share knowledge: Team C may be updating the database interface without anyone on Team A knowing the work is happening.
- The compute work done by all microservices is enough to tax a normal laptop.
- More than one database is in use.
- Code is spread across multiple repositories.
When the team and product were half the size, a developer could just grab the necessary repos, get help from other teams to get things working, and when their replica got out of date, they likely knew already from updates from other teams. However, at this scale, those casual human communications no longer scale, and someone from Team A will find their local replication environment gets out of sync without their realizing it.
Sunk Costs: Over-Committing to Local Replicas
In this situation, many teams will actually make a decision to buy into local replication, in that they’ll start committing real DevOps resources to the project. Suddenly we’re responsible for maintaining a Dockerfile for local replication, which developers must update to know if their changes will work with the other services.
The reasons for committing to this work seem persuasive: With a consistent local replica, the devs find bugs before updates go to staging and don’t block the work of other teams that need staging to be available most of the time. (I use staging throughout here, but just think of it as the deployment before production, whether it’s called staging, QA, testing or something else.) However, there are three major concerns about investing serious time in local replication at this stage:
- If you’re not currently running local replicas of your whole cluster, it’s likely that the architecture itself will need to be reworked, with a standard way of starting and running services, a monorepo architecture and clear service ownership.
- Many components can’t be replicated well locally, including third-party services and data stores with complex data structures inside. The result will be mocks of these components or other highly simplified copies, which raises concerns about the accuracy of testing and the cost of ongoing maintenance.
- This approach won’t scale long term. Once the team size and the architecture both double in size, there will be no way to run the whole thing on a developer laptop. Once a laptop can’t run a cluster, the cost to run the same cluster with a replica for every developer has prohibitive cloud infrastructure costs.
This doesn’t mean local replication won’t work for all teams, it means that once you’re realizing that your scale necessitates full-time maintenance of the local replica image, you should spend that time on something else.
Why Do All Your Microservices Come as a Bundle?
This whole discussion begs another question: If you need to test every code change, do you really have microservices? Even if 25 components of your product run as separate services, if they’re so tightly coupled that they can’t be tested in isolation, don’t you have microservices in name only? (As a side note, I desperately hope that the acronym MINO for tightly coupled microservice architectures will catch on.)
Every discussion of testing integration between microservices comes back to the question of how the microservices should be well separated so that you can do contract testing. The issue, again, comes back to scale.
At small scales, every service should be relied upon to perfectly meet its contract with other services. Even at a large scale, there shouldn’t be unexpected side effects of transactions within your cluster. However, at larger scales, the contract-testing requirements grow more and more complex. Contract testing doesn’t test latency, multivariate requests, and unexpected data within data stores — all cases that we’d like to have covered by testing before we’re preparing to head to production.
Is it possible to cover these cases? Certainly, but the question is whether we should be spending large amounts of time simulating all the other services within a cluster, or if that time would be better spent establishing a single, high-accuracy clone of production in a staging server.
The overall return on investment for re-architecting our product to more cleanly separate microservices and implement extensive contract testing between those services might not be what we’d like for a large technical spike.
A Better Solution: A Shared Cluster as a Source of Truth
If we don’t want to invest the time in fitting our cluster into a workstation or a deep set of contract tests, the solution is a shared cluster that very closely resembles production. This staging environment offers real answers over whether changes will play well with the other services, and it’s a single cluster to update when other services change. Finally, unlike a local environment, it should be available 24/7 with no need for developers to update their replication environment.
Again, we must discuss problems of scale. At 50 developers and 25 microservices, it’s likely that multiple teams will want to test on staging at the same time. This scale is the inflection point here, again: a bit smaller and teams could just post on Slack that they’re using staging for the next few hours. But as we grow, it’s harder to stay synchronized, and we can end with devs waiting hours or days for staging to become available.
Using a Kubernetes namespace as a team’s development environment offers a robust solution for replicating the conditions of a staging or production environment. By creating a namespace that is a clone of the staging setup, developers can work in an environment that closely mimics production. This approach ensures that all services, configurations and dependencies are aligned, making it easier to catch issues early in the development cycle.
A cloned namespace also facilitates better collaboration among team members. Since the namespace is isolated, multiple developers can work on different features or bug fixes without stepping on each other’s toes. This isolation is particularly beneficial for DevOps engineers who need to manage complex CI/CD pipelines, as it allows them to test deployment scripts and orchestration procedures in an environment that is almost identical to production. The namespace can act as a final checkpoint where all code and features are integrated and tested before being moved to staging or production. This approach is in use at teams like Prezi, where each development team has a namespace to deploy and test changes.
Concerns with Namespace Replication
It’s essential to manage these cloned namespaces carefully to avoid configuration drift. Automated tools and scripts are required to ensure that the namespace remains a true replica of the staging or production environment. Any changes to the staging or production setups need to be mirrored in the development namespace as soon as possible.
If this tight integration and synchronization isn’t maintained, the result will be namespaces that have fallen out of date and are no longer trustworthy environments for developers to use.
As your team scales up, you’ll need more namespaces with replicas of the pertinent parts of production. As you need to replicate databases, cloud resources and third-party integrations for each namespace, this can start to feel daunting.
The last consideration is the cost of running all these replicated namespaces either in infrastructure costs or time. Either you’re running many namespaces all the time, which is costly, or you’re starting the namespace’s services every time the team wants to run integration tests, adding to their friction for testing and experimentation. The overhead for platform engineering teams brings us back to the general point that environment replication doesn’t scale for large microservice teams.
At Uber and Lyft, the engineering teams found the namespace approach insufficient due to synchronization and testing fidelity issues, and they moved on to a request isolation model where multiple teams could safely experiment on a single shared cluster.
Why Environment Replication Doesn’t Scale
The allure of local replication, while initially promising, reveals its limitations as teams and architectures scale. It’s not just a matter of finding bugs early; it’s about the accuracy of those tests and the sustainability of the testing environment. Contract testing, while valuable, also shows its limitations as the complexity of interactions between services increases.
When considering these roadblocks to integration testing and dev environments at scale with microservices, I would encourage you to reconsider what we mean by “microservices.” If services are so interdependent that they can’t be tested in isolation, the term becomes more of a label than a description of the architecture.
The shared staging environment emerges as a pragmatic middle ground. Using Kubernetes namespaces for team-specific environments can offer a balance between isolation and accuracy. However, even this approach is not without its pitfalls, such as the risk of configuration drift and the operational overhead involved.
As we scale, our testing methods must scale with us, always aiming for that elusive combination of accuracy, efficiency and maintainability. In recent years a new approach has come to the fore, using a shared environment without multiple replicas and isolating experiments via request isolation.
Request Isolation: A New Model for Developer Experimentation and Testing
At large enterprise teams, and increasingly with mid-sized development teams, a new model is emerging that promises faster, better testing earlier in the development cycle.
Request-level isolation is an approach to testing in a microservices environment that leverages context propagation and request routing. When a developer wants to test a new version of a microservice, dependencies are satisfied from a shared pool of services running the latest stable version, known as the baseline. This method ensures that changes made by one developer are isolated from another, mitigating issues of cross-dependency and unpredictable staging environments.
With a shared staging environment, we can produce a highly accurate replication space like in the namespace strategy mentioned above. But rather than replicating components into a namespace, we can use request isolation to deploy multiple developer versions of services run simultaneously.