Context:
I am adamant advocate of designing services in such a way that promotes testing them in production as part of the software development cycle. Developers have their local test environments, and we have ephemeral environments for running integration tests, but we do not have staging environments. This comes as a surprise and almost a crazy idea to people who have not worked in a such setup. What follows is an internal memo that expresses my views in a conversation with the team about having a dedicated test / staging environment.
Internal memo:
There is no denying that testing pre-production is valuable – you can do risk-free deployments, revert them, etc.
In the ideal world, every branch that is being worked on would have its own web app, database, API, temporal, etc. – the entire copy of production. This environment would last no longer than the branch (but would also not provide any guarantees about persisting state within that period). This is the ideal world ephemeral testing environment. It is ideal because:
- we know that it is reasonably similar to production because initialization is versioned and replicable
- we can always ensure that it matches the current production environment by recreating it
- there is no maintenance cost assigned to it because it has no expectations of persistence
We already have components of it (ephemeral API and database instances), and one day I would like us to have all of the above.
In contrast, a persistent staging environment does not provide any of these guarantees:
- You have no clue what is deployed to this environment because multiple people can be contributing to it, i.e. You are not testing your changes in isolation.
- You have no clue in what state is the environment because multiple users / automated tests can be making changes to it.
- Staging environment is more than likely to lack robust monitoring and alerts coming from staging are not going to get prioritized.
Some testing is not even possible in staging environment, namely: performance and stress testing. This is because staging is deployed under a different infrastructure.
As far as integrity tests go, a shared test environment that is in an unpredictable state is as useful to integration tests as self-attesting PRs that they do not have bugs. What works there is not guaranteed to work in production.
On the other hand, testing in production is hard:
- You need to be careful when deploying changes because no matter how many safe-guards you add to CI/CD, there is a chance of breaking production.
- You have to think ahead not only about whether your changes will work in the test environment, but also if they are going to work in production (a perfect example of that is forgetting to add indexes concurrently, which will work in staging, but may completely lock production).
- You have to make expensive transactions conditional, e.g. You should be able to test using sandbox / production Stripe credentials in production and in your local development environments. Otherwise, you cannot easily debug issues that arise in just one of the environments.
- You have to plan for tests to be atomic, i.e. ensuring that other user actions do not affect your tests.
- You have to have a process for canary, blue/green and rollbacks.
- You have to have a process for isolating and deleting test data.
- You have to have sufficient redundancy to avoid outages.
- You have to communicate changes being deployed.
- You have to monitor your deployments to isolate or revert deployments with unexpected behavior.
This is a lot of overhead, but all of these are also great practices for production maintenance regardless of where testing takes place.
In addition, testing in production allows a quicker path to (iteratively) test and collect feedback from real-users, which is what we all should be aiming to be doing more.
The way I see it, testing in production has a higher upfront cost and ongoing cost, but it brings along a heap of benefits that otherwise would get deprioritized in a normal software development cycle. It is how we learn to make safe deployments and quickly restore services when the unexpected happens.