A recent outage involving CrowdStrike impacted 8.5 million Windows operating systems, leading to disruptions in various global services, including airlines and hospitals. Multiple analyses have examined the root cause of this incident itself.
However, as a software engineer, I think we are missing the aspect of human emotions related to deployments, specifically the fear of breaking production. That’s what we will try to dive into in this article. We will cover:
- Understanding the function of release engineering.
- What software engineers care about and what they don’t.
- Impact of continuous delivery (CD).
- A look at manual deployments.
- Problems with manual deployment and the solution to these problems.
Release Engineering
Before delving into the fear of deployments from a software engineer’s perspective, let’s first understand the role of a release engineer.
Release engineering has evolved considerably in recent years, thanks to the modern CI and CD tools and standardization of Kubernetes. Despite these advancements, the primary responsibilities remains the same:
Consistent and repeatable deployments: Standardizing release processes, reduces the risk of bad deployments to production.
Reducing service disruptions: Standardized processes also ensure teams are equipped to tackle harmful production environment incidents—for example, a rollback strategy for scenarios where a release causes problems.
Monitor and Optimize Performance: Look for performance improvements for faster and reliable deployments.
Collaborate with engineering: Work closely with developers, QA, and DevOps teams to ensure all new and existing services have a well defined deployment process.
What Software Engineers Care About
Unlike the release engineers, as a software engineer working in the product team we may only care about certain aspects of deployments:
Quick code merges: Merging quickly allows them to validate their work and move on to new tasks or unblock dependent tasks.
Production incidents: Although engineers may not care about all production incidents, they definitely care about their code changes causing any production outages.
Deployment schedule: Engineers also like to track when their changes go live or have gone live, so that they can have access to real-time feedback on their changes.
What Software Engineers Don’t Care About
Although there are things we care about, there are also those we don’t:
Deployment methodology: Although we know the need for an efficient and reliable deployment process, they don’t care how it is performed.
Effect of other changes: Unless things go wrong, we don’t worry about unrelated changes from other developers.
Deployment management: An engineer is indifferent to who manages deployment in a software team. For instance, we would only care about managing deployment if tasked with doing so.
Impact of Continuous Deployments (CD)
So what does the fear have to do with Continuous deployments?
A lot.
Studies have proven several benefits of Continuous Deployment (CD), and unsurprisingly many of which are psychological in nature. Continuous deployments removes “human-in-the-loop”, therefore it requires a strong trust in the test infrastructure.
In other words, automated tests are not only ensuring reliability of production, but also providing psychological safety, sometimes irrationally, reducing the fear of deployments. As a developer, I’m more comfortable making changes in a CD process vs if I’m asked to verify the changes manually.
However, despite the popularity of these CD strategies, a lot of companies still trigger deployments manually (have a human-in-the-loop), indicating a cautious approach to CD implementations. This behavior suggests that teams prefer to retain supervision on the release process and intervene where necessary.
This is important to understand from a psychological safety perspective. Manual deployments imply that someone is overseeing the process and handling issues when things go wrong. While this provides a sense of security, it can also induce fear in the person deploying and is prone to human error.
Manual deployments
Despite the drawbacks most teams manage deployments manually. A typical manual deployment may include a few steps:
Supervision
Someone babysits the entire deployment process before a release goes out. This person is tasked with intervening when and if there are signs of trouble. Teams maintain an oncall person who manages their deployments and handles problems when they arise.
Dedicated Release Teams
Some teams have a dedicated release engineering team, which ensures releases go smoothly. Since this means a high degree of specialization, the deployment process could be more efficient and reliable.
Spreadsheets
Some companies maintain a spreadsheet to validate any changes made. This allows companies to systematically review and approve these changes, ensuring they meet predefined quality standards.
Manual QA
In addition to spreadsheets, manual QA is another layer companies add. Manual QA tests new releases in staging environments before deploying them to production. However, a testing environment isn’t foolproof, so that some real-life scenarios won’t be accounted for.
Where Do Things Go Wrong With Manual Deployments?
Many things can go wrong for any software development team relying solely on manual deployments:
Dependence on a small group
This can create bottlenecks, which lead to release delays and human error in some instances. Also, a team could have problems when this specific person leaves or can’t deliver on the required tasks.
No risk-mitigation strategy
There is no strategy for following through in an unfavorable production incident. When an incident happens, the release team has to grapple to find the relevant stakeholders to help resolve and make decisions.
Prone to human error
Typographical errors in commands or scripts, or forgot to run the pre-deployment or post-deployment steps.
High effort
Since the deployments require babysitting the process,it becomes a time consuming effort. Also causing the frequency of deployments to drop significantly. For instance, if it requires an hour to monitor the entire deployment, the release team may decide to skip deployments on the days with minor changes to save that time.
Communication Breakdown
It’s unclear from product teams on the state of the releases and when their changes are getting into production.
Looking at these challenges, it’s easy to understand why engineers dread deployments. The risk of deployment failures, the high stakes, and the pressure to keep downtime low also contribute to this fear.
These failures can be minimized by increasing test automation. Still, since these tests are carried out in a test environment, you should not expect an automated test to catch every possible error. Failures are to be expected but at a reduced rate.
What can we do about it?
Simply set up Continuous Deployments? Easier said than done. Despite the drawbacks, manual deployments are still okay if managed well. The goals should be:
- provide guardrails to avoid production incidents
- reduce human errors
- enable anyone to trigger deploys
- ensure deployments happen frequently
Guardrails – Canary and Rollbacks
Canary and Rollback strategies can help reduce the impact of an outage and in many cases avert the crisis automatically.
A canary release exposes your new release to a small portion of production environment traffic. This gives teams insight into issues that might not have come up during testing.
On the other hand, a rollback strategy helps engineers revert a release to its previous stable version state. It is done when new problems arise after deployments to the production environment.
Reduce human errors – Standardization
Define standard deployment methodologies that result in efficiency, consistency, reliability, and high software quality. In their state of DevOps report, DORA shows that reliability predicts better operational performance. Furthermore, having a standardized process allows repeatability in release processes, which can be automated. Automating this process helps a team keep production costs lower.
Democratize deployment process
Democratizing the deployment process removes the reliance on specific individuals. If we empower any software engineer to deploy, it slowly reduces the fear. “If “anyone can deploy it should not be too hard”. Share your legos!
Frequent deployments
To reduce deployment anxiety, we need to deploy more frequently, not less. The DORA report also highlights that smaller batch deployments are less likely to cause issues and help lower the psychological barrier for developers.
Improve developer experience
Clarifying what is being deployed enhances the developer experience. Make it easy for developers to know when deployments occur and what changes are included. This transparency helps developers track when their changes go live and simplifies incident investigations.
Defined risk-mitigation strategies
There should be defined steps to follow for rollbacks and hotfixes, as this helps eliminate any indecision with production incidents. For instance, there should be separate build and deploy steps for teams to follow for easy rollbacks
Similarly, standardizing how to deal with hotfixes and cherrypicks can make it simple to operate when the stakes are high.
Feature flags
Feature flags are like kill-switches that can turn off a new feature that caused an incident in production. This can enable engineers to resolve production incidents quickly.
Conclusion
Software teams must treat release engineering as a priority from the outset of product development to avoid costly mistakes. And we should not let incidents like the Crowdstrike outage cripple our development practices. Addressing the fear of deployment and preventing production incidents involves several key strategies:
- Invest in the standardization of deployment processes
- Set up well-defined risk-mitigating strategies, such as canary releases, strategic rollouts, rollbacks, and hotfixes.
- Simplify the developer experience by democratizing deployments, and encourage everyone to participate.