DORA Metrics: What are they, and what's new in 2024?

Justin Reock - Jan 23 - - Dev Community

Despite some recent criticism, DORA metrics remain the most asked about framework for measuring developer productivity. But how can it's younger sibling, the SPACE framework change the dialogue around engineering measurement, and what role do IDPs play in bridging the gap?

There is nothing more valuable to an organization than data—about customers, products, opportunities, gaps... the list goes on. We know that to maximize value streams for the business we need to turn a critical eye to data related to how each group operates, including software development teams. In 2019 a group known as the DevOps Research and Assessment (DORA) team set out to find a universally applicable framework for doing just that. After analyzing survey data from 31,000 software professionals worldwide collected over a period of six years, the DORA team identified four key metrics to help DevOps and engineering leaders better measure software delivery efficiency:

Velocity Metrics

Deployment frequency: Frequency of code deployed
Lead time for changes: Time from code commit to production
Stability metrics
Mean time to recovery: Time to recover after an incident (now Failed Deployment Recovery Time)
Change failure rate: Percentage of changes that lead to failure

In 2021, DORA added a 5th metric to close a noted gap in measuring performance — reliability. The addition of this metric opened the door for increased collaboration between SREs and DevOps groups. Together these five metrics, now referred to simply as “DORA metrics” have become the standard for gauging the efficacy of software development teams in organizations looking to modernize, as well as those looking to gain an edge against competitors.

In this post we’ll discuss what each metric can reveal about your team, how the benchmarks available for “Elite,” (back again in 2023 after being dropped in 2022) “High-Performing,” “Medium,” and “Low-Performing” teams has changed in the last year, and what all of this means in relation to the recently released SPACE framework—which puts a higher emphasis on process maturity rather than output.

What is DORA?

First, let’s revisit what DORA (the institution behind the metrics) actually is. The DevOps Research and Assessment (DORA) team was founded in 2015 by Dr. Nicole Forsgren, Gene Kim, and Jez Humble with the charter of improving how organizations develop and deploy software. This group was also behind the first inaugural State of DevOps report, and maintained ownership of the report until 2017. Their research resulted in what Humble has referred to as “—a valid and reliable way to measure software delivery performance,” while also demonstrating that these metrics can “drive both commercial and non-commercial business outcomes.” In 2019 they joined Google, and in 2020 the first four of the familiar five DORA metrics were released, with the fifth and final following in 2021. An overview of each metric is below:

Lead Time for Changes

Lead Time for Changes (LTC) is the amount of time between a commit and production. LTC indicates how agile your team is—it not only tells you how long it takes to implement changes, but how responsive your team is to the ever-evolving needs of end users, which is why this is a critical metric for organizations hoping to stay ahead of an increasingly tight competitive landscape.

The DORA team first identified these benchmarks for performance in their Accelerate State of DevOps 2021 report, but have since updated them to the following (with original benchmarks noted in parentheses):

Elite Performers: <1 day (Original: <1 hour)
High Performers: 1 day to 1 week (Original: Same)
Medium Performers: 1 week and 1 month (Original: 1 month and 6 months)
Low Performers: 1 week and 1 month (Original: 6+ months)

LTC can reveal symptoms of poor DevOps practices: if it’s taking weeks or months to release code into production, you should assume inefficiencies in you processes. However, engineering teams can take several steps to minimize your LTC:

  • Implement continuous integration and continuous delivery (CI/CD). Encourage testers and developers are working closely together, so everyone has a comprehensive understanding of the software.
  • Consider building automated tests. Save even more time and improve your CI/CD pipeline.
  • Define each step of your development process. Because there are a number of phases between the initiation and deployment of a change, it’s smart to define each step of your process and track how long each takes.
  • Examine your pull request cycle time. Gain a thorough picture of how your team is functioning and further insight into exactly where they can save time.
  • Be careful not to let the quality of your software delivery suffer in a quest for quicker changes. While a low LTC may indicate that your team is efficient, if they can’t support the changes they’re implementing, or if they’re moving at an unsustainable pace, you risk sacrificing the user experience. Rather than compare your team’s Lead Time for Changes to other teams’ or organizations’ LTC, you should evaluate this metric over time, and consider it as an indication of growth (or stagnancy).

Deployment Frequency

Deployment Frequency (DF) is how often you ship changes, how consistent your software delivery is. This enables your organization to better forecast delivery timelines for new features or enhancements to end user favorites. According to the DORA team, these are the latest benchmarks for Deployment Frequency:

Elite Performers: On-demand or multiple deploys per day (Original: multiple per day)
High Performers: Once per day to once per week (Original: Once a week to once a month)
Medium Performers: Once per day to once per week (Original: Once a month to once every 6 months)
Low Performers: Once per day to once per week (Original: Less than once every 6 months)

While DORA has raised the bar on acceptable deployment frequency, it should be noted that numbers that are starkly different within and across teams could have a deeper meaning. Here are some common scenarios to watch for when investigating particularly high deploy counts:

Bottlenecks in development process: Inconsistencies in coding and deployment processes could lead some teams to have starkly different practices for breaking up their code.
Project complexity: If projects are too complex deploy frequency may be high, but may not say much about the quality of code shipped in each push.

Gamification: This particular metric may be easier to “game” than others since it’s largely in the control of an individual developer who may push code at higher intervals than normal if they believe their impact is measured by this metric alone.

Shipping many small changes usually isn’t a bad thing in and of itself, however. Shipping often might also mean you are constantly perfecting your service, and if there is a problem with your code, it’s easier to find and remedy the issue. However, If your team is large, this may not be a feasible option. Instead, you may consider building release trains and shipping at regular intervals. This approach will allow you to deploy more often without overwhelming your team members.

Failed Deployment Recovery Time (Formerly Mean Time to Recovery)

DORA recently updated the Mean Time to Recovery (MTTR) metric to a more specific, Failed Deployment Recovery Time (FDRT)—which is more explicitly focused on failed software deployments rather than incidents or breaches at large. FDRT is the amount of time it takes your team to restore service when there’s a service disruption as a result of a failed deployment, like an outage. This metric offers a look into the stability of your software, as well as the agility of your team in the face of a challenge. These are the benchmarks identified in the State of DevOps report:

Elite Performers: <1 hour (Original: Same)
High Performers: <1 day (Original: Same)
Medium Performers: 1 day to 1 week (Original: Same)
Low Performers: Between 1 month and 6 months (Original: Over 6 months)

To minimize the impact of degraded service on your value stream, there should be as little downtime as possible. If it’s taking your team more than a day to restore services, you should consider utilizing feature flags so you can quickly disable a change without causing too much disruption. If you ship in small batches, it should also be easier to discover and resolve problems.

Although Mean Time to Discover (MTTD) is different from Mean Time to Recovery, the amount of time it takes your team to detect an issue will impact your MTTR—the faster your team can spot an issue, the more quickly service can be restored.

Just like with Lead Time for Changes, you don’t want to implement hasty changes at the expense of a quality solution. Rather than deploy a quick fix, make sure that the change you’re shipping is durable and comprehensive. You should track MTTR over time to see how your team is improving, and aim for steady, stable growth in successful deployments.

Change Failure Rate

Change Failure Rate (CFR) is the percentage of releases that result in downtime, degraded service, or rollbacks, which can tell you how effective your team is at implementing changes. This metric is also critical for business planning as repeated failure and fix cycles will delay launch of new product initiatives. Originally, there was not much distinction between performance benchmarks for this metric, with Elite performers pegged at 0-15% CFR and High, Medium, and Low performers all grouped into 16-30%. But the latest State of DevOps Report has made a few changes:

Elite Performers: 5% (Original: 0-15%)
High Performers: 10% (Original: 16-30%)
Medium Performers: 15% (Original: 16-30%)
Low Performers: 64% (Original: 16-30%)

Change Failure Rate is a particularly valuable metric because it can prevent your team from being mislead by the total number of failures you encounter. Teams who aren’t implementing many changes will see fewer failures, but that doesn’t necessarily mean they’re more successful with the changes they do deploy. Those following CI/CD practices may see a higher number of failures, but if CFR is low, then these teams will have an edge because of the speed of their deployments and their overall rate of success.

This rate can also have significant implications for your value stream: it can indicate how much time is spent remedying problems instead of developing new projects. Improve change failure rate by implementing testing, code reviews and continuous improvement workflows.

Examples of things to monitor to maintain a low change failure rate include:

  • Number of rollbacks in the 30 days
  • Ratio of incidents to deploys in the last 7 days
  • Ratio of rollbacks to deploys in the last 30 days

Reliability

The reliability metric -- or more accurately the reliability “dimension” -- is the only factor that does not have a standard quantifiable target for performance levels. This is because this dimension comprises several metrics used to assess operational performance including availability, latency, performance, and scalability. Reliability can be measured with individual software SLAs, performance targets, and error budgets.

These metrics have a significant impact on customer retention and success—even if the “customers” are developers themselves. To improve reliability, organizations can set checks and targets for all of the software they create. Some examples include:

  • Attach appropriate documentation
  • Attach relevant incident runbooks
  • Ensure integration with existing incident management tools
  • Ensure up to date software including search tools
  • Perform standard health checks
  • Add unit tests in CI
  • Ensure database failover handling code patterns are implemented

Are DORA metrics still the best way to build high-performing teams?

Because DORA metrics provide a high-level view of your team’s performance, they can be particularly useful for organizations trying to modernize—DORA metrics can help you identify exactly where and how to improve. Over time, you can see how your teams have grown, and which areas have been more stubborn.

Those who fall into the elite categories can leverage DORA metrics to continue improving services and to gain and edge over competitors. As the State of DevOps report reveals, the group of elite performers is rapidly growing (from 7% in 2018 to 26% in 2021), so DORA metrics can provide valuable insights for this group.

So why have these metrics come under fire recently? Criticism of this framework is partially rooted in how they’re applied more than how they’re defined. Teams may over-rotate on the numbers themselves, rather than context surrounding them. This has lead to gamification, and a separation of output from real business outcomes. Though, it’s not difficult to see how we got here when we consider that previous means for tracking DORA metrics included static exports, cobbled together spreadsheets, or stand-alone tools that failed to consider individual and team dynamics.

How does DORA relate to the new SPACE Framework?

The same group at DORA behind the original 5 metrics recently released a new framework that refocuses measurement more towards the human process behind each technical metric outlined in DORA. The SPACE Framework is a compilation of factors that comprises a more holistic view of developer productivity.

The full list includes:

Satisfaction: How do developers feel about the work they’re doing? How fulfilling is it?
Performance: Are we meeting deadlines? Addressing security issues quickly enough?
Activity: How is productivity? PRs, lines of code, etc.
Collaboration: Is the team working together and taking advantage of their strengths?
Efficiency: Keeping developers within their creative flow state

While DORA primarily focused on output, SPACE focuses on the process to get to the output (optimizing workflows). This duality is why many teams don’t find the two to be mutually exclusive—and instead consider SPACE to be an extension of DORA. Dr. Nicole Forsgren herself has reportedly noted, “DORA is an implementation of SPACE.”

This framing is bolstered by SPACE’s open guidelines—which are intentionally non-prescriptive when it comes to the data needed to assess each pillar. This makes the SPACE framework highly portable and universally applicable, regardless of your organization’s maturity.

Are we done with DORA?

While DORA has been met with increased criticism in recent years, the addition of SPACE has greatly balanced the equation. Rather than feel the need to choose between the two models, or throw away DORA entirely, Engineering and DevOps teams should consider using them in parallel to give equal weight to developer productivity and happiness. Using both together enables organizations to ask bi-directional questions like: Is that recent performance issue impacting satisfaction and efficiency? Or has a recent increase in satisfaction led to temporary dip in performance?

Context is king. DORA metrics can still be used to improve overall team performance, but they must be considered lagging indicators of success in relation to context about team talent, tenure, and complexity of on-going projects. Engineering leaders should first consider the health of software produced in the context of this information, and then use DORA metrics to trace correlations between metrics associated with velocity and reliability. For example, software produced by a new team that fails basic checks of maturity and security is far more likely to see comparatively poor MTTR and change failure rate metrics. But that doesn’t mean the team lacks drive or capability.

. . . . .