#ToggleTalk 3: Resiliency

Dawn Parzych - Apr 17 '20 - - Dev Community

According to Merriam-Webster, resilience is defined as “an ability to recover from or adjust easily to misfortune or change.”  

Toggle thought this topic was a good follow-up to last week’s conversation about productivity. We are currently having to adjust expectations about what being productive is. We’ve had to adapt to new remote working situations quickly. Systems are being pushed to the limits as large numbers of people quickly moved to cloud-based solutions for meetings, social gatherings, and educating students. How well you adjust to these changes requires resiliency

Questions we posed on resiliency:

  • How do you define resilience?
  • How do you build resiliency in your systems? 
  • How do you increase your own tolerance for disruption and failure?
  • What value can we derive from critical events?

Highlight reel

Whole vs. parts

If you are looking for resilience, you have to look at the big picture. From a technology perspective, if you are striving for five-nines availability, you have to look not just at the technology but the people, the processes, and the organization as a whole.

Liquid error: internal

Sociotechnical models do just this and help when it comes to resiliency. Sociotechnical theory looks at the interrelationships between the social and technical aspects. Consider how people will use the software, who will be using it, who will be supporting it. This can help you build resilience as you adapt to the changing social aspects.

When looking at the social aspects, remember that resources are finite. And people are not resources. People do not have an infinite ability to respond to and recover from failures. We use metrics to track the health of individual elements of our systems. We can also use metrics to track our own health and ability to respond to failures.

Liquid error: internal

Humans are resilient, systems are robust

One aspect of resilience is sustained adaptability. This is where humans come in. People make decisions about what to build, how to build it, and how and when to change them. Systems will not adapt without humans. It isn’t possible to separate the human from the tech.

Liquid error: internal

Liquid error: internal

Surprise!

I love the framing of incidents as surprises. It takes away some of the negative stigma of incidents being bad. If we frame incidents as surprise learning opportunities, it helps us figure out what the best response is.
 
Liquid error: internal

Mental models

The conversation about resilience and surprises seemed to naturally lead to a discussion of mental models. A mental model is an explanation of someone’s thought process of how something works. Mental models help us understand and interpret the relationships between things. When we encounter an obstacle, we may have to update our mental models. The solution that worked previously may not work the second time around. Our ability to continually update our mental models is part of our resiliency.

Liquid error: internal

Liquid error: internal

Summary

During #ToggleTalk, we touched on all four concepts for resilience as outlined by David Woods (see article below): 

  • Ability to rebound 
  • Robustness 
  • Extensibility 
  • Adaptability 

We need to look at technology from a sociotechnical perspective for true resiliency.

Thanks to everybody that joined in this week’s discussion on resilience. See you next week on #ToggleTalk!

Want more?

There is an upcoming conference (next week on April 21st!) if you want to learn more about resilience engineering and the process for building systems that can withstand unexpected failures. You can register here: FailoverConf (free of charge!). We will be there, and so will one of our Developer Advocates, Heidi Waterhouse.

Or you can check out these recommended reads and talks: 

Recommended Reads
Resilience is a Verb 

Four concepts for resilience and the implications for the future of resilience engineering

Report from the SNAFUcatchers Workshop on Coping with Complexity

Above the line, below the line

Recommended Talks 
OOPS! Learning from Surprise at Netflix

How did things go right? Learning from incidents

A Few Observations on the Marvelous Resilience of Bone & Resilience Engineering

. . . . . . . . . . . . . . . . . . . . .