Resilient Response in Complex Systems John Allspaw SVP, Tech Ops - - PowerPoint PPT Presentation

resilient response in complex systems
SMART_READER_LITE
LIVE PREVIEW

Resilient Response in Complex Systems John Allspaw SVP, Tech Ops - - PowerPoint PPT Presentation

Resilient Response in Complex Systems John Allspaw SVP, Tech Ops Friday, March 9, 12 OPERABILITY Friday, March 9, 12 PRODUCTION Friday, March 9, 12 http://whoownsmyavailability.com Friday, March 9, 12 Friday, March 9, 12 How important


slide-1
SLIDE 1

Resilient Response in Complex Systems

John Allspaw SVP, Tech Ops

Friday, March 9, 12
slide-2
SLIDE 2

OPERABILITY

Friday, March 9, 12
slide-3
SLIDE 3

PRODUCTION

Friday, March 9, 12
slide-4
SLIDE 4

http://whoownsmyavailability.com

Friday, March 9, 12
slide-5
SLIDE 5 Friday, March 9, 12
slide-6
SLIDE 6

How important is this?

Friday, March 9, 12
slide-7
SLIDE 7 Friday, March 9, 12
slide-8
SLIDE 8 Friday, March 9, 12
slide-9
SLIDE 9 Friday, March 9, 12
slide-10
SLIDE 10 Friday, March 9, 12
slide-11
SLIDE 11 Friday, March 9, 12
slide-12
SLIDE 12 Friday, March 9, 12
slide-13
SLIDE 13 Friday, March 9, 12
slide-14
SLIDE 14 Friday, March 9, 12
slide-15
SLIDE 15 Friday, March 9, 12
slide-16
SLIDE 16 Friday, March 9, 12
slide-17
SLIDE 17 Friday, March 9, 12
slide-18
SLIDE 18 Friday, March 9, 12
slide-19
SLIDE 19

How important is this?

Friday, March 9, 12
slide-20
SLIDE 20

How Can This Happen?

Friday, March 9, 12
slide-21
SLIDE 21

Complicated? Complex?

Friday, March 9, 12
slide-22
SLIDE 22

Complex Systems

  • Cascading Failures
  • Difficult to determine boundaries
  • Complex systems may be open
  • Complex systems may have a memory
  • Complex systems may be nested
  • Dynamic network of multiplicity
  • May produce emergent phenomena
  • Relationships are non-linear
  • Relationships contain feedback loops
Friday, March 9, 12
slide-23
SLIDE 23

1998

Friday, March 9, 12
slide-24
SLIDE 24

How Can This Happen? It does happen. And it will again. And again.

Friday, March 9, 12
slide-25
SLIDE 25 Friday, March 9, 12
slide-26
SLIDE 26

Optimization

MTBF MTTR

Friday, March 9, 12
slide-27
SLIDE 27

http://www.flickr.com/photos/sparktography/75499095/

Friday, March 9, 12
slide-28
SLIDE 28

How does team troubleshooting happen?

Friday, March 9, 12
slide-29
SLIDE 29

Time Problem Starts Detection Evaluation Response Stable Confirmation All Clear

PostMortem

Friday, March 9, 12
slide-30
SLIDE 30

Time Problem Starts Detection Evaluation Response Stable Confirmation All Clear

Stress

PostMortem

Friday, March 9, 12
slide-31
SLIDE 31

Forced beyond learned roles Actions whose consequences are both important and difficult to see Cognitively and perceptively noisy Coordinative load increases exponentially

Friday, March 9, 12
slide-32
SLIDE 32 Friday, March 9, 12
slide-33
SLIDE 33

So What Can We Do?

Friday, March 9, 12
slide-34
SLIDE 34

We Learn From Others

Friday, March 9, 12
slide-35
SLIDE 35

Characteristics of response to escalating scenarios

Friday, March 9, 12
slide-36
SLIDE 36

...tend to neglect how processes develop within time (awareness of rates) versus assessing how things are in the moment

Characteristics of response to escalating scenarios

“On the Difficulties People Have in Dealing With Complexity” Dietrich Doerner, 1980

Friday, March 9, 12
slide-37
SLIDE 37

...have difficulty in dealing with exponential developments (hard to imagine how fast something can change, or accelerate)

Characteristics of response to escalating scenarios

“On the Difficulties People Have in Dealing With Complexity” Dietrich Doerner, 1980

Friday, March 9, 12
slide-38
SLIDE 38

...inclined to think in causal series, instead of causal nets. A therefore B, instead of A, therefore B and C (therefore D and E), etc.

Characteristics of response to escalating scenarios

“On the Difficulties People Have in Dealing With Complexity” Dietrich Doerner, 1980

Friday, March 9, 12
slide-39
SLIDE 39

Thematic Vagabonding

Pitfalls

Friday, March 9, 12
slide-40
SLIDE 40

Pitfalls

Goal Fixation (encystment)

Friday, March 9, 12
slide-41
SLIDE 41

Pitfalls

Refusal to make decisions

Friday, March 9, 12
slide-42
SLIDE 42

Non-communicating lone wolf-isms

Heroism

Friday, March 9, 12
slide-43
SLIDE 43

Irrelevant noise in comm channels

Distraction

Friday, March 9, 12
slide-44
SLIDE 44

Jens Rasmussen, 1983

Senior Member, IEEE “Skills, Rules, and Knowledge; Signals, Signs, and Symbols, and Other Distinctions in Human Performance Models”

IEEE Transactions On Systems, Man, and Cybernetics, May 1983

Friday, March 9, 12
slide-45
SLIDE 45

SKILL - BASED

Simple, routine

RULE - BASED

Knowable, but unfamiliar

KNOWLEDGE - BASED

WTF IS GOING ON?

(Reason, 1990)

Friday, March 9, 12
slide-46
SLIDE 46

Team Dynamics

Friday, March 9, 12
slide-47
SLIDE 47
  • Air Traffic Control
  • Naval Air Operations At Sea
  • Electrical Power Systems
  • Etc.

High Reliability Organizations

  • Complex Socio-Technical

systems

  • Efficiency <-> Thoroughness
  • Time/Resource Constrained
  • Engineering-driven
Friday, March 9, 12
slide-48
SLIDE 48 Friday, March 9, 12
slide-49
SLIDE 49

“The Self-Designing High-Reliability Organization: Aircraft Carrier Flight Operations at Sea”

Rochlin, La Porte, and Roberts. Naval War College Review 1987

Friday, March 9, 12
slide-50
SLIDE 50 Friday, March 9, 12
slide-51
SLIDE 51

Close interdependence between groups

Friday, March 9, 12
slide-52
SLIDE 52

Close reciprocal coordination and information sharing, resulting in overlapping knowledge

Friday, March 9, 12
slide-53
SLIDE 53

High redundancy: multiple people observing the same event and sharing information

Friday, March 9, 12
slide-54
SLIDE 54

Broad definition of who belongs to the team.

Friday, March 9, 12
slide-55
SLIDE 55

Teammates are included in the communication loops rather than excluded.

Friday, March 9, 12
slide-56
SLIDE 56

Lots of error correction.

Friday, March 9, 12
slide-57
SLIDE 57

High levels of situation comprehension: maintain constant awareness of the possibility of accidents.

Friday, March 9, 12
slide-58
SLIDE 58

High levels of interpersonal skills

Friday, March 9, 12
slide-59
SLIDE 59

Maintenance of detailed records of past incidents that are closely examined with a view to learning from them.

Friday, March 9, 12
slide-60
SLIDE 60

Patterns of authority are changed to meet the demands of the events:

  • rganizational flexibility.
Friday, March 9, 12
slide-61
SLIDE 61

The reporting of errors and faults is rewarded, not punished.

Friday, March 9, 12
slide-62
SLIDE 62

So What Else Can We Do?

Friday, March 9, 12
slide-63
SLIDE 63

We Drill

Friday, March 9, 12
slide-64
SLIDE 64

We GameDay

Friday, March 9, 12
slide-65
SLIDE 65 Friday, March 9, 12
slide-66
SLIDE 66

We Learn To Improvise

Friday, March 9, 12
slide-67
SLIDE 67

IMPROVISATION

Friday, March 9, 12
slide-68
SLIDE 68

IMPROVISATION

Friday, March 9, 12
slide-69
SLIDE 69

We Learn From Our Mistakes

Friday, March 9, 12
slide-70
SLIDE 70

Postmortems

  • Full timelines: What happened, when
  • Review in public, everyone invited
  • Search for “second stories” instead of “human error”
  • Cultivating a blameless environment
  • Giving requisite authority to individuals to improve things
Friday, March 9, 12
slide-71
SLIDE 71

High signal:noise in comm channels? Troubleshooting fatigue? Troubleshooting handoff? All tools on-hand? Improvised tooling or solutions? Metrics visibility? Collaborative and skillful communication?

Qualifying Response

Friday, March 9, 12
slide-72
SLIDE 72

Remediation

Friday, March 9, 12
slide-73
SLIDE 73

Mature Role of Automation

http://www.bainbrdg.demon.co.uk/Papers/Ironies.html “Ironies of Automation” - Lisanne Bainbridge

Friday, March 9, 12
slide-74
SLIDE 74

Mature Role of Automation

  • Moves humans from manual operator to supervisor
  • Extends and augments human abilities, doesn’t replace it
  • Doesn’t remove “human error”
  • Are brittle
  • Recognize that there is always discretionary space for humans
  • Recognizes the Law of Stretched Systems
Friday, March 9, 12
slide-75
SLIDE 75

Law of Stretched Systems

“Every system is stretched to operate at its capacity; as soon as there is some improvement, for example, in the form of new technology, it will be exploited to achieve a new intensity and tempo of activity”

D. Woods, E. Hollnagel, “Joint Cognitive Systems: Patterns” 2006

Friday, March 9, 12
slide-76
SLIDE 76

We Share Near-Miss Events

Friday, March 9, 12
slide-77
SLIDE 77

Near Misses

Hey everybody -

Don’t be like me. I tried to X, but that wasn’t a good idea. It almost exploded everyone.

So, don’t do: (details about X) Love, Joe

Friday, March 9, 12
slide-78
SLIDE 78
  • Can act like “vaccines” - help system safety without actually

hurting anything

  • Happen more often, so provide more data on latent failures
  • Powerful reminder of hazards, and slows down the process of

forgetting to be afraid

Near Misses

Friday, March 9, 12
slide-79
SLIDE 79

A parting word A parting challenge

Friday, March 9, 12
slide-80
SLIDE 80

Two Propositions

Friday, March 9, 12
slide-81
SLIDE 81

100 changes 6 change-related issues

Friday, March 9, 12
slide-82
SLIDE 82

100 > 6

Friday, March 9, 12
slide-83
SLIDE 83

Proposition #1

“Ways in which things go right are special cases

  • f the ways in which things go wrong.”
Friday, March 9, 12
slide-84
SLIDE 84

Proposition #1

Successes = failures gone wrong Study the failures, generalize from that. Potential data sources: 6 out of 100

Friday, March 9, 12
slide-85
SLIDE 85

Proposition #2

“Ways in which things go wrong are special cases of the ways in which things go right.”

Friday, March 9, 12
slide-86
SLIDE 86

Proposition #2

Failures = successes gone wrong Study the successes, generalize from that Potential data sources: 94 out of 100

Friday, March 9, 12
slide-87
SLIDE 87

94/100 ? 6/100 ?

OR

Friday, March 9, 12
slide-88
SLIDE 88

What and WHY Do Things Go RIGHT?

Friday, March 9, 12
slide-89
SLIDE 89

Not just:

why did we fail?

But also:

why did we succeed?

Friday, March 9, 12
slide-90
SLIDE 90

Resilient Response

  • Can learn from other fields
  • Can train for outages
  • Can learn from mistakes
  • Can learn from successes as well as failures
Friday, March 9, 12
slide-91
SLIDE 91

http://www.flickr.com/photos/sparktography/75499095/

Friday, March 9, 12
slide-92
SLIDE 92

THE END

Friday, March 9, 12