Illusions of Certainty What the brain can teach us about software - - PowerPoint PPT Presentation

illusions of certainty
SMART_READER_LITE
LIVE PREVIEW

Illusions of Certainty What the brain can teach us about software - - PowerPoint PPT Presentation

Illusions of Certainty What the brain can teach us about software engineering Julie Pitt Co-founder, Order of Magnitude Labs @yakticus relevant links found here: github.com/yakticus/IllusionsOfCertainty For the things we have to learn


slide-1
SLIDE 1

Illusions of Certainty

What the brain can teach us about software engineering

Julie Pitt

Co-founder, Order of Magnitude Labs

@yakticus

slide-2
SLIDE 2
slide-3
SLIDE 3

github.com/yakticus/IllusionsOfCertainty relevant links found here:

slide-4
SLIDE 4

“For the things we have to learn before we can do them, we learn by doing them.” ― Aristotle, The Nicomachean Ethics

slide-5
SLIDE 5

today we will discuss

➔ a BIG reason why software projects are unpredictable ➔ how to help computers better understand what we mean ➔ how to make our software systems more resilient ➔ how to better understand what our software systems are doing

slide-6
SLIDE 6
slide-7
SLIDE 7
slide-8
SLIDE 8
slide-9
SLIDE 9

life

slide-10
SLIDE 10

life: a generative model with an interface to the world the world

senses action

generative model

slide-11
SLIDE 11

survival

slide-12
SLIDE 12

working working not working

slide-13
SLIDE 13

software as a generative model

the world

input

  • utput

the code

slide-14
SLIDE 14

software as a generative model

the world

input

  • utput

the code infinite precision

slide-15
SLIDE 15
slide-16
SLIDE 16
slide-17
SLIDE 17

misjudging uncertainty in software

reality perception

slide-18
SLIDE 18
slide-19
SLIDE 19

human precision

don’t kill humans be nice to people don’t hurt people keep humans alive respect human life

don’t kill humans

slide-20
SLIDE 20

machine precision

don’t kill humans be nice to people don’t hurt people keep humans alive respect human life

slide-21
SLIDE 21

machine precision

don’t kill humans be nice to people don’t hurt people keep humans alive respect human life

slide-22
SLIDE 22

the cliffs of infinite precision

the happy path utterly broken

slide-23
SLIDE 23

how do we get to this?

  • ptimal

degraded resilience

slide-24
SLIDE 24

ways we can cheat

➔ property tests ➔ remedy-first design ➔ build intuitive insight

slide-25
SLIDE 25

property tests

slide-26
SLIDE 26
slide-27
SLIDE 27

test suite as a generative model

system under test

y x test suite

slide-28
SLIDE 28

individual test cases are often too precise

software system state space

desired behavior tests (“training examples”)

slide-29
SLIDE 29

testing an addition function: F# example

state space

credit: http://fsharpforfunandprofit.com/posts/property-based-testing/

✅ test passes

slide-30
SLIDE 30
  • verfitting to tests

state space

bug

credit: http://fsharpforfunandprofit.com/posts/property-based-testing/

slide-31
SLIDE 31

property tests combat overfitting

state space

bug

credit: http://fsharpforfunandprofit.com/posts/property-based-testing/

slide-32
SLIDE 32

property tests: let’s review

  • test suites are generative models
  • describe the properties of your system
  • requires less precision
  • test the properties
slide-33
SLIDE 33

remedy-first design

slide-34
SLIDE 34
slide-35
SLIDE 35

GET /api/metadata/12345 { “status”: “failure” “error”: { “errorCode”: 234 “description”: “database timeout” }

input

  • utput

RESTful service

client falls

  • ff cliff
slide-36
SLIDE 36

each error has a precise cause

read timeout connect timeout connection pool exhausted token expired credentials revoked key rotation endpoint moved endpoints expired failover user error insufficient permissions account problem

slide-37
SLIDE 37

remedies are imprecise

read timeout connect timeout connection pool exhausted token expired credentials revoked key rotation endpoint moved endpoints expired failover user error insufficient permissions account problem

RETRY REDIRECT DISPLAY ERROR RE-AUTHENT ICATE

slide-38
SLIDE 38

remedy tells the client how to ease pain

{“status”: “failure” “failure”: { “action”: “RETRY” “error”: { “errorCode”: 234 “description”: “database timeout” } } remedy

(actionable)

slide-39
SLIDE 39

What about failures that weren’t predicted?

slide-40
SLIDE 40

AWS outage - 2012/10/22

  • > DNS change didn’t propagate
  • > indirectly triggered a latent memory leak
  • > insufficient alerting; failovers happened too little, too late
  • > API throttling affected some customers more than others
  • > many popular internet services down for hours

failure comes in many forms

slide-41
SLIDE 41

AWS scheduled maintenance - 2014/09/25

  • > time-sensitive security update on 10% of EC2 nodes
  • > required reboot of those nodes
  • > possible impact to customer applications running on those nodes

failure comes in many forms

slide-42
SLIDE 42

AWS DynamoDB outage - 2015/09/20

  • > DynamoDB failed in us-east-1 region
  • > dozens of dependent services also failed
  • > many prominent internet services were taken down

for hours

failure comes in many forms

slide-43
SLIDE 43

Netflix was prepared

slide-44
SLIDE 44

meet simian army

  • OSS project by Netflix
  • deliberately cause failures in a controlled

manner

  • e.g., randomly takes down AWS ec2

nodes, datacenter, or region

  • validate whether the system handles

failure

slide-45
SLIDE 45

simian army -> cultural change

  • failure is the norm
  • simulates the nature of failure and

not the cause

  • we can’t predict all causes of

failure

slide-46
SLIDE 46

remedy-first design: let’s review

  • design with remedies in mind
  • # remedies << # causes
  • test resilience during business hours
  • find out what you’re up against when wide awake
  • use a tool that is agnostic to causes
  • e.g., simian army
slide-47
SLIDE 47

intuitive feedback

slide-48
SLIDE 48

is it working?

slide-49
SLIDE 49

logs: easy to produce

slide-50
SLIDE 50

logs: hard to consume

slide-51
SLIDE 51

charts

slide-52
SLIDE 52

charts: easier to consume, but still hard

slide-53
SLIDE 53

we want the whole picture

slide-54
SLIDE 54

solution: leverage our intuition

slide-55
SLIDE 55
slide-56
SLIDE 56

thought experiment

What if your software system’s interactions sounded like cars on the road?

slide-57
SLIDE 57

intuitive feedback: let’s review

  • humans want to know “is it working”?
  • the tools of today inhibit us from seeing the big picture
  • we need tools that leverage our intuition
  • e.g., vizceral & TBD
slide-58
SLIDE 58

conclusion

slide-59
SLIDE 59
slide-60
SLIDE 60

curiosity-driven tests

system under test

senses action

test agent

(neural network)

slide-61
SLIDE 61

mapping the state space through exploration

state space begin testing random states without expectations

slide-62
SLIDE 62

mapping the state space through exploration

state space gradually build a model containing expectations

slide-63
SLIDE 63

mapping the state space through exploration

state space

model capable of recognizing anomalies

slide-64
SLIDE 64

self-healing systems

telemetry

senses action

  • ps agent

(neural network) deployment, scaling, failover, etc.

slide-65
SLIDE 65

working working not working

let’s review

slide-66
SLIDE 66

goal: change the landscape

slide-67
SLIDE 67

the end.

slide-68
SLIDE 68
slide-69
SLIDE 69

links github.com/yakticus/IllusionsOfCertainty