[PPT] - Safe Reinforcement Learning via Formal Methods Nathan Fulton and Andr PowerPoint Presentation

SLIDE 1

Safe Reinforcement Learning via Formal Methods

Nathan Fulton and André Platzer Carnegie Mellon University

SLIDE 2

Safe Reinforcement Learning via Formal Methods

Nathan Fulton and André Platzer Carnegie Mellon University

SLIDE 3

Safety-Critical Systems

"How can we provide people with cyber-physical systems they can bet their lives on?" - Jeannette Wing

SLIDE 4

Autonomous Safety-Critical Systems

How can we provide people with autonomous cyber-physical systems they can bet their lives on?

SLIDE 5

Model-Based Verification

φ

Reinforcement Learning

SLIDE 6

Model-Based Verification

pos < stopSign

Reinforcement Learning

SLIDE 7

Model-Based Verification

pos < stopSign

Reinforcement Learning

ctrl

SLIDE 8

Approach: prove that control software achieves a specification with respect to a model of the physical system.

Model-Based Verification

pos < stopSign

Reinforcement Learning

ctrl

SLIDE 9

Approach: prove that control software achieves a specification with respect to a model of the physical system.

Model-Based Verification

pos < stopSign

Reinforcement Learning

ctrl

SLIDE 10

Benefits:

Strong safety guarantees
Automated analysis

Model-Based Verification

φ

Reinforcement Learning

SLIDE 11

Benefits:

Strong safety guarantees
Automated analysis

Drawbacks:

Control policies are typically

non-deterministic: answers “what is safe”, not “what is useful”

Model-Based Verification

φ

Reinforcement Learning

SLIDE 12

Benefits:

Strong safety guarantees
Automated analysis

Drawbacks:

Control policies are typically

non-deterministic: answers “what is safe”, not “what is useful”

Assumes accurate model

Model-Based Verification

φ

Reinforcement Learning

SLIDE 13

Benefits:

Strong safety guarantees
Automated analysis

Drawbacks:

Control policies are typically

non-deterministic: answers “what is safe”, not “what is useful”

Assumes accurate model.

Model-Based Verification

φ

Reinforcement Learning

Observe Act

SLIDE 14

Benefits:

Strong safety guarantees
Automated analysis

Drawbacks:

Control policies are typically

non-deterministic: answers “what is safe”, not “what is useful”

Assumes accurate model.

Model-Based Verification

φ

Reinforcement Learning

Observe Act

Benefits:

No need for complete model
Optimal (effective) policies

SLIDE 15

Benefits:

Strong safety guarantees
Automated analysis

Drawbacks:

Control policies are typically

non-deterministic: answers “what is safe”, not “what is useful”

Assumes accurate model.

Model-Based Verification

φ

Reinforcement Learning

Observe Act

Benefits:

No need for complete model
Optimal (effective) policies

Drawbacks:

No strong safety guarantees
Proofs are obtained and

checked by hand

Formal proofs = decades-long

proof development

SLIDE 16

Benefits:

Strong safety guarantees
Aomputational aids (ATP)

Drawbacks:

Control policies are typically

non-deterministic: answers “what is safe”, not “what is useful”

Assumes accurate model

Model-Based Verification

φ

Reinforcement Learning

Observe Act

Benefits:

No need for complete model
Optimal (effective) policies

Drawbacks:

No strong safety guarantees
Proofs are obtained and

checked by hand

Formal proofs = decades-long

proof development

Goal: Provably correct reinforcement learning

SLIDE 17

Benefits:

Strong safety guarantees
Aomputational aids (ATP)

Drawbacks:

Control policies are typically

non-deterministic: answers “what is safe”, not “what is useful”

Assumes accurate model

Model-Based Verification

φ

Reinforcement Learning

Observe Act

Benefits:

No need for complete model
Optimal (effective) policies

Drawbacks:

No strong safety guarantees
Proofs are obtained and

checked by hand

Formal proofs = decades-long

proof development

Goal: Provably correct reinforcement learning

1. Learn Safety
2. Learn a Safe Policy
3. Justify claims of safety

SLIDE 18

Model-Based Verification

Accurate, analyzable models often exist!

{ {?safeAccel;accel∪ brake ∪ ?safeTurn; turn}; {pos’ = vel, vel’ = acc} }*

SLIDE 19

Model-Based Verification

Accurate, analyzable models often exist!

{ {?safeAccel;accel∪ brake ∪ ?safeTurn; turn}; {pos’ = vel, vel’ = acc} }*

discrete control Continuous motion

SLIDE 20

Model-Based Verification

Accurate, analyzable models often exist!

{ {?safeAccel;accel∪ brake ∪ ?safeTurn; turn}; {pos’ = vel, vel’ = acc} }*

discrete, non-deterministic control Continuous motion

SLIDE 21

Model-Based Verification

Accurate, analyzable models often exist!

init → [{ { ?safeAccel;accel ∪ brake ∪ ?safeTurn; turn}; {pos’ = vel, vel’ = acc} }*]pos < stopSign

SLIDE 22

Model-Based Verification

Accurate, analyzable models often exist! formal verification gives strong safety guarantees

init → [{ { ?safeAccel;accel ∪ brake ∪ ?safeTurn; turn}; {pos’ = vel, vel’ = acc} }*]pos < stopSign

SLIDE 23

Model-Based Verification

Accurate, analyzable models often exist! formal verification gives strong safety guarantees

=

Computer-checked proofs
f safety specification.

SLIDE 24

Model-Based Verification

Accurate, analyzable models often exist! formal verification gives strong safety guarantees

=

Computer-checked proofs
f safety specification
Formal proofs mapping

model to runtime monitors

SLIDE 25

Model-Based Verification Isn’t Enough

Perfect, analyzable models don’t exist!

SLIDE 26

Model-Based Verification Isn’t Enough

Perfect, analyzable models don’t exist!

{ { ?safeAccel;accel ∪ brake ∪ ?safeTurn; turn}; {pos’ = vel, vel’ = acc} }*

How to implement? Only accurate sometimes

SLIDE 27

Model-Based Verification Isn’t Enough

Perfect, analyzable models don’t exist!

{ { ?safeAccel;accel ∪ brake ∪ ?safeTurn; turn}; {dx’=wy, dy’=-wx, ...} }*

How to implement? Only accurate sometimes

SLIDE 28

Our Contribution

Justified Speculative Control is an approach toward provably safe reinforcement learning that:

1. learns to resolve non-determinism without

sacrificing formal safety results

SLIDE 29

Our Contribution

Justified Speculative Control is an approach toward provably safe reinforcement learning that:

1. learns to resolve non-determinism without

sacrificing formal safety results

2. allows and directs speculation whenever

model mismatches occur

SLIDE 30

Learning to Resolve Non-determinism

Observe & compute reward Act

SLIDE 31

Learning to Resolve Non-determinism

Observe & compute reward

accel ∪ brake U turn

SLIDE 32

Learning to Resolve Non-determinism

Observe & compute reward

{accel,brake,turn}

SLIDE 33

Learning to Resolve Non-determinism

⇨

Observe & compute reward

Policy

{accel,brake,turn}

SLIDE 34

Learning to Resolve Non-determinism

⇨

Observe & compute reward

(safe?) Policy

{accel,brake,turn}

SLIDE 35

Learning to Safely Resolve Non-determinism

⇨

Observe & compute reward

(safe?) Policy

Safety Monitor

SLIDE 36

Learning to Safely Resolve Non-determinism

⇨

Observe & compute reward

(safe?) Policy

Safety Monitor

≠ “Trust Me”

SLIDE 37

Learning to Safely Resolve Non-determinism

⇨

Observe & compute reward

(safe?) Policy

φ

Use a theorem prover to prove: (init→[{{accel∪brake};ODEs}*](safe)) ↔ φ

SLIDE 38

Learning to Safely Resolve Non-determinism

⇨

Observe & compute reward

(safe?) Policy

φ

Use a theorem prover to prove: (init→[{{accel∪brake};ODEs}*](safe)) ↔ φ

SLIDE 39

(safe?) Policy

Learning to Safely Resolve Non-determinism

⇨

Observe & compute reward

φ

Main Theorem: If the ODEs are accurate, then

ur formal proofs transfer from the

non-deterministic model to the learned (deterministic) policy

Use a theorem prover to prove: (init→[{{accel∪brake};ODEs}*](safe)) ↔ φ

SLIDE 40

(safe?) Policy

Learning to Safely Resolve Non-determinism

⇨

Observe & compute reward

φ

Main Theorem: If the ODEs are accurate, then

ur formal proofs transfer from the

non-deterministic model to the learned (deterministic) policy via the model monitor.

Use a theorem prover to prove: (init→[{{accel∪brake};ODEs}*](safe)) ↔ φ

SLIDE 41

What about the physical model?

⇨

Observe & compute reward

φ

Use a theorem prover to prove: (init→ [{{accel∪brake};ODEs}*](safe)) ↔ φ

{pos’=vel,vel’=acc} ≠ (safe?) Policy

SLIDE 42

What About the Physical Model?

Observe & compute reward {brake, accel, turn}

SLIDE 43

What About the Physical Model?

Observe & compute reward {brake, accel, turn}

Model is accurate.

SLIDE 44

What About the Physical Model?

Observe & compute reward {brake, accel, turn}

Model is accurate.

SLIDE 45

What About the Physical Model?

Observe & compute reward {brake, accel, turn}

Model is accurate.

Model is inaccurate

SLIDE 46

What About the Physical Model?

Observe & compute reward {brake, accel, turn}

Model is accurate.

Model is inaccurate Obstacle!

SLIDE 47

What About the Physical Model?

Observe & compute reward {brake, accel, turn}

Expected Reality

SLIDE 48

Speculation is Justified

Observe & compute reward {brake, accel, turn}

Expected (safe) Reality (crash!)

SLIDE 49

Leveraging Verification Results to Learn Better

Observe & compute reward {brake, accel, turn}

Use a real-valued version of the model monitor as a reward signal

SLIDE 50

Conclusion

Justified Speculative Control provides the best of logic and learning:

⇨

Policy

φ

SLIDE 51

Conclusion

Justified Speculative Control provides the best of logic and learning:

Formally model the control system (control + physics)

⇨

Policy

φ

SLIDE 52

Conclusion

Justified Speculative Control provides the best of logic and learning:

Formally model the control system (control + physics)
Learn how to resolve non-determinism in models.

⇨

Policy

φ

SLIDE 53

Conclusion

Justified Speculative Control provides the best of logic and learning:

Formally model the control system (control + physics)
Learn how to resolve non-determinism in models.
Leverage theorem proving to transfer proofs to learned policies.

⇨

Policy

φ

SLIDE 54

Conclusion

Justified Speculative Control provides the best of logic and learning:

Formally model the control system (control + physics)
Learn how to resolve non-determinism in models.
Leverage theorem proving to transfer proofs to learned policies.
Unsafe speculation is justified when model deviates from reality

⇨

Policy

φ

SLIDE 55

Conclusion

Justified Speculative Control provides the best of logic and learning:

Formally model the control system (control + physics)
Learn how to resolve non-determinism in models
Leverage theorem proving to transfer proofs to learned policies
Unsafe speculation is justified when model deviates from reality,

but verification results can still be helpful!

⇨

Policy

φ

SLIDE 56

Conclusion

Justified Speculative Control provides the best of logic and learning:

Formally model the control system (control + physics)
Learn how to resolve non-determinism in models
Leverage theorem proving to transfer proofs to learned policies
Unsafe speculation is justified when model deviates from reality,

but verification results can still be helpful!

⇨

Policy

φ

SLIDE 57

SLIDE 58

SLIDE 59

SLIDE 60

Justified Speculative Control

≈

Learn over a constrained action space

≠

SLIDE 61

Justified Speculative Control

≈

Learn over a constrained action space

≠

SLIDE 62

Safe Reinforcement Learning?

⇨

Observe & compute reward

unverified Policy

Policy deviates from model:

1. Policy is deterministic, verification result is

set-valued.

{accel,brake,turn}

SLIDE 63

Some Actions Aren’t Always Safe

⇨

Observe & compute reward

unverified Policy

Policy deviates from model:

1. Policy is deterministic, verification result is

set-valued. {accel,brake,turn} ≠ ?safeAccel; accel ∪ brake

SLIDE 64

Some Actions Aren’t Always Safe

⇨

Observe & compute reward

unverified Policy

Policy deviates from model:

1. Policy is deterministic, verification result is

set-valued. {accel,brake,turn} ≠ ?safeAccel; accel ∪ brake

SLIDE 65

Safe Reinforcement Learning?

⇨

unverified Policy

Policy deviates from model:

1. Policy is deterministic, verification result is

set-valued.

Observe & compute reward

?safeAccel; accel ∪ brake ≠

SLIDE 66

Physical Models are Approximations

Policy deviates from model:

1. Policy is deterministic, verification result is

set-valued.

2. Environment may not be accurately modeled.

⇨

Observe & compute reward

unverified Policy

{accel,brake,turn}

≠ pos’=vel, vel’=acc

SLIDE 67

Safety resolving non-determinism

unverified Policy

?safeAccel; accel ∪ brake ≠

SLIDE 68

Sandboxing Reinforcement Learning

≈

“Accurate modulo determinism”

init → [{ {accel ∪ brake}; t:=0; continuousMotion }*](safe)

SLIDE 69

Sandboxing Reinforcement Learning

≈

Learn over a constrained action space “Accurate modulo determinism”

SLIDE 70

Sandboxing Reinforcement Learning

≈

Learn over a constrained action space “Accurate modulo determinism”

SLIDE 71

Sandboxing Reinforcement Learning

Theorem: If the physical model is accurate then verification results are preserved during learning and by learned policies.

⇨

Policy

Constrained Actions Observe & compute reward

SLIDE 72

Sandboxing Reinforcement Learning

Theorem: If the physical model is accurate then verification results are preserved during learning and by learned policies.

⇨

Observe & compute reward

Policy

Constrained Actions

init → [{ {accel ∪ brake}; t:=0; continuousMotion }*](safe)

SLIDE 73

Sandboxing Reinforcement Learning

Theorem: If the physical model is accurate then verification results are preserved during learning and by learned policies.

⇨

Observe & compute reward

Policy

Constrained Actions

init → [{ {accel ∪ brake}; t:=0; continuousMotion }*](safe)

SLIDE 74

Sandboxing Safe Reinforcement Learning

Theorem: If the physical model is accurate then verification results are preserved by learned policies.

⇨

Observe & compute reward

Policy

Constrained Actions

init → [{ {accel ∪ brake}; t:=0; continuousMotion }*](safe)

SLIDE 75

What About the Physical Model?

Observe & compute reward {brake, accel, turn}

SLIDE 76

What About the Physical Model?

Observe & compute reward {brake, accel, turn}

Model is accurate.

SLIDE 77

What About the Physical Model?

Observe & compute reward {brake, accel, turn}

Model is accurate.

SLIDE 78

What About the Physical Model?

Observe & compute reward {brake, accel, turn}

Model is correct.

Model is inaccurate

SLIDE 79

What About the Physical Model?

Observe & compute reward {brake, accel, turn}

Model is correct.

Model is inaccurate Obstacle!

SLIDE 80

What About the Physical Model?

Observe & compute reward {brake, accel, turn}

Expected Reality

SLIDE 81

What About the Physical Model?

Observe & compute reward {brake, accel, turn}

Expected (safe) Reality (crash!)

SLIDE 82

Justified Speculative Control

≈

Learn over a constrained action space

≠

SLIDE 83

Justified Speculative Control

≈

Learn over a constrained action space

≠

SLIDE 84

Justified Speculative Control

Some Questions:

1. How do we know when we’re in unmodeled state space?
2. What do we do when we are in modeled state space?

Learn over a constrained action space

Learn

SLIDE 85

Justified Speculative Control

Some Questions:

1. How do we know when we’re in unmodeled state space?
2. What do we do when we are in modeled state space?

Learn over a constrained action space

Learn

SLIDE 86

Justified Speculative Control

Theorem: Verification results are preserved outside of red

region. But:

☒ How do we know when we’re in unmodeled state space? ☐ What do we do when we are in modeled state space? Learn over a constrained action space

Learn

SLIDE 87

What do we do in unmodeled state-space?

SLIDE 88

What do we do in unmodeled state-space?

SLIDE 89

What do we do in unmodeled state-space?

SLIDE 90

What do we do in unmodeled state-space?

Get from here...

SLIDE 91

What do we do in unmodeled state-space?

...to here Get from here...

SLIDE 92

Leveraging Formal Methods during Learning

Leader Own Car

SLIDE 93

Leveraging Formal Methods during Learning

Perturbation “Don’t hit the leader”

“Get back to modeled state space”

5% 3 2 25% 18 16 50% 41 24

Leader Own Car

SLIDE 94

Conclusion

KeYmaera X + Justified Speculative Control:

1. Transfer formal verification results for

non-deterministic control policies to policies obtained via a generic reinforcement learning algorithm.

SLIDE 95

Conclusion

KeYmaera X + Justified Speculative Control:

1. Transfer formal verification results for

non-deterministic control policies to policies obtained via a generic reinforcement learning algorithm.

2. Leverages insights obtained during verification to direct

future learning.

≠

SLIDE 96

init → [{ {?safeAccel; accel ∪ brake}; t:=0; {pos’=vel,vel’=acc} }*](pos < stopSign)

Model-Based Verification

pos < stopSign

Reinforcement Learning

ctrl

Safe Reinforcement Learning via Formal Methods

Safe Reinforcement Learning via Formal Methods

Safety-Critical Systems

"How can we provide people with cyber-physical systems they can bet their lives on?" - Jeannette Wing

Autonomous Safety-Critical Systems

How can we provide people with autonomous cyber-physical systems they can bet their lives on?

Model-Based Verification

φ

Reinforcement Learning

Model-Based Verification

Reinforcement Learning

Model-Based Verification

Reinforcement Learning

Approach: prove that control software achieves a specification with respect to a model of the physical system.

Model-Based Verification

Reinforcement Learning

Approach: prove that control software achieves a specification with respect to a model of the physical system.

Model-Based Verification

Reinforcement Learning

Model-Based Verification

Reinforcement Learning

Model-Based Verification

Reinforcement Learning

Model-Based Verification

Reinforcement Learning

Model-Based Verification

Reinforcement Learning

Model-Based Verification

Reinforcement Learning

Model-Based Verification

Reinforcement Learning

Model-Based Verification

Reinforcement Learning

Goal: Provably correct reinforcement learning

Model-Based Verification

Reinforcement Learning

Goal: Provably correct reinforcement learning

Model-Based Verification

Accurate, analyzable models often exist!

{ {?safeAccel;accel∪ brake ∪ ?safeTurn; turn}; {pos’ = vel, vel’ = acc} }*

Model-Based Verification

Accurate, analyzable models often exist!

{ {?safeAccel;accel∪ brake ∪ ?safeTurn; turn}; {pos’ = vel, vel’ = acc} }*

discrete control Continuous motion

Model-Based Verification

Accurate, analyzable models often exist!

{ {?safeAccel;accel∪ brake ∪ ?safeTurn; turn}; {pos’ = vel, vel’ = acc} }*

discrete, non-deterministic control Continuous motion

Model-Based Verification

Accurate, analyzable models often exist!

init → [{ { ?safeAccel;accel ∪ brake ∪ ?safeTurn; turn}; {pos’ = vel, vel’ = acc} }*]pos < stopSign

Model-Based Verification

Accurate, analyzable models often exist! formal verification gives strong safety guarantees

init → [{ { ?safeAccel;accel ∪ brake ∪ ?safeTurn; turn}; {pos’ = vel, vel’ = acc} }*]pos < stopSign

Model-Based Verification

Accurate, analyzable models often exist! formal verification gives strong safety guarantees

=

Model-Based Verification

Accurate, analyzable models often exist! formal verification gives strong safety guarantees

=

model to runtime monitors

Model-Based Verification Isn’t Enough

Perfect, analyzable models don’t exist!

Model-Based Verification Isn’t Enough

Perfect, analyzable models don’t exist!

{ { ?safeAccel;accel ∪ brake ∪ ?safeTurn; turn}; {pos’ = vel, vel’ = acc} }*

How to implement? Only accurate sometimes

Model-Based Verification Isn’t Enough

Perfect, analyzable models don’t exist!

{ { ?safeAccel;accel ∪ brake ∪ ?safeTurn; turn}; {dx’=w*y, dy’=-w*x, ...} }*

How to implement? Only accurate sometimes

Our Contribution

Justified Speculative Control is an approach toward provably safe reinforcement learning that:

sacrificing formal safety results

Our Contribution

Justified Speculative Control is an approach toward provably safe reinforcement learning that:

sacrificing formal safety results

model mismatches occur

Learning to Resolve Non-determinism

Learning to Resolve Non-determinism

{ { ?safeAccel;accel ∪ brake ∪ ?safeTurn; turn}; {dx’=wy, dy’=-wx, ...} }*