[PPT] - CS 730/830: Intro AI Solving MDPs MDP Extras Wheeler Ruml (UNH) PowerPoint Presentation

SLIDE 1

CS 730/830: Intro AI

Solving MDPs MDP Extras

Wheeler Ruml (UNH) Lecture 20, CS 730 – 1 / 23

SLIDE 2

Solving MDPs

Solving MDPs ■ Definition ■ What to do? ■ Value Iteration ■ Stopping ■ Sweeping ■ SSPs ■ RTDP ■ UCT ■ Break ■ Policy Iteration ■ Policy Evaluation ■ Summary MDP Extras

Wheeler Ruml (UNH) Lecture 20, CS 730 – 2 / 23

SLIDE 3

Markov Decision Process (MDP)

Solving MDPs ■ Definition ■ What to do? ■ Value Iteration ■ Stopping ■ Sweeping ■ SSPs ■ RTDP ■ UCT ■ Break ■ Policy Iteration ■ Policy Evaluation ■ Summary MDP Extras

Wheeler Ruml (UNH) Lecture 20, CS 730 – 3 / 23

initial state: s0 transition model: T(s, a, s′) = probability of going from s to s′ after doing a. reward function: R(s) for landing in state s. terminal states: sinks = absorbing states (end the trial).

bjective:

total reward: reward over (finite) trajectory: R(s0) + R(s1) + R(s2) discounted reward: penalize future by γ: R(s0) + γR(s1) + γ2R(s2) . . . find: policy: π(s) = a

ptimal policy:

π∗ proper policy: reaches terminal state

SLIDE 4

What to do?

Solving MDPs ■ Definition ■ What to do? ■ Value Iteration ■ Stopping ■ Sweeping ■ SSPs ■ RTDP ■ UCT ■ Break ■ Policy Iteration ■ Policy Evaluation ■ Summary MDP Extras

Wheeler Ruml (UNH) Lecture 20, CS 730 – 4 / 23

π∗(s) = argmax

a

s′

T(s, a, s′)U π∗(s′) U π(s) = E[

∞

t=0

γtR(st)|π, s0 = s] The key: U(s) = R(s) + γ max

a

s′

T(s, a, s′)U(s′) (Richard Bellman, 1957)

SLIDE 5

Value Iteration

Solving MDPs ■ Definition ■ What to do? ■ Value Iteration ■ Stopping ■ Sweeping ■ SSPs ■ RTDP ■ UCT ■ Break ■ Policy Iteration ■ Policy Evaluation ■ Summary MDP Extras

Wheeler Ruml (UNH) Lecture 20, CS 730 – 5 / 23

Repeated Bellman updates: Repeat until happy for each state s U ′(s) ← R(s) + γ maxa

s′ T(s, a, s′)U(s′)

U ← U ′ For infinite updates everywhere, guaranteed to reach equilibrium. Equilibrium is unique solution to Bellman equations! asychronous works: converges if every state updated infinitely

ften (no state permanently ignored)

SLIDE 6

Stopping

Solving MDPs ■ Definition ■ What to do? ■ Value Iteration ■ Stopping ■ Sweeping ■ SSPs ■ RTDP ■ UCT ■ Break ■ Policy Iteration ■ Policy Evaluation ■ Summary MDP Extras

Wheeler Ruml (UNH) Lecture 20, CS 730 – 6 / 23

||Ui − Ui−1|| = max difference between corresponding elts U ∗ = U π∗ if ||Ui − Ui−1||γ/(1 − γ) < ǫ then ||Ui − U ∗|| < ǫ

SLIDE 7

Stopping

Solving MDPs ■ Definition ■ What to do? ■ Value Iteration ■ Stopping ■ Sweeping ■ SSPs ■ RTDP ■ UCT ■ Break ■ Policy Iteration ■ Policy Evaluation ■ Summary MDP Extras

Wheeler Ruml (UNH) Lecture 20, CS 730 – 6 / 23

||Ui − Ui−1|| = max difference between corresponding elts U ∗ = U π∗ if ||Ui − Ui−1||γ/(1 − γ) < ǫ then ||Ui − U ∗|| < ǫ if ||Ui − U ∗|| < ǫ then ||U πi − U π∗|| < 2ǫγ/(1 − γ)

SLIDE 8

Stopping

Solving MDPs ■ Definition ■ What to do? ■ Value Iteration ■ Stopping ■ Sweeping ■ SSPs ■ RTDP ■ UCT ■ Break ■ Policy Iteration ■ Policy Evaluation ■ Summary MDP Extras

Wheeler Ruml (UNH) Lecture 20, CS 730 – 6 / 23

||Ui − Ui−1|| = max difference between corresponding elts U ∗ = U π∗ if ||Ui − Ui−1||γ/(1 − γ) < ǫ then ||Ui − U ∗|| < ǫ if ||Ui − U ∗|| < ǫ then ||U πi − U π∗|| < 2ǫγ/(1 − γ) loss < 2(maxUpdate)γ

1−γ

maxUpdate > loss(1−γ)

2γ

SLIDE 9

Stopping

Solving MDPs ■ Definition ■ What to do? ■ Value Iteration ■ Stopping ■ Sweeping ■ SSPs ■ RTDP ■ UCT ■ Break ■ Policy Iteration ■ Policy Evaluation ■ Summary MDP Extras

Wheeler Ruml (UNH) Lecture 20, CS 730 – 7 / 23

maxUpdate > loss(1 − γ) 2γ

10

10 20 30 40 0.2 0.4 0.6 0.8 1 (1-x)/(2*x)

SLIDE 10

Prioritized Sweeping

Solving MDPs ■ Definition ■ What to do? ■ Value Iteration ■ Stopping ■ Sweeping ■ SSPs ■ RTDP ■ UCT ■ Break ■ Policy Iteration ■ Policy Evaluation ■ Summary MDP Extras

Wheeler Ruml (UNH) Lecture 20, CS 730 – 8 / 23

concentrate updates on states whose value changes! to update state s with change δ in U(s): update U(s) priority of s ← 0 for each predecessor s′ of s: priority s′ ← max of current and maxa δ ˆ T(s′, a, s)

SLIDE 11

Stochastic Shortest Path Problems

Solving MDPs ■ Definition ■ What to do? ■ Value Iteration ■ Stopping ■ Sweeping ■ SSPs ■ RTDP ■ UCT ■ Break ■ Policy Iteration ■ Policy Evaluation ■ Summary MDP Extras

Wheeler Ruml (UNH) Lecture 20, CS 730 – 9 / 23

■

minimize sum of action costs

■

all action costs ≥ 0

■

non-empty set of (absorbing zero-cost) goal states

■

there exists at least one proper policy proper policy: eventually brings agent to goal from any state with probability 1

SLIDE 12

Real-time Dynamic Programming (RTDP)

Solving MDPs ■ Definition ■ What to do? ■ Value Iteration ■ Stopping ■ Sweeping ■ SSPs ■ RTDP ■ UCT ■ Break ■ Policy Iteration ■ Policy Evaluation ■ Summary MDP Extras

Wheeler Ruml (UNH) Lecture 20, CS 730 – 10 / 23

which states to update? initialize U to an upper bound do trials until happy: s ← s0 until at a goal: a, ua ← arg, mina c(s, a) +

s′ T(s, a, s′)U(s′)

U(s) ← ua s ← pick among s′ weighted by T(s, a, s′) states that agent is likely to visit under current policy nice anytime profile in practice, do updates backward from end of trajectory convergence guaranteed by optimism.

SLIDE 13

Upper Confidence Bounds on Trees (UCT, 2006)

Solving MDPs ■ Definition ■ What to do? ■ Value Iteration ■ Stopping ■ Sweeping ■ SSPs ■ RTDP ■ UCT ■ Break ■ Policy Iteration ■ Policy Evaluation ■ Summary MDP Extras

Wheeler Ruml (UNH) Lecture 20, CS 730 – 11 / 23

n-line action selection

Monte Carlo tree search (MCTS) descent, roll-out, update, growth W(s, a) = total reward N(s, a) = number of times a tried in s N(s) = number of times s visited Z(s, a) = W(s, a) N(s, a) + C

log N(s)

N(s, a) roll-out policy add one node after each roll-out consistent!

SLIDE 14

Break

Solving MDPs ■ Definition ■ What to do? ■ Value Iteration ■ Stopping ■ Sweeping ■ SSPs ■ RTDP ■ UCT ■ Break ■ Policy Iteration ■ Policy Evaluation ■ Summary MDP Extras

Wheeler Ruml (UNH) Lecture 20, CS 730 – 12 / 23

■

asst 9

■

project: reading, prep, final proposal

■

wildcard topic

SLIDE 15

Policy Iteration

Solving MDPs ■ Definition ■ What to do? ■ Value Iteration ■ Stopping ■ Sweeping ■ SSPs ■ RTDP ■ UCT ■ Break ■ Policy Iteration ■ Policy Evaluation ■ Summary MDP Extras

Wheeler Ruml (UNH) Lecture 20, CS 730 – 13 / 23

repeat until π doesn’t change: given π, compute U π(s) for all states given U, calculate policy by one-step look-ahead If π doesn’t change, U doesn’t either. We are at an equilibrium (= optimal π)!

SLIDE 16

Policy Evaluation

Solving MDPs ■ Definition ■ What to do? ■ Value Iteration ■ Stopping ■ Sweeping ■ SSPs ■ RTDP ■ UCT ■ Break ■ Policy Iteration ■ Policy Evaluation ■ Summary MDP Extras

Wheeler Ruml (UNH) Lecture 20, CS 730 – 14 / 23

computing U π(s): U π(s) = R(s) + γ

s′

T(s, π(s), s′)U(s′)

SLIDE 17

Policy Evaluation

Solving MDPs ■ Definition ■ What to do? ■ Value Iteration ■ Stopping ■ Sweeping ■ SSPs ■ RTDP ■ UCT ■ Break ■ Policy Iteration ■ Policy Evaluation ■ Summary MDP Extras

Wheeler Ruml (UNH) Lecture 20, CS 730 – 14 / 23

computing U π(s): U π(s) = R(s) + γ

s′

T(s, π(s), s′)U(s′) linear programming (N 3) or

SLIDE 18

Policy Evaluation

Solving MDPs ■ Definition ■ What to do? ■ Value Iteration ■ Stopping ■ Sweeping ■ SSPs ■ RTDP ■ UCT ■ Break ■ Policy Iteration ■ Policy Evaluation ■ Summary MDP Extras

Wheeler Ruml (UNH) Lecture 20, CS 730 – 14 / 23

computing U π(s): U π(s) = R(s) + γ

s′

T(s, π(s), s′)U(s′) linear programming (N 3) or simplified value iteration: do a few times: Ui+1(s) ← R(s) + γ

s′

T(s, π(s), s′)U(s′) (simplified because we are given π, no max over a)

SLIDE 19

Summary of MDP Solving

Solving MDPs ■ Definition ■ What to do? ■ Value Iteration ■ Stopping ■ Sweeping ■ SSPs ■ RTDP ■ UCT ■ Break ■ Policy Iteration ■ Policy Evaluation ■ Summary MDP Extras

Wheeler Ruml (UNH) Lecture 20, CS 730 – 15 / 23

■

value iteration: compute U π∗

◆

prioritized sweeping

◆

RTDP

■

policy iteration: compute U π using

◆

linear algebra

◆

simplified value iteration

◆

a few updates (modified PI)

SLIDE 20

MDP Extras

Solving MDPs MDP Extras ■ ADP ■ Bandits ■ Q-Learning ■ RL Summary ■ Approx U ■ Deep RL ■ EOLQs

Wheeler Ruml (UNH) Lecture 20, CS 730 – 16 / 23

SLIDE 21

Adaptive Dynamic Programming

Solving MDPs MDP Extras ■ ADP ■ Bandits ■ Q-Learning ■ RL Summary ■ Approx U ■ Deep RL ■ EOLQs

Wheeler Ruml (UNH) Lecture 20, CS 730 – 17 / 23

‘model-based’. active vs passive learn T and R as we go, calculating π using MDP methods (eg, VI or PI) example with VI: Until max-update is small enough for each state s U(s) ← R(s) + γ maxa

s′ T(s, a, s′)U(s′)

π(s) = argmax

a

s′

T(s, a, s′)U(s′)

SLIDE 22

Exploration vs Exploitation

Solving MDPs MDP Extras ■ ADP ■ Bandits ■ Q-Learning ■ RL Summary ■ Approx U ■ Deep RL ■ EOLQs

Wheeler Ruml (UNH) Lecture 20, CS 730 – 18 / 23

problem:

SLIDE 23

Exploration vs Exploitation

Solving MDPs MDP Extras ■ ADP ■ Bandits ■ Q-Learning ■ RL Summary ■ Approx U ■ Deep RL ■ EOLQs

Wheeler Ruml (UNH) Lecture 20, CS 730 – 18 / 23

problem: greedy (local minima) ‘multi-armed bandit’ problem random action with probability

1 N

r even something like

U +(s) ← R(s) + γ max

a

f

s′

T(s, a, s′)U +(s′), N(a, s)

where f(u, n) = Rmax if n < k, u otherwise

SLIDE 24

Q-Learning

Solving MDPs MDP Extras ■ ADP ■ Bandits ■ Q-Learning ■ RL Summary ■ Approx U ■ Deep RL ■ EOLQs

Wheeler Ruml (UNH) Lecture 20, CS 730 – 19 / 23

U(s) = R(s) + γ max

a

s′

T(s, a, s′)U(s′)

SLIDE 25

Q-Learning

Solving MDPs MDP Extras ■ ADP ■ Bandits ■ Q-Learning ■ RL Summary ■ Approx U ■ Deep RL ■ EOLQs

Wheeler Ruml (UNH) Lecture 20, CS 730 – 19 / 23

U(s) = R(s) + γ max

a

s′

T(s, a, s′)U(s′) Q(s, a) = γ

s′
T(s, a, s′)(R(s′) + max

a′

Q(s′, a′))

Given experience s, a, s′, r:

SLIDE 26

Q-Learning

Solving MDPs MDP Extras ■ ADP ■ Bandits ■ Q-Learning ■ RL Summary ■ Approx U ■ Deep RL ■ EOLQs

Wheeler Ruml (UNH) Lecture 20, CS 730 – 19 / 23

U(s) = R(s) + γ max

a

s′

T(s, a, s′)U(s′) Q(s, a) = γ

s′
T(s, a, s′)(R(s′) + max

a′

Q(s′, a′))

Given experience s, a, s′, r:

Q(s, a) ← Q(s, a) + α(error)

SLIDE 27

Q-Learning

Solving MDPs MDP Extras ■ ADP ■ Bandits ■ Q-Learning ■ RL Summary ■ Approx U ■ Deep RL ■ EOLQs

Wheeler Ruml (UNH) Lecture 20, CS 730 – 19 / 23

U(s) = R(s) + γ max

a

s′

T(s, a, s′)U(s′) Q(s, a) = γ

s′
T(s, a, s′)(R(s′) + max

a′

Q(s′, a′))

Given experience s, a, s′, r:

Q(s, a) ← Q(s, a) + α(error) Q(s, a) ← Q(s, a) + α(sensed − predicted)

SLIDE 28

Q-Learning

Solving MDPs MDP Extras ■ ADP ■ Bandits ■ Q-Learning ■ RL Summary ■ Approx U ■ Deep RL ■ EOLQs

Wheeler Ruml (UNH) Lecture 20, CS 730 – 19 / 23

U(s) = R(s) + γ max

a

s′

T(s, a, s′)U(s′) Q(s, a) = γ

s′
T(s, a, s′)(R(s′) + max

a′

Q(s′, a′))

Given experience s, a, s′, r:

Q(s, a) ← Q(s, a) + α(error) Q(s, a) ← Q(s, a) + α(sensed − predicted) Q(s, a) ← Q(s, a) + α(γ(r + max

a′

Q(s′, a′)) − Q(s, a))

SLIDE 29

Q-Learning

Solving MDPs MDP Extras ■ ADP ■ Bandits ■ Q-Learning ■ RL Summary ■ Approx U ■ Deep RL ■ EOLQs

Wheeler Ruml (UNH) Lecture 20, CS 730 – 19 / 23

U(s) = R(s) + γ max

a

s′

T(s, a, s′)U(s′) Q(s, a) = γ

s′
T(s, a, s′)(R(s′) + max

a′

Q(s′, a′))

Given experience s, a, s′, r:

Q(s, a) ← Q(s, a) + α(error) Q(s, a) ← Q(s, a) + α(sensed − predicted) Q(s, a) ← Q(s, a) + α(γ(r + max

a′

Q(s′, a′)) − Q(s, a)) α ≈ 1/N? policy: choose random with probability 1/N?

SLIDE 30

RL Summary

Solving MDPs MDP Extras ■ ADP ■ Bandits ■ Q-Learning ■ RL Summary ■ Approx U ■ Deep RL ■ EOLQs

Wheeler Ruml (UNH) Lecture 20, CS 730 – 20 / 23

Model known (solving MDP):

■

value iteration

■

policy iteration: compute U π using

◆

linear algebra

◆

simplified value iteration

◆

a few updates (modified PI) Model unknown (RL):

■

ADP using

◆

value iteration

◆

a few updates (eg, prioritized sweeping)

■

Q-learning

SLIDE 31

Function Approximation

Solving MDPs MDP Extras ■ ADP ■ Bandits ■ Q-Learning ■ RL Summary ■ Approx U ■ Deep RL ■ EOLQs

Wheeler Ruml (UNH) Lecture 20, CS 730 – 21 / 23

ˆ U(s) = θ0f0(s) + θ1f1(s) + θ2f2(s) + . . .

SLIDE 32

Function Approximation

Solving MDPs MDP Extras ■ ADP ■ Bandits ■ Q-Learning ■ RL Summary ■ Approx U ■ Deep RL ■ EOLQs

Wheeler Ruml (UNH) Lecture 20, CS 730 – 21 / 23

ˆ U(s) = θ0f0(s) + θ1f1(s) + θ2f2(s) + . . . ˆ U(x, y) = θ0 + θ1x + θ2y

SLIDE 33

Function Approximation

Solving MDPs MDP Extras ■ ADP ■ Bandits ■ Q-Learning ■ RL Summary ■ Approx U ■ Deep RL ■ EOLQs

Wheeler Ruml (UNH) Lecture 20, CS 730 – 21 / 23

ˆ U(s) = θ0f0(s) + θ1f1(s) + θ2f2(s) + . . . ˆ U(x, y) = θ0 + θ1x + θ2y given sample u at s = x, y, want update to decrease error: E = ( ˆ U(s) − u)2 2

SLIDE 34

Function Approximation

Solving MDPs MDP Extras ■ ADP ■ Bandits ■ Q-Learning ■ RL Summary ■ Approx U ■ Deep RL ■ EOLQs

Wheeler Ruml (UNH) Lecture 20, CS 730 – 21 / 23

ˆ U(s) = θ0f0(s) + θ1f1(s) + θ2f2(s) + . . . ˆ U(x, y) = θ0 + θ1x + θ2y given sample u at s = x, y, want update to decrease error: E = ( ˆ U(s) − u)2 2 θi ← θi − αδE δθi

SLIDE 35

Function Approximation

Solving MDPs MDP Extras ■ ADP ■ Bandits ■ Q-Learning ■ RL Summary ■ Approx U ■ Deep RL ■ EOLQs

Wheeler Ruml (UNH) Lecture 20, CS 730 – 21 / 23

ˆ U(s) = θ0f0(s) + θ1f1(s) + θ2f2(s) + . . . ˆ U(x, y) = θ0 + θ1x + θ2y given sample u at s = x, y, want update to decrease error: E = ( ˆ U(s) − u)2 2 θi ← θi − αδE δθi θi ← θi − α( ˆ U(s) − u)δ ˆ U(s) δθi

SLIDE 36

Function Approximation

Solving MDPs MDP Extras ■ ADP ■ Bandits ■ Q-Learning ■ RL Summary ■ Approx U ■ Deep RL ■ EOLQs

Wheeler Ruml (UNH) Lecture 20, CS 730 – 21 / 23

ˆ U(s) = θ0f0(s) + θ1f1(s) + θ2f2(s) + . . . ˆ U(x, y) = θ0 + θ1x + θ2y given sample u at s = x, y, want update to decrease error: E = ( ˆ U(s) − u)2 2 θi ← θi − αδE δθi θi ← θi − α( ˆ U(s) − u)δ ˆ U(s) δθi in other words, the updates are: θ0 ← θ0 − α( ˆ U(s) − u) θ1 ← θ1 − α( ˆ U(s) − u)x θ2 ← θ2 − α( ˆ U(s) − u)y

SLIDE 37

Deep RL

Solving MDPs MDP Extras ■ ADP ■ Bandits ■ Q-Learning ■ RL Summary ■ Approx U ■ Deep RL ■ EOLQs

Wheeler Ruml (UNH) Lecture 20, CS 730 – 22 / 23

How to choose features? Learn them!

■

deep Q-learning (DQN): eg, backgammon, Atari games use mini-batches to try to avoid divergence

■

value approximation: eg, Go outcome also, move probability

SLIDE 38

EOLQs

Solving MDPs MDP Extras ■ ADP ■ Bandits ■ Q-Learning ■ RL Summary ■ Approx U ■ Deep RL ■ EOLQs