ARTIFICIAL INTELLIGENCE Markov decision processes Lecturer: Silja - - PowerPoint PPT Presentation

artificial intelligence markov decision processes
SMART_READER_LITE
LIVE PREVIEW

ARTIFICIAL INTELLIGENCE Markov decision processes Lecturer: Silja - - PowerPoint PPT Presentation

INFOB2KI 2019-2020 Utrecht University The Netherlands ARTIFICIAL INTELLIGENCE Markov decision processes Lecturer: Silja Renooij These slides are part of the INFOB2KI Course Notes available from www.cs.uu.nl/docs/vakken/b2ki/schema.html


slide-1
SLIDE 1

ARTIFICIAL INTELLIGENCE

Lecturer: Silja Renooij

Markov decision processes

Utrecht University The Netherlands

These slides are part of the INFOB2KI Course Notes available from www.cs.uu.nl/docs/vakken/b2ki/schema.html

INFOB2KI 2019-2020

slide-2
SLIDE 2

PageRank (Google)

PageRank can be understood as a) A Markov Chain b) A Markov Decision Process c) A Partially Observable Markov Decision Process d) None of the above

2

slide-3
SLIDE 3

Markov models

Markov model = stochastic model that assumes Markov property.

  • stochastic model: models process where state depends on

previous states in non‐deterministic way.

  • Markov property: the probability distribution of future states,

conditioned on both past and present values, depends only upon present state: “given the present, the future does not depend on the past” Generally, this assumption enables reasoning and computation with the model that would otherwise be intractable.

3

slide-4
SLIDE 4

Markov model types

Prediction Planning Fully observable

Markov chain MDP

(Markov decision process)

Partially observable

Hidden Markov model POMDP

(Partially observable Markov decision process)

Prediction models can be represented at variable level by a (Dynamic) Bayesian network:

S3 S1 S2 … S1 S2 S3 O1 O2 O3 …

4

Typically for

  • ptimisation

purposes

slide-5
SLIDE 5

PageRank (Google)

PageRank can be understood as a) A Markov Chain b) A Markov Decision Process c) A Partially Observable Markov Decision Process d) None of the above

5

slide-6
SLIDE 6

MDPs: outline

  • Search in non‐deterministic environments
  • Solution: optimal policy (plan) of actions

that maximizes rewards

(decision‐theoretic planning)

  • Algorithm:

– Bellman equation – value iteration

  • Link with learning

6

slide-7
SLIDE 7

Running example: Grid World

  • A maze‐like problem
  • The agent lives in a grid, where walls block the agent’s path
  • Noisy movement: actions do not always go as planned
  • If wall in chosen direction, then stay put;
  • 80% of the time, the action North takes the

agent North

  • 10% of the time, North takes the agent West;

10% East (same deviation for other actions)

  • The agent receives rewards each “time” step
  • Small “living” reward each step (can be negative)
  • Big rewards come at the end (good or bad)
  • Goal: maximize sum of rewards
slide-8
SLIDE 8

Search in non-deterministic environments: Grid World example

Deterministic Grid World Stochastic Grid World Noisy movement: actions do not always go as planned

8

slide-9
SLIDE 9

MDP: in search of `plan’ that maximises reward

Each MDP state projects a search tree

a s s’

(s,a,s’) called a transition T(s,a,s’) = P(s’| s, a) R(s,a,s’) s is a state:

9

slide-10
SLIDE 10

Goals, rewards and optimality criteria

  • Planning goals: encoded in reward function

Example: achieving a state satisfying property P at minimal cost is encoded by making any state satisfying P a zero‐ reward absorbing state, and assigning all other states negative reward.

  • Rewards: additive and time‐separable
  • Transitions: effect is uncertain
  • Objective: maximize expected total reward;

future rewards may be discounted

  • Planning horizon: finite, infinite or indefinite (latter:

special case of infinite; guaranteed to reach terminal state)

10

slide-11
SLIDE 11

Markov Decision Processes

MDPs are non‐deterministic search problems An MDP is defined by:

  • A set of states s  S
  • A set of actions a  A
  • A transition function T(s, a, s’)

– Probability that a from s leads to s’, i.e., P(s’ |s, a) – Also called the model or the dynamics

  • A reward function R(s, a, s’)

– Sometimes just R(s) or R(s’)

  • A start state
  • Sometimes a terminal state

11

slide-12
SLIDE 12

What is Markov about MDPs?

  • Recall: “Markov” generally means that given the

present state, the future and the past are independent

  • For Markov decision processes, “Markov” means

action outcomes depend only on the current state

  • This is just like search, where the successor function

could only depend on the current state (not the history)

Andrey Markov (1856‐1922)

12

slide-13
SLIDE 13

Running example: Grid World

  • A maze‐like problem
  • The agent lives in a grid, where walls block the agent’s path
  • Noisy movement: actions do not always go as planned
  • If wall in chosen direction, then stay put;
  • 80% of the time, the action North takes the

agent North

  • 10% of the time, North takes the agent West;

10% East

  • same deviation for other actions
  • The agent receives rewards each “time” step
  • Small “living” reward each step (can be negative)
  • Big rewards come at the end (good or bad)
  • Goal: maximize sum of rewards
slide-14
SLIDE 14

Policies

Example: optimal policy when R(s, a, s’) = ‐0.03 for all non‐ terminal states s

  • Recall: deterministic search  optimal

plan: a sequence of actions, from start to a goal

  • For MDPs, we want an optimal

policy *: S → A

– A policy  gives an action for each state – An optimal policy is one that maximizes expected utility (reward) if followed Note: an explicit policy defines a reflex agent

14

slide-15
SLIDE 15

Optimal Policies - examples

R(s) = ‐2.0 R(s) = ‐0.4 R(s) = ‐0.03 R(s) = ‐0.01

15

Visualisation: each cell represents the state in which the robot occupies that cell; arrow indicates optimal action in the given state +1 +1 +1 +1 ‐1 ‐1 ‐1 ‐1

slide-16
SLIDE 16

Utilities of reward sequences

  • What preferences should an agent have over

reward sequences?

  • More or less?
  • Now or later?
  • It’s reasonable to maximize the sum of rewards
  • It’s also reasonable to prefer rewards now to

rewards later

  • A solution: values of rewards decay exponentially

[1, 2, 2] [2, 3, 4]

  • r

[0, 0, 1] [1, 0, 0]

  • r

16

slide-17
SLIDE 17

Discounting

Worth Now Worth Next Step Worth In Two Steps

17

slide-18
SLIDE 18

Discounting: implementation

  • How to discount?

– Each time we descend a level, we multiply in the discount once

  • Why discount?

– Sooner rewards probably do have higher utility than later rewards – Also helps our algorithms converge

  • Example:

– Value of receiving [1,2,3] with discount

  • f 0.5 = 1*1 + 0.5*2 + 0.25*3

– Which is less than that of [3,2,1]

18

slide-19
SLIDE 19
  • Episodic tasks: interaction breaks naturally into episodes, e.g.,

plays of a game, trips through a maze.

 return is total reward from ‘time’ t to ‘time’ T , ending an

episode: 𝑆 𝑠

𝑠 … 𝑠 𝑠

  • Continuing tasks: interaction does not have natural episodes.

 return discounted by discount rate γ, 0 ≤ γ ≤ 1:

𝑆 𝑠

𝛿 · 𝑠 𝛿 · 𝑠 ⋯ 𝛿 · 𝑠

  • Return: reward in the long run

) farsighted 1 ted (shortsigh  

19

slide-20
SLIDE 20

Solving MDPs

20

slide-21
SLIDE 21

Optimal quantities

  • State s has value V(s):

V π(s) = expected reward from s when acting according to policy π V *(s) = expected reward starting in s and thereafter acting optimally

  • The optimal policy:

 *(s) = optimal action from state s Important propery: Policy  that is greedy with respect to V*  optimal policy *

(For later: intermediate q‐state has value Q(s,a) )

21

action a state s Transition (s,a,s’) q‐state state s’

slide-22
SLIDE 22

What is the value of following policy  when in state st ? First, let’s consider the deterministic situation: Noise  take expected value over all possible next states: where P() is given by the transition function T().

Characterizing any V π(s)

) ( ) ), ( , ( ) ( ] [ ) (

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1                           

                

   

t t t t t t t t i i t i t k k t k t k k t k t k k t k t t t

s V s s s R s V r R r r r r r r r r R R E s V

   

         

 

) ( ) ), ( , ( )) ( , | ( ) (

1 1 1

1

  

 

t t t t t s t t t

s V s s s R s s s P s V

t

 

  

22

slide-23
SLIDE 23

Example: Policy Evaluation

π: Always Go Right π: Always Go Forward V π is shown for each state (indicated in cell)

23

terminal states

slide-24
SLIDE 24

Characterizing optimalV*(s)

 

 

)) , ( max ( ) ' ( ) ' , , ( ) ' , , ( max ) ' ( ) ' ), ( , ( ) ' ), ( , ( max ) ( max ) ( ) (

*

* ' ' * *

a s Q s V s a s R s a s T s V s s s R s s s T s V s V s V

a s a s

        

 

   

    

Expected reward from state s is maximized by acting

  • ptimally in s and thereafter

 the optimal value for a state is obtained when following the optimal policy :

* 

24

This equation is called the Bellman equation.

slide-25
SLIDE 25

Using V*(s) to obtain *(s)

The optimal policy can be extracted from V*(s): using one‐step look‐ahead, i.e.

  • use the Bellman equation once more to compute

the given summation for all actions

  • rather than returning the max value, return the

action that gives the max value

 

) ' ( ) ' , , ( ) ' , , ( max arg ) ( max arg ) (

* ' *

s V s a s R s a s T s V s

s

a

 

  

25

slide-26
SLIDE 26

Back to Gridworld: Noise = 0.2

(i.e. move successful with p=0.8; deviation to left/right both with p=0.1)

Discount γ = 0.9 Living reward R(s,a,s’) = 0 Optimal policy? Given V* (shown in cells), one‐step look‐ahead produces the long‐term optimal actions (shown as small arrowheads).

26

Using V*(s) to obtain *(s)

slide-27
SLIDE 27

𝜌∗ 𝑡 : 𝑏 𝑜𝑝𝑠𝑢ℎ s’=𝑡: 0.8∙[0 + 0.9∙0.57] = 0.4104 s’=𝑡 : 0.1∙[0 + 0.9∙0.49] = 0.0441 s’=𝑡: 0.1∙[0 + 0.9∙0.43] = 0.0387 + 0.4932 𝑏 𝑥𝑓𝑡𝑢 s’=𝑡 : 0.8∙[0 + 0.9∙0.49] = 0.3528 s’=𝑡: 0.1∙[0 + 0.9∙0.57] = 0.0513 s’=𝑡 : 0.1∙[0 + 0.9∙0.49] = 0.0441 + 0.4482 Same for 𝑏 𝑡𝑝𝑣𝑢ℎ and 𝑏 𝑓𝑏𝑡𝑢

27

Using V*(s) to obtain *(s)

s sn se

slide-28
SLIDE 28

Value Iteration

A Dynamic Programming algorithm for computing V*

28

slide-29
SLIDE 29

Value Iteration (VI)

‘Tree backup’: define Vk(s) as the optimal value of s still

to be obtained if ‘the game’ ends in k more time steps

  • Start with V0(s) = 0 for all s (including terminal s)

 Terminal state rewards (if any) typically added at k=1

  • Given Vk(s) values, compute for each s
  • Repeat until convergence of V‐values

Theorem: VI will converge to unique optimal values

– Basic idea: approximations get refined towards optimal values

a Vk+1(s) ) ’ s (

k

V

29

 

)) , ( max ( ) ' ( ) ' , , ( ) ' , , ( max ) (

' 1

a s Q s V s a s R s a s T s V

k k s k

a a

  

slide-30
SLIDE 30

k=0

Noise = 0.2 Discount = 0.9 Living reward (R) = 0

30

Policy:

  • based on one‐step look‐

ahead, i.e. actions that give max Vk+1 ‐values

  • not used in computing

V‐values!

  • not yet interesting

(only shown to demonstrate changes)

  • default policy: N

VI init:

slide-31
SLIDE 31

k=1

Implementation of terminal state ‘e’ with reward ‘r’:

  • 1 action: ‘x’ (exit)
  • T(e, x)= 1
  • R(e, x) = r

In k=1 only action from terminal states results in change in V‐values

31

Noise = 0.2 Discount = 0.9 Living reward (R)= 0

slide-32
SLIDE 32

k=2

Noise = 0.2 Discount = 0.9 Living reward = 0

32

V(s) only changes for those states s in which an action can result in a transition to state s’ with V(s’) ≠ 0 let s: robot in (3,3) then V 𝑡 : 𝒃 𝒐𝒑𝒔𝒖𝒊 0.8∙[0 + 0.9∙0] + 0.1∙[0 + 0.9∙0] + 0.1∙[0 + 0.9∙1] = 0.09 𝒃 𝒙𝒇𝒕𝒖 0.8∙0.9∙0 + 0.1∙0.9∙0 + 0.1∙0.9∙0 = 0 𝒃 𝒕𝒑𝒗𝒖𝒊 0.8∙0.9∙0 + 0.1∙0.9∙0 + 0.1∙0.9∙1 = 0.09 𝒃 𝒇𝒃𝒕𝒖 0.8∙0.9∙1 + 0.1∙0.9∙0 + 0.1∙0.9∙0 = 0.72

slide-33
SLIDE 33

k=3

Noise = 0.2 Discount = 0.9 Living reward = 0

33

let s: robot in (3,3) recall V2 𝑡 0.72; 𝑜𝑝𝑥 V3 𝑡 : 𝒃 𝒐𝒑𝒔𝒖𝒊 0.8∙[0 + 0.9∙0.72] + 0.1∙[0 + 0.9∙0] + 0.1∙[0 + 0.9∙1] = 0.6084 𝒃 𝒙𝒇𝒕𝒖 0.8∙0.9∙0 + 0.1∙0.9∙0.72 + 0.1∙0.9∙0 = 0.0648 𝒃 𝒕𝒑𝒗𝒖𝒊 0.8∙0.9∙0 + 0.1∙0.9∙0 + 0.1∙0.9∙1 = 0.09 𝒃 𝒇𝒃𝒕𝒖 0.8∙0.9∙1 + 0.1∙0.9∙0.72 + 0.1∙0.9∙0 = 0.7848

slide-34
SLIDE 34

k=4

Noise = 0.2 Discount = 0.9 Living reward = 0

34

slide-35
SLIDE 35

k=5

Noise = 0.2 Discount = 0.9 Living reward = 0

35

slide-36
SLIDE 36

k=6

Noise = 0.2 Discount = 0.9 Living reward = 0

36

slide-37
SLIDE 37

k=7

Noise = 0.2 Discount = 0.9 Living reward = 0

37

slide-38
SLIDE 38

k=8

Noise = 0.2 Discount = 0.9 Living reward = 0

38

slide-39
SLIDE 39

k=9

Noise = 0.2 Discount = 0.9 Living reward = 0

39

slide-40
SLIDE 40

k=10

Noise = 0.2 Discount = 0.9 Living reward = 0

40

slide-41
SLIDE 41

k=100

Noise = 0.2 Discount = 0.9 Living reward = 0

41

slide-42
SLIDE 42

Problems with Value Iteration

  • Value iteration repeats the Bellman updates:
  • Problem 1: It’s slow – O(S2A) per iteration
  • Problem 2: The “max” at each state rarely changes
  • Problem 3: The policy often converges long before the

values a s s,a,s’ s’

42

 

) ' ( ) ' , , ( ) ' , , ( max ) (

' 1

s V s a s R s a s T s V

k s k

a

  

slide-43
SLIDE 43

Policy Iteration

  • Alternative approach for optimal values:

– Step 1: Policy evaluation: calculate returns for some fixed policy until convergence – Step 2: Policy improvement: update policy using one‐step look‐ahead with resulting converged (but not optimal!) returns as future values – Repeat steps until policy converges

  • This is policy iteration

– It’s still optimal! – Can converge (much) faster under some conditions

43

slide-44
SLIDE 44

Recall: Policy Evaluation

π: Always Go Right π: Always Go Forward V π is shown for each state

44

slide-45
SLIDE 45

Comparison VI and PI

  • Both are DP algorithms for solving MDPs
  • both compute the same thing (all optimal values)
  • In value iteration:

– Every iteration updates both values and, implicitly, policy – Doesn’t track policy: taking max over actions implicitly recomputes it

  • In policy iteration:

– Do several passes that update returns with fixed policy (each pass is fast: we consider only one action, not all) – After the policy is evaluated, a new policy is chosen (slow like a value iteration pass) – The new policy will be better (or we’re done)

45

slide-46
SLIDE 46

Double Bandits

46

slide-47
SLIDE 47

Double-Bandit MDP

  • Actions: play Blue or Red
  • States: Win, Lose

W L

$1 1.0 $1 1.0 0.25 $0 0.75 $2 0.75 $2 0.25 $0 Representation: states with transitions, including probability and reward

47

slide-48
SLIDE 48

Offline Planning

  • Solving MDPs is offline planning

– You determine all quantities through computation – You need to know the details of the MDP – You do not actually play the game!

W L

$1 1.0 $1 1.0 0.25 $0 0.75 $2 0.75 $2 0.25 $0

48

No discount 100 time steps Both states have the same value

slide-49
SLIDE 49

Let’s Play!

$2$2$0$2$2 $2$2$0$0$0

49

slide-50
SLIDE 50

Online Planning

  • Rules changed! Red’s win chance is different.

W L

$1 1.0 $1 1.0 ?? $0 ?? $2 ?? $2 ?? $0

50

slide-51
SLIDE 51

Let’s Play again!

$0$0$0$2$0 $2$0$0$0$0

51

slide-52
SLIDE 52

What Just Happened?

  • That wasn’t planning, it was learning!

– Specifically, reinforcement learning – There was an MDP, but you couldn’t solve it with just computation – You needed to actually act to figure it out

52