[PPT] - ARTIFICIAL INTELLIGENCE Markov decision processes Lecturer: Silja PowerPoint Presentation

SLIDE 1

ARTIFICIAL INTELLIGENCE

Lecturer: Silja Renooij

Markov decision processes

Utrecht University The Netherlands

These slides are part of the INFOB2KI Course Notes available from www.cs.uu.nl/docs/vakken/b2ki/schema.html

INFOB2KI 2019-2020

SLIDE 2

PageRank (Google)

PageRank can be understood as a) A Markov Chain b) A Markov Decision Process c) A Partially Observable Markov Decision Process d) None of the above

2

SLIDE 3

Markov models

Markov model = stochastic model that assumes Markov property.

stochastic model: models process where state depends on

previous states in non‐deterministic way.

Markov property: the probability distribution of future states,

conditioned on both past and present values, depends only upon present state: “given the present, the future does not depend on the past” Generally, this assumption enables reasoning and computation with the model that would otherwise be intractable.

3

SLIDE 4

Markov model types

Prediction Planning Fully observable

Markov chain MDP

(Markov decision process)

Partially observable

Hidden Markov model POMDP

(Partially observable Markov decision process)

Prediction models can be represented at variable level by a (Dynamic) Bayesian network:

S3 S1 S2 … S1 S2 S3 O1 O2 O3 …

4

Typically for

ptimisation

purposes

SLIDE 5

PageRank (Google)

PageRank can be understood as a) A Markov Chain b) A Markov Decision Process c) A Partially Observable Markov Decision Process d) None of the above

5

SLIDE 6

MDPs: outline

Search in non‐deterministic environments
Solution: optimal policy (plan) of actions

that maximizes rewards

(decision‐theoretic planning)

Algorithm:

– Bellman equation – value iteration

Link with learning

6

SLIDE 7

Running example: Grid World

A maze‐like problem
The agent lives in a grid, where walls block the agent’s path
Noisy movement: actions do not always go as planned
If wall in chosen direction, then stay put;
80% of the time, the action North takes the

agent North

10% of the time, North takes the agent West;

10% East (same deviation for other actions)

The agent receives rewards each “time” step
Small “living” reward each step (can be negative)
Big rewards come at the end (good or bad)
Goal: maximize sum of rewards

SLIDE 8

Search in non-deterministic environments: Grid World example

Deterministic Grid World Stochastic Grid World Noisy movement: actions do not always go as planned

8

SLIDE 9

MDP: in search of `plan’ that maximises reward

Each MDP state projects a search tree

a s s’

(s,a,s’) called a transition T(s,a,s’) = P(s’| s, a) R(s,a,s’) s is a state:

9

SLIDE 10

Goals, rewards and optimality criteria

Planning goals: encoded in reward function

Example: achieving a state satisfying property P at minimal cost is encoded by making any state satisfying P a zero‐ reward absorbing state, and assigning all other states negative reward.

Rewards: additive and time‐separable
Transitions: effect is uncertain
Objective: maximize expected total reward;

future rewards may be discounted

Planning horizon: finite, infinite or indefinite (latter:

special case of infinite; guaranteed to reach terminal state)

10

SLIDE 11

Markov Decision Processes

MDPs are non‐deterministic search problems An MDP is defined by:

A set of states s  S
A set of actions a  A
A transition function T(s, a, s’)

– Probability that a from s leads to s’, i.e., P(s’ |s, a) – Also called the model or the dynamics

A reward function R(s, a, s’)

– Sometimes just R(s) or R(s’)

A start state
Sometimes a terminal state

11

SLIDE 12

What is Markov about MDPs?

Recall: “Markov” generally means that given the

present state, the future and the past are independent

For Markov decision processes, “Markov” means

action outcomes depend only on the current state

This is just like search, where the successor function

could only depend on the current state (not the history)

Andrey Markov (1856‐1922)

12

SLIDE 13

Running example: Grid World

A maze‐like problem
The agent lives in a grid, where walls block the agent’s path
Noisy movement: actions do not always go as planned
If wall in chosen direction, then stay put;
80% of the time, the action North takes the

agent North

10% of the time, North takes the agent West;

10% East

same deviation for other actions
The agent receives rewards each “time” step
Small “living” reward each step (can be negative)
Big rewards come at the end (good or bad)
Goal: maximize sum of rewards

SLIDE 14

Policies

Example: optimal policy when R(s, a, s’) = ‐0.03 for all non‐ terminal states s

Recall: deterministic search  optimal

plan: a sequence of actions, from start to a goal

For MDPs, we want an optimal

policy *: S → A

– A policy  gives an action for each state – An optimal policy is one that maximizes expected utility (reward) if followed Note: an explicit policy defines a reflex agent

14

SLIDE 15

Optimal Policies - examples

R(s) = ‐2.0 R(s) = ‐0.4 R(s) = ‐0.03 R(s) = ‐0.01

15

Visualisation: each cell represents the state in which the robot occupies that cell; arrow indicates optimal action in the given state +1 +1 +1 +1 ‐1 ‐1 ‐1 ‐1

SLIDE 16

Utilities of reward sequences

What preferences should an agent have over

reward sequences?

More or less?
Now or later?
It’s reasonable to maximize the sum of rewards
It’s also reasonable to prefer rewards now to

rewards later

A solution: values of rewards decay exponentially

[1, 2, 2] [2, 3, 4]

r

[0, 0, 1] [1, 0, 0]

r

16

SLIDE 17

Discounting

Worth Now Worth Next Step Worth In Two Steps

17

SLIDE 18

Discounting: implementation

How to discount?

– Each time we descend a level, we multiply in the discount once

Why discount?

– Sooner rewards probably do have higher utility than later rewards – Also helps our algorithms converge

Example:

– Value of receiving [1,2,3] with discount

f 0.5 = 1*1 + 0.5*2 + 0.25*3

– Which is less than that of [3,2,1]

18

SLIDE 19

Episodic tasks: interaction breaks naturally into episodes, e.g.,

plays of a game, trips through a maze.

 return is total reward from ‘time’ t to ‘time’ T , ending an

episode: 𝑆 𝑠

𝑠 … 𝑠 𝑠

Continuing tasks: interaction does not have natural episodes.

 return discounted by discount rate γ, 0 ≤ γ ≤ 1:

𝑆 𝑠

𝛿 · 𝑠 𝛿 · 𝑠 ⋯ 𝛿 · 𝑠

Return: reward in the long run

) farsighted 1 ted (shortsigh  

19

SLIDE 20

Solving MDPs

20

SLIDE 21

Optimal quantities

State s has value V(s):

V π(s) = expected reward from s when acting according to policy π V *(s) = expected reward starting in s and thereafter acting optimally

The optimal policy:

 (s) = optimal action from state s Important propery: Policy  that is greedy with respect to V  optimal policy *

(For later: intermediate q‐state has value Q(s,a) )

21

action a state s Transition (s,a,s’) q‐state state s’

SLIDE 22

What is the value of following policy  when in state st ? First, let’s consider the deterministic situation: Noise  take expected value over all possible next states: where P() is given by the transition function T().

Characterizing any V π(s)

) ( ) ), ( , ( ) ( ] [ ) (

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1                           

                

   

t t t t t t t t i i t i t k k t k t k k t k t k k t k t t t

s V s s s R s V r R r r r r r r r r R R E s V

   

         

 

) ( ) ), ( , ( )) ( , | ( ) (

1 1 1

1

  

 



t t t t t s t t t

s V s s s R s s s P s V

t

 

  

22

SLIDE 23

Example: Policy Evaluation

π: Always Go Right π: Always Go Forward V π is shown for each state (indicated in cell)

23

terminal states

SLIDE 24

Characterizing optimalV*(s)

 

)) , ( max ( ) ' ( ) ' , , ( ) ' , , ( max ) ' ( ) ' ), ( , ( ) ' ), ( , ( max ) ( max ) ( ) (

*

* ' ' * *

a s Q s V s a s R s a s T s V s s s R s s s T s V s V s V

a s a s

        

 

   

    

Expected reward from state s is maximized by acting

ptimally in s and thereafter

 the optimal value for a state is obtained when following the optimal policy :

* 

24

This equation is called the Bellman equation.

SLIDE 25

Using V(s) to obtain (s)

The optimal policy can be extracted from V*(s): using one‐step look‐ahead, i.e.

use the Bellman equation once more to compute

the given summation for all actions

rather than returning the max value, return the

action that gives the max value

 

) ' ( ) ' , , ( ) ' , , ( max arg ) ( max arg ) (

* ' *

s V s a s R s a s T s V s

s

a

 



  



25

SLIDE 26

Back to Gridworld: Noise = 0.2

(i.e. move successful with p=0.8; deviation to left/right both with p=0.1)

Discount γ = 0.9 Living reward R(s,a,s’) = 0 Optimal policy? Given V* (shown in cells), one‐step look‐ahead produces the long‐term optimal actions (shown as small arrowheads).

26

Using V(s) to obtain (s)

SLIDE 27

𝜌∗ 𝑡 : 𝑏 𝑜𝑝𝑠𝑢ℎ s’=𝑡: 0.8∙[0 + 0.9∙0.57] = 0.4104 s’=𝑡 : 0.1∙[0 + 0.9∙0.49] = 0.0441 s’=𝑡: 0.1∙[0 + 0.9∙0.43] = 0.0387 + 0.4932 𝑏 𝑥𝑓𝑡𝑢 s’=𝑡 : 0.8∙[0 + 0.9∙0.49] = 0.3528 s’=𝑡: 0.1∙[0 + 0.9∙0.57] = 0.0513 s’=𝑡 : 0.1∙[0 + 0.9∙0.49] = 0.0441 + 0.4482 Same for 𝑏 𝑡𝑝𝑣𝑢ℎ and 𝑏 𝑓𝑏𝑡𝑢

27

Using V(s) to obtain (s)

s sn se

SLIDE 28

Value Iteration

A Dynamic Programming algorithm for computing V*

28

SLIDE 29

Value Iteration (VI)

‘Tree backup’: define Vk(s) as the optimal value of s still

to be obtained if ‘the game’ ends in k more time steps

Start with V0(s) = 0 for all s (including terminal s)

 Terminal state rewards (if any) typically added at k=1

Given Vk(s) values, compute for each s
Repeat until convergence of V‐values

Theorem: VI will converge to unique optimal values

– Basic idea: approximations get refined towards optimal values

a Vk+1(s) ) ’ s (

k

V

29

 

)) , ( max ( ) ' ( ) ' , , ( ) ' , , ( max ) (

' 1

a s Q s V s a s R s a s T s V

k k s k

a a

  







SLIDE 30

k=0

Noise = 0.2 Discount = 0.9 Living reward (R) = 0

30

Policy:

based on one‐step look‐

ahead, i.e. actions that give max Vk+1 ‐values

not used in computing

V‐values!

not yet interesting

(only shown to demonstrate changes)

default policy: N

VI init:

SLIDE 31

k=1

Implementation of terminal state ‘e’ with reward ‘r’:

1 action: ‘x’ (exit)
T(e, x)= 1
R(e, x) = r

In k=1 only action from terminal states results in change in V‐values

31

Noise = 0.2 Discount = 0.9 Living reward (R)= 0

SLIDE 32