ARTIFICIAL INTELLIGENCE
Lecturer: Silja Renooij
Markov decision processes
Utrecht University The Netherlands
These slides are part of the INFOB2KI Course Notes available from www.cs.uu.nl/docs/vakken/b2ki/schema.html
ARTIFICIAL INTELLIGENCE Markov decision processes Lecturer: Silja - - PowerPoint PPT Presentation
INFOB2KI 2019-2020 Utrecht University The Netherlands ARTIFICIAL INTELLIGENCE Markov decision processes Lecturer: Silja Renooij These slides are part of the INFOB2KI Course Notes available from www.cs.uu.nl/docs/vakken/b2ki/schema.html
Utrecht University The Netherlands
These slides are part of the INFOB2KI Course Notes available from www.cs.uu.nl/docs/vakken/b2ki/schema.html
2
3
Prediction Planning Fully observable
(Markov decision process)
Partially observable
(Partially observable Markov decision process)
S3 S1 S2 … S1 S2 S3 O1 O2 O3 …
4
5
6
agent North
10% East (same deviation for other actions)
8
9
10
11
Andrey Markov (1856‐1922)
12
agent North
10% East
Example: optimal policy when R(s, a, s’) = ‐0.03 for all non‐ terminal states s
14
R(s) = ‐2.0 R(s) = ‐0.4 R(s) = ‐0.03 R(s) = ‐0.01
15
Visualisation: each cell represents the state in which the robot occupies that cell; arrow indicates optimal action in the given state +1 +1 +1 +1 ‐1 ‐1 ‐1 ‐1
16
17
– Each time we descend a level, we multiply in the discount once
– Sooner rewards probably do have higher utility than later rewards – Also helps our algorithms converge
– Value of receiving [1,2,3] with discount
– Which is less than that of [3,2,1]
18
𝑠 … 𝑠 𝑠
𝛿 · 𝑠 𝛿 · 𝑠 ⋯ 𝛿 · 𝑠
19
20
(For later: intermediate q‐state has value Q(s,a) )
21
action a state s Transition (s,a,s’) q‐state state s’
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
t t t t t t t t i i t i t k k t k t k k t k t k k t k t t t
1 1 1
1
t t t t t s t t t
t
22
23
terminal states
*
* ' ' * *
a s a s
24
* ' *
s
a
25
(i.e. move successful with p=0.8; deviation to left/right both with p=0.1)
26
27
28
– Basic idea: approximations get refined towards optimal values
k
29
' 1
k k s k
a a
Noise = 0.2 Discount = 0.9 Living reward (R) = 0
30
(only shown to demonstrate changes)
31
Noise = 0.2 Discount = 0.9 Living reward (R)= 0
Noise = 0.2 Discount = 0.9 Living reward = 0
32
Noise = 0.2 Discount = 0.9 Living reward = 0
33
Noise = 0.2 Discount = 0.9 Living reward = 0
34
Noise = 0.2 Discount = 0.9 Living reward = 0
35
Noise = 0.2 Discount = 0.9 Living reward = 0
36
Noise = 0.2 Discount = 0.9 Living reward = 0
37
Noise = 0.2 Discount = 0.9 Living reward = 0
38
Noise = 0.2 Discount = 0.9 Living reward = 0
39
Noise = 0.2 Discount = 0.9 Living reward = 0
40
Noise = 0.2 Discount = 0.9 Living reward = 0
41
42
' 1
k s k
a
43
44
45
46
47
– You determine all quantities through computation – You need to know the details of the MDP – You do not actually play the game!
48
No discount 100 time steps Both states have the same value
49
50
51
52