[PPT] - Online Knowledge Enhancements for Monte Carlo Tree Search in PowerPoint Presentation

SLIDE 1

Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning

Bachelor presentation

Marcel Neidinger

<m.neidinger@unibas.ch>

Department of Mathematics and Computer Science, University of Basel

13. February 2017

SLIDE 2

What is Probabilistic Planning?

Solve planning tasks with probabilistic transitions Models a Markov Decision Problem given by M = ⟨V, s0, A, T, R⟩

A set of binary variables V inducing States S = 2V An initial state s0 ∈ S A set of applicable actions A A transition model T : S × A × S → [0; 1] A Reward R(s, a)

Monte Carlo Tree Search algorithms solve MDPs

Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 2 / 33

SLIDE 3

Monte Carlo Tree Search Algorithms

Algorithmic framework to solve MDPs Used especially in computer Go Go Board1 Lee Sedol2

1Source: https://commons.wikimedia.org/wiki/File:Go_board.jpg 2Source: https://qz.com/639952/googles-ai-won-the-game-go-by-defying-

millennia-of-basic-human-instinct/

Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 3 / 33

SLIDE 4

Four phases - Two components

Selection Expansion Simulation

e Simulation

Backpropagation

Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 4 / 33

SLIDE 5

Monte Carlo Tree node

MCTS tree for a MDP M Important information in a tree node

A state s ∈ S A counter N (i) for the number of visits A counter N (i)(s, a) ∀a ∈ A for the number of times a was selected in s A reward estimate Q(i)(s, a) for action a in state s

Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 5 / 33

SLIDE 6

Online Knowledge

AlphaGo used Neural Networks for the two policis → Domain-specific knowledge We want domain independent enhancements

Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 6 / 33

SLIDE 7

Overview

Tree-Policy Enhancements All Moves as First

α-AMAF Cutoff-AMAF

Rapid Action Value Estimation Default-Policy Enhancements Move-Average Sampling Technique Conclusion

Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 7 / 33

SLIDE 8

What is a Tree Policy?

Iterate through the known part of the tree and select an action given a node Use a Q value for a state-action pair to estimate an actions reward

Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 8 / 33

SLIDE 9

UCT

MCTS implementation first proposed in 2006

m m′ m′ m′ m′′ Reward: 10 s1 s2 s3 s4 s5

Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 9 / 33

SLIDE 10

UCT

Reward approximation, parent node vl, child node vj UCT(vl, vj) = Q(i)(sl, aj) + 2 Cp √ 2 ln N(i)(sl) N(i+1)(sj) (1) From parent vl select child node v∗ that maximises v∗ = max

vj {UCT(nl, nj)}

(2)

Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 10 / 33

SLIDE 11

All Moves as First - Idea

UCT score needs several trials to become reliable Idea: Generalize informations extracted from trials Implementation: Use additional (node-independant) score that updates unselected actions as well

m m′ m′ m′ m′′ Reward: 10 s1 s2 s3 s4 s5

State Action Reward s1 m …

Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 11 / 33

SLIDE 12

All Moves as First - α-AMAF

Idea: Combine UCT and AMAF score SCR = αAMAF + (1 − α)UCT (3) Choose action with highest SCR

Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 12 / 33

SLIDE 13

All Moves as First - α-AMAF - Results

0.2 0.4 0.6 0.8 1 wildfire triangle academic elevators tamarisk sysadmin recon game traffic crossing skill navigation total IPPC score Domain AMAF(α = 0) AMAF(α = 0.2) AMAF(α = 0.4) AMAF(α = 0.6) AMAF(α = 0.8) AMAF(α = 1.0)

Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 13 / 33

SLIDE 14

All Moves as First - α-AMAF - Problems

With more trials UCT becomes more reliable AMAF score has higher variance

We want to discontinue using AMAF score after some time

Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 14 / 33

SLIDE 15

All Moves as First - α-AMAF - Problems

With more trials UCT becomes more reliable AMAF score has higher variance

We want to discontinue using AMAF score after some time

Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 14 / 33

SLIDE 16

All Moves as First - Cutoff-AMAF

Introduce cutoff parameter K SCR = { αAMAF + (1 − α)UCT , for i ≤ k UCT , else (4) Use AMAF score only in the first k trials

Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 15 / 33

SLIDE 17

All Moves as First - Cutoff-AMAF - Results

0.5 0.55 0.6 0.65 0.7 0.75 10 20 30 40 50 Total IPPC score K value init: IDS, backup: MC Raw UCT Plain α-AMAF

Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 16 / 33

SLIDE 18

All Moves as First - Cutoff-AMAF - Problems

How to choose the parameter K? When is the UCT score reliable enough?

Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 17 / 33

SLIDE 19

Rapid Actio Value Estimation - Idea

First introduced in 2007 for computer go Use soft cutoff α = max { 0, V − v(n) V } (5) Use UCT for often visited nodes and AMAF score for less-visited

Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 18 / 33

SLIDE 20

Rapid Action Value Estimation - Results

0.2 0.4 0.6 0.8 1 wildfire triangle academic elevators tamarisk sysadmin recon game traffic crossing skill navigation total IPPC score Domain UCT RAVE(5) RAVE(15) RAVE(25) RAVE(50)

Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 19 / 33

SLIDE 21

All Moves as First - Conclusion

0.2 0.4 0.6 0.8 1 w i l d f i r e t r i a n g l e a c a d e m i c e l e v a t

r

s t a m a r i s k s y s a d m i n r e c

n

g a m e t r a f f i c c r

s

s i n g s k i l l n a v i g a t i

n

t

t

a l IPPC score Domain UCT RAVE(25) AMAF(α = 0.2)

Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 20 / 33

SLIDE 22

Rapid Action Value Estimation - Problems

PROST uses problem description with conditional effects Also no preconditions given PROST description is more general

Player Goal field Movepath

In PROST: Action: move_up In e.g. computer chess Action: move_a2_to_a3

Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 21 / 33

SLIDE 23

Predicate Rapid Action Value Estimation

A state has predicates that give some context Idea Use predicates to find similar states and use their score QPRAV E(s, a) = 1 N ∑

p∈P

QRAV E(p, a) (6) and weight with α = { 0, V − v(n) V } (7)

Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 22 / 33

SLIDE 24

All Moves as First - Conclusion - Revisited

0.2 0.4 0.6 0.8 1 w i l d f i r e t r i a n g l e a c a d e m i c e l e v a t

r

s t a m a r i s k s y s a d m i n r e c

n

g a m e t r a f f i c c r

s

s i n g s k i l l n a v i g a t i

n

t

t

a l IPPC score Domain UCT PRAVE RAVE(25) AMAF(α = 0.2)

Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 23 / 33

SLIDE 25

Overview

Tree-Policy Enhancements All Moves as First

α-AMAF Cutoff-AMAF

Rapid Action Value Estimation Default-Policy Enhancements Move-Average Sampling Technique Conclusion

Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 24 / 33

SLIDE 26

What is a Default Policy?

e Simulation

Simulate the outcome of a trial Basic default policy: random walk

Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 25 / 33

SLIDE 27

X-Average Sampling Technique

Use tree knowledge to bias default policy towards moves that are more goal-oriented

Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 26 / 33

SLIDE 28

Move-Average Sampling Technique - Idea - Sample Game

Player Goal field Movepath

Introduce Q(a) Use moves that are good on average Choose action according to: P(a) = e

Q(a) τ

∑

b∈A

e

Q(b) τ

(8)

Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 27 / 33

SLIDE 29

Move-Average Sampling Technique - Idea - Example

Actions: r,r,u,u,u Q(r) = 1; N(r) = 2 Q(u) = 6; N(u) = 3 Actions: r,r,u,l,l Q(r) = 2; N(r) = 4 Q(u) = 7; N(u) = 4 Q(l) = 3; N(l) = 2

Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 28 / 33

SLIDE 30

Move-Average Sampling Technique - Idea - Example (2)

Actions: l,u,u,r,r Q(r) = 7; N(r) = 6 Q(u) = 8; N(u) = 6 Q(l) = 2; N(l) = 3 Actions: r,r,r,u,u Q(r) = 7; N(r) = 9 Q(u) = 9; N(u) = 8 Q(l) = 2; N(l) = 3

Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 29 / 33

SLIDE 31

Move-Average Sampling Technique - Results

0.1 0.2 0.3 0.4 0.5 w i l d f i r e t r i a n g l e a c a d e m i c e l e v a t

r

s t a m a r i s k s y s a d m i n r e c

n

g a m e t r a f f i c c r

s

s i n g s k i l l n a v i g a t i

n

t

t

a l IPPC score Domain UCT(RandomWalk) UCT(MAST)

Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 30 / 33

SLIDE 32

Overview

Tree-Policy Enhancements All Moves as First

α-AMAF Cutoff-AMAF

Rapid Action Value Estimation Default-Policy Enhancements Move-Average Sampling Technique Conclusion

Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 31 / 33

SLIDE 33

Conclusion

Tree-policy enhancements

α-AMAF and RAVE performe worse than standard UCT PRAVE performs slightly better but still worse than standard UCT

Default-policy enhancements

MAST outperforms RandomWalk

Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 32 / 33

SLIDE 34