[PPT] - Mastering the game of Go with deep neural networks and tree search PowerPoint Presentation

SLIDE 1

Article overview by Ilya Kuzovkin

David Silver et al. from Google DeepMind

Reinforcement Learning Seminar University of Tartu, 2016

Mastering the game of Go with deep neural networks and tree search

SLIDE 2

THE GAME OF GO

SLIDE 3

BOARD

SLIDE 4

BOARD STONES

SLIDE 5

BOARD STONES GROUPS

SLIDE 6

BOARD STONES LIBERTIES GROUPS

SLIDE 7

BOARD STONES LIBERTIES CAPTURE GROUPS

SLIDE 8

BOARD STONES LIBERTIES CAPTURE KO GROUPS

SLIDE 9

BOARD STONES LIBERTIES CAPTURE KO GROUPS EXAMPLES

SLIDE 10

BOARD STONES LIBERTIES CAPTURE KO GROUPS EXAMPLES

SLIDE 11

BOARD STONES LIBERTIES CAPTURE KO GROUPS EXAMPLES

SLIDE 12

BOARD STONES LIBERTIES CAPTURE T

WO EYES

KO GROUPS EXAMPLES

SLIDE 13

BOARD STONES LIBERTIES CAPTURE FINAL COUNT KO GROUPS EXAMPLES T

WO EYES

SLIDE 14

TRAINING

SLIDE 15

Supervised policy network pσ(a|s) Reinforcement policy network pρ(a|s) Rollout policy network pπ(a|s) Value network vθ(s) Tree policy network pτ(a|s)

TRAINING THE BUILDING BLOCKS

SUPERVISED

CLASSIFICATION

REINFORCEMENT SUPERVISED

REGRESSION

SLIDE 16

Supervised policy network pσ(a|s)

SLIDE 17

Supervised policy network pσ(a|s)

19 x 19 x 48 input 1 convolutional layer 5x5 with k=192 filters, ReLU 11 convolutional layers 3x3 with k=192 filters, ReLU 1 convolutional layer 1x1 ReLU Softmax

SLIDE 18

Supervised policy network pσ(a|s)

19 x 19 x 48 input 1 convolutional layer 5x5 with k=192 filters, ReLU 11 convolutional layers 3x3 with k=192 filters, ReLU Softmax 1 convolutional layer 1x1 ReLU

29.4M positions from games

between 6 to 9 dan players

SLIDE 19

Supervised policy network pσ(a|s)

19 x 19 x 48 input 1 convolutional layer 5x5 with k=192 filters, ReLU 11 convolutional layers 3x3 with k=192 filters, ReLU Softmax

stochastic gradient ascent
learning rate = 0.003,

halved every 80M steps

batch size m = 16
3 weeks on 50 GPUs to make

340M steps

α

1 convolutional layer 1x1 ReLU

29.4M positions from games

between 6 to 9 dan players

SLIDE 20

Supervised policy network pσ(a|s)

19 x 19 x 48 input 1 convolutional layer 5x5 with k=192 filters, ReLU 11 convolutional layers 3x3 with k=192 filters, ReLU Softmax

stochastic gradient ascent
learning rate = 0.003,

halved every 80M steps

batch size m = 16
3 weeks on 50 GPUs to make

340M steps

α

1 convolutional layer 1x1 ReLU

29.4M positions from games

between 6 to 9 dan players

Augmented: 8 reflections/rotations
Test set (1M) accuracy: 57.0%
3 ms to select an action

SLIDE 21

19 X 19 X 48 INPUT

SLIDE 22

19 X 19 X 48 INPUT

SLIDE 23

19 X 19 X 48 INPUT

SLIDE 24

19 X 19 X 48 INPUT

SLIDE 25

Rollout policy pπ(a|s)

Supervised — same data as
Less accurate: 24.2% (vs. 57.0%)
Faster: 2μs per action (1500 times)
Just a linear model with softmax

pσ(a|s)

SLIDE 26

Rollout policy pπ(a|s)

Supervised — same data as
Less accurate: 24.2% (vs. 57.0%)
Faster: 2μs per action (1500 times)
Just a linear model with softmax

pσ(a|s)

SLIDE 27

Rollout policy pπ(a|s) Tree policy pτ(a|s)

Supervised — same data as
Less accurate: 24.2% (vs. 57.0%)
Faster: 2μs per action (1500 times)
Just a linear model with softmax

pσ(a|s)

SLIDE 28

Rollout policy pπ(a|s)

“similar to the rollout policy but

with more features”

Tree policy pτ(a|s)

Supervised — same data as
Less accurate: 24.2% (vs. 57.0%)
Faster: 2μs per action (1500 times)
Just a linear model with softmax

pσ(a|s)

SLIDE 29

Reinforcement policy network pρ(a|s)

Same architecture Weights are initialized with

ρ σ

SLIDE 30

Reinforcement policy network pρ(a|s)

Same architecture Weights are initialized with

ρ σ

Self-play: current network vs.

randomized pool of previous versions

SLIDE 31

Reinforcement policy network pρ(a|s)

Same architecture Weights are initialized with

ρ σ

Self-play: current network vs.

randomized pool of previous versions

Play a game until the end, get the reward zt = ±r(sT ) = ±1

SLIDE 32

Reinforcement policy network pρ(a|s)

Same architecture Weights are initialized with

ρ σ

Self-play: current network vs.

randomized pool of previous versions

Play a game until the end, get the reward
Set and play the same game again, this time

updating the network parameters at each time step t

zt = ±r(sT ) = ±1 zi

t = zt

SLIDE 33

Reinforcement policy network pρ(a|s)

Same architecture Weights are initialized with

ρ σ

Self-play: current network vs.

randomized pool of previous versions

Play a game until the end, get the reward
Set and play the same game again, this time

updating the network parameters at each time step t

= …
0 “on the first pass through the training pipeline”
“on the second pass”

zt = ±r(sT ) = ±1 zi

t = zt

v(si

t)

vθ(si

t)

SLIDE 34

Reinforcement policy network pρ(a|s)

Same architecture Weights are initialized with

ρ σ

Self-play: current network vs.

randomized pool of previous versions

Play a game until the end, get the reward
Set and play the same game again, this time

updating the network parameters at each time step t

= …
0 “on the first pass through the training pipeline”
“on the second pass”
batch size n = 128 games
10,000 batches
One day on 50 GPUs

zt = ±r(sT ) = ±1 zi

t = zt

v(si

t)

vθ(si

t)

SLIDE 35

Reinforcement policy network pρ(a|s)

Same architecture Weights are initialized with

ρ σ

Self-play: current network vs.

randomized pool of previous versions

80% wins against Supervised Network
85% wins against Pachi (no search yet!)
3 ms to select an action
Play a game until the end, get the reward
Set and play the same game again, this time

updating the network parameters at each time step t

= …
0 “on the first pass through the training pipeline”
“on the second pass”
batch size n = 128 games
10,000 batches
One day on 50 GPUs

zt = ±r(sT ) = ±1 zi

t = zt

v(si

t)

vθ(si

t)

SLIDE 36

Value network vθ(s)

SLIDE 37

19 x 19 x 49 input 1 convolutional layer 5x5 with k=192 filters, ReLU 11 convolutional layers 3x3 with k=192 filters, ReLU Fully connected layer 256 ReLU units

Value network vθ(s)

Fully connected layer 1 tanh unit 1 convolutional layer 1x1 ReLU

SLIDE 38

19 x 19 x 49 input 1 convolutional layer 5x5 with k=192 filters, ReLU 11 convolutional layers 3x3 with k=192 filters, ReLU Fully connected layer 256 ReLU units

Value network vθ(s)

Fully connected layer 1 tanh unit

Evaluate the value of the position s under policy p:
Double approximation

1 convolutional layer 1x1 ReLU

SLIDE 39

19 x 19 x 49 input 1 convolutional layer 5x5 with k=192 filters, ReLU 11 convolutional layers 3x3 with k=192 filters, ReLU Fully connected layer 256 ReLU units

Value network vθ(s)

Fully connected layer 1 tanh unit

Evaluate the value of the position s under policy p:
Double approximation

1 convolutional layer 1x1 ReLU

Stochastic gradient descent to

minimize MSE

SLIDE 40

19 x 19 x 49 input 1 convolutional layer 5x5 with k=192 filters, ReLU 11 convolutional layers 3x3 with k=192 filters, ReLU Fully connected layer 256 ReLU units

Value network vθ(s)

Fully connected layer 1 tanh unit

Evaluate the value of the position s under policy p:
Double approximation

1 convolutional layer 1x1 ReLU

Stochastic gradient descent to

minimize MSE

Train on 30M state-outcome

pairs , each from a unique game generated by self-play:

(s, z)

SLIDE 41

19 x 19 x 49 input 1 convolutional layer 5x5 with k=192 filters, ReLU 11 convolutional layers 3x3 with k=192 filters, ReLU Fully connected layer 256 ReLU units

Value network vθ(s)

Fully connected layer 1 tanh unit

Evaluate the value of the position s under policy p:
Double approximation

1 convolutional layer 1x1 ReLU

Stochastic gradient descent to

minimize MSE

Train on 30M state-outcome

pairs , each from a unique game generated by self-play:

choose a random time step u
sample moves t=1…u-1 from

SL policy

make random move u
sample t=u+1…T from RL

policy and get game

utcome z
add pair to the

training set

(s, z) (su, zu)

SLIDE 42

19 x 19 x 49 input 1 convolutional layer 5x5 with k=192 filters, ReLU 11 convolutional layers 3x3 with k=192 filters, ReLU Fully connected layer 256 ReLU units

Value network vθ(s)

Fully connected layer 1 tanh unit

Evaluate the value of the position s under policy p:
Double approximation

1 convolutional layer 1x1 ReLU

Stochastic gradient descent to

minimize MSE

Train on 30M state-outcome

pairs , each from a unique game generated by self-play:

choose a random time step u
sample moves t=1…u-1 from

SL policy

make random move u
sample t=u+1…T from RL

policy and get game

utcome z
add pair to the

training set

One week on 50 GPUs to train
n 50M batches of size m=32

(s, z) (su, zu)

SLIDE 43

19 x 19 x 49 input 1 convolutional layer 5x5 with k=192 filters, ReLU 11 convolutional layers 3x3 with k=192 filters, ReLU Fully connected layer 256 ReLU units

Value network vθ(s)

Fully connected layer 1 tanh unit

Evaluate the value of the position s under policy p:
Double approximation
MSE on the test set: 0.234
Close to MC estimation from RL policy; 15,000 faster

1 convolutional layer 1x1 ReLU

Stochastic gradient descent to

minimize MSE

Train on 30M state-outcome

pairs , each from a unique game generated by self-play:

choose a random time step u
sample moves t=1…u-1 from

SL policy

make random move u
sample t=u+1…T from RL

policy and get game

utcome z
add pair to the

training set

One week on 50 GPUs to train
n 50M batches of size m=32

(s, z) (su, zu)

SLIDE 44

SLIDE 45

PLAYING

SLIDE 46

APV-MCTS ASYNCHRONOUS POLICY AND VALUE MCTS

SLIDE 47

APV-MCTS

Each node s has edges (s, a) for all legal actions and stores statistics:

Prior Number of evaluations Number of rollouts MC value estimate Rollout value estimate Combined mean action value

ASYNCHRONOUS POLICY AND VALUE MCTS

SLIDE 48

APV-MCTS

Each node s has edges (s, a) for all legal actions and stores statistics:

Prior Number of evaluations Number of rollouts MC value estimate Rollout value estimate Combined mean action value

ASYNCHRONOUS POLICY AND VALUE MCTS

Simulation starts at the root and stops at time L, when a leaf (unexplored state) is found.

SLIDE 49

APV-MCTS

Each node s has edges (s, a) for all legal actions and stores statistics:

Prior Number of evaluations Number of rollouts MC value estimate Rollout value estimate Combined mean action value

ASYNCHRONOUS POLICY AND VALUE MCTS

Simulation starts at the root and stops at time L, when a leaf (unexplored state) is found. Position is added to evaluation queue. sL

SLIDE 50

APV-MCTS

Each node s has edges (s, a) for all legal actions and stores statistics:

Prior Number of evaluations Number of rollouts MC value estimate Rollout value estimate Combined mean action value

ASYNCHRONOUS POLICY AND VALUE MCTS

Simulation starts at the root and stops at time L, when a leaf (unexplored state) is found. Position is added to evaluation queue. sL

Bunch of nodes selected for evaluation…

SLIDE 51

APV-MCTS ASYNCHRONOUS POLICY AND VALUE MCTS

SLIDE 52

APV-MCTS ASYNCHRONOUS POLICY AND VALUE MCTS

Node s is evaluated using the value network to obtain

vθ(s)

SLIDE 53

APV-MCTS ASYNCHRONOUS POLICY AND VALUE MCTS

Node s is evaluated using the value network to obtain and using rollout simulation with policy till the end of each simulated game to get the final game score.

pπ(

vθ(s)

SLIDE 54

APV-MCTS ASYNCHRONOUS POLICY AND VALUE MCTS

Node s is evaluated using the value network to obtain and using rollout simulation with policy till the end of each simulated game to get the final game score.

pπ(

vθ(s)

Each leaf is evaluated, we are ready to propagate updates

SLIDE 55

APV-MCTS ASYNCHRONOUS POLICY AND VALUE MCTS

SLIDE 56

APV-MCTS ASYNCHRONOUS POLICY AND VALUE MCTS

Statistics along the paths of each simulation are updated during the backward pass though t < L

SLIDE 57

APV-MCTS ASYNCHRONOUS POLICY AND VALUE MCTS

Statistics along the paths of each simulation are updated during the backward pass though t < L visits counts are updated as well

SLIDE 58

APV-MCTS ASYNCHRONOUS POLICY AND VALUE MCTS

Statistics along the paths of each simulation are updated during the backward pass though t < L visits counts are updated as well Finally overall evaluation of each visited state-action edge is updated

SLIDE 59

APV-MCTS ASYNCHRONOUS POLICY AND VALUE MCTS

Statistics along the paths of each simulation are updated during the backward pass though t < L visits counts are updated as well Finally overall evaluation of each visited state-action edge is updated

Current tree is updated

SLIDE 60

APV-MCTS ASYNCHRONOUS POLICY AND VALUE MCTS

SLIDE 61

APV-MCTS ASYNCHRONOUS POLICY AND VALUE MCTS

Once an edge (s, a) is visited enough ( ) times it is included into the tree with s’

nthr

SLIDE 62

APV-MCTS ASYNCHRONOUS POLICY AND VALUE MCTS

Once an edge (s, a) is visited enough ( ) times it is included into the tree with s’

nthr

It is initialized using the tree policy to and updated with SL policy:

pτ(a|s0)

τ

SLIDE 63

APV-MCTS ASYNCHRONOUS POLICY AND VALUE MCTS

Tree is expanded, fully updated and ready for the next move!

Once an edge (s, a) is visited enough ( ) times it is included into the tree with s’

nthr

It is initialized using the tree policy to and updated with SL policy:

pτ(a|s0)

τ

SLIDE 64

SLIDE 65

https://www.youtube.com/watch?v=oRvlyEpOQ-8

WINNING

SLIDE 66

https://www.youtube.com/watch?v=oRvlyEpOQ-8