SLIDE 1 Article overview by Ilya Kuzovkin
David Silver et al. from Google DeepMind
Reinforcement Learning Seminar University of Tartu, 2016
Mastering the game of Go with deep neural networks and tree search
SLIDE 2
THE GAME OF GO
SLIDE 3
BOARD
SLIDE 4
BOARD STONES
SLIDE 5
BOARD STONES GROUPS
SLIDE 6
BOARD STONES LIBERTIES GROUPS
SLIDE 7
BOARD STONES LIBERTIES CAPTURE GROUPS
SLIDE 8
BOARD STONES LIBERTIES CAPTURE KO GROUPS
SLIDE 9
BOARD STONES LIBERTIES CAPTURE KO GROUPS EXAMPLES
SLIDE 10
BOARD STONES LIBERTIES CAPTURE KO GROUPS EXAMPLES
SLIDE 11
BOARD STONES LIBERTIES CAPTURE KO GROUPS EXAMPLES
SLIDE 12 BOARD STONES LIBERTIES CAPTURE T
WO EYES
KO GROUPS EXAMPLES
SLIDE 13 BOARD STONES LIBERTIES CAPTURE FINAL COUNT KO GROUPS EXAMPLES T
WO EYES
SLIDE 14
TRAINING
SLIDE 15 Supervised policy network pσ(a|s) Reinforcement policy network pρ(a|s) Rollout policy network pπ(a|s) Value network vθ(s) Tree policy network pτ(a|s)
TRAINING THE BUILDING BLOCKS
SUPERVISED
CLASSIFICATION
REINFORCEMENT SUPERVISED
REGRESSION
SLIDE 16
Supervised policy network pσ(a|s)
SLIDE 17 Supervised policy network pσ(a|s)
19 x 19 x 48 input 1 convolutional layer 5x5 with k=192 filters, ReLU 11 convolutional layers 3x3 with k=192 filters, ReLU 1 convolutional layer 1x1 ReLU Softmax
SLIDE 18 Supervised policy network pσ(a|s)
19 x 19 x 48 input 1 convolutional layer 5x5 with k=192 filters, ReLU 11 convolutional layers 3x3 with k=192 filters, ReLU Softmax 1 convolutional layer 1x1 ReLU
- 29.4M positions from games
between 6 to 9 dan players
SLIDE 19 Supervised policy network pσ(a|s)
19 x 19 x 48 input 1 convolutional layer 5x5 with k=192 filters, ReLU 11 convolutional layers 3x3 with k=192 filters, ReLU Softmax
- stochastic gradient ascent
- learning rate = 0.003,
halved every 80M steps
- batch size m = 16
- 3 weeks on 50 GPUs to make
340M steps
α
1 convolutional layer 1x1 ReLU
- 29.4M positions from games
between 6 to 9 dan players
SLIDE 20 Supervised policy network pσ(a|s)
19 x 19 x 48 input 1 convolutional layer 5x5 with k=192 filters, ReLU 11 convolutional layers 3x3 with k=192 filters, ReLU Softmax
- stochastic gradient ascent
- learning rate = 0.003,
halved every 80M steps
- batch size m = 16
- 3 weeks on 50 GPUs to make
340M steps
α
1 convolutional layer 1x1 ReLU
- 29.4M positions from games
between 6 to 9 dan players
- Augmented: 8 reflections/rotations
- Test set (1M) accuracy: 57.0%
- 3 ms to select an action
SLIDE 21
19 X 19 X 48 INPUT
SLIDE 22
19 X 19 X 48 INPUT
SLIDE 23
19 X 19 X 48 INPUT
SLIDE 24
19 X 19 X 48 INPUT
SLIDE 25 Rollout policy pπ(a|s)
- Supervised — same data as
- Less accurate: 24.2% (vs. 57.0%)
- Faster: 2μs per action (1500 times)
- Just a linear model with softmax
pσ(a|s)
SLIDE 26 Rollout policy pπ(a|s)
- Supervised — same data as
- Less accurate: 24.2% (vs. 57.0%)
- Faster: 2μs per action (1500 times)
- Just a linear model with softmax
pσ(a|s)
SLIDE 27 Rollout policy pπ(a|s) Tree policy pτ(a|s)
- Supervised — same data as
- Less accurate: 24.2% (vs. 57.0%)
- Faster: 2μs per action (1500 times)
- Just a linear model with softmax
pσ(a|s)
SLIDE 28 Rollout policy pπ(a|s)
- “similar to the rollout policy but
with more features”
Tree policy pτ(a|s)
- Supervised — same data as
- Less accurate: 24.2% (vs. 57.0%)
- Faster: 2μs per action (1500 times)
- Just a linear model with softmax
pσ(a|s)
SLIDE 29 Reinforcement policy network pρ(a|s)
Same architecture Weights are initialized with
ρ σ
SLIDE 30 Reinforcement policy network pρ(a|s)
Same architecture Weights are initialized with
ρ σ
- Self-play: current network vs.
randomized pool of previous versions
SLIDE 31 Reinforcement policy network pρ(a|s)
Same architecture Weights are initialized with
ρ σ
- Self-play: current network vs.
randomized pool of previous versions
- Play a game until the end, get the reward zt = ±r(sT ) = ±1
SLIDE 32 Reinforcement policy network pρ(a|s)
Same architecture Weights are initialized with
ρ σ
- Self-play: current network vs.
randomized pool of previous versions
- Play a game until the end, get the reward
- Set and play the same game again, this time
updating the network parameters at each time step t
zt = ±r(sT ) = ±1 zi
t = zt
SLIDE 33 Reinforcement policy network pρ(a|s)
Same architecture Weights are initialized with
ρ σ
- Self-play: current network vs.
randomized pool of previous versions
- Play a game until the end, get the reward
- Set and play the same game again, this time
updating the network parameters at each time step t
- = …
- 0 “on the first pass through the training pipeline”
- “on the second pass”
zt = ±r(sT ) = ±1 zi
t = zt
v(si
t)
vθ(si
t)
SLIDE 34 Reinforcement policy network pρ(a|s)
Same architecture Weights are initialized with
ρ σ
- Self-play: current network vs.
randomized pool of previous versions
- Play a game until the end, get the reward
- Set and play the same game again, this time
updating the network parameters at each time step t
- = …
- 0 “on the first pass through the training pipeline”
- “on the second pass”
- batch size n = 128 games
- 10,000 batches
- One day on 50 GPUs
zt = ±r(sT ) = ±1 zi
t = zt
v(si
t)
vθ(si
t)
SLIDE 35 Reinforcement policy network pρ(a|s)
Same architecture Weights are initialized with
ρ σ
- Self-play: current network vs.
randomized pool of previous versions
- 80% wins against Supervised Network
- 85% wins against Pachi (no search yet!)
- 3 ms to select an action
- Play a game until the end, get the reward
- Set and play the same game again, this time
updating the network parameters at each time step t
- = …
- 0 “on the first pass through the training pipeline”
- “on the second pass”
- batch size n = 128 games
- 10,000 batches
- One day on 50 GPUs
zt = ±r(sT ) = ±1 zi
t = zt
v(si
t)
vθ(si
t)
SLIDE 36
Value network vθ(s)
SLIDE 37 19 x 19 x 49 input 1 convolutional layer 5x5 with k=192 filters, ReLU 11 convolutional layers 3x3 with k=192 filters, ReLU Fully connected layer 256 ReLU units
Value network vθ(s)
Fully connected layer 1 tanh unit 1 convolutional layer 1x1 ReLU
SLIDE 38 19 x 19 x 49 input 1 convolutional layer 5x5 with k=192 filters, ReLU 11 convolutional layers 3x3 with k=192 filters, ReLU Fully connected layer 256 ReLU units
Value network vθ(s)
Fully connected layer 1 tanh unit
- Evaluate the value of the position s under policy p:
- Double approximation
1 convolutional layer 1x1 ReLU
SLIDE 39 19 x 19 x 49 input 1 convolutional layer 5x5 with k=192 filters, ReLU 11 convolutional layers 3x3 with k=192 filters, ReLU Fully connected layer 256 ReLU units
Value network vθ(s)
Fully connected layer 1 tanh unit
- Evaluate the value of the position s under policy p:
- Double approximation
1 convolutional layer 1x1 ReLU
- Stochastic gradient descent to
minimize MSE
SLIDE 40 19 x 19 x 49 input 1 convolutional layer 5x5 with k=192 filters, ReLU 11 convolutional layers 3x3 with k=192 filters, ReLU Fully connected layer 256 ReLU units
Value network vθ(s)
Fully connected layer 1 tanh unit
- Evaluate the value of the position s under policy p:
- Double approximation
1 convolutional layer 1x1 ReLU
- Stochastic gradient descent to
minimize MSE
- Train on 30M state-outcome
pairs , each from a unique game generated by self-play:
(s, z)
SLIDE 41 19 x 19 x 49 input 1 convolutional layer 5x5 with k=192 filters, ReLU 11 convolutional layers 3x3 with k=192 filters, ReLU Fully connected layer 256 ReLU units
Value network vθ(s)
Fully connected layer 1 tanh unit
- Evaluate the value of the position s under policy p:
- Double approximation
1 convolutional layer 1x1 ReLU
- Stochastic gradient descent to
minimize MSE
- Train on 30M state-outcome
pairs , each from a unique game generated by self-play:
- choose a random time step u
- sample moves t=1…u-1 from
SL policy
- make random move u
- sample t=u+1…T from RL
policy and get game
training set
(s, z) (su, zu)
SLIDE 42 19 x 19 x 49 input 1 convolutional layer 5x5 with k=192 filters, ReLU 11 convolutional layers 3x3 with k=192 filters, ReLU Fully connected layer 256 ReLU units
Value network vθ(s)
Fully connected layer 1 tanh unit
- Evaluate the value of the position s under policy p:
- Double approximation
1 convolutional layer 1x1 ReLU
- Stochastic gradient descent to
minimize MSE
- Train on 30M state-outcome
pairs , each from a unique game generated by self-play:
- choose a random time step u
- sample moves t=1…u-1 from
SL policy
- make random move u
- sample t=u+1…T from RL
policy and get game
training set
- One week on 50 GPUs to train
- n 50M batches of size m=32
(s, z) (su, zu)
SLIDE 43 19 x 19 x 49 input 1 convolutional layer 5x5 with k=192 filters, ReLU 11 convolutional layers 3x3 with k=192 filters, ReLU Fully connected layer 256 ReLU units
Value network vθ(s)
Fully connected layer 1 tanh unit
- Evaluate the value of the position s under policy p:
- Double approximation
- MSE on the test set: 0.234
- Close to MC estimation from RL policy; 15,000 faster
1 convolutional layer 1x1 ReLU
- Stochastic gradient descent to
minimize MSE
- Train on 30M state-outcome
pairs , each from a unique game generated by self-play:
- choose a random time step u
- sample moves t=1…u-1 from
SL policy
- make random move u
- sample t=u+1…T from RL
policy and get game
training set
- One week on 50 GPUs to train
- n 50M batches of size m=32
(s, z) (su, zu)
SLIDE 44
SLIDE 45
PLAYING
SLIDE 46
APV-MCTS ASYNCHRONOUS POLICY AND VALUE MCTS
SLIDE 47 APV-MCTS
Each node s has edges (s, a) for all legal actions and stores statistics:
Prior Number of evaluations Number of rollouts MC value estimate Rollout value estimate Combined mean action value
ASYNCHRONOUS POLICY AND VALUE MCTS
SLIDE 48 APV-MCTS
Each node s has edges (s, a) for all legal actions and stores statistics:
Prior Number of evaluations Number of rollouts MC value estimate Rollout value estimate Combined mean action value
ASYNCHRONOUS POLICY AND VALUE MCTS
Simulation starts at the root and stops at time L, when a leaf (unexplored state) is found.
SLIDE 49 APV-MCTS
Each node s has edges (s, a) for all legal actions and stores statistics:
Prior Number of evaluations Number of rollouts MC value estimate Rollout value estimate Combined mean action value
ASYNCHRONOUS POLICY AND VALUE MCTS
Simulation starts at the root and stops at time L, when a leaf (unexplored state) is found. Position is added to evaluation queue. sL
SLIDE 50 APV-MCTS
Each node s has edges (s, a) for all legal actions and stores statistics:
Prior Number of evaluations Number of rollouts MC value estimate Rollout value estimate Combined mean action value
ASYNCHRONOUS POLICY AND VALUE MCTS
Simulation starts at the root and stops at time L, when a leaf (unexplored state) is found. Position is added to evaluation queue. sL
Bunch of nodes selected for evaluation…
SLIDE 51
APV-MCTS ASYNCHRONOUS POLICY AND VALUE MCTS
SLIDE 52
APV-MCTS ASYNCHRONOUS POLICY AND VALUE MCTS
Node s is evaluated using the value network to obtain
vθ(s)
SLIDE 53
APV-MCTS ASYNCHRONOUS POLICY AND VALUE MCTS
Node s is evaluated using the value network to obtain and using rollout simulation with policy till the end of each simulated game to get the final game score.
pπ(
vθ(s)
SLIDE 54
APV-MCTS ASYNCHRONOUS POLICY AND VALUE MCTS
Node s is evaluated using the value network to obtain and using rollout simulation with policy till the end of each simulated game to get the final game score.
pπ(
vθ(s)
Each leaf is evaluated, we are ready to propagate updates
SLIDE 55
APV-MCTS ASYNCHRONOUS POLICY AND VALUE MCTS
SLIDE 56
APV-MCTS ASYNCHRONOUS POLICY AND VALUE MCTS
Statistics along the paths of each simulation are updated during the backward pass though t < L
SLIDE 57
APV-MCTS ASYNCHRONOUS POLICY AND VALUE MCTS
Statistics along the paths of each simulation are updated during the backward pass though t < L visits counts are updated as well
SLIDE 58
APV-MCTS ASYNCHRONOUS POLICY AND VALUE MCTS
Statistics along the paths of each simulation are updated during the backward pass though t < L visits counts are updated as well Finally overall evaluation of each visited state-action edge is updated
SLIDE 59
APV-MCTS ASYNCHRONOUS POLICY AND VALUE MCTS
Statistics along the paths of each simulation are updated during the backward pass though t < L visits counts are updated as well Finally overall evaluation of each visited state-action edge is updated
Current tree is updated
SLIDE 60
APV-MCTS ASYNCHRONOUS POLICY AND VALUE MCTS
SLIDE 61
APV-MCTS ASYNCHRONOUS POLICY AND VALUE MCTS
Once an edge (s, a) is visited enough ( ) times it is included into the tree with s’
nthr
SLIDE 62 APV-MCTS ASYNCHRONOUS POLICY AND VALUE MCTS
Once an edge (s, a) is visited enough ( ) times it is included into the tree with s’
nthr
It is initialized using the tree policy to and updated with SL policy:
pτ(a|s0)
τ
SLIDE 63 APV-MCTS ASYNCHRONOUS POLICY AND VALUE MCTS
Tree is expanded, fully updated and ready for the next move!
Once an edge (s, a) is visited enough ( ) times it is included into the tree with s’
nthr
It is initialized using the tree policy to and updated with SL policy:
pτ(a|s0)
τ
SLIDE 64
SLIDE 65 https://www.youtube.com/watch?v=oRvlyEpOQ-8
WINNING
SLIDE 66 https://www.youtube.com/watch?v=oRvlyEpOQ-8