Planning to Be Surprised: Optimal Bayesian Exploration in Dynamic - - PowerPoint PPT Presentation

planning to be surprised optimal bayesian exploration in
SMART_READER_LITE
LIVE PREVIEW

Planning to Be Surprised: Optimal Bayesian Exploration in Dynamic - - PowerPoint PPT Presentation

Planning to Be Surprised: Optimal Bayesian Exploration in Dynamic Environments Yi Sun, Faustino Gomez, J urgen Schmidhuber IDSIA, USI & SUPSI, Switzerland August 2011 Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 1 / 18


slide-1
SLIDE 1

Planning to Be Surprised: Optimal Bayesian Exploration in Dynamic Environments

Yi Sun, Faustino Gomez, J¨ urgen Schmidhuber

IDSIA, USI & SUPSI, Switzerland

August 2011

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 1 / 18

slide-2
SLIDE 2

Motivation

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 2 / 18

slide-3
SLIDE 3

Motivation

An intelligent agent is sent to explore an unknown environment

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 2 / 18

slide-4
SLIDE 4

Motivation

An intelligent agent is sent to explore an unknown environment Learning through sequential interactions

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 2 / 18

slide-5
SLIDE 5

Motivation

An intelligent agent is sent to explore an unknown environment Learning through sequential interactions Limited time / resources

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 2 / 18

slide-6
SLIDE 6

Motivation

An intelligent agent is sent to explore an unknown environment Learning through sequential interactions Limited time / resources Question: How should the agent choose the actions, so that it learns the environment as effectively as possible?

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 2 / 18

slide-7
SLIDE 7

Motivation

An intelligent agent is sent to explore an unknown environment Learning through sequential interactions Limited time / resources Question: How should the agent choose the actions, so that it learns the environment as effectively as possible? Example: Learning the transition model of a Markovian environment using

  • nly 100 < s,a,s′ > triples

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 2 / 18

slide-8
SLIDE 8

Preliminary

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 3 / 18

slide-9
SLIDE 9

Preliminary

A Markov Reward Process (MRP) is defined by the 4-tuple ⟨S,P,r, γ⟩

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 3 / 18

slide-10
SLIDE 10

Preliminary

A Markov Reward Process (MRP) is defined by the 4-tuple ⟨S,P,r, γ⟩ S = {1, . . . ,S} is the state space

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 3 / 18

slide-11
SLIDE 11

Preliminary

A Markov Reward Process (MRP) is defined by the 4-tuple ⟨S,P,r, γ⟩ S = {1, . . . ,S} is the state space P is an S × S transition matrix with {P}i,j = Pr [st+1 = j ∣ st = i]

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 3 / 18

slide-12
SLIDE 12

Preliminary

A Markov Reward Process (MRP) is defined by the 4-tuple ⟨S,P,r, γ⟩ S = {1, . . . ,S} is the state space P is an S × S transition matrix with {P}i,j = Pr [st+1 = j ∣ st = i] r ∈ RS is the reward function

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 3 / 18

slide-13
SLIDE 13

Preliminary

A Markov Reward Process (MRP) is defined by the 4-tuple ⟨S,P,r, γ⟩ S = {1, . . . ,S} is the state space P is an S × S transition matrix with {P}i,j = Pr [st+1 = j ∣ st = i] r ∈ RS is the reward function γ ∈ [0,1) is the discount factor

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 3 / 18

slide-14
SLIDE 14

Preliminary

A Markov Reward Process (MRP) is defined by the 4-tuple ⟨S,P,r, γ⟩ S = {1, . . . ,S} is the state space P is an S × S transition matrix with {P}i,j = Pr [st+1 = j ∣ st = i] r ∈ RS is the reward function γ ∈ [0,1) is the discount factor The Value Function, v ∈ RS, is the solution of the Bellman equation v = r + γPv.

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 3 / 18

slide-15
SLIDE 15

Preliminary

A Markov Reward Process (MRP) is defined by the 4-tuple ⟨S,P,r, γ⟩ S = {1, . . . ,S} is the state space P is an S × S transition matrix with {P}i,j = Pr [st+1 = j ∣ st = i] r ∈ RS is the reward function γ ∈ [0,1) is the discount factor The Value Function, v ∈ RS, is the solution of the Bellman equation v = r + γPv. Let L = I − γP, then v = L−r

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 3 / 18

slide-16
SLIDE 16

Preliminary

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 4 / 18

slide-17
SLIDE 17

Preliminary

Linear function approximation (LFA): ˆ v = Φθ, where

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 4 / 18

slide-18
SLIDE 18

Preliminary

Linear function approximation (LFA): ˆ v = Φθ, where Φ = [φ1, . . . , φN] are N (N ≪ S) basis functions

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 4 / 18

slide-19
SLIDE 19

Preliminary

Linear function approximation (LFA): ˆ v = Φθ, where Φ = [φ1, . . . , φN] are N (N ≪ S) basis functions θ = [θ1, . . . , θN]⊺ are the weights

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 4 / 18

slide-20
SLIDE 20

Preliminary

Linear function approximation (LFA): ˆ v = Φθ, where Φ = [φ1, . . . , φN] are N (N ≪ S) basis functions θ = [θ1, . . . , θN]⊺ are the weights The Bellman Error ε ∈ RS is defined as ε = r + γP ˆ v − ˆ v = r − LΦθ.

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 4 / 18

slide-21
SLIDE 21

Preliminary

Linear function approximation (LFA): ˆ v = Φθ, where Φ = [φ1, . . . , φN] are N (N ≪ S) basis functions θ = [θ1, . . . , θN]⊺ are the weights The Bellman Error ε ∈ RS is defined as ε = r + γP ˆ v − ˆ v = r − LΦθ. ε ≡ 0 ⇐ ⇒ v ≡ Φθ

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 4 / 18

slide-22
SLIDE 22

Preliminary

Linear function approximation (LFA): ˆ v = Φθ, where Φ = [φ1, . . . , φN] are N (N ≪ S) basis functions θ = [θ1, . . . , θN]⊺ are the weights The Bellman Error ε ∈ RS is defined as ε = r + γP ˆ v − ˆ v = r − LΦθ. ε ≡ 0 ⇐ ⇒ v ≡ Φθ ε is the expectation of the TD error

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 4 / 18

slide-23
SLIDE 23

Preliminary

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 5 / 18

slide-24
SLIDE 24

Preliminary

The LFA ˆ v = Φθ depends on both θ and Φ.

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 5 / 18

slide-25
SLIDE 25

Preliminary

The LFA ˆ v = Φθ depends on both θ and Φ. To find θ:

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 5 / 18

slide-26
SLIDE 26

Preliminary

The LFA ˆ v = Φθ depends on both θ and Φ. To find θ:

TD (Sutton, 1988), LSTD (Bradtke et al., 1996), etc.

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 5 / 18

slide-27
SLIDE 27

Preliminary

The LFA ˆ v = Φθ depends on both θ and Φ. To find θ:

TD (Sutton, 1988), LSTD (Bradtke et al., 1996), etc.

To construct Φ:

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 5 / 18

slide-28
SLIDE 28

Preliminary

The LFA ˆ v = Φθ depends on both θ and Φ. To find θ:

TD (Sutton, 1988), LSTD (Bradtke et al., 1996), etc.

To construct Φ:

Bellman error basis functions (BEBFs, Wu and Givan, 2005; Keller et

  • al. 2006; Parr et al. 2007; Mahadevan and Liu 2010)

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 5 / 18

slide-29
SLIDE 29

Preliminary

The LFA ˆ v = Φθ depends on both θ and Φ. To find θ:

TD (Sutton, 1988), LSTD (Bradtke et al., 1996), etc.

To construct Φ:

Bellman error basis functions (BEBFs, Wu and Givan, 2005; Keller et

  • al. 2006; Parr et al. 2007; Mahadevan and Liu 2010)

Proto-value basis functions (Mahadevan et al., 2006)

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 5 / 18

slide-30
SLIDE 30

Preliminary

The LFA ˆ v = Φθ depends on both θ and Φ. To find θ:

TD (Sutton, 1988), LSTD (Bradtke et al., 1996), etc.

To construct Φ:

Bellman error basis functions (BEBFs, Wu and Givan, 2005; Keller et

  • al. 2006; Parr et al. 2007; Mahadevan and Liu 2010)

Proto-value basis functions (Mahadevan et al., 2006) Reduced-rank predictive state representations (Boots and Gordon, 2010)

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 5 / 18

slide-31
SLIDE 31

Preliminary

The LFA ˆ v = Φθ depends on both θ and Φ. To find θ:

TD (Sutton, 1988), LSTD (Bradtke et al., 1996), etc.

To construct Φ:

Bellman error basis functions (BEBFs, Wu and Givan, 2005; Keller et

  • al. 2006; Parr et al. 2007; Mahadevan and Liu 2010)

Proto-value basis functions (Mahadevan et al., 2006) Reduced-rank predictive state representations (Boots and Gordon, 2010) L1-regularized feature selection (Kolter and Ng, 2009)

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 5 / 18

slide-32
SLIDE 32

Bellman Error Basis Functions

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 6 / 18

slide-33
SLIDE 33

Bellman Error Basis Functions

Intuition: ”Bellman error, loosely speaking, point[s] towards the optimal value function”, (Parr et al., 2007)

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 6 / 18

slide-34
SLIDE 34

Bellman Error Basis Functions

Intuition: ”Bellman error, loosely speaking, point[s] towards the optimal value function”, (Parr et al., 2007) Construction:

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 6 / 18

slide-35
SLIDE 35

Bellman Error Basis Functions

Intuition: ”Bellman error, loosely speaking, point[s] towards the optimal value function”, (Parr et al., 2007) Construction: φ(1) = r

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 6 / 18

slide-36
SLIDE 36

Bellman Error Basis Functions

Intuition: ”Bellman error, loosely speaking, point[s] towards the optimal value function”, (Parr et al., 2007) Construction: φ(1) = r At stage k > 1

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 6 / 18

slide-37
SLIDE 37

Bellman Error Basis Functions

Intuition: ”Bellman error, loosely speaking, point[s] towards the optimal value function”, (Parr et al., 2007) Construction: φ(1) = r At stage k > 1

Compute TD fixpoint θ(k) w.r.t the k current basis function Φ(k)

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 6 / 18

slide-38
SLIDE 38

Bellman Error Basis Functions

Intuition: ”Bellman error, loosely speaking, point[s] towards the optimal value function”, (Parr et al., 2007) Construction: φ(1) = r At stage k > 1

Compute TD fixpoint θ(k) w.r.t the k current basis function Φ(k) Get the Bellman error ε(k) = r − LΦ(k)θ(k)

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 6 / 18

slide-39
SLIDE 39

Bellman Error Basis Functions

Intuition: ”Bellman error, loosely speaking, point[s] towards the optimal value function”, (Parr et al., 2007) Construction: φ(1) = r At stage k > 1

Compute TD fixpoint θ(k) w.r.t the k current basis function Φ(k) Get the Bellman error ε(k) = r − LΦ(k)θ(k) Expand: Φ(k+1) = [Φ(k) ⋮ ε(k)].

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 6 / 18

slide-40
SLIDE 40

Bellman Error Basis Functions

Intuition: ”Bellman error, loosely speaking, point[s] towards the optimal value function”, (Parr et al., 2007) Construction: φ(1) = r At stage k > 1

Compute TD fixpoint θ(k) w.r.t the k current basis function Φ(k) Get the Bellman error ε(k) = r − LΦ(k)θ(k) Expand: Φ(k+1) = [Φ(k) ⋮ ε(k)].

Sequences of BEBFs form orthogonal basis (Parr et al. 2007)

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 6 / 18

slide-41
SLIDE 41

Bellman Error Basis Functions

Intuition: ”Bellman error, loosely speaking, point[s] towards the optimal value function”, (Parr et al., 2007) Construction: φ(1) = r At stage k > 1

Compute TD fixpoint θ(k) w.r.t the k current basis function Φ(k) Get the Bellman error ε(k) = r − LΦ(k)θ(k) Expand: Φ(k+1) = [Φ(k) ⋮ ε(k)].

Sequences of BEBFs form orthogonal basis (Parr et al. 2007) In sufficient number, any value function can be represented

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 6 / 18

slide-42
SLIDE 42

Bellman Error Basis Functions

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 7 / 18

slide-43
SLIDE 43

Bellman Error Basis Functions

Problem with BEBF

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 7 / 18

slide-44
SLIDE 44

Bellman Error Basis Functions

Problem with BEBF Slow convergence when γ → 1.

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 7 / 18

slide-45
SLIDE 45

Bellman Error Basis Functions

Problem with BEBF Slow convergence when γ → 1. Reason: failed to take into acount the transition structure

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 7 / 18

slide-46
SLIDE 46

Bellman Error Basis Functions

Problem with BEBF Slow convergence when γ → 1. Reason: failed to take into acount the transition structure

Theorem

Let ˆ J(k) and ˆ J(k+1) be the squared value error corresponding to the BEBF basis functions Φ(k) and Φ(k+1). Then ρ(k) = ˆ J(k+1) ˆ J(k) ≤ γ2.

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 7 / 18

slide-47
SLIDE 47

A Simple Example

  • P = [ 0

1 1 0 ].

  • r ∈ R2 moves along the

unit square

  • Start from empty basis

set The first BEBF is the reward.

  • Distance between the

curve and the origin denotes (ρ(1))(1/2)

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 8 / 18

slide-48
SLIDE 48

V-BEBF: Main Idea

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 9 / 18

slide-49
SLIDE 49

V-BEBF: Main Idea

Fix ˆ v = Φθ as the current value function estimation, then

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 9 / 18

slide-50
SLIDE 50

V-BEBF: Main Idea

Fix ˆ v = Φθ as the current value function estimation, then Adding φ = v − ˆ v with weight 1 eliminated the error completely

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 9 / 18

slide-51
SLIDE 51

V-BEBF: Main Idea

Fix ˆ v = Φθ as the current value function estimation, then Adding φ = v − ˆ v with weight 1 eliminated the error completely Simple derivation gives φ = v − Φθ = L−r − L−LΦθ = L− (r − Φθ) = L−ε.

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 9 / 18

slide-52
SLIDE 52

V-BEBF: Main Idea

Fix ˆ v = Φθ as the current value function estimation, then Adding φ = v − ˆ v with weight 1 eliminated the error completely Simple derivation gives φ = v − Φθ = L−r − L−LΦθ = L− (r − Φθ) = L−ε. Observe: φ is the solution to the Bellman equation φ = ε + γPφ

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 9 / 18

slide-53
SLIDE 53

V-BEBF: Main Idea

Fix ˆ v = Φθ as the current value function estimation, then Adding φ = v − ˆ v with weight 1 eliminated the error completely Simple derivation gives φ = v − Φθ = L−r − L−LΦθ = L− (r − Φθ) = L−ε. Observe: φ is the solution to the Bellman equation φ = ε + γPφ φ is the value function of the Bellman error (V-BEBF)

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 9 / 18

slide-54
SLIDE 54

V-BEBF: Main Idea

Fix ˆ v = Φθ as the current value function estimation, then Adding φ = v − ˆ v with weight 1 eliminated the error completely Simple derivation gives φ = v − Φθ = L−r − L−LΦθ = L− (r − Φθ) = L−ε. Observe: φ is the solution to the Bellman equation φ = ε + γPφ φ is the value function of the Bellman error (V-BEBF) φ can be estimated by any RL algorithm, with TD error as the reward

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 9 / 18

slide-55
SLIDE 55

V-BEBF: Comparison to BEBF

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 10 / 18

slide-56
SLIDE 56

V-BEBF: Comparison to BEBF

Both are reward sensitive, using Bellman error

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 10 / 18

slide-57
SLIDE 57

V-BEBF: Comparison to BEBF

Both are reward sensitive, using Bellman error When computed exactly, representing a value function may require a long sequence of BEBFs, but a single V-BEBF is enough.

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 10 / 18

slide-58
SLIDE 58

V-BEBF: Comparison to BEBF

Both are reward sensitive, using Bellman error When computed exactly, representing a value function may require a long sequence of BEBFs, but a single V-BEBF is enough. When approximated, the sequence of V-BEBFs converges much faster than BEBFs, when γ → 1.

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 10 / 18

slide-59
SLIDE 59

V-BEBF: Framework

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 11 / 18

slide-60
SLIDE 60

V-BEBF: Framework

V-BEBF suggests a natural way to organize RL learners in hierarchy

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 11 / 18

slide-61
SLIDE 61

V-BEBF: Framework

V-BEBF suggests a natural way to organize RL learners in hierarchy A primary learner build the estimation upon a set of basis functions, and propagates the TD-error to a secondary learner

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 11 / 18

slide-62
SLIDE 62

V-BEBF: Framework

V-BEBF suggests a natural way to organize RL learners in hierarchy A primary learner build the estimation upon a set of basis functions, and propagates the TD-error to a secondary learner The secondary learner estimates the value function of the TD-error, which then becomes the new basis function used by the primary learner

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 11 / 18

slide-63
SLIDE 63

Incremental Basis Projection

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 12 / 18

slide-64
SLIDE 64

Incremental Basis Projection

We are given a set of M raw basis functions Ψ = [ψ1, . . . , ψM]

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 12 / 18

slide-65
SLIDE 65

Incremental Basis Projection

We are given a set of M raw basis functions Ψ = [ψ1, . . . , ψM] From Ψ we construct N refined basis functions through linear mapping: Φ = [φ1, . . . , φN] = Ψ[w1, . . . ,wN].

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 12 / 18

slide-66
SLIDE 66

Incremental Basis Projection

We are given a set of M raw basis functions Ψ = [ψ1, . . . , ψM] From Ψ we construct N refined basis functions through linear mapping: Φ = [φ1, . . . , φN] = Ψ[w1, . . . ,wN]. IBP: Construct one wk at stage k

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 12 / 18

slide-67
SLIDE 67

Incremental Basis Projection

We are given a set of M raw basis functions Ψ = [ψ1, . . . , ψM] From Ψ we construct N refined basis functions through linear mapping: Φ = [φ1, . . . , φN] = Ψ[w1, . . . ,wN]. IBP: Construct one wk at stage k

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 12 / 18

slide-68
SLIDE 68

Incremental Basis Projection

We are given a set of M raw basis functions Ψ = [ψ1, . . . , ψM] From Ψ we construct N refined basis functions through linear mapping: Φ = [φ1, . . . , φN] = Ψ[w1, . . . ,wN]. IBP: Construct one wk at stage k

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 12 / 18

slide-69
SLIDE 69

Incremental Basis Projection

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 13 / 18

slide-70
SLIDE 70

Incremental Basis Projection

If the value function is linear combination of refined basis functions, it is also linear combination of raw basis functions. So Why?

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 13 / 18

slide-71
SLIDE 71

Incremental Basis Projection

If the value function is linear combination of refined basis functions, it is also linear combination of raw basis functions. So Why? Small number of basis functions ⇒ Fast convergence

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 13 / 18

slide-72
SLIDE 72

Incremental Basis Projection

If the value function is linear combination of refined basis functions, it is also linear combination of raw basis functions. So Why? Small number of basis functions ⇒ Fast convergence Small number of basis functions ⇒ High estimation accuracy

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 13 / 18

slide-73
SLIDE 73

Incremental Basis Projection

If the value function is linear combination of refined basis functions, it is also linear combination of raw basis functions. So Why? Small number of basis functions ⇒ Fast convergence Small number of basis functions ⇒ High estimation accuracy

Only the learner of the refined basis functions works on raw basis functions

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 13 / 18

slide-74
SLIDE 74

Incremental Basis Projection

If the value function is linear combination of refined basis functions, it is also linear combination of raw basis functions. So Why? Small number of basis functions ⇒ Fast convergence Small number of basis functions ⇒ High estimation accuracy

Only the learner of the refined basis functions works on raw basis functions Therefore it only affect the estimation indirectly

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 13 / 18

slide-75
SLIDE 75

IBP with V-BEBF

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 14 / 18

slide-76
SLIDE 76

IBP with V-BEBF

Approximate each column wk so that Ψwk approximates V-BEBF

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 14 / 18

slide-77
SLIDE 77

IBP with V-BEBF

Approximate each column wk so that Ψwk approximates V-BEBF Sparsity constraints on wk to make the computation tractable

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 14 / 18

slide-78
SLIDE 78

IBP with V-BEBF

Approximate each column wk so that Ψwk approximates V-BEBF Sparsity constraints on wk to make the computation tractable Each refined basis function depends only on a handful of raw basis functions

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 14 / 18

slide-79
SLIDE 79

IBP with V-BEBF

Approximate each column wk so that Ψwk approximates V-BEBF Sparsity constraints on wk to make the computation tractable Each refined basis function depends only on a handful of raw basis functions In this work we simply choose B ≪ M entries in wn at random.

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 14 / 18

slide-80
SLIDE 80

IBP with V-BEBF

Approximate each column wk so that Ψwk approximates V-BEBF Sparsity constraints on wk to make the computation tractable Each refined basis function depends only on a handful of raw basis functions In this work we simply choose B ≪ M entries in wn at random. Combine with LSTD to attain batch version (O(M3/2) in time, O(M) in storage), with TD to attain online version (O(MB)).

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 14 / 18

slide-81
SLIDE 81

Experiments

Randomly generated MRP, 500 states, branching factor 5 Randomly generated binary raw basis functions (30% non-zero) Error measured in mean-square value error w.r.t. LSTD solution. In batch case, B = N = √ M, the training trajectory length is 5000.

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 15 / 18

slide-82
SLIDE 82

Batch

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 16 / 18

slide-83
SLIDE 83

Online

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 17 / 18

slide-84
SLIDE 84

Conclusion

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 18 / 18

slide-85
SLIDE 85

Conclusion

Simple method for incrementally building up basis functions — Just use the value function of the Bellman error

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 18 / 18

slide-86
SLIDE 86

Conclusion

Simple method for incrementally building up basis functions — Just use the value function of the Bellman error Rather effective compare to BEBF when γ → 1

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 18 / 18

slide-87
SLIDE 87

Conclusion

Simple method for incrementally building up basis functions — Just use the value function of the Bellman error Rather effective compare to BEBF when γ → 1 Extensions:

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 18 / 18

slide-88
SLIDE 88

Conclusion

Simple method for incrementally building up basis functions — Just use the value function of the Bellman error Rather effective compare to BEBF when γ → 1 Extensions:

Deeper hierarchy

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 18 / 18

slide-89
SLIDE 89

Conclusion

Simple method for incrementally building up basis functions — Just use the value function of the Bellman error Rather effective compare to BEBF when γ → 1 Extensions:

Deeper hierarchy Multiple secondary learners

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 18 / 18

slide-90
SLIDE 90

Conclusion

Simple method for incrementally building up basis functions — Just use the value function of the Bellman error Rather effective compare to BEBF when γ → 1 Extensions:

Deeper hierarchy Multiple secondary learners Incorporating memory for the secondary learner

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 18 / 18