[PPT] - Planning to Be Surprised: Optimal Bayesian Exploration in Dynamic PowerPoint Presentation

SLIDE 1

Planning to Be Surprised: Optimal Bayesian Exploration in Dynamic Environments

Yi Sun, Faustino Gomez, J¨ urgen Schmidhuber

IDSIA, USI & SUPSI, Switzerland

August 2011

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 1 / 18

SLIDE 2

Motivation

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 2 / 18

SLIDE 3

Motivation

An intelligent agent is sent to explore an unknown environment

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 2 / 18

SLIDE 4

Motivation

An intelligent agent is sent to explore an unknown environment Learning through sequential interactions

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 2 / 18

SLIDE 5

Motivation

An intelligent agent is sent to explore an unknown environment Learning through sequential interactions Limited time / resources

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 2 / 18

SLIDE 6

Motivation

An intelligent agent is sent to explore an unknown environment Learning through sequential interactions Limited time / resources Question: How should the agent choose the actions, so that it learns the environment as effectively as possible?

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 2 / 18

SLIDE 7

Motivation

An intelligent agent is sent to explore an unknown environment Learning through sequential interactions Limited time / resources Question: How should the agent choose the actions, so that it learns the environment as effectively as possible? Example: Learning the transition model of a Markovian environment using

nly 100 < s,a,s′ > triples

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 2 / 18

SLIDE 8

Preliminary

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 3 / 18

SLIDE 9

Preliminary

A Markov Reward Process (MRP) is defined by the 4-tuple ⟨S,P,r, γ⟩

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 3 / 18

SLIDE 10

Preliminary

A Markov Reward Process (MRP) is defined by the 4-tuple ⟨S,P,r, γ⟩ S = {1, . . . ,S} is the state space

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 3 / 18

SLIDE 11

Preliminary

A Markov Reward Process (MRP) is defined by the 4-tuple ⟨S,P,r, γ⟩ S = {1, . . . ,S} is the state space P is an S × S transition matrix with {P}i,j = Pr [st+1 = j ∣ st = i]

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 3 / 18

SLIDE 12

Preliminary

A Markov Reward Process (MRP) is defined by the 4-tuple ⟨S,P,r, γ⟩ S = {1, . . . ,S} is the state space P is an S × S transition matrix with {P}i,j = Pr [st+1 = j ∣ st = i] r ∈ RS is the reward function

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 3 / 18

SLIDE 13

Preliminary

A Markov Reward Process (MRP) is defined by the 4-tuple ⟨S,P,r, γ⟩ S = {1, . . . ,S} is the state space P is an S × S transition matrix with {P}i,j = Pr [st+1 = j ∣ st = i] r ∈ RS is the reward function γ ∈ [0,1) is the discount factor

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 3 / 18

SLIDE 14

Preliminary

A Markov Reward Process (MRP) is defined by the 4-tuple ⟨S,P,r, γ⟩ S = {1, . . . ,S} is the state space P is an S × S transition matrix with {P}i,j = Pr [st+1 = j ∣ st = i] r ∈ RS is the reward function γ ∈ [0,1) is the discount factor The Value Function, v ∈ RS, is the solution of the Bellman equation v = r + γPv.

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 3 / 18

SLIDE 15

Preliminary

A Markov Reward Process (MRP) is defined by the 4-tuple ⟨S,P,r, γ⟩ S = {1, . . . ,S} is the state space P is an S × S transition matrix with {P}i,j = Pr [st+1 = j ∣ st = i] r ∈ RS is the reward function γ ∈ [0,1) is the discount factor The Value Function, v ∈ RS, is the solution of the Bellman equation v = r + γPv. Let L = I − γP, then v = L−r

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 3 / 18

SLIDE 16

Preliminary

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 4 / 18

SLIDE 17

Preliminary

Linear function approximation (LFA): ˆ v = Φθ, where

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 4 / 18

SLIDE 18

Preliminary

Linear function approximation (LFA): ˆ v = Φθ, where Φ = [φ1, . . . , φN] are N (N ≪ S) basis functions

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 4 / 18

SLIDE 19

Preliminary

Linear function approximation (LFA): ˆ v = Φθ, where Φ = [φ1, . . . , φN] are N (N ≪ S) basis functions θ = [θ1, . . . , θN]⊺ are the weights

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 4 / 18

SLIDE 20

Preliminary

Linear function approximation (LFA): ˆ v = Φθ, where Φ = [φ1, . . . , φN] are N (N ≪ S) basis functions θ = [θ1, . . . , θN]⊺ are the weights The Bellman Error ε ∈ RS is defined as ε = r + γP ˆ v − ˆ v = r − LΦθ.

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 4 / 18

SLIDE 21

Preliminary

Linear function approximation (LFA): ˆ v = Φθ, where Φ = [φ1, . . . , φN] are N (N ≪ S) basis functions θ = [θ1, . . . , θN]⊺ are the weights The Bellman Error ε ∈ RS is defined as ε = r + γP ˆ v − ˆ v = r − LΦθ. ε ≡ 0 ⇐ ⇒ v ≡ Φθ

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 4 / 18

SLIDE 22

Preliminary

Linear function approximation (LFA): ˆ v = Φθ, where Φ = [φ1, . . . , φN] are N (N ≪ S) basis functions θ = [θ1, . . . , θN]⊺ are the weights The Bellman Error ε ∈ RS is defined as ε = r + γP ˆ v − ˆ v = r − LΦθ. ε ≡ 0 ⇐ ⇒ v ≡ Φθ ε is the expectation of the TD error

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 4 / 18

SLIDE 23

Preliminary

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 5 / 18

SLIDE 24

Preliminary

The LFA ˆ v = Φθ depends on both θ and Φ.

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 5 / 18

SLIDE 25

Preliminary

The LFA ˆ v = Φθ depends on both θ and Φ. To find θ:

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 5 / 18

SLIDE 26

Preliminary

The LFA ˆ v = Φθ depends on both θ and Φ. To find θ:

TD (Sutton, 1988), LSTD (Bradtke et al., 1996), etc.

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 5 / 18

SLIDE 27

Preliminary

The LFA ˆ v = Φθ depends on both θ and Φ. To find θ:

TD (Sutton, 1988), LSTD (Bradtke et al., 1996), etc.

To construct Φ:

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 5 / 18

SLIDE 28

Preliminary

The LFA ˆ v = Φθ depends on both θ and Φ. To find θ:

TD (Sutton, 1988), LSTD (Bradtke et al., 1996), etc.

To construct Φ:

Bellman error basis functions (BEBFs, Wu and Givan, 2005; Keller et

al. 2006; Parr et al. 2007; Mahadevan and Liu 2010)

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 5 / 18

SLIDE 29

Preliminary

The LFA ˆ v = Φθ depends on both θ and Φ. To find θ:

TD (Sutton, 1988), LSTD (Bradtke et al., 1996), etc.

To construct Φ:

Bellman error basis functions (BEBFs, Wu and Givan, 2005; Keller et

al. 2006; Parr et al. 2007; Mahadevan and Liu 2010)

Proto-value basis functions (Mahadevan et al., 2006)

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 5 / 18

SLIDE 30

Preliminary

The LFA ˆ v = Φθ depends on both θ and Φ. To find θ:

TD (Sutton, 1988), LSTD (Bradtke et al., 1996), etc.

To construct Φ:

Bellman error basis functions (BEBFs, Wu and Givan, 2005; Keller et

al. 2006; Parr et al. 2007; Mahadevan and Liu 2010)

Proto-value basis functions (Mahadevan et al., 2006) Reduced-rank predictive state representations (Boots and Gordon, 2010)

Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 5 / 18

SLIDE 31

Preliminary

The LFA ˆ v = Φθ depends on both θ and Φ. To find θ:

TD (Sutton, 1988), LSTD (Bradtke et al., 1996), etc.

To construct Φ:

Bellman error basis functions (BEBFs, Wu and Givan, 2005; Keller et

al. 2006; Parr et al. 2007; Mahadevan and Liu 2010)