An Emphatic Approach to the Problem of Off-policy TD Learning Rich - - PowerPoint PPT Presentation

an emphatic approach to the problem of off policy td
SMART_READER_LITE
LIVE PREVIEW

An Emphatic Approach to the Problem of Off-policy TD Learning Rich - - PowerPoint PPT Presentation

R L & A I An Emphatic Approach to the Problem of Off-policy TD Learning Rich Sutton Rupam Mahmood Martha White Reinforcement Learning and Artificial Intelligence Laboratory Department of Computing Science University of Alberta,


slide-1
SLIDE 1

An Emphatic Approach to the Problem of Off-policy TD Learning

Rich Sutton Rupam Mahmood Martha White

Reinforcement Learning and Artificial Intelligence Laboratory Department of Computing Science University of Alberta, Canada

R A I L

&

slide-2
SLIDE 2

A = lim

t!1 E[At] = lim t!1 Eπ

h x(St) (x(St) − γx(St+1))>i = X

s

dπ(s) x(s) x(s) − γ X

s0

[Pπ]ss0x(s0) !> = X>Dπ(I − γPπ)X,

.

wt+1 = wt + α ⇣ Rt+1x(St) | {z }

bt2Rn

x(St) (x(St) γx(St+1))> | {z }

At2Rn⇥n

wt ⌘ = wt + α(bt Atwt) = (I αAt)wt + αbt.

2 R e St 2 S

states

  • n At 2 A

actions

Rt+1 2 R

rewards

Gt . = Rt+1 + γRt+2 + γ2Rt+3 + · · ·

return policy e π(a|s) . = P{At =a|St =s}. value function

⇡ vπ(s) . = Eπ[Gt|St =s] ,

) ⇡ w>

t x(s)

linear TD(0):

π

× s [Pπ]ij . = P

a π(a|i)p(j|i, a) where p(j|i, a) .

= P{St+1 =j|St =i, At =a}. T property of d is that

transition prob matrix deterministic ‘expected’ update:

.

¯ wt+1 = (I − αA) ¯ wt + αb

feature vectors

d x(s) 2 Rn ∀s ∈ S

weight vector or wt 2 Rn, n ⌧ |S| Stable if is positive definite

A X . =      − x(1)>− − x(2)>− . . . −x(|S|)>−      . Dπ . =   dπ   ,

t+1 t

| {z }

R

if this “key matrix” is pos. def., then is pos. def. and everything is stable

A y>Ay > 0, 8y 6= 0

i.e., if

.

Converges to lim

t→∞ ¯

wt = A−1b. Temporal-Difference Learning with Linear Function Approximation 0 ≤ γ < 1

I showed in 1988 that the key matrix is pos. def. if its column sums are >0 [dπ]s . = dπ(s) . = limt!1 P{St =s}, w P>

π dπ = dπ.

ergodic stationary distribution

> 0

wt+1 . = wt + α ⇣ Rt+1 + γw>

t x(St+1) w> t x(St)

⌘ x(St),

slide-3
SLIDE 3

A = lim

t!1 E[At] = lim t!1 Eπ

h x(St) (x(St) − γx(St+1))>i = X

s

dπ(s) x(s) x(s) − γ X

s0

[Pπ]ss0x(s0) !> = X>Dπ(I − γPπ)X,

.

π

× s [Pπ]ij . = P

a π(a|i)p(j|i, a) where p(j|i, a) .

= P{St+1 =j|St =i, At =a}. T property of d is that

transition prob matrix deterministic ‘expected’ update:

.

¯ wt+1 = (I − αA) ¯ wt + αb

Stable if is positive definite

A X . =      − x(1)>− − x(2)>− . . . −x(|S|)>−      . Dπ . =   dπ   ,

t+1 t

| {z }

R

if this “key matrix” is pos. def., then is pos. def. and everything is stable

A y>Ay > 0, 8y 6= 0

i.e., if

.

Converges to lim

t→∞ ¯

wt = A−1b.

I showed in 1988 that the key matrix is pos. def. if its column sums are >0 [dπ]s . = dπ(s) . = limt!1 P{St =s}, w P>

π dπ = dπ.

ergodic stationary distribution

> 0

. For the jth column, the sum is X

i

[Dπ(I − γP

π)]ij =

X

i

X

k

[Dπ]ik[I − γP

π]kj

= X

i

[Dπ]ii[I − γP

π]ij

= X

i

dπ(i)[I − γP

π]ij

= [d>

π (I − γP π)]j

= [d>

π − γd> π P π)]j

= [d>

π − γd> π ]j

= (1 − γ)dπ(j) > 0.

slide-4
SLIDE 4

2 off-policy learning problems

  • 1. Correcting for the distribution of future returns

solution: importance sampling (Sutton & Barto 1998, improved by Precup, Sutton & Singh, 2000), now used in GTD(λ) and GQ(λ)

  • 2. Correcting for the state-update distribution

solution: none known, other than more importance sampling (Precup, Sutton & Dasgupta, 2001) which as proposed was of very high variance. The ideas of that work are strikingly similar to those of emphasis…

slide-5
SLIDE 5

2 R e St 2 S

states

  • n At 2 A

actions

Rt+1 2 R

rewards target policy is no longer used to select actions linear off-policy TD(0):

Off-policy Temporal-Difference Learning with Linear Function Approximation

h ⇡(a|s) behavior policy is used to select actions!

µ(a|s)

π(a|s) > 0 = ⇒ µ(a|s) > 0

assume coverage:

∀s, a

new ergodic stationary distribution

  • n dµ(s) .

= limt→∞ P{St =s} > 0, ∀s ∈ S, w

[dµ]s . =

  • ld value function

⇡ vπ(s) . = Eπ[Gt|St =s]

) ⇡ w>

t x(s)

importance sampling ratio ⇢t .

= ⇡(At|St) µ(At|St).

Eµ[⇢t|St =s] = X

a

µ(a|s)⇡(a|s) µ(a|s) = X

a

⇡(a|s) = 1.

Eµ[ρtZt+1|St =s] = X

a

µ(a|s)π(a|s) µ(a|s)Zt+1 = X

a

π(a|s)Zt+1 = Eπ[Zt+1|St =s] For any r.v. :

Zt+1

wt+1 . = wt + ρt α ⇣ Rt+1 + γw>

t xt+1 w> t xt

⌘ xt = wt + α ⇣ ρtRt+1xt | {z }

bt

ρtxt (xt γxt+1)> | {z }

At

wt ⌘ {z } | d xt . = x(St) A = lim

t!1 E[At] = lim t!1 Eµ

h ρtxt (xt γxt+1)>i = X

s

dµ(s)Eµ h ρtxt (xt γxt+1)>

  • St = s

i = X

s

dµ(s)Eπ h xt (xt γxt+1)>

  • St = s

i = X

s

dµ(s) x(s) x(s) γ X

s0

[Pπ]ss0x(s0) !> = X>D

µ(I γPπ)X,

and its A matrix: key matrix now has mismatched D and P matrices; it is not stable

slide-6
SLIDE 6

sums to <0! A = lim

t!1 E[At] = lim t!1 Eµ

h ρtxt (xt γxt+1)>i = X

s

dµ(s)Eµ h ρtxt (xt γxt+1)>

  • St = s

i = X

s

dµ(s)Eπ h xt (xt γxt+1)>

  • St = s

i = X

s

dµ(s) x(s) x(s) γ X

s0

[Pπ]ss0x(s0) !> = X>D

µ(I γPπ)X,

2 R e St 2 S

states

  • n At 2 A

actions

Rt+1 2 R

rewards target policy is no longer used to select actions

Off-policy Temporal-Difference Learning with Linear Function Approximation

h ⇡(a|s) behavior policy is used to select actions!

µ(a|s)

π(a|s) > 0 = ⇒ µ(a|s) > 0

assume coverage:

∀s, a

new ergodic stationary distribution

  • n dµ(s) .

= limt→∞ P{St =s} > 0, ∀s ∈ S, w

[dµ]s . =

  • ld value function

⇡ vπ(s) . = Eπ[Gt|St =s]

) ⇡ w>

t x(s)

  • ff-policy TD(0)’s A matrix:

key matrix now has mismatched D and P matrices; it is not stable

A P

π =

0 1 1

  • π

s [Pπ]ij . = P

a π(a|i)p(j|i, a) where p(j|i, a) .

= P property of d is that

transition prob matrix:

D

µ(I − γP π) =

0.5 0.5

  • ×

1

−0.9

0.1

  • =

0.5

−0.45

0.05

  • key matrix:

pos def test:

X>D

µ(I − γP π)X =

⇥ 1 2 ⇤ × 0.5

−0.45

0.05

  • ×

1 2

  • =

⇥ 1 2 ⇤ × 

−0.4

0.1

  • = −0.2.

γ = 0.9 λ = 0 µ(right|·) = 0.5 π(right|·) = 1

2w w

Counterexample:

X =  1 2

  • A is not positive definite! Stability is not assured.
slide-7
SLIDE 7

2 off-policy learning problems

  • 1. Correcting for the distribution of future returns

solution: importance sampling (Sutton & Barto 1998, improved by Precup, Sutton & Singh, 2000), now used in GTD(λ) and GQ(λ)

  • 2. Correcting for the state-update distribution

solution: none known, other than more importance sampling (Precup, Sutton & Dasgupta, 2001) which as proposed was of very high variance. The ideas of that work are strikingly similar to those of emphasis…

slide-8
SLIDE 8

Geometric Insight

J∗ ˜ Jr

Ben Van Roy 2009

ˆ v

slide-9
SLIDE 9

Other Distribution

J∗ ˜ Jr

Ben Van Roy 2009

ˆ v

slide-10
SLIDE 10

Problem 2 of off-policy learning: Correcting for the state-update distribution

  • The distribution of updated states does not ‘match’ the target policy
  • Only a problem with function approximation, but that’s a show stopper
  • Precup, Sutton & Dasgupta (2001) treated the episodic case, used

importance sampling to warp the state distribution from the behavior policy’s distribution to the target policy’s distribution, then did a future- reweighted update at each state

  • equivalent to emphasis = product of all i.s. ratios since the

beginning of time

  • ok algorithm, but severe variance problems in both theory and practice
  • Performance assessed on whole episodes following the target policy
  • This ‘alternate life’ view of off-policy learning was then abandoned
slide-11
SLIDE 11

The excursion view


  • f off-policy learning
  • In which we are following a (possibly changing) behavior

policy forever, and are in its stationary distribution

  • We want to predict the consequences of deviating from it

for a limited time with various target policies (e.g., options)

  • Error is assessed on these ‘excursions’ starting from states

in the behavior distribution

  • Much more practical setting than ‘alternate life’
  • This setting was the basis for all the work with gradient-TD

and MSPBE

slide-12
SLIDE 12

Emphasis warping

  • The idea is that emphasis warps the distribution of

updated states from the behavior policy’s stationary distribution to something like the ‘followon distribution’ of the target policy started in the behavior policy’s stationary distribution

  • From which future-reweighted updates will be stable

in expectation—this follows from old results (Dayan 1992, Sutton 1988) on convergence of TD(λ) in episodic MDPs

  • A new algorithm: Emphatic TD(λ)
slide-13
SLIDE 13

Emphatic TD(0)

Introduces a new short-term memory random variable—the followon trace:

yF1 = 0

Ft . = γρt1Ft1 + 1, ∀t > 0,

Emphatic TD(0):

wt+1 . = wt + αFtρt ⇣ Rt+1 + γw>

t xt+1 − w> t xt

⌘ xt = wt + α ⇣ FtρtRt+1xt | {z }

bt

− Ftρtxt (xt − γxt+1)> | {z }

At

wt ⌘ A = lim

t!1 E[At] = lim t!1 Eµ

h Ftρtxt (xt − γxt+1)>i X h

X = X>F(I − γPπ)X,

t+1 t

| {z }

R

key matrix

F . =   f   ,

where with

) . = dµ(s) limt!1 Eµ[Ft|St =s], .

t!1

ts [f]s

we have:

f = dµ + γP>

π dµ +

⇣ γP>

π

⌘2 dµ + · · · = ⇣ I − γP>

π

⌘1 dµ.

X

i

[F(I − γP

π)]ij =

X

i

X

k

[F]ik[I − γP

π]kj

= X

i

[F]ii[I − γP

π]ij

= X

i

[f]i[I − γP

π]ij

= [f>(I − γP

π)]j

= [d>

µ (I − γP π)1(I − γP π)]j

= [d>

µ ]j

= dµ(j) > 0.

Sum of jth column of key matrix

slide-14
SLIDE 14

sums to >0

wt+1 . = wt + αFtρt ⇣ Rt+1 + γw>

t xt+1 − w> t xt

⌘ xt = wt + α ⇣ FtρtRt+1xt | {z }

bt

− Ftρtxt (xt − γxt+1)> | {z }

At

wt ⌘

Emphatic TD(0)

Introduces a new short-term memory random variable—the followon trace:

yF1 = 0

Ft . = γρt1Ft1 + 1, ∀t > 0,

Emphatic TD(0):

X = X>F(I − γPπ)X,

F . =   f   ,

where with

) . = dµ(s) limt!1 Eµ[Ft|St =s], .

t!1

ts [f]s

we have:

f = dµ + γP>

π dµ +

⇣ γP>

π

⌘2 dµ + · · · = ⇣ I − γP>

π

⌘1 dµ. A = lim

t!1 E[At] = lim t!1 Eµ

h Ftρtxt (xt − γxt+1)>i X h

t+1 t

| {z }

R

key matrix

F(I − γP

π) =

0.5 9.5

  • ×

1

−0.9

0.1

  • =

0.5

−0.45

0.95

  • [f]1 = dµ(1) = 0.5

[f]2 = 0.5 + 0.9 + 0.92 + 0.93 + · · · = 0.5 + 0.9 · 10 = 9.5 P

π =

0 1 1

  • γ = 0.9

λ = 0 µ(right|·) = 0.5 π(right|·) = 1

2w w

Counterexample:

F

F(I − γP

π

key matrix

slide-15
SLIDE 15

steps

  • ff-policy TD

emphatic TD

w

2w w

γ = 0.9 γ = 0

µ(right|·) = 0.1 π(right|·) = 1

steps

  • ff-policy TD

emphatic TD

w

2w w

µ(right|·) = 0.5 π(right|·) = 1

w w

γ = 0.9

  • ff-policy TD(0)

emphatic TD(0)

  • ff-policy TD(0)

emphatic TD(0)

Figure 3: Emphatic TD approaches the correct value of zero, whereas conventional off- policy TD diverges, on fifty trajectories on the w → 2w problems shown above each graph. Also shown as a thick line is the trajectory of the deterministic expected-update algorithm. On the continuing problem (left) emphatic TD has

  • ccasional high variance deviations from zero.
slide-16
SLIDE 16

1 1

w2 w3

+

1 1

µ(left|·) = 2/3 π(right|·) = 1

w1

1 1 1 1 1 1

w1 w2

+

w2 w3 γ = 1 γ = 1 γ = 1 γ = 0 γ = 0

vπ =1 vπ =1 vπ =2 vπ =3 vπ =4

Figure 4: Twenty learning curves and their analytic expectation on the 5-state problem from Section 5, in which excursions terminate promptly and both algorithms converge reliably. Here λ = 0, w0 = 0, α = 0.001, and i(s) = 1, ∀s. The MSVE performance measure is defined in (20). . = X

s2S

dµ(s)i(s) ⇣ vπ(s) w>x(s) ⌘2 .

slide-17
SLIDE 17

Summary of emphatic results

  • Linear emphatic TD(0) is the simplest TD alg with linear FA

that is stable under off-policy training

  • Some empirical illustrations
  • Stability theorem for full case of GVFs
  • Convergence w.p.1 theorem (Janey Yu, under review)
  • Asymptotic approximation bounds (Remi Munos)
  • Also a new (better?) algorithm for the on-policy case