[PPT] - An Emphatic Approach to the Problem of Off-policy TD Learning Rich PowerPoint Presentation

SLIDE 1

An Emphatic Approach to the Problem of Off-policy TD Learning

Rich Sutton Rupam Mahmood Martha White

Reinforcement Learning and Artificial Intelligence Laboratory Department of Computing Science University of Alberta, Canada

R A I L

&

SLIDE 2

A = lim

t!1 E[At] = lim t!1 Eπ

h x(St) (x(St) − γx(St+1))>i = X

s

dπ(s) x(s) x(s) − γ X

s0

[Pπ]ss0x(s0) !> = X>Dπ(I − γPπ)X,

.

wt+1 = wt + α ⇣ Rt+1x(St) | {z }

bt2Rn

x(St) (x(St) γx(St+1))> | {z }

At2Rn⇥n

wt ⌘ = wt + α(bt Atwt) = (I αAt)wt + αbt.

2 R e St 2 S

states

n At 2 A

actions

Rt+1 2 R

rewards

Gt . = Rt+1 + γRt+2 + γ2Rt+3 + · · ·

return policy e π(a|s) . = P{At =a|St =s}. value function

⇡ vπ(s) . = Eπ[Gt|St =s] ,

) ⇡ w>

t x(s)

linear TD(0):

π

× s [Pπ]ij . = P

a π(a|i)p(j|i, a) where p(j|i, a) .

= P{St+1 =j|St =i, At =a}. T property of d is that

transition prob matrix deterministic ‘expected’ update:

.

¯ wt+1 = (I − αA) ¯ wt + αb

feature vectors

d x(s) 2 Rn ∀s ∈ S

weight vector or wt 2 Rn, n ⌧ |S| Stable if is positive definite

A X . =      − x(1)>− − x(2)>− . . . −x(|S|)>−      . Dπ . =   dπ   ,

⇣

t+1 t

| {z }

R

if this “key matrix” is pos. def., then is pos. def. and everything is stable

A y>Ay > 0, 8y 6= 0

i.e., if

.

Converges to lim

t→∞ ¯

wt = A−1b. Temporal-Difference Learning with Linear Function Approximation 0 ≤ γ < 1

I showed in 1988 that the key matrix is pos. def. if its column sums are >0 [dπ]s . = dπ(s) . = limt!1 P{St =s}, w P>

π dπ = dπ.

ergodic stationary distribution

> 0

wt+1 . = wt + α ⇣ Rt+1 + γw>

t x(St+1) w> t x(St)

⌘ x(St),

SLIDE 3

A = lim

t!1 E[At] = lim t!1 Eπ

h x(St) (x(St) − γx(St+1))>i = X

s

dπ(s) x(s) x(s) − γ X

s0

[Pπ]ss0x(s0) !> = X>Dπ(I − γPπ)X,

.

π

× s [Pπ]ij . = P

a π(a|i)p(j|i, a) where p(j|i, a) .

= P{St+1 =j|St =i, At =a}. T property of d is that

transition prob matrix deterministic ‘expected’ update:

.

¯ wt+1 = (I − αA) ¯ wt + αb

Stable if is positive definite

A X . =      − x(1)>− − x(2)>− . . . −x(|S|)>−      . Dπ . =   dπ   ,

⇣

t+1 t

| {z }

R

if this “key matrix” is pos. def., then is pos. def. and everything is stable

A y>Ay > 0, 8y 6= 0

i.e., if

.

Converges to lim

t→∞ ¯

wt = A−1b.

I showed in 1988 that the key matrix is pos. def. if its column sums are >0 [dπ]s . = dπ(s) . = limt!1 P{St =s}, w P>

π dπ = dπ.

ergodic stationary distribution

> 0

. For the jth column, the sum is X

i

[Dπ(I − γP

π)]ij =

X

i

X

k

[Dπ]ik[I − γP

π]kj

= X

i

[Dπ]ii[I − γP

π]ij

= X

i

dπ(i)[I − γP

π]ij

= [d>

π (I − γP π)]j

= [d>

π − γd> π P π)]j

= [d>

π − γd> π ]j

= (1 − γ)dπ(j) > 0.

SLIDE 4

2 off-policy learning problems

1. Correcting for the distribution of future returns

solution: importance sampling (Sutton & Barto 1998, improved by Precup, Sutton & Singh, 2000), now used in GTD(λ) and GQ(λ)

2. Correcting for the state-update distribution

solution: none known, other than more importance sampling (Precup, Sutton & Dasgupta, 2001) which as proposed was of very high variance. The ideas of that work are strikingly similar to those of emphasis…

SLIDE 5

2 R e St 2 S

states

n At 2 A

actions

Rt+1 2 R

rewards target policy is no longer used to select actions linear off-policy TD(0):

Off-policy Temporal-Difference Learning with Linear Function Approximation

h ⇡(a|s) behavior policy is used to select actions!

µ(a|s)

π(a|s) > 0 = ⇒ µ(a|s) > 0

assume coverage:

∀s, a

new ergodic stationary distribution

n dµ(s) .

= limt→∞ P{St =s} > 0, ∀s ∈ S, w

[dµ]s . =

ld value function

⇡ vπ(s) . = Eπ[Gt|St =s]

) ⇡ w>

t x(s)

importance sampling ratio ⇢t .

= ⇡(At|St) µ(At|St).

Eµ[⇢t|St =s] = X

a

µ(a|s)⇡(a|s) µ(a|s) = X

a

⇡(a|s) = 1.

Eµ[ρtZt+1|St =s] = X

a

µ(a|s)π(a|s) µ(a|s)Zt+1 = X

a

π(a|s)Zt+1 = Eπ[Zt+1|St =s] For any r.v. :

Zt+1

wt+1 . = wt + ρt α ⇣ Rt+1 + γw>

t xt+1 w> t xt

⌘ xt = wt + α ⇣ ρtRt+1xt | {z }

bt

ρtxt (xt γxt+1)> | {z }

At

wt ⌘ {z } | d xt . = x(St) A = lim

t!1 E[At] = lim t!1 Eµ

h ρtxt (xt γxt+1)>i = X

s

dµ(s)Eµ h ρtxt (xt γxt+1)>

St = s

i = X

s

dµ(s)Eπ h xt (xt γxt+1)>

St = s

i = X

s

dµ(s) x(s) x(s) γ X

s0

[Pπ]ss0x(s0) !> = X>D

µ(I γPπ)X,

and its A matrix: key matrix now has mismatched D and P matrices; it is not stable

SLIDE 6

sums to <0! A = lim

t!1 E[At] = lim t!1 Eµ

h ρtxt (xt γxt+1)>i = X

s

dµ(s)Eµ h ρtxt (xt γxt+1)>

St = s

i = X

s

dµ(s)Eπ h xt (xt γxt+1)>

St = s

i = X

s

dµ(s) x(s) x(s) γ X

s0

[Pπ]ss0x(s0) !> = X>D

µ(I γPπ)X,

2 R e St 2 S

states

n At 2 A

actions

Rt+1 2 R

rewards target policy is no longer used to select actions

Off-policy Temporal-Difference Learning with Linear Function Approximation

h ⇡(a|s) behavior policy is used to select actions!

µ(a|s)

π(a|s) > 0 = ⇒ µ(a|s) > 0

assume coverage:

∀s, a

new ergodic stationary distribution

n dµ(s) .

= limt→∞ P{St =s} > 0, ∀s ∈ S, w

[dµ]s . =

ld value function

⇡ vπ(s) . = Eπ[Gt|St =s]

) ⇡ w>

t x(s)

ff-policy TD(0)’s A matrix:

key matrix now has mismatched D and P matrices; it is not stable

A P

π =

0 1 1

π

s [Pπ]ij . = P

a π(a|i)p(j|i, a) where p(j|i, a) .

= P property of d is that

transition prob matrix:

D

µ(I − γP π) =

0.5 0.5

×

1

−0.9

0.1

=

0.5

−0.45

0.05

key matrix:

pos def test:

X>D

µ(I − γP π)X =

⇥ 1 2 ⇤ × 0.5

−0.45

0.05

×

1 2

=

⇥ 1 2 ⇤ × 

−0.4

0.1

= −0.2.

γ = 0.9 λ = 0 µ(right|·) = 0.5 π(right|·) = 1

2w w

Counterexample:

X =  1 2

A is not positive definite! Stability is not assured.

SLIDE 7

2 off-policy learning problems

1. Correcting for the distribution of future returns

solution: importance sampling (Sutton & Barto 1998, improved by Precup, Sutton & Singh, 2000), now used in GTD(λ) and GQ(λ)

2. Correcting for the state-update distribution

solution: none known, other than more importance sampling (Precup, Sutton & Dasgupta, 2001) which as proposed was of very high variance. The ideas of that work are strikingly similar to those of emphasis…

SLIDE 8

Geometric Insight

J∗ ˜ Jr

Ben Van Roy 2009

vπ

ˆ v

SLIDE 9

Other Distribution

J∗ ˜ Jr

Ben Van Roy 2009

vπ

ˆ v

SLIDE 10

Problem 2 of off-policy learning: Correcting for the state-update distribution

The distribution of updated states does not ‘match’ the target policy
Only a problem with function approximation, but that’s a show stopper
Precup, Sutton & Dasgupta (2001) treated the episodic case, used

importance sampling to warp the state distribution from the behavior policy’s distribution to the target policy’s distribution, then did a future- reweighted update at each state

equivalent to emphasis = product of all i.s. ratios since the

beginning of time

ok algorithm, but severe variance problems in both theory and practice
Performance assessed on whole episodes following the target policy
This ‘alternate life’ view of off-policy learning was then abandoned

SLIDE 11

The excursion view 

f off-policy learning
In which we are following a (possibly changing) behavior

policy forever, and are in its stationary distribution

We want to predict the consequences of deviating from it

for a limited time with various target policies (e.g., options)

Error is assessed on these ‘excursions’ starting from states

in the behavior distribution

Much more practical setting than ‘alternate life’
This setting was the basis for all the work with gradient-TD

and MSPBE

SLIDE 12

Emphasis warping

The idea is that emphasis warps the distribution of

updated states from the behavior policy’s stationary distribution to something like the ‘followon distribution’ of the target policy started in the behavior policy’s stationary distribution

From which future-reweighted updates will be stable

in expectation—this follows from old results (Dayan 1992, Sutton 1988) on convergence of TD(λ) in episodic MDPs

A new algorithm: Emphatic TD(λ)

SLIDE 13

Emphatic TD(0)

Introduces a new short-term memory random variable—the followon trace:

yF1 = 0

Ft . = γρt1Ft1 + 1, ∀t > 0,

Emphatic TD(0):

wt+1 . = wt + αFtρt ⇣ Rt+1 + γw>

t xt+1 − w> t xt

⌘ xt = wt + α ⇣ FtρtRt+1xt | {z }

bt

− Ftρtxt (xt − γxt+1)> | {z }

At

wt ⌘ A = lim

t!1 E[At] = lim t!1 Eµ

h Ftρtxt (xt − γxt+1)>i X h

X = X>F(I − γPπ)X,

⇣

t+1 t

| {z }

R

key matrix

F . =   f   ,

where with

) . = dµ(s) limt!1 Eµ[Ft|St =s], .

t!1

ts [f]s

we have:

f = dµ + γP>

π dµ +

⇣ γP>

π

⌘2 dµ + · · · = ⇣ I − γP>

π

⌘1 dµ.

X

i

[F(I − γP

π)]ij =

X

i

X

k

[F]ik[I − γP

π]kj

= X

i

[F]ii[I − γP

π]ij

= X

i

[f]i[I − γP

π]ij

= [f>(I − γP

π)]j

= [d>

µ (I − γP π)1(I − γP π)]j

= [d>

µ ]j

= dµ(j) > 0.

Sum of jth column of key matrix

SLIDE 14

sums to >0

wt+1 . = wt + αFtρt ⇣ Rt+1 + γw>

t xt+1 − w> t xt

⌘ xt = wt + α ⇣ FtρtRt+1xt | {z }

bt

− Ftρtxt (xt − γxt+1)> | {z }

At

wt ⌘

Emphatic TD(0)

Introduces a new short-term memory random variable—the followon trace:

yF1 = 0

Ft . = γρt1Ft1 + 1, ∀t > 0,

Emphatic TD(0):

X = X>F(I − γPπ)X,

F . =   f   ,

where with

) . = dµ(s) limt!1 Eµ[Ft|St =s], .

t!1

ts [f]s

we have:

f = dµ + γP>

π dµ +

⇣ γP>

π

⌘2 dµ + · · · = ⇣ I − γP>

π

⌘1 dµ. A = lim

t!1 E[At] = lim t!1 Eµ

h Ftρtxt (xt − γxt+1)>i X h

⇣

t+1 t

| {z }

R

key matrix

F(I − γP

π) =

0.5 9.5

×

1

−0.9

0.1

=

0.5

−0.45

0.95

[f]1 = dµ(1) = 0.5

[f]2 = 0.5 + 0.9 + 0.92 + 0.93 + · · · = 0.5 + 0.9 · 10 = 9.5 P

π =

0 1 1

γ = 0.9

λ = 0 µ(right|·) = 0.5 π(right|·) = 1

2w w

Counterexample:

F

F(I − γP

π

key matrix

SLIDE 15

steps

ff-policy TD

emphatic TD

w

2w w

γ = 0.9 γ = 0

µ(right|·) = 0.1 π(right|·) = 1

steps

ff-policy TD

emphatic TD

w

2w w

µ(right|·) = 0.5 π(right|·) = 1

w w

γ = 0.9

ff-policy TD(0)

emphatic TD(0)

ff-policy TD(0)

emphatic TD(0)

Figure 3: Emphatic TD approaches the correct value of zero, whereas conventional off- policy TD diverges, on fifty trajectories on the w → 2w problems shown above each graph. Also shown as a thick line is the trajectory of the deterministic expected-update algorithm. On the continuing problem (left) emphatic TD has

ccasional high variance deviations from zero.

SLIDE 16

1 1

w2 w3

+

1 1

µ(left|·) = 2/3 π(right|·) = 1

w1

1 1 1 1 1 1

w1 w2

+

w2 w3 γ = 1 γ = 1 γ = 1 γ = 0 γ = 0

vπ =1 vπ =1 vπ =2 vπ =3 vπ =4

Figure 4: Twenty learning curves and their analytic expectation on the 5-state problem from Section 5, in which excursions terminate promptly and both algorithms converge reliably. Here λ = 0, w0 = 0, α = 0.001, and i(s) = 1, ∀s. The MSVE performance measure is defined in (20). . = X

s2S

dµ(s)i(s) ⇣ vπ(s) w>x(s) ⌘2 .

SLIDE 17

Summary of emphatic results

Linear emphatic TD(0) is the simplest TD alg with linear FA

that is stable under off-policy training

Some empirical illustrations
Stability theorem for full case of GVFs
Convergence w.p.1 theorem (Janey Yu, under review)
Asymptotic approximation bounds (Remi Munos)
Also a new (better?) algorithm for the on-policy case