An Emphatic Approach to the Problem of Off-policy TD Learning
Rich Sutton Rupam Mahmood Martha White
Reinforcement Learning and Artificial Intelligence Laboratory Department of Computing Science University of Alberta, Canada
R A I L
&
An Emphatic Approach to the Problem of Off-policy TD Learning Rich - - PowerPoint PPT Presentation
R L & A I An Emphatic Approach to the Problem of Off-policy TD Learning Rich Sutton Rupam Mahmood Martha White Reinforcement Learning and Artificial Intelligence Laboratory Department of Computing Science University of Alberta,
Reinforcement Learning and Artificial Intelligence Laboratory Department of Computing Science University of Alberta, Canada
R A I L
&
A = lim
t!1 E[At] = lim t!1 Eπ
h x(St) (x(St) − γx(St+1))>i = X
s
dπ(s) x(s) x(s) − γ X
s0
[Pπ]ss0x(s0) !> = X>Dπ(I − γPπ)X,
.
wt+1 = wt + α ⇣ Rt+1x(St) | {z }
bt2Rn
x(St) (x(St) γx(St+1))> | {z }
At2Rn⇥n
wt ⌘ = wt + α(bt Atwt) = (I αAt)wt + αbt.
2 R e St 2 S
states
actions
Rt+1 2 R
rewards
Gt . = Rt+1 + γRt+2 + γ2Rt+3 + · · ·
return policy e π(a|s) . = P{At =a|St =s}. value function
⇡ vπ(s) . = Eπ[Gt|St =s] ,
) ⇡ w>
t x(s)
linear TD(0):
π
× s [Pπ]ij . = P
a π(a|i)p(j|i, a) where p(j|i, a) .
= P{St+1 =j|St =i, At =a}. T property of d is that
transition prob matrix deterministic ‘expected’ update:
.
¯ wt+1 = (I − αA) ¯ wt + αb
feature vectors
d x(s) 2 Rn ∀s ∈ S
weight vector or wt 2 Rn, n ⌧ |S| Stable if is positive definite
A X . = − x(1)>− − x(2)>− . . . −x(|S|)>− . Dπ . = dπ ,
⇣
t+1 t
| {z }
R
if this “key matrix” is pos. def., then is pos. def. and everything is stable
A y>Ay > 0, 8y 6= 0
i.e., if
.
Converges to lim
t→∞ ¯
wt = A−1b. Temporal-Difference Learning with Linear Function Approximation 0 ≤ γ < 1
I showed in 1988 that the key matrix is pos. def. if its column sums are >0 [dπ]s . = dπ(s) . = limt!1 P{St =s}, w P>
π dπ = dπ.
ergodic stationary distribution
> 0
wt+1 . = wt + α ⇣ Rt+1 + γw>
t x(St+1) w> t x(St)
⌘ x(St),
A = lim
t!1 E[At] = lim t!1 Eπ
h x(St) (x(St) − γx(St+1))>i = X
s
dπ(s) x(s) x(s) − γ X
s0
[Pπ]ss0x(s0) !> = X>Dπ(I − γPπ)X,
.
π
× s [Pπ]ij . = P
a π(a|i)p(j|i, a) where p(j|i, a) .
= P{St+1 =j|St =i, At =a}. T property of d is that
transition prob matrix deterministic ‘expected’ update:
.
¯ wt+1 = (I − αA) ¯ wt + αb
Stable if is positive definite
A X . = − x(1)>− − x(2)>− . . . −x(|S|)>− . Dπ . = dπ ,
⇣
t+1 t
| {z }
R
if this “key matrix” is pos. def., then is pos. def. and everything is stable
A y>Ay > 0, 8y 6= 0
i.e., if
.
Converges to lim
t→∞ ¯
wt = A−1b.
I showed in 1988 that the key matrix is pos. def. if its column sums are >0 [dπ]s . = dπ(s) . = limt!1 P{St =s}, w P>
π dπ = dπ.
ergodic stationary distribution
> 0
. For the jth column, the sum is X
i
[Dπ(I − γP
π)]ij =
X
i
X
k
[Dπ]ik[I − γP
π]kj
= X
i
[Dπ]ii[I − γP
π]ij
= X
i
dπ(i)[I − γP
π]ij
= [d>
π (I − γP π)]j
= [d>
π − γd> π P π)]j
= [d>
π − γd> π ]j
= (1 − γ)dπ(j) > 0.
solution: importance sampling (Sutton & Barto 1998, improved by Precup, Sutton & Singh, 2000), now used in GTD(λ) and GQ(λ)
solution: none known, other than more importance sampling (Precup, Sutton & Dasgupta, 2001) which as proposed was of very high variance. The ideas of that work are strikingly similar to those of emphasis…
2 R e St 2 S
states
actions
Rt+1 2 R
rewards target policy is no longer used to select actions linear off-policy TD(0):
Off-policy Temporal-Difference Learning with Linear Function Approximation
h ⇡(a|s) behavior policy is used to select actions!
µ(a|s)
π(a|s) > 0 = ⇒ µ(a|s) > 0
assume coverage:
∀s, a
new ergodic stationary distribution
= limt→∞ P{St =s} > 0, ∀s ∈ S, w
[dµ]s . =
⇡ vπ(s) . = Eπ[Gt|St =s]
) ⇡ w>
t x(s)
importance sampling ratio ⇢t .
= ⇡(At|St) µ(At|St).
Eµ[⇢t|St =s] = X
a
µ(a|s)⇡(a|s) µ(a|s) = X
a
⇡(a|s) = 1.
Eµ[ρtZt+1|St =s] = X
a
µ(a|s)π(a|s) µ(a|s)Zt+1 = X
a
π(a|s)Zt+1 = Eπ[Zt+1|St =s] For any r.v. :
Zt+1
wt+1 . = wt + ρt α ⇣ Rt+1 + γw>
t xt+1 w> t xt
⌘ xt = wt + α ⇣ ρtRt+1xt | {z }
bt
ρtxt (xt γxt+1)> | {z }
At
wt ⌘ {z } | d xt . = x(St) A = lim
t!1 E[At] = lim t!1 Eµ
h ρtxt (xt γxt+1)>i = X
s
dµ(s)Eµ h ρtxt (xt γxt+1)>
i = X
s
dµ(s)Eπ h xt (xt γxt+1)>
i = X
s
dµ(s) x(s) x(s) γ X
s0
[Pπ]ss0x(s0) !> = X>D
µ(I γPπ)X,
and its A matrix: key matrix now has mismatched D and P matrices; it is not stable
sums to <0! A = lim
t!1 E[At] = lim t!1 Eµ
h ρtxt (xt γxt+1)>i = X
s
dµ(s)Eµ h ρtxt (xt γxt+1)>
i = X
s
dµ(s)Eπ h xt (xt γxt+1)>
i = X
s
dµ(s) x(s) x(s) γ X
s0
[Pπ]ss0x(s0) !> = X>D
µ(I γPπ)X,
2 R e St 2 S
states
actions
Rt+1 2 R
rewards target policy is no longer used to select actions
Off-policy Temporal-Difference Learning with Linear Function Approximation
h ⇡(a|s) behavior policy is used to select actions!
µ(a|s)
π(a|s) > 0 = ⇒ µ(a|s) > 0
assume coverage:
∀s, a
new ergodic stationary distribution
= limt→∞ P{St =s} > 0, ∀s ∈ S, w
[dµ]s . =
⇡ vπ(s) . = Eπ[Gt|St =s]
) ⇡ w>
t x(s)
key matrix now has mismatched D and P matrices; it is not stable
A P
π =
0 1 1
s [Pπ]ij . = P
a π(a|i)p(j|i, a) where p(j|i, a) .
= P property of d is that
transition prob matrix:
D
µ(I − γP π) =
0.5 0.5
1
−0.9
0.1
0.5
−0.45
0.05
pos def test:
X>D
µ(I − γP π)X =
⇥ 1 2 ⇤ × 0.5
−0.45
0.05
1 2
⇥ 1 2 ⇤ ×
−0.4
0.1
γ = 0.9 λ = 0 µ(right|·) = 0.5 π(right|·) = 1
2w w
Counterexample:
X = 1 2
solution: importance sampling (Sutton & Barto 1998, improved by Precup, Sutton & Singh, 2000), now used in GTD(λ) and GQ(λ)
solution: none known, other than more importance sampling (Precup, Sutton & Dasgupta, 2001) which as proposed was of very high variance. The ideas of that work are strikingly similar to those of emphasis…
J∗ ˜ Jr
Ben Van Roy 2009
vπ
ˆ v
J∗ ˜ Jr
Ben Van Roy 2009
vπ
ˆ v
importance sampling to warp the state distribution from the behavior policy’s distribution to the target policy’s distribution, then did a future- reweighted update at each state
beginning of time
policy forever, and are in its stationary distribution
for a limited time with various target policies (e.g., options)
in the behavior distribution
and MSPBE
updated states from the behavior policy’s stationary distribution to something like the ‘followon distribution’ of the target policy started in the behavior policy’s stationary distribution
in expectation—this follows from old results (Dayan 1992, Sutton 1988) on convergence of TD(λ) in episodic MDPs
Emphatic TD(0)
Introduces a new short-term memory random variable—the followon trace:
yF1 = 0
Ft . = γρt1Ft1 + 1, ∀t > 0,
Emphatic TD(0):
wt+1 . = wt + αFtρt ⇣ Rt+1 + γw>
t xt+1 − w> t xt
⌘ xt = wt + α ⇣ FtρtRt+1xt | {z }
bt
− Ftρtxt (xt − γxt+1)> | {z }
At
wt ⌘ A = lim
t!1 E[At] = lim t!1 Eµ
h Ftρtxt (xt − γxt+1)>i X h
X = X>F(I − γPπ)X,
⇣
t+1 t
| {z }
R
key matrix
F . = f ,
where with
) . = dµ(s) limt!1 Eµ[Ft|St =s], .
t!1
ts [f]s
we have:
f = dµ + γP>
π dµ +
⇣ γP>
π
⌘2 dµ + · · · = ⇣ I − γP>
π
⌘1 dµ.
X
i
[F(I − γP
π)]ij =
X
i
X
k
[F]ik[I − γP
π]kj
= X
i
[F]ii[I − γP
π]ij
= X
i
[f]i[I − γP
π]ij
= [f>(I − γP
π)]j
= [d>
µ (I − γP π)1(I − γP π)]j
= [d>
µ ]j
= dµ(j) > 0.
Sum of jth column of key matrix
sums to >0
wt+1 . = wt + αFtρt ⇣ Rt+1 + γw>
t xt+1 − w> t xt
⌘ xt = wt + α ⇣ FtρtRt+1xt | {z }
bt
− Ftρtxt (xt − γxt+1)> | {z }
At
wt ⌘
Emphatic TD(0)
Introduces a new short-term memory random variable—the followon trace:
yF1 = 0
Ft . = γρt1Ft1 + 1, ∀t > 0,
Emphatic TD(0):
X = X>F(I − γPπ)X,
F . = f ,
where with
) . = dµ(s) limt!1 Eµ[Ft|St =s], .
t!1
ts [f]s
we have:
f = dµ + γP>
π dµ +
⇣ γP>
π
⌘2 dµ + · · · = ⇣ I − γP>
π
⌘1 dµ. A = lim
t!1 E[At] = lim t!1 Eµ
h Ftρtxt (xt − γxt+1)>i X h
⇣
t+1 t
| {z }
R
key matrix
F(I − γP
π) =
0.5 9.5
1
−0.9
0.1
0.5
−0.45
0.95
[f]2 = 0.5 + 0.9 + 0.92 + 0.93 + · · · = 0.5 + 0.9 · 10 = 9.5 P
π =
0 1 1
λ = 0 µ(right|·) = 0.5 π(right|·) = 1
2w w
Counterexample:
F
F(I − γP
π
key matrix
steps
emphatic TD
w
2w w
γ = 0.9 γ = 0
µ(right|·) = 0.1 π(right|·) = 1
steps
emphatic TD
w
2w w
µ(right|·) = 0.5 π(right|·) = 1
w w
γ = 0.9
emphatic TD(0)
emphatic TD(0)
Figure 3: Emphatic TD approaches the correct value of zero, whereas conventional off- policy TD diverges, on fifty trajectories on the w → 2w problems shown above each graph. Also shown as a thick line is the trajectory of the deterministic expected-update algorithm. On the continuing problem (left) emphatic TD has
1 1
w2 w3
+
1 1
µ(left|·) = 2/3 π(right|·) = 1
w1
1 1 1 1 1 1
w1 w2
+
w2 w3 γ = 1 γ = 1 γ = 1 γ = 0 γ = 0
vπ =1 vπ =1 vπ =2 vπ =3 vπ =4
Figure 4: Twenty learning curves and their analytic expectation on the 5-state problem from Section 5, in which excursions terminate promptly and both algorithms converge reliably. Here λ = 0, w0 = 0, α = 0.001, and i(s) = 1, ∀s. The MSVE performance measure is defined in (20). . = X
s2S
dµ(s)i(s) ⇣ vπ(s) w>x(s) ⌘2 .
that is stable under off-policy training