ts tr P - - PDF document

t s t r p r t
SMART_READER_LITE
LIVE PREVIEW

ts tr P - - PDF document

ts tr P rt st rs r


slide-1
SLIDE 1

❈❙✷✸✹ ◆♦t❡s ✲ ▲❡❝t✉r❡ ✽ ✫ ✾ P♦❧✐❝② ●r❛❞✐❡♥t

▲✉❦❡ ❏♦❤♥st♦♥✱ ❊♠♠❛ ❇r✉♥s❦✐❧❧ ▼❛r❝❤ ✷✵✱ ✷✵✶✽

✶ ■♥tr♦❞✉❝t✐♦♥ t♦ P♦❧✐❝② ❙❡❛r❝❤

❙♦ ❢❛r✱ ✐♥ ♦r❞❡r t♦ ❧❡❛r♥ ❛ ♣♦❧✐❝②✱ ✇❡ ❤❛✈❡ ❢♦❝✉s❡❞ ♦♥ ✈❛❧✉❡✲❜❛s❡❞ ❛♣♣r♦❛❝❤❡s ✇❤❡r❡ ✇❡ ✜♥❞ t❤❡ ♦♣t✐♠❛❧ st❛t❡ ✈❛❧✉❡ ❢✉♥❝t✐♦♥ ♦r st❛t❡✲❛❝t✐♦♥ ✈❛❧✉❡ ❢✉♥❝t✐♦♥ ✇✐t❤ ♣❛r❛♠❡t❡rs θ✱ Vθ(s) ≈ V π(s) Qθ(s, a) ≈ Qπ(s, a) ❛♥❞ t❤❡♥ ✉s❡ Vθ ♦r Qθ t♦ ❡①tr❛❝t ❛ ♣♦❧✐❝②✱ ❡✳❣✳ ✇✐t❤ ǫ✲❣r❡❡❞②✳ ❍♦✇❡✈❡r✱ ✇❡ ❝❛♥ ❛❧s♦ ✉s❡ ❛ ♣♦❧✐❝②✲❜❛s❡❞ ❛♣♣r♦❛❝❤ t♦ ❞✐r❡❝t❧② ♣❛r❛♠❡t❡r✐③❡ t❤❡ ♣♦❧✐❝②✿ πθ(a|s) = P[a|s; θ] ■♥ t❤✐s s❡tt✐♥❣✱ ♦✉r ❣♦❛❧ ✐s t♦ ❞✐r❡❝t❧② ✜♥❞ t❤❡ ♣♦❧✐❝② ✇✐t❤ t❤❡ ❤✐❣❤❡st ✈❛❧✉❡ ❢✉♥❝t✐♦♥ V π✱ r❛t❤❡r t❤❛♥ ✜rst ✜♥❞✐♥❣ t❤❡ ✈❛❧✉❡✲❢✉♥❝t✐♦♥ ♦❢ t❤❡ ♦♣t✐♠❛❧ ♣♦❧✐❝② ❛♥❞ t❤❡♥ ❡①tr❛❝t✐♥❣ t❤❡ ♣♦❧✐❝② ❢r♦♠ ✐t✳ ■♥st❡❛❞ ♦❢ t❤❡ ♣♦❧✐❝② ❜❡✐♥❣ ❛ ❧♦♦❦✲✉♣ t❛❜❧❡ ❢r♦♠ st❛t❡s t♦ ❛❝t✐♦♥s✱ ✇❡ ✇✐❧❧ ❝♦♥s✐❞❡r st♦❝❤❛st✐❝ ♣♦❧✐❝✐❡s t❤❛t ❛r❡ ♣❛r❛♠❡t❡r✐③❡❞✳ ❋✐♥❞✐♥❣ ❛ ❣♦♦❞ ♣♦❧✐❝② r❡q✉✐r❡s t✇♦ ♣❛rts✿

  • ●♦♦❞ ♣♦❧✐❝② ♣❛r❛♠❡t❡r✐③❛t✐♦♥✿ ♦✉r ❢✉♥❝t✐♦♥ ❛♣♣r♦①✐♠❛t✐♦♥ ❛♥❞ st❛t❡✴❛❝t✐♦♥ r❡♣r❡s❡♥t❛t✐♦♥s ♠✉st

❜❡ ❡①♣r❡ss✐✈❡ ❡♥♦✉❣❤

  • ❊✛❡❝t✐✈❡ s❡❛r❝❤✿ ✇❡ ♠✉st ❜❡ ❛❜❧❡ t♦ ✜♥❞ ❣♦♦❞ ♣❛r❛♠❡t❡rs ❢♦r ♦✉r ♣♦❧✐❝② ❢✉♥❝t✐♦♥ ❛♣♣r♦①✐♠❛t✐♦♥

P♦❧✐❝②✲❜❛s❡❞ ❘▲ ❤❛s ❛ ❢❡✇ ❛❞✈❛♥t❛❣❡s ♦✈❡r ✈❛❧✉❡✲❜❛s❡❞ ❘▲✿

  • ❇❡tt❡r ❝♦♥✈❡r❣❡♥❝❡ ♣r♦♣❡rt✐❡s ✭s❡❡ ❈❤ ✶✸✳✸ ♦❢ ❙✉tt♦♥ ❛♥❞ ❇❛rt♦✮
  • ❊✛❡❝t✐✈❡♥❡ss ✐♥ ❤✐❣❤✲❞✐♠❡♥s✐♦♥❛❧ ♦r ❝♦♥t✐♥✉♦✉s ❛❝t✐♦♥ s♣❛❝❡s✱ ❡✳❣✳ r♦❜♦t✐❝s✳ ❖♥❡ ♠❡t❤♦❞ ❢♦r

❝♦♥t✐♥✉♦✉s ❛❝t✐♦♥ s♣❛❝❡s ✐s ❝♦✈❡r❡❞ ✐♥ s❡❝t✐♦♥ ✻✳✷✳

  • ❆❜✐❧✐t② t♦ ❧❡❛r♥ st♦❝❤❛st✐❝ ♣♦❧✐❝✐❡s✳ ❙❡❡ t❤❡ ❢♦❧❧♦✇✐♥❣ s❡❝t✐♦♥✳

❚❤❡ ❞✐s❛❞✈❛♥t❛❣❡s ♦❢ ♣♦❧✐❝②✲❜❛s❡❞ ❘▲ ♠❡t❤♦❞s ❛r❡✿

  • ❚❤❡② t②♣✐❝❛❧❧② ❝♦♥✈❡r❣❡ t♦ ❧♦❝❛❧❧② r❛t❤❡r t❤❛♥ ❣❧♦❜❛❧❧② ♦♣t✐♠❛❧ ♣♦❧✐❝✐❡s✱ s✐♥❝❡ t❤❡② r❡❧② ♦♥ ❣r❛❞✐❡♥t

❞❡s❝❡♥t✳

  • ❊✈❛❧✉❛t✐♥❣ ❛ ♣♦❧✐❝② ✐s t②♣✐❝❛❧❧② ❞❛t❛ ✐♥❡✣❝✐❡♥t ❛♥❞ ❤✐❣❤ ✈❛r✐❛♥❝❡✳

slide-2
SLIDE 2

✷ ❙t♦❝❤❛st✐❝ ♣♦❧✐❝✐❡s

■♥ t❤✐s s❡❝t✐♦♥✱ ✇❡ ✇✐❧❧ ❜r✐❡✢② ❣♦ ♦✈❡r t✇♦ ❡♥✈✐r♦♥♠❡♥ts ✐♥ ✇❤✐❝❤ ❛ st♦❝❤❛st✐❝ ♣♦❧✐❝② ✐s ❜❡tt❡r t❤❛♥ ❛♥② ❞❡t❡r♠✐♥✐st✐❝ ♣♦❧✐❝②✳

✷✳✶ ❘♦❝❦✲♣❛♣❡r✲s❝✐ss♦rs

❋♦r ❛ r❡❧❛t❛❜❧❡ ❡①❛♠♣❧❡✱ ✐♥ t❤❡ ♣♦♣✉❧❛r ③❡r♦✲s✉♠ ❣❛♠❡ ♦❢ r♦❝❦✲♣❛♣❡r✲s❝✐ss♦rs✱ ❛♥② ♣♦❧✐❝② t❤❛t ✐s ♥♦t ✉♥✐❢♦r♠❧② r❛♥❞♦♠✿ P(r♦❝❦|s) = 1/3 P(s❝✐ss♦rs|s) = 1/3 P(♣❛♣❡r|s) = 1/3 ❝❛♥ ❜❡ ❡①♣❧♦✐t❡❞✳

✷✳✷ ❆❧✐❛s❡❞ ❣r✐❞✇♦r❧❞

❋✐❣✉r❡ ✶✿ ■♥ t❤✐s ♣❛rt✐❛❧❧② ♦❜s❡r✈❛❜❧❡ ❣r✐❞✇♦r❧❞ ❡♥✈✐r♦♥♠❡♥t✱ t❤❡ ❛❣❡♥t ❝❛♥♥♦t ❞✐st✐♥❣✉✐s❤ ❜❡t✇❡❡♥ t❤❡ ❣r❛② st❛t❡s✳ ■♥ t❤❡ ❛❜♦✈❡ ❣r✐❞✇♦r❧❞ ❡♥✈✐r♦♥♠❡♥t✱ s✉♣♣♦s❡ t❤❛t t❤❡ ❛❣❡♥t ❝❛♥ ♠♦✈❡ ✐♥ t❤❡ ❢♦✉r ❝❛r❞✐♥❛❧ ❞✐r❡❝t✐♦♥s✱ s♦ ✐ts ❛❝t✐♦♥s s♣❛❝❡ ✐s A = {N, S, E, W}✳ ❍♦✇❡✈❡r✱ s✉♣♣♦s❡ t❤❛t ✐t ❝❛♥ ♦♥❧② s❡♥s❡ t❤❡ ✇❛❧❧s ❛r♦✉♥❞ ✐ts ❝✉rr❡♥t ❧♦❝❛t✐♦♥✳ ❙♣❡❝✐✜❝❛❧❧②✱ ✐t ♦❜s❡r✈❡s ❢❡❛t✉r❡s ♦❢ t❤❡ ❢♦❧❧♦✇✐♥❣ ❢♦r♠ ❢♦r ❡❛❝❤ ❞✐r❡❝t✐♦♥✿ φ(s) =   1(✇❛❧❧ t♦ ◆) . . . 1(✇❛❧❧ t♦ ❲)   ◆♦t❡ t❤❛t ✐ts ♦❜s❡r✈❛t✐♦♥s ❛r❡ ♥♦t ❢✉❧❧② r❡♣r❡s❡♥t❛t✐✈❡ ♦❢ t❤❡ ❡♥✈✐r♦♥♠❡♥t✱ ❛s ✐t ❝❛♥♥♦t ❞✐st✐♥❣✉✐s❤ ❜❡t✇❡❡♥ t❤❡ t✇♦ ❣r❛② sq✉❛r❡s✳ ❚❤✐s ❛❧s♦ ♠❡❛♥s t❤❛t ✐ts ❞♦♠❛✐♥ ✐s ♥♦t ▼❛r❦♦✈✳ ❍❡♥❝❡✱ ❛ ❞❡t❡r♠✐♥✐st✐❝ ♣♦❧✐❝② ♠✉st ❡✐t❤❡r ❧❡❛r♥ t♦ ❛❧✇❛②s ❣♦ ❧❡❢t ✐♥ t❤❡ ❣r❛② sq✉❛r❡s✱ ♦r ❛❧✇❛②s ❣♦ r✐❣❤t✳ ◆❡✐t❤❡r ♦❢ t❤❡s❡ ♣♦❧✐❝✐❡s ✐s ♦♣t✐♠❛❧✱ s✐♥❝❡ t❤❡ ❛❣❡♥t ❝❛♥ ❣❡t st✉❝❦ ✐♥ ♦♥❡ ❝♦r♥❡r ♦❢ t❤❡ ❡♥✈✐r♦♥♠❡♥t✿ ❋✐❣✉r❡ ✷✿ ❋♦r t❤✐s ❞❡t❡r♠✐♥✐st✐❝ ♣♦❧✐❝②✱ t❤❡ ❛❣❡♥t ❝❛♥♥♦t ✏❡s❝❛♣❡✑ ❢r♦♠ t❤❡ ✉♣♣❡r✲❧❡❢t t✇♦ st❛t❡s✳ ❍♦✇❡✈❡r✱ ❛ st♦❝❤❛st✐❝ ♣♦❧✐❝② ❝❛♥ ❧❡❛r♥ t♦ r❛♥❞♦♠❧② s❡❧❡❝t ❛ ❞✐r❡❝t✐♦♥ ✐♥ t❤❡ ❣r❛② st❛t❡s✱ ❣✉❛r❛♥t❡❡✐♥❣ t❤❛t ✐t ✇✐❧❧ ❡✈❡♥t✉❛❧❧② r❡❛❝❤ t❤❡ r❡✇❛r❞ ❢r♦♠ ❛♥② st❛rt✐♥❣ ❧♦❝❛t✐♦♥✳ ■♥ ❣❡♥❡r❛❧✱ st♦❝❤❛st✐❝ ♣♦❧✐❝✐❡s ❝❛♥ ❤❡❧♣ ♦✈❡r❝♦♠❡ ❛♥ ❛❞✈❡rs❛r✐❛❧ ♦r ♥♦♥✲st❛t✐♦♥❛r② ❞♦♠❛✐♥ ❛♥❞ ❝❛s❡s ✇❤❡r❡ t❤❡ st❛t❡✲r❡♣r❡s❡♥t❛t✐♦♥ ✐s ♥♦t ▼❛r❦♦✈✳ ✷

slide-3
SLIDE 3

❋✐❣✉r❡ ✸✿ ❆ st♦❝❤❛st✐❝ ♣♦❧✐❝② ✇❤✐❝❤ ♠♦✈❡s ❊ ♦r ❲ ✇✐t❤ ❡q✉❛❧ ♣r♦❜❛❜✐❧✐t② ✐♥ t❤❡ ❣r❛② st❛t❡s ✇✐❧❧ r❡❛❝❤ t❤❡ ❣♦❛❧ ✐♥ ❛ ❢❡✇ t✐♠❡ st❡♣s ✇✐t❤ ❤✐❣❤ ♣r♦❜❛❜✐❧✐t②✳

✸ P♦❧✐❝② ♦♣t✐♠✐③❛t✐♦♥

✸✳✶ P♦❧✐❝② ♦❜❥❡❝t✐✈❡ ❢✉♥❝t✐♦♥s

❖♥❝❡ ✇❡ ❤❛✈❡ ❞❡✜♥❡❞ ❛ ♣♦❧✐❝② πθ(a|s)✱ ✇❡ ♥❡❡❞ t♦ ❛❜❧❡ t♦ ♠❡❛s✉r❡ ❤♦✇ ✐t ✐s ♣❡r❢♦r♠✐♥❣ ✐♥ ♦r❞❡r t♦ ♦♣t✐♠✐③❡ ✐t✳ ■♥ ❛♥ ❡♣✐s♦❞✐❝ ❡♥✈✐r♦♥♠❡♥t✱ ❛ ♥❛t✉r❛❧ ♠❡❛s✉r❡♠❡♥t ✐s t❤❡ st❛rt ✈❛❧✉❡ ♦❢ t❤❡ ♣♦❧✐❝②✱ ✇❤✐❝❤ ✐s t❤❡ ❡①♣❡❝t❡❞ ✈❛❧✉❡ ♦❢ t❤❡ st❛rt st❛t❡✿ J1(θ) = V πθ(s1) = Eπθ[v1] ■♥ ❝♦♥t✐♥✉✐♥❣ ❡♥✈✐r♦♥♠❡♥ts ✇❡ ❝❛♥ ✉s❡ t❤❡ ❛✈❡r❛❣❡ ✈❛❧✉❡ ♦❢ t❤❡ ♣♦❧✐❝②✱ ✇❤❡r❡ dπθ(s) ✐s t❤❡ st❛t✐♦♥❛r② ❞✐str✐❜✉t✐♦♥ ♦❢ πθ✿ JavV (θ) =

  • s

dπθ(s)V πθ(s) ♦r ❛❧t❡r♥❛t✐✈❡❧② ✇❡ ❝❛♥ ✉s❡ t❤❡ ❛✈❡r❛❣❡ r❡✇❛r❞ ♣❡r t✐♠❡✲st❡♣✿ JavR(θ) =

  • s

dπθ(s)

  • a

πθ(a|s)R(s, a) ■♥ t❤❡s❡ ♥♦t❡s ✇❡ ❞✐s❝✉ss t❤❡ ❡♣✐s♦❞✐❝ ❝❛s❡✱ ❜✉t ❛❧❧ t❤❡ r❡s✉❧ts ✇❡ ❞❡r✐✈❡ ❝❛♥ ❜❡ ❡❛s✐❧② ❡①t❡♥❞❡❞ t♦ t❤❡ ♥♦♥✲❡♣✐s♦❞✐❝ ❝❛s❡✳ ❲❡ ✇✐❧❧ ❛❧s♦ ❢♦❝✉s ♦♥ t❤❡ ❝❛s❡ ✇❤❡r❡ t❤❡ ❞✐s❝♦✉♥t γ = 1✱ ❜✉t ❛❣❛✐♥✱ t❤❡ r❡s✉❧ts ❛r❡ ❡❛s✐❧② ❡①t❡♥❞❡❞ t♦ ❣❡♥❡r❛❧ γ✳

✸✳✷ ❖♣t✐♠✐③❛t✐♦♥ ♠❡t❤♦❞s

❲✐t❤ ❛♥ ♦❜❥❡❝t✐✈❡ ❢✉♥❝t✐♦♥✱ ✇❡ ❝❛♥ tr❡❛t ♦✉r ♣♦❧✐❝②✲❜❛s❡❞ r❡✐♥❢♦r❝❡♠❡♥t ❧❡❛r♥✐♥❣ ♣r♦❜❧❡♠ ❛s ❛♥ ♦♣✲ t✐♠✐③❛t✐♦♥ ♣r♦❜❧❡♠✳ ■♥ t❤❡s❡ ♥♦t❡s✱ ✇❡ ❢♦❝✉s ♦♥ ❣r❛❞✐❡♥t ❞❡s❝❡♥t✱ ❜❡❝❛✉s❡ r❡❝❡♥t❧② t❤❛t ❤❛s ❜❡❡♥ t❤❡ ♠♦st ❝♦♠♠♦♥ ♦♣t✐♠✐③❛t✐♦♥ ♠❡t❤♦❞ ❢♦r ♣♦❧✐❝②✲❜❛s❡❞ ❘▲ ♠❡t❤♦❞s✳ ❍♦✇❡✈❡r✱ ✐t ✐s ✇♦rt❤ ❝♦♥s✐❞❡r✐♥❣ s♦♠❡ ❣r❛❞✐❡♥t✲❢r❡❡ ♦♣t✐♠✐③❛t✐♦♥ ♠❡t❤♦❞s✱ ✐♥❝❧✉❞✐♥❣ t❤❡ ❢♦❧❧♦✇✐♥❣✿

  • ❍✐❧❧ ❝❧✐♠❜✐♥❣
  • ❙✐♠♣❧❡① ✴ ❛♠♦❡❜❛ ✴ ◆❡❧❞❡r ▼❡❛❞
  • ●❡♥❡t✐❝ ❛❧❣♦r✐t❤♠s
  • ❈r♦ss✲❊♥tr♦♣② ♠❡t❤♦❞ ✭❈❊▼✮
  • ❈♦✈❛r✐❛♥❝❡ ▼❛tr✐① ❆❞❛♣t❛t✐♦♥ ✭❈▼❆✮
  • ❊✈♦❧✉t✐♦♥ str❛t❡❣✐❡s

❚❤❡s❡ ♠❡t❤♦❞s ❤❛✈❡ t❤❡ ❛❞✈❛♥t❛❣❡ ♦✈❡r ❣r❛❞✐❡♥t✲❜❛s❡❞ ♠❡t❤♦❞s ✐♥ t❤❛t t❤❡② ❞♦ ♥♦t ❤❛✈❡ t♦ ❝♦♠♣✉t❡ ❛ ❣r❛❞✐❡♥t ♦❢ t❤❡ ♦❜❥❡❝t✐✈❡ ❢✉♥❝t✐♦♥✳ ❚❤✐s ❛❧❧♦✇s t❤❡ ♣♦❧✐❝② ♣❛r❛♠❡t❡r✐③❛t✐♦♥ t♦ ❜❡ ♥♦♥✲❞✐✛❡r❡♥t✐❛❜❧❡✱ ❛♥❞ t❤❡s❡ ♠❡t❤♦❞s ❛r❡ ❛❧s♦ ♦❢t❡♥ ❡❛s② t♦ ♣❛r❛❧❧❡❧✐③❡✳ ●r❛❞✐❡♥t✲❢r❡❡ ♠❡t❤♦❞s ❛r❡ ♦❢t❡♥ ❛ ✉s❡❢✉❧ ❜❛s❡❧✐♥❡ t♦ tr②✱ ❛♥❞ s♦♠❡t✐♠❡s t❤❡② ❝❛♥ ✇♦r❦ ❡♠❜❛rr❛ss✐♥❣❧② ✇❡❧❧ ❬✶❪✳ ❍♦✇❡✈❡r✱ t❤✐s ♠❡t❤♦❞s ❛r❡ ✉s✉❛❧❧② ♥♦t ✈❡r② s❛♠♣❧❡ ❡✣❝✐❡♥t ❜❡❝❛✉s❡ t❤❡② ✐❣♥♦r❡ t❤❡ t❡♠♣♦r❛❧ str✉❝t✉r❡ ♦❢ t❤❡ r❡✇❛r❞s ✲ ✉♣❞❛t❡s ♦♥❧② t❛❦❡ ✐♥t♦ ❛❝❝♦✉♥t t❤❡ t♦t❛❧ r❡✇❛r❞ ♦✈❡r t❤❡ ❡♥t✐r❡ ❡♣✐s♦❞❡✱ ❛♥❞ t❤❡② ❞♦ ♥♦t ❜r❡❛❦ ✉♣ t❤❡ r❡✇❛r❞ ✐♥t♦ ❞✐✛❡r❡♥t r❡✇❛r❞s ❢♦r ❡❛❝❤ st❛t❡ ✐♥ t❤❡ tr❛❥❡❝t♦r②✳ ✭❙❡❡ s❡❝t✐♦♥ ✹✳✸✮✳ ✸

slide-4
SLIDE 4

✹ P♦❧✐❝② ❣r❛❞✐❡♥t

▲❡t ✉s ❞❡✜♥❡ V (θ) t♦ ❜❡ t❤❡ ♦❜❥❡❝t✐✈❡ ❢✉♥❝t✐♦♥ ✇❡ ✇✐s❤ t♦ ♠❛①✐♠✐③❡ ♦✈❡r θ✳ P♦❧✐❝② ❣r❛❞✐❡♥t ♠❡t❤♦❞s s❡❛r❝❤ ❢♦r ❛ ❧♦❝❛❧ ♠❛①✐♠✉♠ ✐♥ V (θ) ❜② ❛s❝❡♥❞✐♥❣ t❤❡ ❣r❛❞✐❡♥t ♦❢ t❤❡ ♣♦❧✐❝②✱ ✇✳r✳t ♣❛r❛♠❡t❡rs θ ∆θ = α∇θV (θ) ✇❤❡r❡ α ✐s ❛ st❡♣✲s✐③❡ ♣❛r❛♠❡t❡r ❛♥❞ ∇θV (θ) ✐s t❤❡ ♣♦❧✐❝② ❣r❛❞✐❡♥t ∇θV (θ) =    

∂V (θ) ∂θ1

✳ ✳ ✳

∂V (θ) ∂θn

   

✹✳✶ ❈♦♠♣✉t✐♥❣ t❤❡ ❣r❛❞✐❡♥t

❲✐t❤ t❤✐s s❡t✉♣✱ ❛❧❧ ✇❡ ❤❛✈❡ t♦ ❞♦ ✐s ❝♦♠♣✉t❡ t❤❡ ❣r❛❞✐❡♥t ♦❢ t❤❡ ♦❜❥❡❝t✐✈❡ ❢✉♥❝t✐♦♥ V (θ)✱ ❛♥❞ ✇❡ ❝❛♥ ♦♣t✐♠✐③❡ ✐t✦ ❚❤❡ ♠❡t❤♦❞ ♦❢ ✜♥✐t❡ ❞✐✛❡r❡♥❝❡ ❢r♦♠ ❝❛❧❝✉❧✉s ♣r♦✈✐❞❡s ❛♥ ❛♣♣r♦①✐♠❛t✐♦♥ ♦❢ t❤❡ ❣r❛❞✐❡♥t✿ ∂V (θ) ∂θk ≈ V (θ + ǫuk) − V (θ) ǫ ✇❤❡r❡ uk ✐s ❛ ✉♥✐t ✈❡❝t♦r ✇✐t❤ 1 ✐♥ kt❤ ❝♦♠♣♦♥❡♥t✱ 0 ❡❧s❡✇❤❡r❡✳ ❚❤✐s ♠❡t❤♦❞ ✉s❡s n ❡✈❛❧✉❛t✐♦♥s t♦ ❝♦♠♣✉t❡ t❤❡ ♣♦❧✐❝② ❣r❛❞✐❡♥t ✐♥ n ❞✐♠❡♥s✐♦♥s✱ s♦ ✐t ✐s q✉✐t❡ ✐♥❡✣❝✐❡♥t✱ ❛♥❞ ✐t ✉s✉❛❧❧② ♦♥❧② ♣r♦✈✐❞❡s ❛ ♥♦✐s② ❛♣♣r♦①✐♠❛t✐♦♥ ♦❢ t❤❡ tr✉❡ ♣♦❧✐❝② ❣r❛❞✐❡♥t✳ ❍♦✇❡✈❡r✱ ✐t ❤❛s t❤❡ ❛❞✈❛♥t❛❣❡ t❤❛t ✐t ✇♦r❦s ❢♦r ♥♦♥✲❞✐✛❡r❡♥t✐❛❜❧❡ ♣♦❧✐❝✐❡s✳ ❆♥ ❡①❛♠♣❧❡ ♦❢ ❛ s✉❝❝❡ss❢✉❧ ✉s❡ ♦❢ t❤✐s ♠❡t❤♦❞ t♦ tr❛✐♥ t❤❡ ❆■❇❖ r♦❜♦t ❣❛✐t ❝❛♥ ❜❡ ❢♦✉♥❞ ✐♥ ❬✷❪✳ ✹✳✶✳✶ ❆♥❛❧②t✐❝ ❣r❛❞✐❡♥ts ▲❡t ✉s s❡t t❤❡ ♦❜❥❡❝t✐✈❡ ❢✉♥❝t✐♦♥ V (θ) t♦ ❜❡ t❤❡ ❡①♣❡❝t❡❞ r❡✇❛r❞s ❢♦r ❛♥ ❡♣✐s♦❞❡✱ V (θ) = E(st,at)∼πθ T

  • t=0

R(st, at)

  • = Eτ∼πθ [R(τ)] =
  • τ

P(τ; θ)R(τ) ✇❤❡r❡ τ ✐s ❛ tr❛❥❡❝t♦r②✱ τ = (s0, a0, r0, · · · , sT −1, aT −1, rT −1, sT ) P(τ; θ) ❞❡♥♦t❡s t❤❡ ♣r♦❜❛❜✐❧✐t② ♦✈❡r tr❛❥❡❝t♦r✐❡s ✇❤❡♥ ❢♦❧❧♦✇✐♥❣ ♣♦❧✐❝② πθ✱ ❛♥❞ R(τ) ✐s t❤❡ s✉♠ ♦❢ r❡✇❛r❞s ❢♦r ❛ tr❛❥❡❝t♦r②✳ ◆♦t❡ t❤❛t t❤✐s ♦❜❥❡❝t✐✈❡ ❢✉♥❝t✐♦♥ ✐s t❤❡ s❛♠❡ ❛s t❤❡ st❛rt ✈❛❧✉❡ J1(θ) ❛s ♠❡♥t✐♦♥❡❞ ✐♥ s❡❝t✐♦♥ ✸✳✶ ✇❤❡♥ t❤❡ ❞✐s❝♦✉♥t γ = 1✳ ■❢ ✇❡ ❝❛♥ ♠❛t❤❡♠❛t✐❝❛❧❧② ❝♦♠♣✉t❡ t❤❡ ♣♦❧✐❝② ❣r❛❞✐❡♥t ∇θπθ(a|s)✱ t❤❡♥ ✇❡ ❝❛♥ ❣♦ r✐❣❤t ❛❤❡❛❞ ❛♥❞ ✹

slide-5
SLIDE 5

❝♦♠♣✉t❡ t❤❡ ❣r❛❞✐❡♥t ♦❢ t❤✐s ♦❜❥❡❝t✐✈❡ ❢✉♥❝t✐♦♥ ✇✐t❤ r❡s♣❡❝t t♦ θ✿ ∇θV (θ) = ∇θ

  • τ

P(τ; θ)R(τ) ✭✶✮ =

  • τ

∇θP(τ; θ)R(τ) ✭✷✮ =

  • τ

P(τ; θ) P(τ; θ)∇θP(τ; θ)R(τ) ✭✸✮ =

  • τ

P(τ; θ)R(τ)∇θP(τ; θ) P(τ; θ) ✭✹✮ =

  • τ

P(τ; θ)R(τ)∇θ log P(τ; θ) ✭✺✮ = Eτ∼πθ [R(τ)∇θ log P(τ; θ)] ✭✻✮ ❚❤❡ ❡①♣r❡ss✐♦♥ ∇θP (τ;θ)

P (τ;θ)

✐♥ ❡q✉❛t✐♦♥ ✭✹✮ ✐s ❦♥♦✇♥ ❛s t❤❡ ❧✐❦❡❧✐❤♦♦❞ r❛t✐♦✳ ❚❤❡ tr✐❝❦ ✐♥ st❡♣s ✭✸✮✲✭✻✮ ❤❡❧♣s ❢♦r t✇♦ r❡❛s♦♥s✳ ❋✐rst✱ ✐t ❤❡❧♣s ✉s ❣❡t t❤❡ ❣r❛❞✐❡♥t ✐♥t♦ t❤❡ ❢♦r♠ Eτ∼πθ[. . . ]✱ ✇❤✐❝❤ ❛❧❧♦✇s ✉s t♦ ❛♣♣r♦①✐♠❛t❡ t❤❡ ❣r❛❞✐❡♥t ❜② s❛♠♣❧✐♥❣ tr❛❥❡❝t♦r✐❡s τ (i)✿ ∇θV (θ) ≈ ˆ g = 1 m

m

  • i=1

R(τ (i))∇θ log P(τ (i); θ) ✭✼✮ ❙❡❝♦♥❞✱ ❝♦♠♣✉t✐♥❣ ∇θ log P(τ (i); θ) ✐s ❡❛s✐❡r t❤❛♥ ✇♦r❦✐♥❣ ✇✐t❤ P(τ (i); θ) ❞✐r❡❝t❧②✿ ∇θ log P(τ (i); θ) = ∇θ log    µ(s0)

■♥✐t✐❛❧ st❛t❡ ❞✐str✐❜✉t✐♦♥ T −1

  • t=0

πθ(at|st)

  • ♣♦❧✐❝②

P(st+1|st, at)

  • ❞②♥❛♠✐❝s ♠♦❞❡❧

   ✭✽✮ = ∇θ

  • log µ(s0) +

T −1

  • t=0

log πθ(at|st) + log P(st+1|st, at)

  • ✭✾✮

=

T −1

  • t=0

∇θ log πθ(at|st)

  • ♥♦ ❞②♥❛♠✐❝s ♠♦❞❡❧ r❡q✉✐r❡❞✦

✭✶✵✮ ❲♦r❦✐♥❣ ✇✐t❤ log P(τ (i); θ) ✐♥st❡❛❞ ♦❢ P(τ (i); θ) ❛❧❧♦✇s ✉s t♦ r❡♣r❡s❡♥t t❤❡ ❣r❛❞✐❡♥t ✇✐t❤♦✉t r❡❢❡r❡♥❝❡ t♦ t❤❡ ✐♥✐t✐❛❧ st❛t❡ ❞✐str✐❜✉t✐♦♥✱ ♦r ❡✈❡♥ t❤❡ ❡♥✈✐r♦♥♠❡♥t ❞②♥❛♠✐❝s ♠♦❞❡❧✦ ❚❤❡ ❡①♣r❡ss✐♦♥ ∇θ log πθ(at|st) ✐s ❦♥♦✇♥ ❛s t❤❡ s❝♦r❡ ❢✉♥❝t✐♦♥✳ P✉tt✐♥❣ ❡q✉❛t✐♦♥s ✭✼✮ ❛♥❞ ✭✶✵✮ t♦❣❡t❤❡r✱ ✇❡ ❣❡t ∇θV (θ) ≈ ˆ g = 1 m

m

  • i=1

R(τ (i))

T −1

  • t=0

∇θ log πθ

  • a(i)

t

| s(i)

t

  • ✇❤✐❝❤ ✇❡ ❝❛♥ ❝♦♥✈❡rt ✐♥t♦ ❛ ❝♦♥❝r❡t❡ ❛❧❣♦r✐t❤♠ ❢♦r ♦♣t✐♠✐③✐♥❣ πθ ✭s❡❝t✐♦♥ ✺✮✳ ❇✉t ❜❡❢♦r❡ t❤❛t✱ ✇❡ ✇✐❧❧

♠❡♥t✐♦♥ t❤❡ ❣❡♥❡r❛❧✐③❡❞ ✈❡rs✐♦♥ ♦❢ t❤✐s r❡s✉❧t ❛♥❞ ❝♦✈❡r ❛♥ ♦♣t✐♠✐③❛t✐♦♥ ♦❢ t❤❡ ❛❜♦✈❡ ❞❡r✐✈❛t✐♦♥ t❤❛t t❛❦❡s ❛❞✈❛♥t❛❣❡ ♦❢ ❞❡❝♦♠♣♦s✐♥❣ R(τ (i)) ✐♥t♦ ❛ s✉♠ ♦❢ r❡✇❛r❞ t❡r♠s r(i)

t

✭s❡❝t✐♦♥ ✹✳✸✮✳ ✺

slide-6
SLIDE 6

✹✳✷ ❚❤❡ ♣♦❧✐❝② ❣r❛❞✐❡♥t t❤❡♦r❡♠

❚❤❡♦r❡♠ ✹✳✶✳ ❋♦r ❛♥② ❞✐✛❡r❡♥t✐❛❜❧❡ ♣♦❧✐❝② πθ(a|s) ❛♥❞ ❢♦r ❛♥② ♦❢ t❤❡ ♣♦❧✐❝② ♦❜❥❡❝t✐✈❡ ❢✉♥❝t✐♦♥s V (θ) = J1, JavR✱ ♦r

1 1−γ JavV ✱ t❤❡ ♣♦❧✐❝② ❣r❛❞✐❡♥t ✐s

∇θV (θ) = Eπθ[Qπθ(s, a) · ∇θ log πθ(a|s)] ❲❡ ✇✐❧❧ ♥♦t ❣♦ ♦✈❡r t❤❡ ❞❡r✐✈❛t✐♦♥ ♦❢ t❤✐s ♠♦r❡ ❣❡♥❡r❛❧ t❤❡♦r❡♠✱ ❜✉t t❤❡ s❛♠❡ ❝♦♥❝❡♣ts ❞✐s❝✉ss❡❞ ✐♥ t❤✐s ❧❡❝t✉r❡ ❛♣♣❧② t♦ ♥♦♥✲❡♣✐s♦❞✐❝ ✭❝♦♥t✐♥✉✐♥❣✮ ❡♥✈✐r♦♥♠❡♥ts✳ ■♥ ♦✉r ❞✐s❝✉ss✐♦♥ t❤✉s ❢❛r✱ t❤❡ t♦t❛❧ ❡♣✐s♦❞❡ r❡✇❛r❞s R(τ) ❤❛✈❡ ❜❡❡♥ s✉❜st✐t✉t❡❞ ✐♥ ♣❧❛❝❡ ♦❢ t❤❡ Q ✈❛❧✉❡s ♦❢ t❤✐s t❤❡♦r❡♠✱ ❜✉t ✐♥ t❤❡ ❢♦❧❧♦✇✐♥❣ s❡❝t✐♦♥ ✇❡ ✇✐❧❧ ✉s❡ t❤❡ t❡♠♣♦r❛❧ str✉❝t✉r❡ t♦ ❣❡t ♦✉r r❡s✉❧t ✐♥t♦ ❛ ❢♦r♠ t❤❛t ❧♦♦❦s ♠♦r❡ ❧✐❦❡ t❤✐s t❤❡♦r❡♠✱ ✇❤❡r❡ t❤❡ ❢✉t✉r❡ r❡t✉r♥s Gt ✭✇❤✐❝❤ ❛r❡ ✉♥❜✐❛s❡❞ ❡st✐♠❛t❡s ♦❢ Q(st, at)✮ ❛♣♣❡❛r ✐♥ ♣❧❛❝❡ ♦❢ Qπθ(s, a)✳

✹✳✸ ❯s✐♥❣ t❡♠♣♦r❛❧ str✉❝t✉r❡ ♦❢ r❡✇❛r❞s ❢♦r t❤❡ ♣♦❧✐❝② ❣r❛❞✐❡♥t

❊q✉❛t✐♦♥ ✭✻✮ ❛❜♦✈❡ ❝❛♥ ❜❡ ✇r✐tt❡♥ ∇θV (θ) = ∇θEτ∼πθ[R(τ)] = Eτ∼πθ

  • R(τ)

T −1

  • t=0

∇θ log πθ(at|st)

  • ✭✶✶✮

◆♦t✐❝❡ t❤❛t t❤❡ r❡✇❛r❞s R(τ (i)) ❛r❡ tr❡❛t❡❞ ❛s ❛ s✐♥❣❧❡ ♥✉♠❜❡r ✇❤✐❝❤ ✐s ❛ ❢✉♥❝t✐♦♥ ♦❢ ❛♥ ❡♥t✐r❡ tr❛❥❡❝t♦r② τ (i)✳ ❲❡ ❝❛♥ ❜r❡❛❦ t❤✐s ❞♦✇♥ ✐♥t♦ t❤❡ s✉♠ ♦❢ ❛❧❧ t❤❡ r❡✇❛r❞s ❡♥❝♦✉♥t❡r❡❞ ✐♥ t❤❡ tr❛❥❡❝t♦r②✱ R(τ) =

T −1

  • t=1

R(st, at) ❯s✐♥❣ t❤✐s ❦♥♦✇❧❡❞❣❡✱ ✇❡ ❝❛♥ ❞❡r✐✈❡ t❤❡ ❣r❛❞✐❡♥t ❡st✐♠❛t❡ ❢♦r ❛ s✐♥❣❧❡ r❡✇❛r❞ t❡r♠ rt′ ✐♥ ❡①❛❝t❧② t❤❡ s❛♠❡ ✇❛② ✇❡ ❞❡r✐✈❡❞ ❡q✉❛t✐♦♥ ✭✶✶✮✿ ∇θEπθ[rt′] = Eπθ  rt′

t′

  • t=0

∇θ log πθ(at|st)   ❙✐♥❝❡ T −1

t′=t r(i) t′ ✐s t❤❡ r❡t✉r♥ G(i) t ✱ ✇❡ ❝❛♥ s✉♠ t❤✐s ✉♣ ♦✈❡r ❛❧❧ t✐♠❡ st❡♣s ❢♦r ❛ tr❛❥❡❝t♦r② t♦ ❣❡t

∇θV (θ) = ∇θEτ∼πθ[R(τ)] = Eπθ  

T −1

  • t′=0

rt′

t′

  • t=0

∇θ log πθ(at|st)   ✭✶✷✮ = Eπθ T −1

  • t=0

∇θ log πθ(at|st)

T −1

  • t′=t

rt′

  • ✭✶✸✮

= Eπθ T −1

  • t=0

Gt · ∇θ log πθ(at|st)

  • ✭✶✹✮
  • ♦✐♥❣ ❢r♦♠ ✭✶✷✮ t♦ ✭✶✸✮ ♠❛② ♥♦t ❜❡ ♦❜✈✐♦✉s✱ s♦ ❧❡t✬s ❣♦ ♦✈❡r ❛ q✉✐❝❦ ❡①❛♠♣❧❡✳ ❙❛② ✇❡ ❤❛✈❡ ❛ tr❛❥❡❝t♦r②

t❤❛t ✐s t❤r❡❡ t✐♠❡ st❡♣s ❧♦♥❣✳ ❚❤❡♥ ❡q✉❛t✐♦♥ ✭✶✷✮ ❜❡❝♦♠❡s ∇θV (θ) = Eπθ

  • r0∇θ log πθ(a0|s0)+

r1(∇θ log πθ(a0|s0) + ∇θ log πθ(a1|s1))+ r2(∇θ log πθ(a0|s0) + ∇θ log πθ(a1|s1) + ∇θ log πθ(a2|s2))

slide-7
SLIDE 7

❘❡❣r♦✉♣✐♥❣ t❤❡ t❡r♠s✱ ✇❡ ❣❡t ∇θV (θ) = Eπθ

  • ∇θ log πθ(a0|s0)(r0 + r1 + r2)+

∇θ log πθ(a1|s1)(r1 + r2)+ ∇θ log πθ(a2|s2)(r2)

  • ✇❤✐❝❤ ❡q✉❛❧s ❡q✉❛t✐♦♥ ✭✶✸✮ ❛s ❡①♣❡❝t❡❞✳ ❚❤❡ ♠❛✐♥ ✐❞❡❛ ✐s t❤❛t t❤❡ ♣♦❧✐❝②✬s ❝❤♦✐❝❡ ❛t ❛ ♣❛rt✐❝✉❧❛r t✐♠❡

st❡♣ t ♦♥❧② ❛✛❡❝ts r❡✇❛r❞s r❡❝❡✐✈❡❞ ✐♥ ❧❛t❡r st❡♣s ♦❢ t❤❡ ❡♣✐s♦❞❡✱ ❛♥❞ ❤❛s ♥♦ ❡✛❡❝t ♦♥ r❡✇❛r❞s r❡❝❡✐✈❡❞ ✐♥ ♣r❡✈✐♦✉s t✐♠❡ st❡♣s✳ ❖✉r ♦r✐❣✐♥❛❧ ❡①♣r❡ss✐♦♥ ✐♥ ❡q✉❛t✐♦♥ ✭✶✶✮ ❞✐❞ ♥♦t t❛❦❡ t❤✐s ✐♥t♦ ❛❝❝♦✉♥t✳ ❖✉r ✜♥❛❧ ❡①♣r❡ss✐♦♥ t❤❛t ✇❡ ✇✐❧❧ ✉s❡ ✐♥ t❤❡ ♣♦❧✐❝② ❣r❛❞✐❡♥t ❛❧❣♦r✐t❤♠ ✐♥ t❤❡ ♥❡①t s❡❝t✐♦♥ ✐s ∇θV (θ) = ∇θEτ∼πθ[R(τ)] ≈ 1 m

m

  • i=1

T −1

  • t=0

G(i)

t

· ∇θ log πθ(a(i)

t |s(i) t )

✭✶✺✮

✺ ❘❊■◆❋❖❘❈❊✿ ❆ ▼♦♥t❡✲❈❛r❧♦ ♣♦❧✐❝② ❣r❛❞✐❡♥t ❛❧❣♦r✐t❤♠

❲❡✬✈❡ ❞♦♥❡ ♠♦st ♦❢ t❤❡ ✇♦r❦ t♦✇❛r❞s ♦✉r ✜rst ♣♦❧✐❝② ❣r❛❞✐❡♥t ❛❧❣♦r✐t❤♠ ✐♥ t❤❡ s❡❝t✐♦♥s ❛❜♦✈❡✳ ❚❤❡ ❛❧❣♦r✐t❤♠ s✐♠♣❧② s❛♠♣❧❡s ♠✉❧t✐♣❧❡ tr❛❥❡❝t♦r✐❡s ❢♦❧❧♦✇✐♥❣ t❤❡ ♣♦❧✐❝② πθ ✇❤✐❧❡ ✉♣❞❛t✐♥❣ θ ✉s✐♥❣ t❤❡ ❡st✐♠❛t❡❞ ❣r❛❞✐❡♥t ✭✶✺✮✳ ❆❧❣♦r✐t❤♠ ✶ ❘❊■◆❋❖❘❈❊✿ ▼♦♥t❡✲❈❛r❧♦ ♣♦❧✐❝② ❣r❛❞✐❡♥t ❛❧❣♦r✐t❤♠

✶✿ ♣r♦❝❡❞✉r❡ ❘❊■◆❋❖❘❈❊✭α✮ ✷✿

■♥✐t✐❛❧✐③❡ ♣♦❧✐❝② ♣❛r❛♠❡t❡rs θ ❛r❜✐tr❛r✐❧②

✸✿

❢♦r ❡❛❝❤ ❡♣✐s♦❞❡ {s1, a1, r2, · · · , sT −1, aT −1, rT } ∼ πθ ❞♦

✹✿

❢♦r t = 1 t♦ T − 1 ❞♦

✺✿

θ ← θ + α · Gt∇θ log πθ(at|st) r❡t✉r♥ θ

✻ ❉✐✛❡r❡♥t✐❛❜❧❡ ♣♦❧✐❝② ❝❧❛ss❡s

✻✳✶ ❉✐s❝r❡t❡ ❛❝t✐♦♥ s♣❛❝❡✿ s♦❢t♠❛① ♣♦❧✐❝②

■♥ ❞✐s❝r❡t❡ ❛❝t✐♦♥ s♣❛❝❡s✱ t❤❡ s♦❢t♠❛① ❢✉♥❝t✐♦♥ ✐s ❝♦♠♠♦♥❧② ✉s❡❞ t♦ ♣❛r❛♠❡t❡r✐③❡ t❤❡ ♣♦❧✐❝②✿ πθ(a|s) = eφ(s,a)T θ

  • a′ eφ(s,a′)T θ

slide-8
SLIDE 8

❚❤❡ s❝♦r❡ ❢✉♥❝t✐♦♥ ✐s t❤❡♥ ∇θ log πθ(a|s) = ∇θ

  • φ(s, a)T θ − log
  • a′

eφ(s,a′)T θ

  • = φ(s, a) −

1

  • a′ eφ(s,a′)T θ ∇θ
  • a′

eφ(s,a′)T θ = φ(s, a) − 1

  • a′ eφ(s,a′)T θ
  • a′

φ(s, a′)eφ(s,a′)T θ = φ(s, a) −

  • a′

φ(s, a′) eφ(s,a′)T θ

  • a′′ eφ(s,a′′)T θ

= φ(s, a) −

  • a′

φ(s, a′)πθ(a′|s) = φ(s, a) − Ea′∼πθ(a′|s)[φ(s, a′)]

✻✳✷ ❈♦♥t✐♥✉♦✉s ❛❝t✐♦♥ s♣❛❝❡✿ ●❛✉ss✐❛♥ ♣♦❧✐❝②

❋♦r ❝♦♥t✐♥✉♦✉s ❛❝t✐♦♥ s♣❛❝❡s✱ ❛ ❝♦♠♠♦♥ ❝❤♦✐❝❡ ✐s ❛ ●❛✉ss✐❛♥ ♣♦❧✐❝② a ∼ N(µ(s), σ2)✳

  • ❚❤❡ ♠❡❛♥ ❛❝t✐♦♥ ✐s ❛ ❧✐♥❡❛r ❝♦♠❜✐♥❛t✐♦♥ ♦❢ st❛t❡ ❢❡❛t✉r❡s✿ µ(s) = φ(s)T θ
  • ❚❤❡ ✈❛r✐❛♥❝❡ σ2 ❝❛♥ ❜❡ ✜①❡❞✱ ♦r ❛❧s♦ ♣❛r❛♠❡t❡r✐③❡❞

❚❤❡ s❝♦r❡ ❢✉♥❝t✐♦♥ ✐s ∇θ log πθ(a|s) = (a − µ(s))φ(s) σ2

✼ ❱❛r✐❛♥❝❡ r❡❞✉❝t✐♦♥ ✇✐t❤ ❛ ❜❛s❡❧✐♥❡

❆ ✇❡❛❦♥❡ss ♦❢ ▼♦♥t❡✲❈❛r❧♦ ♣♦❧✐❝② ❣r❛❞✐❡♥t ❛❧❣♦r✐t❤♠s ✐s t❤❛t t❤❡ r❡t✉r♥s G(i)

t

♦❢t❡♥ ❤❛✈❡ ❤✐❣❤ ✈❛r✐❛♥❝❡ ❛❝r♦ss ♠✉❧t✐♣❧❡ ❡♣✐s♦❞❡s✳ ❖♥❡ ✇❛② t♦ ❛❞❞r❡ss t❤✐s ✐s t♦ s✉❜tr❛❝t ❛ ❜❛s❡❧✐♥❡ b(s) ❢r♦♠ ❡❛❝❤ G(i)

t ✳ ❚❤❡

❜❛s❡❧✐♥❡ ❝❛♥ ❜❡ ❛♥② ❢✉♥❝t✐♦♥✱ ❛s ❧♦♥❣ ❛s ✐t ❞♦❡s ♥♦t ✈❛r② ✇✐t❤ a✳ ∇θV (θ) = ∇θEτ∼πθ[R(τ)] = Eπθ T −1

  • t=0

(Gt − b(st)) ∇θ log πθ(at|st)

  • ❋✐rst✱ ✇❤② ❞♦ ✇❡ ✇❛♥t t♦ ❞♦ t❤✐s❄ ■♥t✉✐t✐✈❡❧②✱ ✇❡ ❝❛♥ t❤✐♥❦ ♦❢ (Gt − b(st)) ❛s ❛♥ ❡st✐♠❛t❡ ♦❢ ❤♦✇

♠✉❝❤ ❜❡tt❡r ✇❡ ❞✐❞ ❛❢t❡r t✐♠❡ st❡♣ t t❤❛♥ ✐s ❡①♣❡❝t❡❞ ❜② t❤❡ ❜❛s❡❧✐♥❡ b(st)✳ ❙♦✱ ✐❢ t❤❡ ❜❛s❡❧✐♥❡ ✐s ❛♣♣r♦①✐♠❛t❡❧② ❡q✉❛❧ t♦ t❤❡ ❡①♣❡❝t❡❞ r❡t✉r♥ b(st) ≈ E[rt +rt+1 +· · ·+rT −1]✱ t❤❡♥ ✇❡ ✇✐❧❧ ❜❡ ✐♥❝r❡❛s✐♥❣ t❤❡ ❧♦❣✲♣r♦❜❛❜✐❧✐t② ♦❢ ❛❝t✐♦♥ at ♣r♦♣♦rt✐♦♥❛❧❧② t♦ ❤♦✇ ♠✉❝❤ ❜❡tt❡r t❤❡ r❡t✉r♥ Gt ✐s t❤❛♥ ❡①♣❡❝t❡❞✳ Pr❡✈✐♦✉s❧②✱ ✇❡ ✇❡r❡ ✐♥❝r❡❛s✐♥❣ t❤❡ ❧♦❣✲♣r♦❜❛❜✐❧✐t② ♣r♦♣♦rt✐♦♥❛❧❧② t♦ t❤❡ ♠❛❣♥✐t✉❞❡ ♦❢ Gt✱ s♦ ❡✈❡♥ ✐❢ t❤❡ ♣♦❧✐❝② ❛❧✇❛②s ❛❝❤✐❡✈❡❞ ❡①❛❝t❧② ✐ts ❡①♣❡❝t❡❞ r❡t✉r♥s✱ ✇❡ ✇♦✉❧❞ st✐❧❧ ❜❡ ❛♣♣❧②✐♥❣ ❣r❛❞✐❡♥t ✉♣❞❛t❡s t❤❛t ❝♦✉❧❞ ❝❛✉s❡ ✐t t♦ ❞✐✈❡r❣❡✳ ❚❤❡ q✉❛♥t✐t② (Gt − b(st)) ✐s ✉s✉❛❧❧② ❝❛❧❧❡❞ t❤❡ ❛❞✈❛♥t❛❣❡✱ At✳ ❲❡ ❝❛♥ ❡st✐♠❛t❡ t❤❡ tr✉❡ ❛❞✈❛♥t❛❣❡ ❢r♦♠ ❛ s❛♠♣❧❡❞ tr❛❥❡❝t♦r② τ (i) ✇✐t❤ ˆ At = (G(i)

t

− b(st)) ❙❡❝♦♥❞❧②✱ ✇❤② ❝❛♥ ✇❡ ❞♦ t❤✐s❄ ■t t✉r♥s ♦✉t t❤❛t s✉❜tr❛❝t✐♥❣ ❛ ❜❛s❡❧✐♥❡ ✐♥ t❤✐s ♠❛♥♥❡r ❞♦❡s ♥♦t ✐♥tr♦❞✉❝❡ ❛♥② ❜✐❛s ✐♥t♦ t❤❡ ❣r❛❞✐❡♥t ❝❛❧❝✉❧❛t✐♦♥✳ Eτ[b(st)∇θ log πθ(at|st)] ❡✈❛❧✉❛t❡s t♦ ③❡r♦✱ ❛♥❞ ❤❡♥❝❡ ✽

slide-9
SLIDE 9

❤❛s ♥♦ ❡✛❡❝t ♦♥ t❤❡ ❣r❛❞✐❡♥t ✉♣❞❛t❡✳ Eτ∼πθ[b(st)∇θ log πθ(at|st)] = Es0:t,a0:(t−1)

  • Es(t+1):T ,at:(T −1)[∇θ log πθ(at|st)b(st)]
  • ✭❜r❡❛❦ ✉♣ ❡①♣❡❝t❛t✐♦♥✮

= Es0:t,a0:(t−1)

  • b(st)Es(t+1):T ,at:(T −1)[∇θ log πθ(at|st)]
  • ✭♣✉❧❧ ❜❛s❡❧✐♥❡ t❡r♠ ♦✉t✮

= Es0:t,a0:(t−1) [b(st)Eat[∇θ log πθ(at|st)]] ✭r❡♠♦✈❡ ✐rr❡❧❡✈❛♥t ✈❛r✐❛❜❧❡s✮ = Es0:t,a0:(t−1)

  • b(st)
  • a

πθ(at|st)∇θπθ(at|st) πθ(at|st)

  • ✭❡①♣❛♥❞ ❡①♣❡❝t❛t✐♦♥ ✰ t❛❦❡ ❞❡r✐✈❛t✐✈❡ ♦❢ ❧♦❣❛r✐t❤♠✮

= Es0:t,a0:(t−1)

  • b(st)
  • at

∇θπθ(at|st)

  • = Es0:t,a0:(t−1)
  • b(st)∇θ
  • at

πθ(at|st)

  • = Es0:t,a0:(t−1) [b(st)∇θ1]

= Es0:t,a0:(t−1) [b(st) · 0] = 0

✼✳✶ ❱❛♥✐❧❧❛ ♣♦❧✐❝② ❣r❛❞✐❡♥t

❯s✐♥❣ t❤❡ ❜❛s❡❧✐♥❡ ❛s ❞❡s❝r✐❜❡❞ ❛❜♦✈❡✱ ✇❡ ✐♥tr♦❞✉❝❡ t❤❡ ✏✈❛♥✐❧❧❛✑ ♣♦❧✐❝② ❣r❛❞✐❡♥t ❛❧❣♦r✐t❤♠✳ ❙✉♣♣♦s❡ t❤❛t t❤❡ ❜❛s❡❧✐♥❡ ❢✉♥❝t✐♦♥ ❤❛s ♣❛r❛♠❡t❡rs w✳ ❆❧❣♦r✐t❤♠ ✷ ❱❛♥✐❧❧❛ P♦❧✐❝② ●r❛❞✐❡♥t ❆❧❣♦r✐t❤♠

✶✿ ♣r♦❝❡❞✉r❡ P♦❧✐❝② ●r❛❞✐❡♥t✭α✮ ✷✿

■♥✐t✐❛❧✐③❡ ♣♦❧✐❝② ♣❛r❛♠❡t❡rs θ ❛♥❞ ❜❛s❡❧✐♥❡ ✈❛❧✉❡s b(s) ❢♦r ❛❧❧ s✱ ❡✳❣✳ t♦ ✵

✸✿

❢♦r ✐t❡r❛t✐♦♥ ❂ 1, 2, . . . ❞♦

✹✿

❈♦❧❧❡❝t ❛ s❡t ♦❢ m tr❛❥❡❝t♦r✐❡s ❜② ❡①❡❝✉t✐♥❣ t❤❡ ❝✉rr❡♥t ♣♦❧✐❝② πθ

✺✿

❢♦r ❡❛❝❤ t✐♠❡ st❡♣ t ♦❢ ❡❛❝❤ tr❛❥❡❝t♦r② τ (i) ❞♦

✻✿

❈♦♠♣✉t❡ t❤❡ r❡t✉r♥ G(i)

t

= T −1

t′=t rt′

✼✿

❈♦♠♣✉t❡ t❤❡ ❛❞✈❛♥t❛❣❡ ❡st✐♠❛t❡ ˆ A(i)

t

= G(i)

t

− b(st)

✽✿

❘❡✲✜t t❤❡ ❜❛s❡❧✐♥❡ t♦ t❤❡ ❡♠♣✐r✐❝❛❧ r❡t✉r♥s ❜② ✉♣❞❛t✐♥❣ w t♦ ♠✐♥✐♠✐③❡

m

  • i=1

T −1

  • t=0

b(st) − G(i)

t 2

✾✿

❯♣❞❛t❡ ♣♦❧✐❝② ♣❛r❛♠❡t❡rs θ ✉s✐♥❣ t❤❡ ♣♦❧✐❝② ❣r❛❞✐❡♥t ❡st✐♠❛t❡ ˆ g ˆ g =

m

  • i=1

T −1

  • t=0

ˆ A(i)

t ∇θ log πθ(a(i) t |s(i) t )

✇✐t❤ ❛♥ ♦♣t✐♠✐③❡r ❧✐❦❡ ❙●❉ ✭θ ← θ + α · ˆ g✮ ♦r ❆❞❛♠ r❡t✉r♥ θ ❛♥❞ ❜❛s❡❧✐♥❡ ✈❛❧✉❡s b(s) ❖♥❡ ♥❛t✉r❛❧ ❝❤♦✐❝❡ ❢♦r t❤❡ ❜❛s❡❧✐♥❡ ✐s t❤❡ st❛t❡ ✈❛❧✉❡ ❢✉♥❝t✐♦♥✱ b(st) = V (st)✳ ❯♥❞❡r t❤✐s ❢♦r♠✉❧❛t✐♦♥✱ ✇❡ ❝❛♥ ❞❡✜♥❡ t❤❡ ❛❞✈❛♥t❛❣❡ ❢✉♥❝t✐♦♥ ❛s Aπ(s, a) = Qπ(s, a) − V π(s)✳ ❍♦✇❡✈❡r✱ s✐♥❝❡ ✇❡ ❞♦ ♥♦t ❦♥♦✇ t❤❡ tr✉❡ st❛t❡ ✈❛❧✉❡s✱ ✇❡ ✐♥st❡❛❞ ✉s❡ ❛♥ ❡st✐♠❛t❡ ˆ V (st; w) ❢♦r s♦♠❡ ✇❡✐❣❤t ✈❡❝t♦r w✳ ❲❡ ❝❛♥ s✐♠✉❧t❛♥❡♦✉s❧② ❧❡❛r♥ t❤❡ ✇❡✐❣❤t ✈❡❝t♦r w ❢♦r t❤❡ ❜❛s❡❧✐♥❡ ✭st❛t❡✲✈❛❧✉❡✮ ❢✉♥❝t✐♦♥ ❛♥❞ ♣♦❧✐❝② ♣❛r❛♠❡t❡rs θ ✉s✐♥❣ t❤❡ ▼♦♥t❡✲❈❛r❧♦ tr❛❥❡❝t♦r② s❛♠♣❧❡s✳ ✾

slide-10
SLIDE 10

◆♦t❡ t❤❛t ✐♥ t❤❡ ❛❜♦✈❡ ❛❧❣♦r✐t❤♠✱ ✇❡ ✉s✉❛❧❧② ❞♦ ♥♦t ❝♦♠♣✉t❡ t❤❡ ❣r❛❞✐❡♥ts

t ˆ

At∇θ log πθ(at|st) ✐♥❞✐✈✐❞✉❛❧❧②✳ ❘❛t❤❡r✱ ✇❡ ❛❝❝✉♠✉❧❛t❡ ❞❛t❛ ❢r♦♠ ❛ ❜❛t❝❤ ✐♥t♦ ❛ ❧♦ss ❢✉♥❝t✐♦♥ L(θ) =

  • t

ˆ At log πθ(at|st) ❛♥❞ t❤❡♥ ❛♣♣❧② t❤❡ ❣r❛❞✐❡♥ts ❛❧❧ ❛t ♦♥❝❡ ❜② ❝♦♠♣✉t✐♥❣ ∇θL(θ)✳ ❲❡ ❝❛♥ ❛❧s♦ ✐♥tr♦❞✉❝❡ ❛ ❝♦♠♣♦♥❡♥t ✐♥t♦ t❤✐s ❧♦ss t♦ ✜t t❤❡ ❜❛s❡❧✐♥❡ ❢✉♥❝t✐♦♥✿ L(θ, w) =

  • t
  • ˆ

At log πθ(at|st) − b(st) − G(i)

t 2

❲❡ ❝❛♥ t❤❡♥ ❝♦♠♣✉t❡ t❤❡ ❣r❛❞✐❡♥ts ♦❢ L(θ, w) ✇✳r✳t✳ θ ❛♥❞ w t♦ ♣❡r❢♦r♠ ❙●❉ ✉♣❞❛t❡s✳

✼✳✷ ◆✲st❡♣ ❡st✐♠❛t♦rs

■♥ t❤❡ ❛❜♦✈❡ ❞❡r✐✈❛t✐♦♥s✱ ✇❡ ❤❛✈❡ ✉s❡❞ t❤❡ ▼♦♥t❡✲❈❛r❧♦ ❡st✐♠❛t❡s ♦❢ t❤❡ r❡✇❛r❞ ✐♥ t❤❡ ♣♦❧✐❝② ❣r❛❞✐❡♥t ❛♣♣r♦①✐♠❛t✐♦♥✳ ❍♦✇❡✈❡r✱ ✐❢ ✇❡ ❤❛✈❡ ❛❝❝❡ss t♦ ❛ ✈❛❧✉❡ ❢✉♥❝t✐♦♥ ✭❢♦r ❡①❛♠♣❧❡✱ t❤❡ ❜❛s❡❧✐♥❡✮✱ t❤❡♥ ✇❡ ❝❛♥ ❛❧s♦ ✉s ❚❉ ♠❡t❤♦❞s ❢♦r t❤❡ ♣♦❧✐❝② ❣r❛❞✐❡♥t ✉♣❞❛t❡✱ ♦r ❛♥② ✐♥t❡r♠❡❞✐❛t❡ ❜❧❡♥❞ ❜❡t✇❡❡♥ ❚❉ ❛♥❞ ▼❈ ♠❡t❤♦❞s✿ ˆ G(1)

t

= rt + γV (st+1) ˆ G(2)

t

= rt + γrt+1 + γ2V (st+2) · · · ˆ G(inf)

t

= rt + γrt+1 + γ2rt+2 + · · · ✇❤✐❝❤ ✇❡ ❝❛♥ ❛❧s♦ ✉s❡ t♦ ❝♦♠♣✉t❡ ❛❞✈❛♥t❛❣❡s✿ ˆ A(1)

t

= rt + γV (st+1)−V (st) ˆ A(2)

t

= rt + γrt+1 + γ2V (st+2)−V (st) · · · ˆ A(inf)

t

= rt + γrt+1 + γ2rt+2 + · · · −V (st) A(1)

t

✐s ❛ ♣✉r❡❧② ❚❉ ❡st✐♠❛t❡✱ ❛♥❞ ❤❛s ❧♦✇ ✈❛r✐❛♥❝❡✱ ❜✉t ❤✐❣❤ ❜✐❛s✳ A(inf)

t

✐s ❛ ♣✉r❡❧② ▼❈ ❡st✐♠❛t❡✱ ❛♥❞ ❤❛s ③❡r♦ ❜✐❛s✱ ❜✉t ❤✐❣❤ ✈❛r✐❛♥❝❡✳ ■❢ ✇❡ ❝❤♦♦s❡ ❛♥ ✐♥t❡r♠❡❞✐❛t❡ ✈❛❧✉❡ ♦❢ k ❢♦r A(k)

t

✱ ✇❡ ❝❛♥ ❣❡t ❛♥ ✐♥t❡r♠❡❞✐❛t❡ ❛♠♦✉♥t ♦❢ ❜✐❛s ❛♥❞ ❛♥ ✐♥t❡r♠❡❞✐❛t❡ ❛♠♦✉♥t ♦❢ ✈❛r✐❛♥❝❡✳

❘❡❢❡r❡♥❝❡s

❬✶❪ ❤tt♣s✿✴✴❜❧♦❣✳♦♣❡♥❛✐✳❝♦♠✴❡✈♦❧✉t✐♦♥✲str❛t❡❣✐❡s✴ ❬✷❪ ❑♦❤❧ ❛♥❞ ❙t♦♥❡✳ P♦❧✐❝② ❣r❛❞✐❡♥t r❡✐♥❢♦r❝❡♠❡♥t ❧❡❛r♥✐♥❣ ❢♦r ❢❛st q✉❛❞r✉♣❡❞❛❧ ❧♦❝♦♠♦t✐♦♥✳ ■❈❘❆ ✷✵✵✹✳ ❤tt♣✿✴✴✇✇✇✳❝s✳✉t❡①❛s✳❡❞✉✴ ❛✐✲❧❛❜✴♣✉❜s✴✐❝r❛✵✹✳♣❞❢ ❊♠♠❛ ❇r✉♥s❦✐❧❧ ✭❈❙✷✸✹ ❘❡✐♥❢♦r❝❡♠❡♥t ▲❡❛r♥✐♥❣✳ ✮ ▲❡❝t✉r❡ ✽✿ P♦❧✐❝② ●r❛❞✐❡♥t ■ ✷✽ ❲✐♥t❡r ✷✵✶ ✶✵