Online Exploration in Least-Squares Policy Iteration Lihong Li, - - PowerPoint PPT Presentation

online exploration in least squares policy iteration
SMART_READER_LITE
LIVE PREVIEW

Online Exploration in Least-Squares Policy Iteration Lihong Li, - - PowerPoint PPT Presentation

Online Exploration in Least-Squares Policy Iteration Lihong Li, Michael L. Littman, and Christopher R. Mansley Rutgers Laboratory for Real-Life Reinforcement Learning (RL 3 ) 5/14/2009 AAMAS - Budapest 1 Contributions Reinforcement Learning


slide-1
SLIDE 1

5/14/2009 AAMAS - Budapest 1

Online Exploration in Least-Squares Policy Iteration

Lihong Li, Michael L. Littman, and Christopher R. Mansley

Rutgers Laboratory for Real-Life Reinforcement Learning (RL3)

slide-2
SLIDE 2

5/14/2009 AAMAS - Budapest 2

Contributions

Reinforcement Learning Challenge I Exploration/Exploitation Tradeoff Challenge II Value-Function Approximation Rmax [Brafman & Tenneholtz 02] (provably efficient, finite) LSPI [Lagoudakis & Parr 03] (continuous, offline)

LSPI-Rmax

slide-3
SLIDE 3

5/14/2009 AAMAS - Budapest 3

Outline

  • Introduction

– LSPI – Rmax

  • LSPI-Rmax
  • Experiments
  • Conclusions
slide-4
SLIDE 4

5/14/2009 AAMAS - Budapest 4

Basic Terminology

  • Markov decision process

– States: S – Actions: A – Reward function: -1≤ R(s,a)≤1 – Transition probabilities: T(s’|s,a) – Discount factor: 0 < γ < 1

  • Optimal value function:
  • Optimal policy:
  • Approximate
slide-5
SLIDE 5

5/14/2009 AAMAS - Budapest 5

  • Features:

– A.k.a. “basis functions”, and predefined

  • Weights:

– Measures contributions of φi to approximating Q*

  • Linear Function Approximation
slide-6
SLIDE 6

5/14/2009 AAMAS - Budapest 6

LSPI [Lagoudakis & Parr 03]

Initialize π Evaluate π: compute w Improve π: π’(s) = argmaxa w·φ(s,a) π  π’

slide-7
SLIDE 7

5/14/2009 AAMAS - Budapest 7

LSPI [Lagoudakis & Parr 03]

Initialize π Evaluate π: compute w Improve π: π’(s) = argmaxa w·φ(s,a) π  π’

But, LSPI does not specify how to collect samples D: a fundamental challenge in online reinforcement learning An agent only collects samples in states it visits…

slide-8
SLIDE 8

5/14/2009 AAMAS - Budapest 8

Exploration/Exploitation Tradeoff

time total rewards inefficient exploration efficient exploration

  • ptimal policy

1 2 3 98 99 100 1000

0.001

slide-9
SLIDE 9

5/14/2009 AAMAS - Budapest 9

Rmax [Brafman & Tenenholtz 02]

  • Rmax is for finite-state, finite-action MDPs
  • Learns T and R by counting/averaging
  • In st, takes optimal action in

Known state-actions Unknown state-actions

  • Either: explore “unknown” region
  • Or: exploit “known” region

Thm: Rmax is provably efficient

S x A

“Optimism in the face of uncertainty”

slide-10
SLIDE 10

5/14/2009 AAMAS - Budapest 10

LSPI-Rmax

  • Similar to LSPI
  • But distinguishes known/unknown (s,a):

S x A

Samples in D

Known state-actions Unknown state-actions

Treat their Q-value as Qmax

(Like Rmax)

Modifications of LSTDQ

slide-11
SLIDE 11

5/14/2009 AAMAS - Budapest 11

LSTDQ-Rmax

slide-12
SLIDE 12

5/14/2009 AAMAS - Budapest 12

LSPI-Rmax for Online RL

  • D = empty set
  • Initialize w
  • for t = 1, 2, 3, …

– Take greedy action: at = argmaxa w·φ(st,a) – D = D U {(st,at,rt,st+1)} – Run LSPI using LSTDQ-Rmax

slide-13
SLIDE 13

5/14/2009 AAMAS - Budapest 13

Experiments

  • Problems

– MountainCar – Bicycle – Continuous Combination Lock – ExpressWorld (a variant of PuddleWorld)

Four actions Stochastic transitions Reward:

  • 1 reward per step
  • 0.5 reward per step in “expresslane”

penalty for stepping into puddles Random start states

slide-14
SLIDE 14

5/14/2009 AAMAS - Budapest 14

Various Exploration Rules with LSPI

Converges to better policies

slide-15
SLIDE 15

5/14/2009 AAMAS - Budapest 15

A Closer Look

States visited in the first 3 episodes:

Inefficient exploration Efficient exploration Help discovery of goal and expresslane

slide-16
SLIDE 16

5/14/2009 AAMAS - Budapest 16

More Experiments

slide-17
SLIDE 17

5/14/2009 AAMAS - Budapest 17

Effect of Rmax Threshold

slide-18
SLIDE 18

5/14/2009 AAMAS - Budapest 18

Conclusions

  • We proposed LSPI-Rmax

– LSPI + Rmax – encourages active exploration – with linear function approximation

  • Future directions

– Similar idea applied to Gaussian process RL – Comparison to model-based RL

slide-19
SLIDE 19

5/14/2009 AAMAS - Budapest 19

slide-20
SLIDE 20

5/14/2009 AAMAS - Budapest 20

Where are features from?

  • Hand-crafted features

– expert knowledge required – expensive and error prone

  • Generic features

– RBF, CMAC, polynomial, etc. – may not always work well

  • Automatic feature selection using

– Bellman error [Parr et al. 07] – spectral graph analysis [Mahadevan & Maggioni 07] – TD approximation [Li & Williams & Balakrishnan 09] – L1 Regularization for LSPI [Kolter & Ng 09]

slide-21
SLIDE 21

5/14/2009 AAMAS - Budapest 21

LSPI-Rmax vs. MBRL

  • Model-based RL (e.g., Rmax)

– Learns an MDP model – Computes policy with the approximate model – Can use function approx. in model learning

  • Rmax w/ many compact representations [Li 09]
  • LSPI-Rmax is model-free RL

– Avoids expensive “planning” step – Has weaker theoretical guarantees