Online Exploration in Least-Squares Policy Iteration Lihong Li, - - PowerPoint PPT Presentation

▶

Jul 17, 2023 619 likes •852 views

Online Exploration in Least-Squares Policy Iteration Lihong Li, Michael L. Littman, and Christopher R. Mansley Rutgers Laboratory for Real-Life Reinforcement Learning (RL 3 ) 5/14/2009 AAMAS - Budapest 1 Contributions Reinforcement Learning

SLIDE 1

5/14/2009 AAMAS - Budapest 1

Online Exploration in Least-Squares Policy Iteration

Lihong Li, Michael L. Littman, and Christopher R. Mansley

Rutgers Laboratory for Real-Life Reinforcement Learning (RL3)

SLIDE 2

5/14/2009 AAMAS - Budapest 2

Contributions

Reinforcement Learning Challenge I Exploration/Exploitation Tradeoff Challenge II Value-Function Approximation Rmax [Brafman & Tenneholtz 02] (provably efficient, finite) LSPI [Lagoudakis & Parr 03] (continuous, offline)

LSPI-Rmax

SLIDE 3

5/14/2009 AAMAS - Budapest 3

Outline

Introduction

– LSPI – Rmax

LSPI-Rmax
Experiments
Conclusions

SLIDE 4

5/14/2009 AAMAS - Budapest 4

Basic Terminology

Markov decision process

– States: S – Actions: A – Reward function: -1≤ R(s,a)≤1 – Transition probabilities: T(s’|s,a) – Discount factor: 0 < γ < 1

Optimal value function:
Optimal policy:
Approximate

SLIDE 5

5/14/2009 AAMAS - Budapest 5

Features:

– A.k.a. “basis functions”, and predefined

Weights:

– Measures contributions of φi to approximating Q*

Linear Function Approximation

SLIDE 6

5/14/2009 AAMAS - Budapest 6

LSPI [Lagoudakis & Parr 03]

Initialize π Evaluate π: compute w Improve π: π’(s) = argmaxa w·φ(s,a) π  π’

SLIDE 7

5/14/2009 AAMAS - Budapest 7

LSPI [Lagoudakis & Parr 03]

Initialize π Evaluate π: compute w Improve π: π’(s) = argmaxa w·φ(s,a) π  π’

But, LSPI does not specify how to collect samples D: a fundamental challenge in online reinforcement learning An agent only collects samples in states it visits…

SLIDE 8

5/14/2009 AAMAS - Budapest 8

Exploration/Exploitation Tradeoff

time total rewards inefficient exploration efficient exploration

ptimal policy

1 2 3 98 99 100 1000

0.001

SLIDE 9

5/14/2009 AAMAS - Budapest 9

Rmax [Brafman & Tenenholtz 02]

Rmax is for finite-state, finite-action MDPs
Learns T and R by counting/averaging
In st, takes optimal action in

Known state-actions Unknown state-actions

Either: explore “unknown” region
Or: exploit “known” region

Thm: Rmax is provably efficient

S x A

“Optimism in the face of uncertainty”

SLIDE 10

5/14/2009 AAMAS - Budapest 10

LSPI-Rmax

Similar to LSPI
But distinguishes known/unknown (s,a):

S x A

Samples in D

Known state-actions Unknown state-actions

Treat their Q-value as Qmax

(Like Rmax)

Modifications of LSTDQ

SLIDE 11

5/14/2009 AAMAS - Budapest 11

LSTDQ-Rmax

SLIDE 12

5/14/2009 AAMAS - Budapest 12

LSPI-Rmax for Online RL

D = empty set
Initialize w
for t = 1, 2, 3, …

– Take greedy action: at = argmaxa w·φ(st,a) – D = D U {(st,at,rt,st+1)} – Run LSPI using LSTDQ-Rmax

SLIDE 13

5/14/2009 AAMAS - Budapest 13

Experiments

Problems

– MountainCar – Bicycle – Continuous Combination Lock – ExpressWorld (a variant of PuddleWorld)

Four actions Stochastic transitions Reward:

1 reward per step
0.5 reward per step in “expresslane”

penalty for stepping into puddles Random start states

SLIDE 14

5/14/2009 AAMAS - Budapest 14

Various Exploration Rules with LSPI

Converges to better policies

SLIDE 15

5/14/2009 AAMAS - Budapest 15

A Closer Look

States visited in the first 3 episodes:

Inefficient exploration Efficient exploration Help discovery of goal and expresslane

SLIDE 16

5/14/2009 AAMAS - Budapest 16

More Experiments

SLIDE 17

5/14/2009 AAMAS - Budapest 17

Effect of Rmax Threshold

SLIDE 18

5/14/2009 AAMAS - Budapest 18

Conclusions

We proposed LSPI-Rmax

– LSPI + Rmax – encourages active exploration – with linear function approximation

Future directions

– Similar idea applied to Gaussian process RL – Comparison to model-based RL

SLIDE 19

5/14/2009 AAMAS - Budapest 19

SLIDE 20

5/14/2009 AAMAS - Budapest 20

Where are features from?

Hand-crafted features

– expert knowledge required – expensive and error prone

Generic features

– RBF, CMAC, polynomial, etc. – may not always work well

Automatic feature selection using

– Bellman error [Parr et al. 07] – spectral graph analysis [Mahadevan & Maggioni 07] – TD approximation [Li & Williams & Balakrishnan 09] – L1 Regularization for LSPI [Kolter & Ng 09]

SLIDE 21

5/14/2009 AAMAS - Budapest 21

LSPI-Rmax vs. MBRL

Model-based RL (e.g., Rmax)

– Learns an MDP model – Computes policy with the approximate model – Can use function approx. in model learning

Rmax w/ many compact representations [Li 09]
LSPI-Rmax is model-free RL

Online Exploration in Least-Squares Policy Iteration

Lihong Li, Michael L. Littman, and Christopher R. Mansley

Contributions

LSPI-Rmax

Outline

– LSPI – Rmax

Basic Terminology

– States: S – Actions: A – Reward function: -1≤ R(s,a)≤1 – Transition probabilities: T(s’|s,a) – Discount factor: 0 < γ < 1

– A.k.a. “basis functions”, and predefined

– Measures contributions of φi to approximating Q*

LSPI [Lagoudakis & Parr 03]

LSPI [Lagoudakis & Parr 03]

But, LSPI does not specify how to collect samples D: a fundamental challenge in online reinforcement learning An agent only collects samples in states it visits…

Exploration/Exploitation Tradeoff

Rmax [Brafman & Tenenholtz 02]

Thm: Rmax is provably efficient

S x A

“Optimism in the face of uncertainty”

LSPI-Rmax

S x A

LSTDQ-Rmax

LSPI-Rmax for Online RL

– Take greedy action: at = argmaxa w·φ(st,a) – D = D U {(st,at,rt,st+1)} – Run LSPI using LSTDQ-Rmax

Experiments

– MountainCar – Bicycle – Continuous Combination Lock – ExpressWorld (a variant of PuddleWorld)

Various Exploration Rules with LSPI

A Closer Look

More Experiments

Effect of Rmax Threshold

Conclusions

– LSPI + Rmax – encourages active exploration – with linear function approximation

– Similar idea applied to Gaussian process RL – Comparison to model-based RL

Where are features from?

– expert knowledge required – expensive and error prone

– RBF, CMAC, polynomial, etc. – may not always work well

– Bellman error [Parr et al. 07] – spectral graph analysis [Mahadevan & Maggioni 07] – TD approximation [Li & Williams & Balakrishnan 09] – L1 Regularization for LSPI [Kolter & Ng 09]

LSPI-Rmax vs. MBRL

– Learns an MDP model – Computes policy with the approximate model – Can use function approx. in model learning

– Avoids expensive “planning” step – Has weaker theoretical guarantees