Maximum Entropy Inverse Reinforcement Learning Nomenclature Basis - - PowerPoint PPT Presentation

maximum entropy inverse reinforcement learning
SMART_READER_LITE
LIVE PREVIEW

Maximum Entropy Inverse Reinforcement Learning Nomenclature Basis - - PowerPoint PPT Presentation

Maximilian Luz Algorithms for Imitation Learning Summer Semester 2019 MLR/IPVS Maximum Entropy Inverse Reinforcement Learning Nomenclature Basis Feature Expectation Matching Principle of Maximum Entropy Maximum Entropy IRL Algorithm and


slide-1
SLIDE 1

Maximum Entropy Inverse Reinforcement Learning

Algorithms for Imitation Learning

Maximilian Luz Summer Semester 2019

MLR/IPVS

slide-2
SLIDE 2

Outline

Nomenclature Basis Feature Expectation Matching Principle of Maximum Entropy Maximum Entropy IRL Algorithm and Derivation Extensions Demonstration

1/16

slide-3
SLIDE 3

Nomenclature

slide-4
SLIDE 4

Nomenclature (i)

Markov Decision Process (MDP) S = {si}i States A = {ai}i Actions T = p(st+1 | st, at) Transition dynamics R : S → R Reward Trajectories & Demonstrations τ =

  • (s1, a1), (s2, a2), . . . , s|τ|
  • Trajectory

D = {τi}i Demonstrations

2/16

slide-5
SLIDE 5

Nomenclature (ii)

Features φ : S → Rd with φ(τ) =

  • st∈τ

φ(st) Policies π(aj | si) Policy (stochastic) πL Learner Policy πE Expert Policy

3/16

slide-6
SLIDE 6

Basis

slide-7
SLIDE 7

Feature Expectation Matching

Idea: Learner should visit same features as expert (in expectation). Feature Expectation Matching [Abbeel and Ng 2004] EπE [φ(τ)] = EπL [φ(τ)] Note: We want to fjnd reward R : S → R defjning πL(a | s) and thus p(τ). EπL [φ(τ)] =

  • τ∈T

p(τ) · φ(τ) Observation: Optimality for linear (unknown) reward [Abbeel and Ng 2004]. ⇒ R(s) = ω⊤φ(s), ω ∈ Rd : Reward parameters

4/16

slide-8
SLIDE 8

Feature Expectation Matching: Problem

Problem: Multiple (infjnite) solutions ⇒ ill-posed (Hadamard).

  • Reward shaping [Ng et al. 1999]:
  • Multiple reward functions R lead to same policy π.

Idea (Ziebart et al. 2008):

  • Regularize by maximizing entropy H(p).
  • But why?

5/16

slide-9
SLIDE 9

Shannon’s Entropy

Entropy H(p) H(p) = −

  • x∈X

p(x) log2 p(x) x ∈ X : Event p(x) : Probability of occurrence − log2 p(x) : Optimal encoding length Expected information received when observing x ∈ X. ⇒ Measure of uncertainty.

1 p(x)

No uncertainty, H(p) minimal.

1 p(x)

Uniformly random, H(p) maximal.

6/16

slide-10
SLIDE 10

Principle of Maximum Entropy [Jaynes 1957]

Consider: A problem with solutions p, q, . . .

(e.g. feature expectation matching)

⇒ p, q represent partial information. p q

baseline (solutions) bias

Information Entropy ⇒ Maximizing entropy minimizes bias.

7/16

slide-11
SLIDE 11

Maximum Entropy IRL

slide-12
SLIDE 12

Constrained Optimization Problem

Problem Formulation arg maxp H(p)

(entropy)

subject to EπE [φ(τ)] = EπL [φ(τ)] ,

(feature matching)

  • τ∈T p(τ) = 1,

∀τ ∈ T : p(τ) > 0

(prob. distr.)

8/16

slide-13
SLIDE 13

Solution: Deterministic Dynamics

Solution via Lagrange multipliers [Ziebart et al. 2008]: p(τ) ∝ exp (R(τ)) where R(τ) =

Lagrange multipliers for feature matching

ω⊤φ(τ) Deterministic transition dynamics: p(τ | ω) = 1 Z(ω)

normalization

exp

  • ω⊤φ(τ)
  • reward
  • with

Z(ω) =

  • τ∈T

exp

  • ω⊤φ(τ)
  • partition function

9/16

slide-14
SLIDE 14

Solution: Stochastic Dynamics

Stochastic transition dynamics: p(τ | ω) = 1 Zs(ω) exp

  • ω⊤φ(τ)
  • ∝ deterministic
  • st,at,st+1∈τ

p

assumes limited transition randomness

(st+1 | st, at)

  • combined transition probability

via adaption of deterministic solution [Ziebart et al. 2008]. Problem: Adaption introduces bias [Osa et al. 2018; Ziebart 2010]: ˜ R(τ) = ω⊤φ(τ) +

  • st,at,st+1∈τ

log p (st+1 | st, at) Solution: Maximum Causal Entropy IRL (Ziebart 2010, not covered here).

10/16

slide-15
SLIDE 15

Likelihood and Gradient

Obtain parameters by maximizing Likelihood: ω∗ = arg max

ω

L(ω) = arg max

ω

  • τ∈D

log p (τ | ω) Observation:

  • Maximizing likelihood equiv. to minimizing KL-divergence [Bishop 2006].

⇒ M-projection onto manifold of maximum entropy distributions [Osa et al. 2018].

  • Convex, can be optimized via gradient ascent.

Gradient [Ziebart et al. 2008]: ∇L(ω) = ED [φ(τ)] −

  • τ∈T

computationally infeasible

p (τ | ω) φ(τ) = ED

“count” features in D

[φ(τ)] −

  • si∈S

state visitation frequency

Dsi φ(si)

11/16

slide-16
SLIDE 16

State Visitation Frequency

Observation: πME(aj | si, ω) ∝

  • τ∈T : si,aj∈τt=1

p (τ | ω)

can be computed via R

Idea: Split into sub-problems.

  • 1. Backward Pass: Compute policy πME(a | s, ω).
  • 2. Forward Pass: Compute state visitation frequency from πME(a | s, ω).

12/16

slide-17
SLIDE 17

State Visitation Frequency: Backward Pass

Observation: πME(aj | si, ω) ∝

  • τ∈T : si,aj∈τt=1

p (τ | ω) Idea: πME(aj | si, ω) = Zsi,aj Zsi

normalization

Zsi,aj

recursively expand observation

=

  • sk∈S

p(sk | si, aj) · exp

  • ω⊤φ(si)
  • · Zsk,

Zsi =

  • aj∈A

Zsi,aj Algorithm:

  • 1. Initialize Zsk = 1 for all terminal states sk ∈ Sterminal.
  • 2. Compute Zsi,aj and Zsi by recursively backing-up from terminal states.
  • 3. Compute πME(ai | si, ω).

Parallels to value-iteration.

13/16

slide-18
SLIDE 18

State Visitation Frequency: Forward Pass

Idea: Propagate starting-state probabilities p0(s) forward via policy πME(a | s, ω). Algorithm:

  • 1. Initialize Dsi,0 = p0(s) = p(τ ∈ T : s ∈ τt=1).
  • 2. Recursively compute

Dsk,t+1 =

  • si∈S
  • aj∈A

Dsi,t · πME

  • aj | si
  • · p
  • sk | aj, si
  • 3. Sum up over t, i.e.

Dsi =

  • t=0,...

Dsi,t

14/16

slide-19
SLIDE 19

Summary

Algorithm: Iterate until convergence:

  • 1. Compute policy πME(a | s, ω) (forward pass).
  • 2. Compute state visitation frequency Dsi (backward pass).
  • 3. Compute gradient ∇L(ω) of likelihood.
  • 4. Gradient-based optimization step, e.g.: ω ← ω + η∇L(ω).

Assumptions:

  • Known transition dynamics T = p(st+1 | st, at).
  • Limited transition randomness.
  • Linear reward R(s) = ω⊤φ(s).

Other Drawbacks:

  • Need to “solve” MDP once per iteration.
  • Reward bias for stochastic transition dynamics.

15/16

slide-20
SLIDE 20

Extensions

  • Maximum Causal Entropy IRL [Ziebart 2010]
  • Maximum Entropy Deep IRL [Wulfmeier et al. 2015]
  • Maximum Entropy IRL in Continuous State Spaces with Path Integrals

[Aghasadeghi and Bretl 2011]

16/16

slide-21
SLIDE 21

Demonstration

github.com/qzed/irl-maxent

slide-22
SLIDE 22

References

Abbeel, Pieter and Andrew Y. Ng (2004). “Apprenticeship Learning via Inverse Reinforcement Learning”. In: Proc. 21st Intl. Conference on Machine Learning (ICML ’04). Aghasadeghi, N. and T. Bretl (Sept. 2011). “Maximum entropy inverse reinforcement learning in continuous state spaces with path integrals”. In: Intl. Conference on Intelligent Robots and Systems (IORS 2011), pp. 1561–1566. Bishop, Christopher M. (Aug. 17, 2006). Pattern Recognition and Machine Learning. Springer-Verlag New York Inc. Jaynes, E. T. (May 1957). “Information Theory and Statistical Mechanics”. In: Physical Review 106.4, pp. 620–630. Ng, Andrew Y., Daishi Harada, and Stuart J. Russell (1999). “Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping”. In: Proc. 16th Intl. Conference on Machine Learning (ICML ’99),

  • pp. 278–287.

Osa, Takayuki et al. (2018). “An Algorithmic Perspective on Imitation Learning”. In: Foundations and Trends in Robotics 7.1-2, pp. 1–179. Wulfmeier, Markus, Peter Ondruska, and Ingmar Posner (2015). “Deep Inverse Reinforcement Learning”. In: Computing Research Repository. arXiv: 1507.04888. Ziebart, Brian D. (2010). “Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy”. PhD thesis. Carnegie Mellon University. Ziebart, Brian D. et al. (2008). “Maximum Entropy Inverse Reinforcement Learning”. In: Proc. 23rd AAAI Conference on Artifjcial Intelligence (AAAI ’08), pp. 1433–1438.