SLIDE 1
Maximum Entropy Inverse Reinforcement Learning Nomenclature Basis - - PowerPoint PPT Presentation
Maximum Entropy Inverse Reinforcement Learning Nomenclature Basis - - PowerPoint PPT Presentation
Maximilian Luz Algorithms for Imitation Learning Summer Semester 2019 MLR/IPVS Maximum Entropy Inverse Reinforcement Learning Nomenclature Basis Feature Expectation Matching Principle of Maximum Entropy Maximum Entropy IRL Algorithm and
SLIDE 2
SLIDE 3
Nomenclature
SLIDE 4
Nomenclature (i)
Markov Decision Process (MDP) S = {si}i States A = {ai}i Actions T = p(st+1 | st, at) Transition dynamics R : S → R Reward Trajectories & Demonstrations τ =
- (s1, a1), (s2, a2), . . . , s|τ|
- Trajectory
D = {τi}i Demonstrations
2/16
SLIDE 5
Nomenclature (ii)
Features φ : S → Rd with φ(τ) =
- st∈τ
φ(st) Policies π(aj | si) Policy (stochastic) πL Learner Policy πE Expert Policy
3/16
SLIDE 6
Basis
SLIDE 7
Feature Expectation Matching
Idea: Learner should visit same features as expert (in expectation). Feature Expectation Matching [Abbeel and Ng 2004] EπE [φ(τ)] = EπL [φ(τ)] Note: We want to fjnd reward R : S → R defjning πL(a | s) and thus p(τ). EπL [φ(τ)] =
- τ∈T
p(τ) · φ(τ) Observation: Optimality for linear (unknown) reward [Abbeel and Ng 2004]. ⇒ R(s) = ω⊤φ(s), ω ∈ Rd : Reward parameters
4/16
SLIDE 8
Feature Expectation Matching: Problem
Problem: Multiple (infjnite) solutions ⇒ ill-posed (Hadamard).
- Reward shaping [Ng et al. 1999]:
- Multiple reward functions R lead to same policy π.
Idea (Ziebart et al. 2008):
- Regularize by maximizing entropy H(p).
- But why?
5/16
SLIDE 9
Shannon’s Entropy
Entropy H(p) H(p) = −
- x∈X
p(x) log2 p(x) x ∈ X : Event p(x) : Probability of occurrence − log2 p(x) : Optimal encoding length Expected information received when observing x ∈ X. ⇒ Measure of uncertainty.
1 p(x)
No uncertainty, H(p) minimal.
1 p(x)
Uniformly random, H(p) maximal.
6/16
SLIDE 10
Principle of Maximum Entropy [Jaynes 1957]
Consider: A problem with solutions p, q, . . .
(e.g. feature expectation matching)
⇒ p, q represent partial information. p q
baseline (solutions) bias
Information Entropy ⇒ Maximizing entropy minimizes bias.
7/16
SLIDE 11
Maximum Entropy IRL
SLIDE 12
Constrained Optimization Problem
Problem Formulation arg maxp H(p)
(entropy)
subject to EπE [φ(τ)] = EπL [φ(τ)] ,
(feature matching)
- τ∈T p(τ) = 1,
∀τ ∈ T : p(τ) > 0
(prob. distr.)
8/16
SLIDE 13
Solution: Deterministic Dynamics
Solution via Lagrange multipliers [Ziebart et al. 2008]: p(τ) ∝ exp (R(τ)) where R(τ) =
Lagrange multipliers for feature matching
ω⊤φ(τ) Deterministic transition dynamics: p(τ | ω) = 1 Z(ω)
normalization
exp
- ω⊤φ(τ)
- reward
- with
Z(ω) =
- τ∈T
exp
- ω⊤φ(τ)
- partition function
9/16
SLIDE 14
Solution: Stochastic Dynamics
Stochastic transition dynamics: p(τ | ω) = 1 Zs(ω) exp
- ω⊤φ(τ)
- ∝ deterministic
- st,at,st+1∈τ
p
assumes limited transition randomness
(st+1 | st, at)
- combined transition probability
via adaption of deterministic solution [Ziebart et al. 2008]. Problem: Adaption introduces bias [Osa et al. 2018; Ziebart 2010]: ˜ R(τ) = ω⊤φ(τ) +
- st,at,st+1∈τ
log p (st+1 | st, at) Solution: Maximum Causal Entropy IRL (Ziebart 2010, not covered here).
10/16
SLIDE 15
Likelihood and Gradient
Obtain parameters by maximizing Likelihood: ω∗ = arg max
ω
L(ω) = arg max
ω
- τ∈D
log p (τ | ω) Observation:
- Maximizing likelihood equiv. to minimizing KL-divergence [Bishop 2006].
⇒ M-projection onto manifold of maximum entropy distributions [Osa et al. 2018].
- Convex, can be optimized via gradient ascent.
Gradient [Ziebart et al. 2008]: ∇L(ω) = ED [φ(τ)] −
- τ∈T
computationally infeasible
p (τ | ω) φ(τ) = ED
“count” features in D
[φ(τ)] −
- si∈S
state visitation frequency
Dsi φ(si)
11/16
SLIDE 16
State Visitation Frequency
Observation: πME(aj | si, ω) ∝
- τ∈T : si,aj∈τt=1
p (τ | ω)
can be computed via R
Idea: Split into sub-problems.
- 1. Backward Pass: Compute policy πME(a | s, ω).
- 2. Forward Pass: Compute state visitation frequency from πME(a | s, ω).
12/16
SLIDE 17
State Visitation Frequency: Backward Pass
Observation: πME(aj | si, ω) ∝
- τ∈T : si,aj∈τt=1
p (τ | ω) Idea: πME(aj | si, ω) = Zsi,aj Zsi
normalization
Zsi,aj
recursively expand observation
=
- sk∈S
p(sk | si, aj) · exp
- ω⊤φ(si)
- · Zsk,
Zsi =
- aj∈A
Zsi,aj Algorithm:
- 1. Initialize Zsk = 1 for all terminal states sk ∈ Sterminal.
- 2. Compute Zsi,aj and Zsi by recursively backing-up from terminal states.
- 3. Compute πME(ai | si, ω).
Parallels to value-iteration.
13/16
SLIDE 18
State Visitation Frequency: Forward Pass
Idea: Propagate starting-state probabilities p0(s) forward via policy πME(a | s, ω). Algorithm:
- 1. Initialize Dsi,0 = p0(s) = p(τ ∈ T : s ∈ τt=1).
- 2. Recursively compute
Dsk,t+1 =
- si∈S
- aj∈A
Dsi,t · πME
- aj | si
- · p
- sk | aj, si
- 3. Sum up over t, i.e.
Dsi =
- t=0,...
Dsi,t
14/16
SLIDE 19
Summary
Algorithm: Iterate until convergence:
- 1. Compute policy πME(a | s, ω) (forward pass).
- 2. Compute state visitation frequency Dsi (backward pass).
- 3. Compute gradient ∇L(ω) of likelihood.
- 4. Gradient-based optimization step, e.g.: ω ← ω + η∇L(ω).
Assumptions:
- Known transition dynamics T = p(st+1 | st, at).
- Limited transition randomness.
- Linear reward R(s) = ω⊤φ(s).
Other Drawbacks:
- Need to “solve” MDP once per iteration.
- Reward bias for stochastic transition dynamics.
15/16
SLIDE 20
Extensions
- Maximum Causal Entropy IRL [Ziebart 2010]
- Maximum Entropy Deep IRL [Wulfmeier et al. 2015]
- Maximum Entropy IRL in Continuous State Spaces with Path Integrals
[Aghasadeghi and Bretl 2011]
16/16
SLIDE 21
Demonstration
github.com/qzed/irl-maxent
SLIDE 22
References
Abbeel, Pieter and Andrew Y. Ng (2004). “Apprenticeship Learning via Inverse Reinforcement Learning”. In: Proc. 21st Intl. Conference on Machine Learning (ICML ’04). Aghasadeghi, N. and T. Bretl (Sept. 2011). “Maximum entropy inverse reinforcement learning in continuous state spaces with path integrals”. In: Intl. Conference on Intelligent Robots and Systems (IORS 2011), pp. 1561–1566. Bishop, Christopher M. (Aug. 17, 2006). Pattern Recognition and Machine Learning. Springer-Verlag New York Inc. Jaynes, E. T. (May 1957). “Information Theory and Statistical Mechanics”. In: Physical Review 106.4, pp. 620–630. Ng, Andrew Y., Daishi Harada, and Stuart J. Russell (1999). “Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping”. In: Proc. 16th Intl. Conference on Machine Learning (ICML ’99),
- pp. 278–287.