A Comparative Analysis of Expected and Distributional Reinforcement - - PowerPoint PPT Presentation

a comparative analysis of expected and distributional
SMART_READER_LITE
LIVE PREVIEW

A Comparative Analysis of Expected and Distributional Reinforcement - - PowerPoint PPT Presentation

A Comparative Analysis of Expected and Distributional Reinforcement Learning Clare Lyle, Pablo Samuel Castro, Marc G. Bellemare Presented by, Jerrod Parker and Shakti Kumar Outline: 1. Motivation 2. Background 3. Proof Sequence 4.


slide-1
SLIDE 1

A Comparative Analysis of Expected and Distributional Reinforcement Learning

Clare Lyle, Pablo Samuel Castro, Marc G. Bellemare Presented by, Jerrod Parker and Shakti Kumar

slide-2
SLIDE 2

Outline:

1. Motivation 2. Background 3. Proof Sequence 4. Experiments 5. Limitations

slide-3
SLIDE 3

Outline:

1. Motivation 2. Background 3. Proof Sequence 4. Experiments 5. Limitations

slide-4
SLIDE 4

Why Distributional RL?

1. Why restrict ourselves to the mean of value distributions? i.e. Approximate Expectation v/s Approximate Distribution

slide-5
SLIDE 5

Why Distributional RL?

1. Why restrict ourselves to the mean of value distributions? i.e. Approximate Expectation v/s Approximate Distribution 2. Approximation of multimodal returns?

slide-6
SLIDE 6

Why Distributional RL?

slide-7
SLIDE 7

Motivation

  • Poor theoretical understanding of distributional RL framework
  • Benefits have only been seen in Deep RL architectures and it is not known if

simpler architectures have any advantage at all

slide-8
SLIDE 8

Contributions

  • Distributional RL different than Expected RL?
slide-9
SLIDE 9

Contributions

  • Distributional RL different than Expected RL?

○ Tabular setting

slide-10
SLIDE 10

Contributions

  • Distributional RL different than Expected RL?

○ Tabular setting ○ Tabular setting with categorical distribution approximator

slide-11
SLIDE 11

Contributions

  • Distributional RL different than Expected RL?

○ Tabular setting ○ Tabular setting with categorical distribution approximator ○ Linear function approximation

slide-12
SLIDE 12

Contributions

  • Distributional RL different than Expected RL?

○ Tabular setting ○ Tabular setting with categorical distribution approximator ○ Linear function approximation ○ Nonlinear function approximation

slide-13
SLIDE 13

Contributions

  • Distributional RL different than Expected RL?

○ Tabular setting ○ Tabular setting with categorical distribution approximator ○ Linear function approximation ○ Nonlinear function approximation

  • Insights into nonlinear function approximators’ interaction with distributional

RL

slide-14
SLIDE 14

Outline:

1. Motivation 2. Background 3. Proof Sequence 4. Experiments 5. Limitations

slide-15
SLIDE 15

General Background– Formulation

X’, A’ are the random variables Sources of randomness in ? 1. Immediate rewards 2. Dynamics 3. Possibly stochastic policy

slide-16
SLIDE 16

General Background– Formulation

X’, A’ are the random variables Sources of randomness in ? 1. Immediate rewards 2. Dynamics 3. Possibly stochastic policy

slide-17
SLIDE 17

General Background– Formulation

X’, A’ are the random variables Sources of randomness in ? 1. Immediate rewards 2. Dynamics 3. Possibly stochastic policy

slide-18
SLIDE 18

General Background– Visualization

denotes the scalar reward obtained for transition

slide-19
SLIDE 19

General Background: Randomness

Source of randomness –

  • Immediate rewards
  • Stochastic dynamics
  • Possibly stochastic policy
slide-20
SLIDE 20

General Background– Contractions?

1. Is the policy evaluation step a contraction operation? Can I believe that during policy evaluation my distribution is converging to the true return distribution? 2. Is contraction guaranteed in the control case, when I want to improve the current policy? Can I believe that the Bellman optimality operator will lead me to the optimal policy?

slide-21
SLIDE 21

Policy Evaluation Contracts?

Is the policy evaluation step a contraction operation? Can I believe that during policy evaluation my distribution is converging to the true return distribution? Formally– given a policy do iterations converge to ?

slide-22
SLIDE 22

Contraction in Policy Evaluation?

So the result says Yes! You can rely on the distributional bellman updates for policy evaluation!

Given a policy do iterations converge to ?

slide-23
SLIDE 23

Defined as, where F-1 and G-1 are inverse CDF of F and G respectively Maximal form of the Wasserstein, Where an and Ƶ denotes the space of value distributions with bounded moments

Detour– Wasserstein Metric

slide-24
SLIDE 24

Contraction in Policy Evaluation?

So the result says Yes! You can rely on the distributional bellman updates for policy evaluation!

Given a policy do iterations converge to ?

slide-25
SLIDE 25

Contraction in Policy Evaluation?

Given a policy do iterations converge to ?

Thus,

slide-26
SLIDE 26

Contraction in Control/Improvement ?

First give a small background using definitions 1 and 2 from DPRL Write the equation in the policy iteration of the attached image. <give equations> Unfortunately this cannot be guaranteed...

GIve a similar equation for the policy evaluation also

slide-27
SLIDE 27

General Background– Contractions?

1. Is the policy evaluation step a contraction operation? Can I believe that during policy evaluation my distribution is converging to the true return distribution? 2. Is contraction guaranteed in the control case, when I want to improve the current policy? Can I believe that the Bellman optimality operator will lead me to the optimal policy?

slide-28
SLIDE 28

Contraction in Policy Improvement?

slide-29
SLIDE 29

Contraction in Policy Improvement?

x1 x2 transition At x2 two actions are possible r(a1 )=0, r(a2 ) = ε+1 or ε-1 with 0.5 probability Assume a1 , a2 are terminal actions and the environment is undiscounted What is the bellman update TZ(x2, a2) ? Since the actions are terminal, the backed up distribution should equal the rewards Thus TZ(x2, a2) = ε±1 (or 2 diracs at ε+1 and ε-1)

slide-30
SLIDE 30

Contraction in Policy Improvement?

x1 x2 transition At x2 two actions are possible r(a1 )=0, r(a2 ) = ε+1 or ε-1 with 0.5 probability Assume a1 , a2 are terminal actions and the environment is undiscounted What is the bellman update TZ(x2, a2) ? Since the actions are terminal, the backed up distribution should equal the rewards Thus TZ(x2, a2) = ε±1 (or 2 diracs at ε+1 and ε-1)

slide-31
SLIDE 31

Contraction in Policy Improvement?

Recall that if rewards are scalar, then bellman updates are older distributions Z just scaled and translated Thus the original distribution Z(x2, a2) can be considered as a translated version

  • f TZ(x2, a2)

Let Z(x2, a2) be -ε±1 The 1 Wasserstein distance between Z and Z* (assuming Z and Z* are same everywhere except x2, a2 )

slide-32
SLIDE 32

Contraction in Policy Improvement?

When we apply T to Z, then greedy action a1 is selected, thus TZ(x1) = Z(x2,a1) This shows that the undiscounted update is not a contraction. Thus a contraction cannot be guaranteed in the control case.

slide-33
SLIDE 33

Contraction in Policy Improvement?

When we apply T to Z, then greedy action a1 is selected, thus TZ(x1) = Z(x2,a1) This shows that the undiscounted update is not a contraction. Thus a contraction cannot be guaranteed in the control case.

So is distributional RL a dead end?

slide-34
SLIDE 34

Contraction in Policy Improvement?

When we apply T to Z, then greedy action a1 is selected, thus TZ(x1) = Z(x2,a1) This shows that the undiscounted update is not a contraction. Thus a contraction cannot be guaranteed in the control case.

So is distributional RL a dead end? Bellemare showed that if there is a total ordering on the set of optimal policies, and the state space is finite, then there exists an optimal distribution which is the fixed point of the bellman update in the control case. And the policy improvement converges to this fixed point [4]

slide-35
SLIDE 35

Contraction in Policy Improvement?

So is distributional RL a dead end? Bellemare showed that if there is a total ordering on the set of optimal policies, and the state space is finite, then there exists an optimal distribution which is the fixed point of the bellman update in the control case Here Z** is the set of value distributions corresponding to the set of optimal policies. This is a set of non stationary optimal value distributions

slide-36
SLIDE 36

The C51 Algorithm

Could have minimized Wasserstein metric between TZ and Z and hence learn an algorithm. But learning cannot be done with samples in this case. The expected sample Wasserstein distance between 2 distributions is always greater than the true Wasserstein distance between the 2 distributions. So how do you develop an algorithm? Instead project it on some finite supports, (which implicitly minimizes the Cramer distance between the original distribution thus still approximating the original distribution while keeping the expectation the same.) Project what? Project the updates TZ. So now we can see the entire algorithm!

slide-37
SLIDE 37

The C51 Algorithm

This is same as a Cramer Projection which we’ll see in the next slide

slide-38
SLIDE 38

C51 Visually

z1 z2 z3…... zK

δzi

Update each dirac as per the distributional bellman operator The distribute the mass

  • f misaligned diracs on

the supports

slide-39
SLIDE 39

Cramèr Distance

  • Gradient for the sample Wasserstein distance is biased
  • For 2 given probability distributions with CDFs, FP and FQ , the cramer metric

is defined as

For biased wasserstein gradient refer to section 3 of Reference [1]

slide-40
SLIDE 40

Cramèr Distance

  • Attractive metric for distributional manipulations

1. The policy evaluation bellman operator is a contraction in Cramer distance as well as shown by Rowland et. al. 2018 2. A Cramer projection produces a distribution supported on z which minimizes the Cramer distance to the original distribution If the support is contained in the interval [z1, zK] then it’s trivial to show that Cramer projection preserves the distribution expected value

slide-41
SLIDE 41

Cramèr Distance

Now as we saw earlier, in distributional RL we need to approximate distributions One way to do this is to formulate them as a categorical distribution like C51 did Then the cramer distance is given as, This is same as a weighted Euclidean norm between the CDFs of the 2 distributions. When the atoms of the support are equally spaced apart, we get a scalar multiple of the Euclidean distance between the vectors of the CDFs

slide-42
SLIDE 42

Outline:

1. Motivation 2. Background 3. Proof Sequence 4. Experiments 5. Limitations

slide-43
SLIDE 43

Methods

  • Compare policy evaluation in expected RL vs dist RL in several settings

(ie tabular, linear approx, non linear approx)

  • For each setting, the goal is to show expectation equivalence of expected

version vs an analogous distributional version. Expectation equivalence:

  • Want to show:
  • Use same experience in both
slide-44
SLIDE 44

Methods: Sequence of Proofs

1. Tabular Models: Represent distribution over returns at each (s,a) separately a. Contains Model: (Have full knowledge of the transition model and policy) i. No constraint on type of distribution to model returns ii. Constrain return distributions to being categorical on fixed support b. Sample Based: (SARSA based updates, i.e. only using samples) i. No constraint on type of distribution to model returns ii. Constrain return distributions to being categorical on fixed support iii. Semi gradient w.r.t CDF update for distributional compared to SARSA iv. Semi gradient w.r.t PDF update for distributional compared to SARSA (doesn’t hold) 2. Linear Approximations: a. Semi gradient of Cramer distance w.r.t CDF 3. Non linear Approximation: a. There exists a non linear representation of the CDF such that initially we have equivalence but lose it after the first weight update.

slide-45
SLIDE 45

Methods: Sequence of Proofs

1. Tabular Models: Represent distribution over returns at each (s,a) separately a. Contains Model: (Have full knowledge of the transition model and policy) i. No constraint on type of distribution to model returns ii. Approximate return distributions as categorical on fixed support b. Sample Based: (SARSA based updates, i.e. only using samples) i. No constraint on type of distribution to model returns ii. Approximate return distributions as categorical on fixed support iii. Semi gradient w.r.t CDF update for distributional compared to SARSA iv. Semi gradient w.r.t PDF update for distributional compared to SARSA (doesn’t hold) 2. Linear Approximations: a. Semi gradient of Cramer distance w.r.t CDF 3. Non linear Approximation: a. There exists a non linear representation of the CDF such that initially we have equivalence but lose it after the first weight update.

slide-46
SLIDE 46

Proposition 1: Cramer Projection

  • If we have a categorical distribution which has support lying between then

where , then Cramer project it onto the support z, then the expectation will remain.

slide-47
SLIDE 47

Proposition 2: Tabular, Model-Based

Z(s,a) and Q(s,a) defined separately for each (s,a)

slide-48
SLIDE 48

Methods: Sequence of Proofs

1. Tabular Models: Represent distribution over returns at each (s,a) separately a. Contains Model: (Have full knowledge of the transition model and policy) i. No constraint on type of distribution to model returns ii. Approximate return distributions as categorical on fixed support b. Sample Based: (SARSA based updates, i.e. only using samples) i. No constraint on type of distribution to model returns ii. Approximate return distributions as categorical on fixed support iii. Semi gradient w.r.t CDF update for distributional compared to SARSA iv. Semi gradient w.r.t PDF update for distributional compared to SARSA (doesn’t hold) 2. Linear Approximations: a. Semi gradient of Cramer distance w.r.t CDF 3. Non linear Approximation: a. There exists a non linear representation of the CDF such that initially we have equivalence but lose it after the first weight update.

slide-49
SLIDE 49

Proof Proposition 2

slide-50
SLIDE 50

Tabular, Contains Model, Categorical Distributions

Suppose Z has finite support kdjfhjk then applying: can cause the resulting distribution to require a projection back to the support. Proposition 3:

slide-51
SLIDE 51

Methods: Sequence of Proofs

1. Tabular Models: Represent distribution over returns at each (s,a) separately a. Contains Model: (Have full knowledge of the transition model and policy) i. No constraint on type of distribution to model returns ii. Approximate return distributions as categorical on fixed support b. Sample Based: (SARSA based updates, i.e. only using samples) i. No constraint on type of distribution to model returns ii. Approximate return distributions as categorical on fixed support iii. Semi gradient w.r.t CDF update for distributional compared to SARSA iv. Semi gradient w.r.t PDF update for distributional compared to SARSA (doesn’t hold) 2. Linear Approximations: a. Semi gradient of Cramer distance w.r.t CDF 3. Non linear Approximation: a. There exists a non linear representation of the CDF such that initially we have equivalence but lose it after the first weight update.

slide-52
SLIDE 52

SARSA vs Distributional SARSA (Arbitrary Distribution)

Given transition:

Proposition 4: These two policy evaluation methods have expectation equivalence.

slide-53
SLIDE 53

Proof: SARSA vs Distributional SARSA

Expand P_(Z_(t+1)) Notice similarities between exp SARSA and dist SARSA

slide-54
SLIDE 54

Methods: Sequence of Proofs

1. Tabular Models: Represent distribution over returns at each (s,a) separately a. Contains Model: (Have full knowledge of the transition model and policy) i. No constraint on type of distribution to model returns ii. Approximate return distributions as categorical on fixed support b. Sample Based: (SARSA based updates, i.e. only using samples) i. No constraint on type of distribution to model returns ii. Approximate return distributions as categorical on fixed support iii. Semi gradient w.r.t CDF update for distributional compared to SARSA iv. Semi gradient w.r.t PDF update for distributional compared to SARSA (doesn’t hold) 2. Linear Approximations: a. Semi gradient of Cramer distance w.r.t CDF 3. Non linear Approximation: a. There exists a non linear representation of the CDF such that initially we have equivalence but lose it after the first weight update.

slide-55
SLIDE 55

SARSA vs Distributional SARSA (with Categorical Dist)

Recall:

Difference: Project onto support

slide-56
SLIDE 56

Proof: SARSA vs Distributional SARSA (Categorical)

Need to Cramer project this variable

slide-57
SLIDE 57

Methods: Sequence of Proofs

1. Tabular Models: Represent distribution over returns at each (s,a) separately a. Contains Model: (Have full knowledge of the transition model and policy) i. No constraint on type of distribution to model returns ii. Approximate return distributions as categorical on fixed support b. Sample Based: (SARSA based updates, i.e. only using samples) i. No constraint on type of distribution to model returns ii. Approximate return distributions as categorical on fixed support iii. Semi gradient w.r.t CDF update for distributional compared to SARSA iv. Semi gradient w.r.t PDF update for distributional compared to SARSA (doesn’t hold) 2. Linear Approximations: a. Semi gradient of Cramer distance w.r.t CDF 3. Non linear Approximation: a. There exists a non linear representation of the CDF such that initially we have equivalence but lose it after the first weight update.

slide-58
SLIDE 58

SARSA vs Semi-gradient of Cramer Distance:

Assume approximating distribution with categorical (c-spaced support). Gradient

  • f squared Cramer w.r.t CDF:

Goal Proposition 6: Showing there is a semi gradient update which maintains expectation equivalence to SARSA (with a slight change in step size). Results: Semi-gradient w.r.t CDF => Expectation Equivalence Semi-gradient w.r.t PDF => Expectation Equivalence

slide-59
SLIDE 59

Methods: Sequence of Proofs

1. Tabular Models: Represent distribution over returns at each (s,a) separately a. Contains Model: (Have full knowledge of the transition model and policy) i. No constraint on type of distribution to model returns ii. Approximate return distributions as categorical on fixed support b. Sample Based: (SARSA based updates, i.e. only using samples) i. No constraint on type of distribution to model returns ii. Approximate return distributions as categorical on fixed support iii. Semi gradient w.r.t CDF update for distributional compared to SARSA iv. Semi gradient w.r.t PDF update for distributional compared to SARSA (doesn’t hold) 2. Linear Approximations: a. Semi gradient of Cramer distance w.r.t CDF 3. Non linear Approximation: a. There exists a non linear representation of the CDF such that initially we have equivalence but lose it after the first weight update.

slide-60
SLIDE 60

Linear Function Approximation

Loss Functions Update Rule

slide-61
SLIDE 61

Theta update from last slide Takeaway: If 1. Distributions add to 1

  • 2. Distance between bins in distribution is 1

Then: Expectation equivalence holds Under the assumption that the distributions add to one, and distance between bins in distribution is one, we obtain expectation equivalence

slide-62
SLIDE 62

Methods: Sequence of Proofs

1. Tabular Models: Represent distribution over returns at each (s,a) separately a. Contains Model: (Have full knowledge of the transition model and policy) i. No constraint on type of distribution to model returns ii. Approximate return distributions as categorical on fixed support b. Sample Based: (SARSA based updates, i.e. only using samples) i. No constraint on type of distribution to model returns ii. Approximate return distributions as categorical on fixed support iii. Semi gradient w.r.t CDF update for distributional compared to SARSA iv. Semi gradient w.r.t PDF update for distributional compared to SARSA (doesn’t hold) 2. Linear Approximations: a. Semi gradient of Cramer distance w.r.t CDF 3. Non linear Approximation: a. There exists a non linear representation of the CDF such that initially we have equivalence but lose it after the first weight update.

slide-63
SLIDE 63

Nonlinear Function Approximation

Created example to show expectation equivalence doesn’t always hold: Let: 1. Start with expectation equivalence 2. Fix a transition such that the target and prediction have the same expectation but different distributions.

slide-64
SLIDE 64

Nonlinear Function Approximation

Created example to show expectation equivalence doesn’t always hold: Let: 1. Start with expectation equivalence 2. Fix a transition such that the target and prediction have the same expectation but different distributions. Recall Losses Used

slide-65
SLIDE 65

Nonlinear Function Approximation

Created example to show expectation equivalence doesn’t always hold: Let: 1. Start with expectation equivalence 2. Fix a transition such that the target and prediction have the same expectation but different distributions. 3. Now show that when we take a gradient step (using gradient of Cramer), the expectation of the predicted distribution changes but the Q-value didn’t change expectation equivalence is broken.

slide-66
SLIDE 66

Nonlinear Function Approximation

Takeaways:

  • This doesn’t prove that for all nonlinear functions that this happens.
  • Gradient is taken w.r.t Cramer distance which isn’t the case in many

successful algorithms (Quantile Distributional RL for ex. minimizes Wasserstein).

  • Expectation equivalence never breaks in the linear case which might mean

that the benefits of distributional RL seen in practice could have to do with it’s interplay with nonlinear function approximation.

slide-67
SLIDE 67

Recap: Sequence of Proofs

1. Tabular Models: Represent distribution over returns at each (s,a) separately a. Model Based: (Have full knowledge of the transition model and policy) i. No constraint on type of distribution to model returns ii. Constrain return distributions to being categorical on fixed support b. Sample Based: (SARSA based updates, ie only using samples) i. No constraint on type of distribution to model returns ii. Constrain return distributions to being categorical on fixed support iii. Semi gradient w.r.t CDF update for distributional compared to SARSA iv. Semi gradient w.r.t PDF update for distributional compared to SARSA (doesn’t hold) 2. Linear Approximations: a. Semi gradient of Cramer distance w.r.t CDF 3. Non linear Approximation: a. There exists a non linear representation of the CDF such that initially we have equivalence but lose it after the first weight update.

slide-68
SLIDE 68

Takeaways

1. In cases where they proved expectation equivalence, there isn’t anything to gain from dist RL in terms of expected return. For ex:

a. Variance of our expected return is same in expected RL and distributional RL since Var[E(Z)]=Var[Q] b. If using greedy methods, then policy improvement steps will be equivalent since expected value is same for each action.

2. Distributional RL and expected RL are usually expectation-equivalent for tabular representations and linear function approximation. 3. Expectation equivalence doesn’t always hold when using non linear function approximation.

slide-69
SLIDE 69

Experimental Results: Tabular Case (12x12 Grid)

Compare: Q-learning, dist with CDF updates, dist with PDF updates. Using same random seed, eps-greedy actions: (so end with same results if expectation equiv)

slide-70
SLIDE 70

Outline:

1. Motivation 2. Background 3. Proof Sequence 4. Experiments 5. Limitations

slide-71
SLIDE 71

Experimental Results: Linear Approximation

Cart Pole Acrobat

slide-72
SLIDE 72

Experimental Results: Nonlinear Approximation

Cart Pole Acrobat

slide-73
SLIDE 73

Outline:

1. Motivation 2. Background 3. Proof Sequence 4. Experiments 5. Limitations

slide-74
SLIDE 74

Limitations

  • Their results all hold for minimizing Cramer distance but possibly not other

metrics that are used in some successful distributional RL algorithms (Wasserstein, cross-entropy)

  • The algorithm they use through their proofs doesn’t seem to lead to quality

results in practice

  • Even though the authors prove that Cramer improves on Wasserstein

limitations distributional RL [1], the empirical results don’t convey this

slide-75
SLIDE 75

Open Questions

1. What happens in deep neural networks that benefits most from the distributional perspective? 2. Is there a regularizing effect of modeling a distribution instead of expected value?

slide-76
SLIDE 76

Questions

1. Derive: 2. What is one of the major benefits of the Cramer projection? 3. What are some possible reasons for the performance improvement of distributional RL over expected value RL when using non linear function approximation?

slide-77
SLIDE 77

References

[1] Marc G. Bellemare, Ivo Danihelka, Will Dabney, Shakir Mohamed, Balaji Lakshminarayanan, Stephan Hoyer, and Rémi

  • Munos. The Cramer distance as a solution to biased Wasserstein gradients. In arXiv preprint arXiv:1705.10743, 2017.

[2] Rowland, M.; Bellemare, M.; Dabney, W.; Munos, R.; and Teh, Y. W. 2018. An analysis of categorical distributional reinforcement learning. In Storkey, A., and Perez-Cruz, F., eds., Proceedings of the Twenty-First International Conference

  • n Artificial Intelligence and Statistics, volume 84 of Proceedings of Machine Learning Research, 29–37. Playa Blanca,

Lanzarote, Canary Islands: PMLR. [3] Bellemare, M. G.; Dabney, W.; and Munos, R. 2017. A distributional perspective on reinforcement learning. In ICML. [4] Will Dabney, Mark Rowland, Marc G. Bellemare, and Remi Munos. Distributional Reinforcement ´ Learning with Quantile

  • Regression. In Proceedings of the AAAI Conference on Artificial Intelligence, 2018b.