[PPT] - Adaptations of the Thompson Sampling Algorithm for Multi-Armed PowerPoint Presentation

SLIDE 1

Adaptations of the Thompson Sampling Algorithm for Multi-Armed Bandits

Ciara Pike-Burke Supervisor: David Leslie 24th April 2015

1 / 14

SLIDE 2

Adaptations of the Thompson Sampling Algorithm for Multi-Armed Bandits Introduction

Motivation

In many real life problems there is a trade-off to be made between exploitation and exploration. For example;

◮ In clinical trials. ◮ In portfolio optimization. ◮ In website optimization. ◮ Choosing a restaurant.

2 / 14

SLIDE 3

Adaptations of the Thompson Sampling Algorithm for Multi-Armed Bandits Introduction

Multi-Armed Bandits

One of the best ways to model the exploitation vs exploration trade-off is using Multi-Armed Bandits.

◮ Each of k slot machines has an unknown reward distribution. ◮ Want to maximize reward, or equivalently minimize regret. ◮ Regret is the accumulated difference in expected reward of the

arm we played and the optimal arm.

Figure: A Multi-Armed Bandit

(image from research.microsoft.com) 3 / 14

SLIDE 4

Adaptations of the Thompson Sampling Algorithm for Multi-Armed Bandits Algorithms Thompson Sampling

Thompson Sampling

4 / 14

SLIDE 5

Adaptations of the Thompson Sampling Algorithm for Multi-Armed Bandits Algorithms Thompson Sampling

Thompson Sampling

For the case of Bernoulli rewards, the Thompson Sampling algorithm is:

1. Initialize with Uniform (Beta(1, 1)) priors on the reward of

each arm.

2. At each time step, t:

◮ Sample θi from Beta(si(t − 1) + 1, fi(t − 1) + 1) for each arm i. ◮ Play the arm that corresponds to the largest θi. ◮ Update si(t) and fi(t) for all i.

Where si(t) is the number of successes from playing arm i in t total plays of the algorithm and fi(t) the number of failures.

4 / 14

SLIDE 6

Adaptations of the Thompson Sampling Algorithm for Multi-Armed Bandits Algorithms Thompson Sampling

Sampling Distributions

Figure: Thompson Sampling for the 2-armed Bernoulli bandit with p = (0.35, 0.8).

5 / 14

SLIDE 7

Adaptations of the Thompson Sampling Algorithm for Multi-Armed Bandits Algorithms Thompson Sampling

Sampling Distributions

Figure: Thompson Sampling for the 2-armed Bernoulli bandit with p = (0.35, 0.8).

5 / 14

SLIDE 8

Adaptations of the Thompson Sampling Algorithm for Multi-Armed Bandits Algorithms Optimistic Bayesian Sampling

OBS: Motivation

◮ If the variance of the better arm is too large, Thompson

Sampling will often end up playing the inferior arm.

◮ May et al. (2012) propose a new method, Optimistic Bayesian

Sampling, to combat this.

6 / 14

SLIDE 9

Adaptations of the Thompson Sampling Algorithm for Multi-Armed Bandits Algorithms Optimistic Bayesian Sampling

OBS: Outline

◮ Optimistic Bayesian Sampling is the same as Thompson

Sampling except for the decision rule.

◮ At each time step t, play the arm that maximizes

qi = max{θi, µi} where θi ∼ Beta(si(t − 1) + 1, fi(t − 1) + 1) and µi is the mean of this distribution. Optimistic Bayesian Sampling has been shown empirically and theoretically to perform better that Thompson Sampling.

7 / 14

SLIDE 10

Adaptations of the Thompson Sampling Algorithm for Multi-Armed Bandits Algorithms Optimistic Bayesian Sampling using Rejection Sampling

Motivation

Thompson Sampling

prob Frequency 0.3 0.4 0.5 0.6 0.7 500 1000 1500 2000

Optimistic Bayesian Sampling

prob Frequency 0.40 0.45 0.50 0.55 0.60 0.65 0.70 1000 3000 5000

8 / 14

SLIDE 11

Adaptations of the Thompson Sampling Algorithm for Multi-Armed Bandits Algorithms Optimistic Bayesian Sampling using Rejection Sampling

Motivation

Thompson Sampling

prob Frequency 0.3 0.4 0.5 0.6 0.7 500 1000 1500 2000

Optimistic Bayesian Sampling

prob Frequency 0.40 0.45 0.50 0.55 0.60 0.65 0.70 1000 3000 5000

8 / 14

SLIDE 12

Adaptations of the Thompson Sampling Algorithm for Multi-Armed Bandits Algorithms Optimistic Bayesian Sampling using Rejection Sampling

Optimistic Bayesian Sampling using Rejection Sampling

We can use Rejection Sampling to obtain samples from the truncated Beta distribution.

Rejection Sampling

prob Frequency 0.3 0.4 0.5 0.6 0.7 500 1000 1500

9 / 14

SLIDE 13

Adaptations of the Thompson Sampling Algorithm for Multi-Armed Bandits Algorithms Optimistic Bayesian Sampling using Rejection Sampling

Optimistic Bayesian Sampling using Rejection Sampling

We can use Rejection Sampling to obtain samples from the truncated Beta distribution.

Rejection Sampling

prob Frequency 0.3 0.4 0.5 0.6 0.7 500 1000 1500

◮ The algorithm is the same as for Thompson Sampling but

sampling from the truncated Beta(si(t − 1) + 1, fi(t − 1) + 1).

◮ Can choose any proposal distribution, the simplest is the Beta

distribution.

9 / 14

SLIDE 14

Adaptations of the Thompson Sampling Algorithm for Multi-Armed Bandits Simulation Study

Simulation Study

The three methods, Thompson Sampling, Optimistic Bayesian Sampling and Optimistic Bayesian Sampling using Rejection Sampling were tested on four simulations with Bernoulli rewards.

10 / 14

SLIDE 15

Adaptations of the Thompson Sampling Algorithm for Multi-Armed Bandits Simulation Study

Simulation Study

The three methods, Thompson Sampling, Optimistic Bayesian Sampling and Optimistic Bayesian Sampling using Rejection Sampling were tested on four simulations with Bernoulli rewards. Simulation 1 The 2 armed bandit with randomly generated probabilities p = (0.34, 0.92) Simulation 2 The 5 armed bandit with p = (0.45, 0.45, 0.45, 0.55, 0.45) Simulation 3 The 10 armed bandit with p = (0.9, 0.8, . . . , 0.8) Simulation 4 The 20 armed bandit with randomly generated probabilities p = (0.56, 0.09, 0.68, 0.69, 0.19, 0.45, 0.77, 0.29, 0.58, 0.11, 0.91, 0.17, 0.29, 0.95, 0.90, 0.39, 0.38, 0.53, 0.84, 0.03).

10 / 14

SLIDE 16

Adaptations of the Thompson Sampling Algorithm for Multi-Armed Bandits Simulation Study

Results

11 / 14

SLIDE 17

Adaptations of the Thompson Sampling Algorithm for Multi-Armed Bandits Conclusion

Conclusion

◮ Both adaptations of the Thompson Sampling algorithm seem

to perform better than the original in simulations.

◮ However, Optimistic Bayesian Sampling using Rejection

Sampling can be slow.

◮ The theoretical regret bound of OBS is better than that of

Thompson Sampling - there has not been a regret bound proved for OBS using Rejection Sampling.

12 / 14

SLIDE 18

Adaptations of the Thompson Sampling Algorithm for Multi-Armed Bandits Conclusion

Future Work

◮ More careful consideration of the proposal distribution for

Optimistic Bayesian Sampling using Rejection Sampling.

◮ Theoretical results for OBS using Rejection Sampling. ◮ Further simulations with:

◮ more arms, ◮ more complex reward distributions, ◮ contextual bandits, ◮ addition or subtraction of arms mid-way through the algorithm. 13 / 14

SLIDE 19

Adaptations of the Thompson Sampling Algorithm for Multi-Armed Bandits References

References

Thompson, W. R. (1933). On the likelihood that one unknown probability exceeds another in view of the evidence of two samples.. Biometrika, pages 285-294. Agrawal, S. and Goyal, N. (2011). Analysis of thompson sampling for the multi-armed bandit problem. arXiv preprint arXiv:1111.1797. May, B. C., Korda,N., Lee, A. and Leslie, D. S. (2012). Optimistic bayesian sampling in contextual-bandit problems. The Journal of Machine Learning Research, 13(1):2069-2106.

14 / 14

SLIDE 20

Adaptations of the Thompson Sampling Algorithm for Multi-Armed Bandits References

References

Thompson, W. R. (1933). On the likelihood that one unknown probability exceeds another in view of the evidence of two samples.. Biometrika, pages 285-294. Agrawal, S. and Goyal, N. (2011). Analysis of thompson sampling for the multi-armed bandit problem. arXiv preprint arXiv:1111.1797. May, B. C., Korda,N., Lee, A. and Leslie, D. S. (2012). Optimistic bayesian sampling in contextual-bandit problems. The Journal of Machine Learning Research, 13(1):2069-2106.

Thank you for listening, any questions?

14 / 14