Adaptations of the Thompson Sampling Algorithm for Multi-Armed Bandits
Adaptations of the Thompson Sampling Algorithm for Multi-Armed Bandits
Ciara Pike-Burke Supervisor: David Leslie 24th April 2015
1 / 14
Adaptations of the Thompson Sampling Algorithm for Multi-Armed - - PowerPoint PPT Presentation
Adaptations of the Thompson Sampling Algorithm for Multi-Armed Bandits Adaptations of the Thompson Sampling Algorithm for Multi-Armed Bandits Ciara Pike-Burke Supervisor: David Leslie 24th April 2015 1 / 14 Adaptations of the Thompson
Adaptations of the Thompson Sampling Algorithm for Multi-Armed Bandits
1 / 14
Adaptations of the Thompson Sampling Algorithm for Multi-Armed Bandits Introduction
◮ In clinical trials. ◮ In portfolio optimization. ◮ In website optimization. ◮ Choosing a restaurant.
2 / 14
Adaptations of the Thompson Sampling Algorithm for Multi-Armed Bandits Introduction
◮ Each of k slot machines has an unknown reward distribution. ◮ Want to maximize reward, or equivalently minimize regret. ◮ Regret is the accumulated difference in expected reward of the
Figure: A Multi-Armed Bandit
(image from research.microsoft.com) 3 / 14
Adaptations of the Thompson Sampling Algorithm for Multi-Armed Bandits Algorithms Thompson Sampling
4 / 14
Adaptations of the Thompson Sampling Algorithm for Multi-Armed Bandits Algorithms Thompson Sampling
◮ Sample θi from Beta(si(t − 1) + 1, fi(t − 1) + 1) for each arm i. ◮ Play the arm that corresponds to the largest θi. ◮ Update si(t) and fi(t) for all i.
4 / 14
Adaptations of the Thompson Sampling Algorithm for Multi-Armed Bandits Algorithms Thompson Sampling
Figure: Thompson Sampling for the 2-armed Bernoulli bandit with p = (0.35, 0.8).
5 / 14
Adaptations of the Thompson Sampling Algorithm for Multi-Armed Bandits Algorithms Thompson Sampling
Figure: Thompson Sampling for the 2-armed Bernoulli bandit with p = (0.35, 0.8).
5 / 14
Adaptations of the Thompson Sampling Algorithm for Multi-Armed Bandits Algorithms Optimistic Bayesian Sampling
◮ If the variance of the better arm is too large, Thompson
◮ May et al. (2012) propose a new method, Optimistic Bayesian
6 / 14
Adaptations of the Thompson Sampling Algorithm for Multi-Armed Bandits Algorithms Optimistic Bayesian Sampling
◮ Optimistic Bayesian Sampling is the same as Thompson
◮ At each time step t, play the arm that maximizes
7 / 14
Adaptations of the Thompson Sampling Algorithm for Multi-Armed Bandits Algorithms Optimistic Bayesian Sampling using Rejection Sampling
Thompson Sampling
prob Frequency 0.3 0.4 0.5 0.6 0.7 500 1000 1500 2000
Optimistic Bayesian Sampling
prob Frequency 0.40 0.45 0.50 0.55 0.60 0.65 0.70 1000 3000 5000
8 / 14
Adaptations of the Thompson Sampling Algorithm for Multi-Armed Bandits Algorithms Optimistic Bayesian Sampling using Rejection Sampling
Thompson Sampling
prob Frequency 0.3 0.4 0.5 0.6 0.7 500 1000 1500 2000
Optimistic Bayesian Sampling
prob Frequency 0.40 0.45 0.50 0.55 0.60 0.65 0.70 1000 3000 5000
8 / 14
Adaptations of the Thompson Sampling Algorithm for Multi-Armed Bandits Algorithms Optimistic Bayesian Sampling using Rejection Sampling
Rejection Sampling
prob Frequency 0.3 0.4 0.5 0.6 0.7 500 1000 1500
9 / 14
Adaptations of the Thompson Sampling Algorithm for Multi-Armed Bandits Algorithms Optimistic Bayesian Sampling using Rejection Sampling
Rejection Sampling
prob Frequency 0.3 0.4 0.5 0.6 0.7 500 1000 1500
◮ The algorithm is the same as for Thompson Sampling but
◮ Can choose any proposal distribution, the simplest is the Beta
9 / 14
Adaptations of the Thompson Sampling Algorithm for Multi-Armed Bandits Simulation Study
10 / 14
Adaptations of the Thompson Sampling Algorithm for Multi-Armed Bandits Simulation Study
10 / 14
Adaptations of the Thompson Sampling Algorithm for Multi-Armed Bandits Simulation Study
11 / 14
Adaptations of the Thompson Sampling Algorithm for Multi-Armed Bandits Conclusion
◮ Both adaptations of the Thompson Sampling algorithm seem
◮ However, Optimistic Bayesian Sampling using Rejection
◮ The theoretical regret bound of OBS is better than that of
12 / 14
Adaptations of the Thompson Sampling Algorithm for Multi-Armed Bandits Conclusion
◮ More careful consideration of the proposal distribution for
◮ Theoretical results for OBS using Rejection Sampling. ◮ Further simulations with:
◮ more arms, ◮ more complex reward distributions, ◮ contextual bandits, ◮ addition or subtraction of arms mid-way through the algorithm. 13 / 14
Adaptations of the Thompson Sampling Algorithm for Multi-Armed Bandits References
Thompson, W. R. (1933). On the likelihood that one unknown probability exceeds another in view of the evidence of two samples.. Biometrika, pages 285-294. Agrawal, S. and Goyal, N. (2011). Analysis of thompson sampling for the multi-armed bandit problem. arXiv preprint arXiv:1111.1797. May, B. C., Korda,N., Lee, A. and Leslie, D. S. (2012). Optimistic bayesian sampling in contextual-bandit problems. The Journal of Machine Learning Research, 13(1):2069-2106.
14 / 14
Adaptations of the Thompson Sampling Algorithm for Multi-Armed Bandits References
Thompson, W. R. (1933). On the likelihood that one unknown probability exceeds another in view of the evidence of two samples.. Biometrika, pages 285-294. Agrawal, S. and Goyal, N. (2011). Analysis of thompson sampling for the multi-armed bandit problem. arXiv preprint arXiv:1111.1797. May, B. C., Korda,N., Lee, A. and Leslie, D. S. (2012). Optimistic bayesian sampling in contextual-bandit problems. The Journal of Machine Learning Research, 13(1):2069-2106.
14 / 14