Hostname: page-component-7bb8b95d7b-l4ctd Total loading time: 0 Render date: 2024-09-09T09:49:35.887Z Has data issue: false hasContentIssue false

ON THE IDENTIFICATION AND MITIGATION OF WEAKNESSES IN THE KNOWLEDGE GRADIENT POLICY FOR MULTI-ARMED BANDITS

Published online by Cambridge University Press:  13 September 2016

James Edwards
Affiliation:
STOR-i Centre for Doctoral Training, Lancaster UniversityLancaster LA1 4YF, UK E-mail: j.edwards4@lancaster.ac.uk
Paul Fearnhead
Affiliation:
Department of Mathematics and Statistics, Lancaster University, Lancaster LA1 4YF, UK E-mail: p.fearnhead@lancaster.ac.uk
Kevin Glazebrook
Affiliation:
Department of Management Science, Lancaster University, Lancaster LA1 4YX, UK E-mail: k.glazebrook@lancaster.ac.uk

Abstract

The knowledge gradient (KG) policy was originally proposed for online ranking and selection problems but has recently been adapted for use in online decision-making in general and multi-armed bandit problems (MABs) in particular. We study its use in a class of exponential family MABs and identify weaknesses, including a propensity to take actions which are dominated with respect to both exploitation and exploration. We propose variants of KG which avoid such errors. These new policies include an index heuristic, which deploys a KG approach to develop an approximation to the Gittins index. A numerical study shows this policy to perform well over a range of MABs including those for which index policies are not optimal. While KG does not take dominated actions when bandits are Gaussian, it fails to be index consistent and appears not to enjoy a performance advantage over competitor policies when arms are correlated to compensate for its greater computational demands.

Type
Research Article
Copyright
Copyright © Cambridge University Press 2016 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

1. Berry, D.A. & Fristedt, B. (1985). Bandit Problems. London: Chapman and Hall.Google Scholar
2. Brezzi, M. & Lai, T.L. (2002). Optimal learning and experimentation in bandit problems. Journal of Economic Dynamics and Control 27(1): 87108.Google Scholar
3. Chick, S.E. & Gans, N. (2009). Economic analysis of simulation selection problems. Management Science 55(3): 421437.Google Scholar
4. Ding, Z. & Ryzhov, I.O. (2016). Optimal learning with non-Gaussian rewards. Advances in Applied Probability 1(48): 112136.Google Scholar
5. Frazier, P.I., Powell, W.B., & Dayanik, S. (2008). A knowledge-gradient policy for sequential information collection. SIAM Journal on Control and Optimization 47(5): 24102439.Google Scholar
6. Frazier, P.I., Powell, W.B., & Dayanik, S. (2009). The knowledge-gradient policy for correlated normal beliefs. INFORMS Journal on Computing 21(4): 599613.CrossRefGoogle Scholar
7. Gittins, J.C., Glazebrook, K.D., & Weber, R. (2011). Multi-armed Bndit Allocation Indices, 2nd ed. Chichester, UK: John Wiley & Sons.Google Scholar
8. Gupta, S.S. & Miescke, K.J. (1996). Bayesian look ahead one-stage sampling allocations for selection of the best population. Journal of Statistical Planning and Inference 54(2): 229244.Google Scholar
9. Jones, D.R., Schonlau, M., & Welch, W.J. (1998). Efficient global optimization of expensive black-box functions. Journal of Global Optimization 13(4): 455492.Google Scholar
10. Powell, W.B. & Ryzhov, I.O. (2012). Optimal Learning. Hoboken, NJ: John Wiley & Sons.Google Scholar
11. Russo, D. & Van Roy, B. (2014). Learning to optimize via posterior sampling. Mathematics of Operations Research 39(4): 12211243.Google Scholar
12. Ryzhov, I.O., Frazier, P.I., & Powell, W.B. (2010). On the robustness of a one-period look-ahead policy in multi-armed bandit problems. Procedia Computer Science 1(1): 16351644.Google Scholar
13. Ryzhov, I.O. & Powell, W.B. (2011). The value of information in multi-armed bandits with exponentially distributed rewards. In Proceedings of the 2011 International Conference on Computational Science, pp. 13631372.Google Scholar
14. Ryzhov, I.O., Powell, W.B., & Frazier, P.I. (2012). The knowledge gradient algorithm for a general class of online learning problems. Operations Research 60(1): 180195.Google Scholar
15. Shaked, M. & Shanthikumar, J.G. (2007). Stochastic Orders. New York: Springer.Google Scholar
16. Weber, R. (1992). On the Gittins index for multiarmed bandits. The Annals of Applied Probability 2(4): 10241033.Google Scholar
17. Whittle, P. (1980). Multi-armed bandits and the Gittins index. Journal of the Royal Statistical Society. Series B (Methodological) 42(2): 143149.Google Scholar
18. Whittle, P. (1988). Restless bandits: Activity allocation in a changing world. Journal of Applied Probability 25: 287298.Google Scholar
19. Yu, Y. (2011). Structural properties of Bayesian bandits with exponential family distributions. arXiv preprint. arXiv:1103.3089v1.Google Scholar