Hostname: page-component-848d4c4894-xm8r8 Total loading time: 0 Render date: 2024-06-20T06:10:13.008Z Has data issue: false hasContentIssue false

A confirmation of a conjecture on Feldman’s two-armed bandit problem

Published online by Cambridge University Press:  26 May 2023

Zengjing Chen*
Affiliation:
Shandong University
Yiwei Lin*
Affiliation:
Shandong University
Jichen Zhang*
Affiliation:
Shandong University
*
*Postal address: School of Mathematics, Shandong University, Jinan 250100, China.
*Postal address: School of Mathematics, Shandong University, Jinan 250100, China.
*Postal address: School of Mathematics, Shandong University, Jinan 250100, China.

Abstract

The myopic strategy is one of the most important strategies when studying bandit problems. In 2018, Nouiehed and Ross put forward a conjecture about Feldman’s bandit problem (J. Appl. Prob. (2018) 55, 318–324). They proposed that for Bernoulli two-armed bandit problems, the myopic strategy stochastically maximizes the number of wins. In this paper we consider the two-armed bandit problem with more general distributions and utility functions. We confirm this conjecture by proving a stronger result: if the agent playing the bandit has a general utility function, the myopic strategy is still optimal if and only if this utility function satisfies reasonable conditions.

Type
Original Article
Copyright
© The Author(s), 2023. Published by Cambridge University Press on behalf of Applied Probability Trust

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Abernethy, J., Hazan, E. and Rakhlin, A. (2008). Competing in the dark: an efficient algorithm for bandit linear optimization. In 21st Annual Conference on Learning Theory (COLT 2008), pp. 263274.Google Scholar
Agrawal, R. (1995). Sample mean based index policies by $o(\log n)$ regret for the multi-armed bandit problem. Adv. Appl. Prob. 27, 10541078.CrossRefGoogle Scholar
Ahmad, S. H. A., Liu, M., Javidi, T., Zhao, Q. and Krishnamachari, B. (2009). Optimality of myopic sensing in multichannel opportunistic access. IEEE Trans. Inform. Theory 55, 40404050.CrossRefGoogle Scholar
Audibert, J.-Y. and Bubeck, S. (2010). Regret bounds and minimax policies under partial monitoring. J. Mach. Learn. Res. 11, 27852836.Google Scholar
Audibert, J.-Y., Munos, R. and Szepesvári, C. (2009). Exploration–exploitation tradeoff using variance estimates in multi-armed bandits. Theoret. Comput. Sci. 410, 18761902.CrossRefGoogle Scholar
Auer, P., Cesa-Bianchi, N. and Fischer, P. (2002). Finite-time analysis of the multiarmed bandit problem. Mach. Learn. 47, 235256.CrossRefGoogle Scholar
Auer, P., Cesa-Bianchi, N., Freund, Y. and Schapire, R. (1995). Gambling in a rigged casino: the adversarial multi-armed bandit problem. In Proceedings of IEEE 36th Annual Foundations of Computer Science, pp. 322–331. IEEE Computer Society Press.CrossRefGoogle Scholar
Banks, J. S. and Sundaram, R. K. (1992). A class of bandit problems yielding myopic optimal strategies. J. Appl. Prob. 29, 625632.CrossRefGoogle Scholar
Berry, D. A. and Fristedt, B. (1985). Bandit Problems: Sequential Allocation of Experiments. Springer, Netherlands.CrossRefGoogle Scholar
Bradt, R. N., Johnson, S. M. and Karlin, S. (1956). On sequential designs for maximizing the sum of n observations. Ann. Math. Statist. 27, 10601074.CrossRefGoogle Scholar
Cappé, O., Garivier, A., Maillard, O.-A., Munos, R. and Stoltz, G. (2013). Kullback–Leibler upper confidence bounds for optimal sequential allocation. Ann. Statist. 41, 15161541.CrossRefGoogle Scholar
Chen, Z., Epstein, L. G. and Zhang, G. (2021). A central limit theorem, loss aversion and multi-armed bandits. Available at arXiv:2106.05472.Google Scholar
Deo, S., Iravani, S., Jiang, T., Smilowitz, K. and Samuelson, S. (2013). Improving health outcomes through better capacity allocation in a community-based chronic care model. Operat. Res. 61, 12771294.CrossRefGoogle Scholar
Feldman, D. (1962). Contributions to the ‘two-armed bandit’ problem. Ann. Math. Statist. 33, 847856.CrossRefGoogle Scholar
Garbe, R. and Glazebrook, K. D. (1998). Stochastic scheduling with priority classes. Math. Operat. Res. 23, 119144.CrossRefGoogle Scholar
Gast, N., Gaujal, B. and Khun, K. (2022). Testing indexability and computing Whittle and Gittins index in subcubic time. Available at arXiv:2203.05207.Google Scholar
Gittins, J. and Jones, D. (1974). A dynamic allocation index for the sequential design of experiments. In Progress in Statistics, ed. J. Gani, pp. 241266. North-Holland, Amsterdam.Google Scholar
Gittins, J. and Wang, Y.-G. (1992). The learning component of dynamic allocation indices. Ann. Statist. 20, 16251636.CrossRefGoogle Scholar
Gittins, J., Glazebrook, K. and Weber, R. (2011). Multi-Armed Bandit Allocation Indices. John Wiley.CrossRefGoogle Scholar
Honda, J. and Takemura, A. (2010). An asymptotically optimal bandit algorithm for bounded support models. In 23rd Conference on Learning Theory (COLT 2010), pp. 6779.Google Scholar
Kaufmann, E. (2018). On Bayesian index policies for sequential resource allocation. Ann. Statist. 46, 842865.CrossRefGoogle Scholar
Kelley, T. A. (1974). A note on the Bernoulli two-armed bandit problem. Ann. Statist. 2, 10561062.CrossRefGoogle Scholar
Kim, M. J. and Lim, A. E. (2015). Robust multiarmed bandit problems. Manag. Sci 62, 264285.CrossRefGoogle Scholar
Kuleshov, V. and Precup, D. (2014). Algorithms for multi-armed bandit problems. Available at arXiv:1402.6028.Google Scholar
Lai, T. and Robbins, H. (1985). Asymptotically efficient adaptive allocation rules. Adv. Appl. Math. 6, 422.CrossRefGoogle Scholar
Lai, T. L. (1987). Adaptive treatment allocation and the multi-armed bandit problem. Ann. Statist. 15, 10911114.CrossRefGoogle Scholar
Lee, E., Lavieri, M. S. and Volk, M. (2019). Optimal screening for hepatocellular carcinoma: a restless bandit model. Manuf. Serv. Oper. Manag. 21, 198212.CrossRefGoogle Scholar
Mahajan, A. and Teneketzis, D. (2008). Multi-armed bandit problems. In Foundations and Applications of Sensor Management, ed. A. O. Hero et al., pp. 121151. Springer, Boston.CrossRefGoogle Scholar
Niño-Mora, J. (2007). A $(2/3)n^3$ fast-pivoting algorithm for the Gittins index and optimal stopping of a Markov chain. INFORMS J. Computing 19, 596606.CrossRefGoogle Scholar
Nouiehed, M. and Ross, S. M. (2018). A conjecture on the Feldman bandit problem. J. Appl. Prob. 55, 318324.CrossRefGoogle Scholar
Rodman, L. (1978). On the many-armed bandit problem. Ann. Prob. 6, 491498.CrossRefGoogle Scholar
Rusmevichientong, P., Shen, Z.-J. M. and Shmoys, D. B. (2010). Dynamic assortment optimization with a multinomial logit choice model and capacity constraint. Operat. Res. 58, 16661680.CrossRefGoogle Scholar
Scott, S. L. (2010). A modern Bayesian look at the multi-armed bandit. Appl. Stoch. Models Business Industry 26, 639658.CrossRefGoogle Scholar
Thompson, W. R. (1933). On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25, 285.CrossRefGoogle Scholar
Washburn, R. B. (2008). Application of multi-armed bandits to sensor management. In Foundations and Applications of Sensor Management, ed. A. O. Hero et al., pp. 153175. Springer, Boston.CrossRefGoogle Scholar
Weber, R. R. and Weiss, G. (1990). On an index policy for restless bandits. J. Appl. Prob. 27, 637648.CrossRefGoogle Scholar
Whittle, P. (1988). Restless bandits: activity allocation in a changing world. J. Appl. Prob. 25, 287298.CrossRefGoogle Scholar