Hostname: page-component-7c8c6479df-ws8qp Total loading time: 0 Render date: 2024-03-28T09:04:28.306Z Has data issue: false hasContentIssue false

An asymptotically optimal heuristic for general nonstationary finite-horizon restless multi-armed, multi-action bandits

Published online by Cambridge University Press:  03 September 2019

Gabriel Zayas-Cabán*
Affiliation:
University of Wisconsin-Madison
Stefanus Jasin*
Affiliation:
University of Michigan
Guihua Wang*
Affiliation:
University of Michigan
*
*Postal address: Mechanical Engineering Building, University of Wisconsin-Madison, 1513 University Avenue, Room 3011 Madison, WI 53706-1691, USA. Email address: zayascaban@wisc.edu
**Postal address: Stephen M. Ross School of Business, University of Michigan, 701 Tappan Street, Ann Arbor, MI 48109, USA.
**Postal address: Stephen M. Ross School of Business, University of Michigan, 701 Tappan Street, Ann Arbor, MI 48109, USA.

Abstract

We propose an asymptotically optimal heuristic, which we term randomized assignment control (RAC) for a restless multi-armed bandit problem with discrete-time and finite states. It is constructed using a linear programming relaxation of the original stochastic control formulation. In contrast to most of the existing literature, we consider a finite-horizon problem with multiple actions and time-dependent (i.e. nonstationary) upper bound on the number of bandits that can be activated at each time period; indeed, our analysis can also be applied in the setting with nonstationary transition matrix and nonstationary cost function. The asymptotic setting is obtained by letting the number of bandits and other related parameters grow to infinity. Our main contribution is that the asymptotic optimality of RAC in this general setting does not require indexability properties or the usual stability conditions of the underlying Markov chain (e.g. unichain) or fluid approximation (e.g. global stable attractor). Moreover, our multi-action setting is not restricted to the usual dominant action concept. Finally, we show that RAC is also asymptotically optimal for a dynamic population, where bandits can randomly arrive and depart the system.

Type
Original Article
Copyright
© Applied Probability Trust 2019 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Ahmad, S. H. A. et al. (2009). Optimality of myopic sensing in multichannel opportunistic access. IEEE Trans. Inf. Theory 55, 40404050.CrossRefGoogle Scholar
Altman, E. (1999). Constrained Markov Decision Processes. Chapman & Hall/CRC, Boca Raton, FL.Google Scholar
Ayer, T. et al. (2016). Prioritizing hepatitis C treatment in U.S. prisons. Preprint. Available at SSRN: https://ssrn.com/abstract=2869158.Google Scholar
Berry, D. A. and Fristedt, B. (1985). Bandit Problems: Sequential Allocation of Experiments. Chapman & Hall, London.CrossRefGoogle Scholar
Bertsimas, D. and Niño-Mora, J. (2000). Restless bandits, linear programming relaxations, and a primal-dual index heuristic. Operat. Res. 48, 8090.CrossRefGoogle Scholar
Bradt, R. N., Johnson, S. M. and Karlin, S. (1956). On sequential designs for maximizing the sum of n observations. Ann. Math. Statist. 27, 10601074.CrossRefGoogle Scholar
Caro, F. and Gallien, J. (2007). Dynamic assortment with demand learning for seasonal consumer goods. Manag. Sci. 53, 276292.CrossRefGoogle Scholar
Cohen, K., Zhao, Q. and Scaglione, A. (2014). Restless multi-armed bandits under time-varying activation constraints for dynamic spectrum access. In 2014 48th Asilomar Conference on Signals, Systems and Computers, IEEE, pp. 1575–1578.CrossRefGoogle Scholar
Deo, S. et al. (2013). Improving health outcomes through better capacity allocation in a community-based chronic care model. Operat. Res. 61, 1277–1294.CrossRefGoogle Scholar
Gittins, J. C. and Jones, D. M. (1974). A dynamic allocation index for the sequential design of experiments. In Progress in Statistics (Budapest, 1972; Colloq. Math. Soc. János Bolyai 9), North-Holland, Amsterdam, pp. 241–266.Google Scholar
Gittins, J., Glazebrook, K. and Weber, R. (2011). Multi-Armed Bandit Allocation Indices, 2nd edn. John Wiley, Chichester.CrossRefGoogle Scholar
Hu, W. and Frazier, P. (2017). An asymptotically optimal index policy for finite-horizon restless bandits. Preprint. Available at https://arxiv.org/abs/1707.00205v1.Google Scholar
Kelly, F. P. (1981). Multi-armed bandits with discount factor near one: the Bernoulli case. Ann. Statist. 9, 9871001.CrossRefGoogle Scholar
Le Ny, J., Dahleh, M. and Feron, E. (2008). A linear programming relaxation and a heuristic for the restless bandit problem with general switching costs. Preprint. Available at https://arxiv.org/abs/0805.1563v1.Google Scholar
Lee, E., Lavieri, M. S. and Volk, M. (2018). Optimal screening for hepatocellular carcinoma: a restless bandit model. Manufacturing Service Operat. Manag. 21.Google Scholar
Mahajan, A. and Teneketzis, D. (2008). Multi-armed bandit problems. In Foundations Applications of Sensor Management, Springer, Boston, MA, pp. 121151.CrossRefGoogle Scholar
Nain, P. and Ross, K. W. (1986). Optimal priority assignment with hard constraint. IEEE Trans. Automatic Control 31, 883888.CrossRefGoogle Scholar
Niño-Mora, J. (2011). Computing a classic index for finite-horizon bandits. INFORMS J. Comput. 23, 254267.CrossRefGoogle Scholar
Papadimitriou, C. H. and Tsitsiklis, J. N. (1999). The complexity of optimal queuing network control. Math. Operat. Res. 24, 293305.CrossRefGoogle Scholar
Robbins, H. (1952). Some aspects of the sequential design of experiments. Bull. Amer. Math. Soc. 58, 527535.CrossRefGoogle Scholar
Schrijver, A. (2000). Theory of Linear and Integer Programming. John Wiley, New York.Google Scholar
Verloop, I. M. (2016). Asymptotically optimal priority policies for indexable and nonindexable restless bandits. Ann. Appl. Prob. 26, 19471995.CrossRefGoogle Scholar
Washburn, R. B. (2008). Application of multi-armed bandits to sensor management. In Foundations and Applications of Sensor Management, Springer, Boston, MA, pp. 153175.CrossRefGoogle Scholar
Weber, R. R. and Weiss, G. (1990). On an index policy for restless bandits. J. Appl. Prob. 27, 637648.CrossRefGoogle Scholar
Whittle, P. (1980). Multi-armed bandits and the Gittins index. J. R. Statist. Soc. B 42, 143149.Google Scholar
Whittle, P. (1988). Restless bandits: activity allocation in a changing world. In A Celebration of Applied Probability (J. Appl. Prob. Spec. Vol. 25(A)), ed. Gani, J., Applied Probability Trust, Sheffield, pp. 287298.Google Scholar
Zayas-Cabán, G., Jasin, S. and Wang, G. (2019). An asymptotically optimal heuristic for general nonstationary finite-horizon restless multi-armed, multi-action bandits. Supplementary material. Available at https://doi.org/10.1017/apr.2019.29CrossRefGoogle Scholar
Supplementary material: PDF

Zayas-Cabán et al. supplementary material

Supplementary data

Download Zayas-Cabán et al. supplementary material(PDF)
PDF 163.6 KB