An asymptotically optimal heuristic for general nonstationary finite-horizon restless multi-armed, multi-action bandits

Gabriel Zayas-Cabán; Stefanus Jasin; Guihua Wang

doi:10.1017/apr.2019.29

An asymptotically optimal heuristic for general nonstationary finite-horizon restless multi-armed, multi-action bandits

Part of: Operations research and management science Computer system organization

Published online by Cambridge University Press: 03 September 2019

Gabriel Zayas-Cabán ,

Stefanus Jasin and

Guihua Wang

Show author details

Gabriel Zayas-Cabán*: Affiliation:
University of Wisconsin-Madison
Stefanus Jasin*: Affiliation:
University of Michigan
Guihua Wang*: Affiliation:
University of Michigan
*: *Postal address: Mechanical Engineering Building, University of Wisconsin-Madison, 1513 University Avenue, Room 3011 Madison, WI 53706-1691, USA. Email address: zayascaban@wisc.edu
**Postal address: Stephen M. Ross School of Business, University of Michigan, 701 Tappan Street, Ann Arbor, MI 48109, USA.
**Postal address: Stephen M. Ross School of Business, University of Michigan, 701 Tappan Street, Ann Arbor, MI 48109, USA.

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

We propose an asymptotically optimal heuristic, which we term randomized assignment control (RAC) for a restless multi-armed bandit problem with discrete-time and finite states. It is constructed using a linear programming relaxation of the original stochastic control formulation. In contrast to most of the existing literature, we consider a finite-horizon problem with multiple actions and time-dependent (i.e. nonstationary) upper bound on the number of bandits that can be activated at each time period; indeed, our analysis can also be applied in the setting with nonstationary transition matrix and nonstationary cost function. The asymptotic setting is obtained by letting the number of bandits and other related parameters grow to infinity. Our main contribution is that the asymptotic optimality of RAC in this general setting does not require indexability properties or the usual stability conditions of the underlying Markov chain (e.g. unichain) or fluid approximation (e.g. global stable attractor). Moreover, our multi-action setting is not restricted to the usual dominant action concept. Finally, we show that RAC is also asymptotically optimal for a dynamic population, where bandits can randomly arrive and depart the system.

Keywords

Restless bandit asymptotic optimality finite horizon nonstationary arm acquiring nonindexable bandit

MSC classification

Primary: 90C40: Markov and semi-Markov decision processes

Secondary: 68M20: Performance evaluation; queueing; scheduling 90B36: Scheduling theory, stochastic

Type: Original Article
Information: Advances in Applied Probability , Volume 51 , Issue 3 , September 2019 , pp. 745 - 772

DOI: https://doi.org/10.1017/apr.2019.29 [Opens in a new window]
Copyright: © Applied Probability Trust 2019

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Ahmad, S. H. A. et al. (2009). Optimality of myopic sensing in multichannel opportunistic access. IEEE Trans. Inf. Theory 55, 4040–4050.CrossRef Google Scholar

Altman, E. (1999). Constrained Markov Decision Processes. Chapman & Hall/CRC, Boca Raton, FL.Google Scholar

Ayer, T. et al. (2016). Prioritizing hepatitis C treatment in U.S. prisons. Preprint. Available at SSRN: https://ssrn.com/abstract=2869158.Google Scholar

Berry, D. A. and Fristedt, B. (1985). Bandit Problems: Sequential Allocation of Experiments. Chapman & Hall, London.CrossRef Google Scholar

Bertsimas, D. and Niño-Mora, J. (2000). Restless bandits, linear programming relaxations, and a primal-dual index heuristic. Operat. Res. 48, 80–90.CrossRef Google Scholar

Bradt, R. N., Johnson, S. M. and Karlin, S. (1956). On sequential designs for maximizing the sum of n observations. Ann. Math. Statist. 27, 1060–1074.CrossRef Google Scholar

Caro, F. and Gallien, J. (2007). Dynamic assortment with demand learning for seasonal consumer goods. Manag. Sci. 53, 276–292.CrossRef Google Scholar

Cohen, K., Zhao, Q. and Scaglione, A. (2014). Restless multi-armed bandits under time-varying activation constraints for dynamic spectrum access. In 2014 48th Asilomar Conference on Signals, Systems and Computers, IEEE, pp. 1575–1578.CrossRef Google Scholar

Deo, S. et al. (2013). Improving health outcomes through better capacity allocation in a community-based chronic care model. Operat. Res. 61, 1277–1294.CrossRef Google Scholar

Gittins, J. C. and Jones, D. M. (1974). A dynamic allocation index for the sequential design of experiments. In Progress in Statistics (Budapest, 1972; Colloq. Math. Soc. János Bolyai 9), North-Holland, Amsterdam, pp. 241–266.Google Scholar

Gittins, J., Glazebrook, K. and Weber, R. (2011). Multi-Armed Bandit Allocation Indices, 2nd edn. John Wiley, Chichester.CrossRef Google Scholar

Hu, W. and Frazier, P. (2017). An asymptotically optimal index policy for finite-horizon restless bandits. Preprint. Available at https://arxiv.org/abs/1707.00205v1.Google Scholar

Kelly, F. P. (1981). Multi-armed bandits with discount factor near one: the Bernoulli case. Ann. Statist. 9, 987–1001.CrossRef Google Scholar

Le Ny, J., Dahleh, M. and Feron, E. (2008). A linear programming relaxation and a heuristic for the restless bandit problem with general switching costs. Preprint. Available at https://arxiv.org/abs/0805.1563v1.Google Scholar

Lee, E., Lavieri, M. S. and Volk, M. (2018). Optimal screening for hepatocellular carcinoma: a restless bandit model. Manufacturing Service Operat. Manag. 21.Google Scholar

Mahajan, A. and Teneketzis, D. (2008). Multi-armed bandit problems. In Foundations Applications of Sensor Management, Springer, Boston, MA, pp. 121–151.CrossRef Google Scholar

Nain, P. and Ross, K. W. (1986). Optimal priority assignment with hard constraint. IEEE Trans. Automatic Control 31, 883–888.CrossRef Google Scholar

Niño-Mora, J. (2011). Computing a classic index for finite-horizon bandits. INFORMS J. Comput. 23, 254–267.CrossRef Google Scholar

Papadimitriou, C. H. and Tsitsiklis, J. N. (1999). The complexity of optimal queuing network control. Math. Operat. Res. 24, 293–305.CrossRef Google Scholar

Robbins, H. (1952). Some aspects of the sequential design of experiments. Bull. Amer. Math. Soc. 58, 527–535.CrossRef Google Scholar

Schrijver, A. (2000). Theory of Linear and Integer Programming. John Wiley, New York.Google Scholar

Verloop, I. M. (2016). Asymptotically optimal priority policies for indexable and nonindexable restless bandits. Ann. Appl. Prob. 26, 1947–1995.CrossRef Google Scholar

Washburn, R. B. (2008). Application of multi-armed bandits to sensor management. In Foundations and Applications of Sensor Management, Springer, Boston, MA, pp. 153–175.CrossRef Google Scholar

Weber, R. R. and Weiss, G. (1990). On an index policy for restless bandits. J. Appl. Prob. 27, 637–648.CrossRef Google Scholar

Whittle, P. (1980). Multi-armed bandits and the Gittins index. J. R. Statist. Soc. B 42, 143–149.Google Scholar

Whittle, P. (1988). Restless bandits: activity allocation in a changing world. In A Celebration of Applied Probability (J. Appl. Prob. Spec. Vol. 25(A)), ed. Gani, J., Applied Probability Trust, Sheffield, pp. 287–298.Google Scholar

Zayas-Cabán, G., Jasin, S. and Wang, G. (2019). An asymptotically optimal heuristic for general nonstationary finite-horizon restless multi-armed, multi-action bandits. Supplementary material. Available at https://doi.org/10.1017/apr.2019.29 CrossRef Google Scholar

Zayas-Cabán et al. supplementary material

Supplementary data

PDF 163.6 KB

An asymptotically optimal heuristic for general nonstationary finite-horizon restless multi-armed, multi-action bandits: Corrigendum

Gabriel Zayas-Cabán Gabriel Zayas-Cabán , Jiaxin Liang , Stefanus Jasin and Guihua Wang

Advances in Applied Probability , Volume 55 , Issue 3

Article contents

An asymptotically optimal heuristic for general nonstationary finite-horizon restless multi-armed, multi-action bandits

Abstract

Keywords

MSC classification

Access options

References

Zayas-Cabán et al. supplementary material

A correction has been issued for this article:

Linked content

This article has been cited by the following publications. This list is generated based on data provided by Crossref.

Article contents

An asymptotically optimal heuristic for general nonstationary finite-horizon restless multi-armed, multi-action bandits

Abstract

Keywords

MSC classification

Access options

References

Zayas-Cabán et al. supplementary material

A correction has been issued for this article:

Linked content

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests