Skip to main content Accessibility help

An asymptotically optimal heuristic for general nonstationary finite-horizon restless multi-armed, multi-action bandits

  • Gabriel Zayas-Cabán (a1), Stefanus Jasin (a2) and Guihua Wang (a2)


We propose an asymptotically optimal heuristic, which we term randomized assignment control (RAC) for a restless multi-armed bandit problem with discrete-time and finite states. It is constructed using a linear programming relaxation of the original stochastic control formulation. In contrast to most of the existing literature, we consider a finite-horizon problem with multiple actions and time-dependent (i.e. nonstationary) upper bound on the number of bandits that can be activated at each time period; indeed, our analysis can also be applied in the setting with nonstationary transition matrix and nonstationary cost function. The asymptotic setting is obtained by letting the number of bandits and other related parameters grow to infinity. Our main contribution is that the asymptotic optimality of RAC in this general setting does not require indexability properties or the usual stability conditions of the underlying Markov chain (e.g. unichain) or fluid approximation (e.g. global stable attractor). Moreover, our multi-action setting is not restricted to the usual dominant action concept. Finally, we show that RAC is also asymptotically optimal for a dynamic population, where bandits can randomly arrive and depart the system.


Corresponding author

*Postal address: Mechanical Engineering Building, University of Wisconsin-Madison, 1513 University Avenue, Room 3011 Madison, WI 53706-1691, USA. Email address:
**Postal address: Stephen M. Ross School of Business, University of Michigan, 701 Tappan Street, Ann Arbor, MI 48109, USA.
***Email address:
****Email address:


Hide All
Ahmad, S. H. A. et al. (2009). Optimality of myopic sensing in multichannel opportunistic access. IEEE Trans. Inf. Theory 55, 40404050.
Altman, E. (1999). Constrained Markov Decision Processes. Chapman & Hall/CRC, Boca Raton, FL.
Ayer, T. et al. (2016). Prioritizing hepatitis C treatment in U.S. prisons. Preprint. Available at SSRN:
Berry, D. A. and Fristedt, B. (1985). Bandit Problems: Sequential Allocation of Experiments. Chapman & Hall, London.
Bertsimas, D. and Niño-Mora, J. (2000). Restless bandits, linear programming relaxations, and a primal-dual index heuristic. Operat. Res. 48, 8090.
Bradt, R. N., Johnson, S. M. and Karlin, S. (1956). On sequential designs for maximizing the sum of n observations. Ann. Math. Statist. 27, 10601074.
Caro, F. and Gallien, J. (2007). Dynamic assortment with demand learning for seasonal consumer goods. Manag. Sci. 53, 276292.
Cohen, K., Zhao, Q. and Scaglione, A. (2014). Restless multi-armed bandits under time-varying activation constraints for dynamic spectrum access. In 2014 48th Asilomar Conference on Signals, Systems and Computers, IEEE, pp. 1575–1578.
Deo, S. et al. (2013). Improving health outcomes through better capacity allocation in a community-based chronic care model. Operat. Res. 61, 1277–1294.
Gittins, J. C. and Jones, D. M. (1974). A dynamic allocation index for the sequential design of experiments. In Progress in Statistics (Budapest, 1972; Colloq. Math. Soc. János Bolyai 9), North-Holland, Amsterdam, pp. 241–266.
Gittins, J., Glazebrook, K. and Weber, R. (2011). Multi-Armed Bandit Allocation Indices, 2nd edn. John Wiley, Chichester.
Hu, W. and Frazier, P. (2017). An asymptotically optimal index policy for finite-horizon restless bandits. Preprint. Available at
Kelly, F. P. (1981). Multi-armed bandits with discount factor near one: the Bernoulli case. Ann. Statist. 9, 9871001.
Le Ny, J., Dahleh, M. and Feron, E. (2008). A linear programming relaxation and a heuristic for the restless bandit problem with general switching costs. Preprint. Available at
Lee, E., Lavieri, M. S. and Volk, M. (2018). Optimal screening for hepatocellular carcinoma: a restless bandit model. Manufacturing Service Operat. Manag. 21.
Mahajan, A. and Teneketzis, D. (2008). Multi-armed bandit problems. In Foundations Applications of Sensor Management, Springer, Boston, MA, pp. 121151.
Nain, P. and Ross, K. W. (1986). Optimal priority assignment with hard constraint. IEEE Trans. Automatic Control 31, 883888.
Niño-Mora, J. (2011). Computing a classic index for finite-horizon bandits. INFORMS J. Comput. 23, 254267.
Papadimitriou, C. H. and Tsitsiklis, J. N. (1999). The complexity of optimal queuing network control. Math. Operat. Res. 24, 293305.
Robbins, H. (1952). Some aspects of the sequential design of experiments. Bull. Amer. Math. Soc. 58, 527535.
Schrijver, A. (2000). Theory of Linear and Integer Programming. John Wiley, New York.
Verloop, I. M. (2016). Asymptotically optimal priority policies for indexable and nonindexable restless bandits. Ann. Appl. Prob. 26, 19471995.
Washburn, R. B. (2008). Application of multi-armed bandits to sensor management. In Foundations and Applications of Sensor Management, Springer, Boston, MA, pp. 153175.
Weber, R. R. and Weiss, G. (1990). On an index policy for restless bandits. J. Appl. Prob. 27, 637648.
Whittle, P. (1980). Multi-armed bandits and the Gittins index. J. R. Statist. Soc. B 42, 143149.
Whittle, P. (1988). Restless bandits: activity allocation in a changing world. In A Celebration of Applied Probability (J. Appl. Prob. Spec. Vol. 25(A)), ed. Gani, J., Applied Probability Trust, Sheffield, pp. 287298.
Zayas-Cabán, G., Jasin, S. and Wang, G. (2019). An asymptotically optimal heuristic for general nonstationary finite-horizon restless multi-armed, multi-action bandits. Supplementary material. Available at


MSC classification

Type Description Title
Supplementary data

Zayas-Cabán et al. supplementary material
Supplementary data

 PDF (164 KB)
164 KB

An asymptotically optimal heuristic for general nonstationary finite-horizon restless multi-armed, multi-action bandits

  • Gabriel Zayas-Cabán (a1), Stefanus Jasin (a2) and Guihua Wang (a2)


Full text views

Total number of HTML views: 0
Total number of PDF views: 0 *
Loading metrics...

Abstract views

Total abstract views: 0 *
Loading metrics...

* Views captured on Cambridge Core between <date>. This data will be updated every 24 hours.

Usage data cannot currently be displayed.