Hostname: page-component-8448b6f56d-tj2md Total loading time: 0 Render date: 2024-04-23T08:20:52.574Z Has data issue: false hasContentIssue false

Some reward–penalty rules for the multi-armed bandit problem which are asymptotically optimal

Published online by Cambridge University Press:  01 July 2016

K. D. Glazebrook*
Affiliation:
University of Newcastle upon Tyne
*
Postal address: Department of Statistics, The University, Newcastle upon Tyne, NE1 7RU, U.K.
Rights & Permissions [Opens in a new window]

Abstract

Core share and HTML view are not available for this content. However, as you have access to this content, a full PDF is available via the ‘Save PDF’ action button.

In the mathematical learning literature, reward–penalty rules have been studied in various decision-theoretic and game-theoretic contexts, the multi-armed bandit problem included. Here we propose an elaboration of Bather's randomised allocation indices which yields rules for the multi-armed bandit which are both reward-penalty and asymptotically optimal.

Type
Letters to the Editor
Copyright
Copyright © Applied Probability Trust 1983 

References

Bather, J. (1980) Randomised allocation of treatments in sequential trials. Adv. Appl. Prob. 12, 174182.Google Scholar
Bather, J. (1981) Randomised allocation of treatments in sequential experiments (with discussion). J. R. Statist. Soc., B43, 265292.Google Scholar
Glazebrook, K. D. (1980) On randomized dynamic allocation indices for the sequential design of experiments. J. R. Statist. Soc., B42, 342346.Google Scholar
Meybodi, M. R. and Lackshmivarahan, S. (1982) e-optimality of a general class of learning algorithms. In Proc. Conf. Mathematical Learning Models–Theory and Applications. To appear.Google Scholar