EXPLORATION–EXPLOITATION POLICIES WITH ALMOST SURE, ARBITRARILY SLOW GROWING ASYMPTOTIC REGRET

Wesley Cowan; Michael N. Katehakis

doi:10.1017/S0269964818000529

EXPLORATION–EXPLOITATION POLICIES WITH ALMOST SURE, ARBITRARILY SLOW GROWING ASYMPTOTIC REGRET

Part of: Nonparametric inference

Published online by Cambridge University Press: 26 January 2019

Wesley Cowan and

Michael N. Katehakis

Show author details

Wesley Cowan: Affiliation:
Department of Computer Science, Rutgers University, Piscataway, NJ08854, USA E-mail: cwcowan@math.rutgers.edu
Michael N. Katehakis: Affiliation:
Department of Management Science and Information Systems, Rutgers University, Piscataway, NJ08854, USA E-mail: mnk@rutgers.edu

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

The purpose of this paper is to provide further understanding into the structure of the sequential allocation (“stochastic multi-armed bandit”) problem by establishing probability one finite horizon bounds and convergence rates for the sample regret associated with two simple classes of allocation policies. For any slowly increasing function g, subject to mild regularity constraints, we construct two policies (the g-Forcing, and the g-Inflated Sample Mean) that achieve a measure of regret of order O(g(n)) almost surely as n → ∞, bound from above and below. Additionally, almost sure upper and lower bounds on the remainder term are established. In the constructions herein, the function g effectively controls the “exploration” of the classical “exploration/exploitation” tradeoff.

Keywords

bandits forcing actions inflated sample means multi-armed online learning sequential allocation upper confidence bounds

MSC classification

Primary: 62G05: Estimation

Secondary: 62G20: Asymptotic properties

Type: Research Article
Information: Probability in the Engineering and Informational Sciences , Volume 34 , Special Issue 3: Analytic Methods in Health Care and in Clinical Trials In Honor and Memory of Lycourgos (Lee) Papayanopoulos , July 2020 , pp. 406 - 428

DOI: https://doi.org/10.1017/S0269964818000529 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2019

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

1.Audibert, J-Y, Munos, R., & Szepesvári, C. (2009). Exploration - exploitation tradeoff using variance estimates in multi-armed bandits. Theoretical Computer Science 410: 1876–1902.CrossRef Google Scholar

2.Auer, P., Cesa-Bianchi, N., & Fischer, P. (2002). Finite-time analysis of the multiarmed bandit problem. Machine Learning 47: 235–256.CrossRef Google Scholar

3.Bubeck, S. & Cesa-Bianchi, N. (2012). Regret analysis of stochastic and nonstochastic multi-armed bandit problems. arXiv preprint arXiv:1204.5721.CrossRef Google Scholar

4.Burnetas, A.N. & Katehakis, M.N. (1996). Optimal adaptive policies for sequential allocation problems. Advances in Applied Mathematics 17: 122–142.CrossRef Google Scholar

5.Cowan, W. & Katehakis, M.N. (2015a). An Asymptotically Optimal Policy for Uniform Bandits of Unknown Support. arXiv preprint: arXiv:1505.01918.Google Scholar

6.Cowan, W. & Katehakis, M.N. (2015b). Multi-armed bandits under general depreciation and commitment. Probability in the Engineering and Informational Sciences 29(1): 51–76.CrossRef Google Scholar

7.Cowan, W. & Katehakis, M.N. (2015c). Asymptotically Optimal Sequential Experimentation Under Generalized Ranking. arXiv preprint arXiv:1510.02041.Google Scholar

8.Cowan, W., Honda, J., & Katehakis, M.N. (2018). Normal bandits of unknown means and variances. Journal of Machine Learning Research 18(154): 1–28.Google Scholar

9.Garivier, A., Ménard, P., & Stoltz, G. (2018). Explore first, exploit next: the true shape of regret in bandit problems. Mathematics of Operations Research. doi: 10.1287/moor.2017.0928.Google Scholar

10.Honda, J. & Takemura, A (2010) An asymptotically optimal bandit algorithm for bounded support models. In COLT, 67–79, Citeseer.Google Scholar

11.Honda, J. & Takemura, A. (2011). An asymptotically optimal policy for finite support models in the multiarmed bandit problem. Machine Learning 85: 361–391.CrossRef Google Scholar

12.Honda, J. & Takemura, A. (2013) Optimality of Thompson sampling for Gaussian bandits depends on priors. arXiv preprint arXiv:1311.1894.Google Scholar

13.Lai, T.L. & Robbins, H.E. (1985). Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics 6: 4–2.CrossRef Google Scholar

14.Lattimore, L. (2018). Refining the confidence level for optimistic bandit strategies. Journal of Machine Learning Research 19: 765–796.Google Scholar

15.Orabona, F., Pál, D. (2016) Open problem: Parameter-free and scale-free online algorithms. Conference on Learning Theory, 1659–1664.Google Scholar

16.Ortner, R. (2018). Regret Bounds for Reinforcement Learning via Markov Chain Concentration. arXiv preprint arXiv:1808.01813.Google Scholar

17.Robbins, H.E. (1952). Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Monthly 58: 527–536.CrossRef Google Scholar

Article contents

EXPLORATION–EXPLOITATION POLICIES WITH ALMOST SURE, ARBITRARILY SLOW GROWING ASYMPTOTIC REGRET

Abstract

Keywords

MSC classification

Access options

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests