Hostname: page-component-77c89778f8-m42fx Total loading time: 0 Render date: 2024-07-16T11:05:39.386Z Has data issue: false hasContentIssue false

Optimal learning with non-Gaussian rewards

Published online by Cambridge University Press:  24 March 2016

Zi Ding*
Affiliation:
University of Maryland
Ilya O. Ryzhov*
Affiliation:
University of Maryland
*
* Postal address: Robert H. Smith School of Business, University of Maryland, 4322 Van Munching Hall, College Park, MD 20742, USA.
* Postal address: Robert H. Smith School of Business, University of Maryland, 4322 Van Munching Hall, College Park, MD 20742, USA.

Abstract

We propose a novel theoretical characterization of the optimal 'Gittins index' policy in multi-armed bandit problems with non-Gaussian, infinitely divisible reward distributions. We first construct a continuous-time, conditional Lévy process which probabilistically interpolates the sequence of discrete-time rewards. When the rewards are Gaussian, this approach enables an easy connection to the convenient time-change properties of a Brownian motion. Although no such device is available in general for the non-Gaussian case, we use optimal stopping theory to characterize the value of the optimal policy as the solution to a free-boundary partial integro-differential equation (PIDE). We provide the free-boundary PIDE in explicit form under the specific settings of exponential and Poisson rewards. We also prove continuity and monotonicity properties of the Gittins index in these two problems, and discuss how the PIDE can be solved numerically to find the optimal index value of a given belief state.

Type
Research Article
Copyright
Copyright © Applied Probability Trust 2016 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Aalto, S., Ayesta, U. and Righter, R. (2011). Properties of the Gittins index with application to optimal scheduling. Prob. Eng. Inf. Sci. 25, 269288. CrossRefGoogle Scholar
Agarwal, D., Chen, B.-C. and Elango, P. (2009). Explore/exploit schemes for web content optimization. In Proceedings of the 9th IEEE International Conference on Data Mining, IEEE, New York, pp. 110. Google Scholar
Auer, P., Cesa-Bianchi, N. and Fischer, P. (2002). Finite-time analysis of the multiarmed bandit problem. Machine Learning 47, 235256. CrossRefGoogle Scholar
Berry, D. A. and Pearson, L. M. (1985). Optimal designs for clinical trials with dichotomous responses. Statist. Medicine 4, 497508. CrossRefGoogle ScholarPubMed
Brezzi, M. and Lai, T. L. (2002). Optimal learning and experimentation in bandit problems. J. Econom. Dynamics Control 27, 87108. Google Scholar
Buonaguidi, B. and Muliere, P. (2013). Sequential testing problems for Lévy processes. Sequential Anal. 32, 4770. CrossRefGoogle Scholar
Caro, F. and Gallien, J. (2007). Dynamic assortment with demand learning for seasonal consumer goods. Manag. Sci. 53, 276292. CrossRefGoogle Scholar
Chhabra, M. and Das, S. (2011). Learning the demand curve in posted-price digital goods auctions. In Proceedings of the 10th International Conference on Autonomous Agents and Multiagent Systems, pp. 6370. Google Scholar
Chick, S. E. (2006). Subjective probability and Bayesian methodology. In Handbooks in Operations Research and Management Science, Vol. 13, Simulation, North-Holland, Amsterdam, pp. 225258. Google Scholar
Chick, S. E. and Frazier, P. I. (2012). Sequential sampling with economics of selection procedures. Manag. Sci. 58, 550569. Google Scholar
Chick, S. E. and Gans, N. (2009). Economic analysis of simulation selection problems. Manag. Sci. 55, 421437. Google Scholar
Chick, S. E. and Inoue, K. (2001). New procedures to select the best simulated system using common random numbers. Manag. Sci. 47, 11331149. Google Scholar
Çinlar, E. (2003). Conditional Lévy processes. Comput. Math. Appl. 46, 993997. Google Scholar
Çinlar, E. (2011). Probability and Stochastics. Springer, New York. CrossRefGoogle Scholar
Cohen, A. and Solan, E. (2013). Bandit problems with Lévy processes. Math. Operat. Res. 38, 92107. CrossRefGoogle Scholar
Coquet, F. and Toldo, S. (2007). Convergence of values in optimal stopping and convergence of optimal stopping times. Electron. J. Prob. 12, 207228. Google Scholar
DeGroot, M. H. (1970). Optimal Statistical Decisions. McGraw-Hill, New York. Google Scholar
Dynkin, E. B. (1965). Markov Processes. Academic Press, New York. Google Scholar
El Karoui, N. and Karatzas, I. (1994). Dynamic allocation problems in continuous time. Ann. Appl. Prob. 4, 255286. Google Scholar
Farias, V. F. and Van Roy, B. (2010). Dynamic pricing with a prior on market response. Operat. Res. 58, 1629. Google Scholar
Filliger, R. and Hongler, M.-O. (2007). Explicit Gittins indices for a class of superdiffusive processes. J. Appl. Prob. 44, 554559. CrossRefGoogle Scholar
Frazier, P. I. and Powell, W. B. (2011). Consistency of sequential Bayesian sampling policies. SIAM J. Control Optimization 49, 712731. Google Scholar
Frazier, P. I., Powell, W. B. and Dayanik, S. (2008). A knowledge-gradient policy for sequential information collection. SIAM J. Control Optimization 47, 24102439. CrossRefGoogle Scholar
Gittins, J. C. and Jones, D. M. (1979). A dynamic allocation index for the discounted multiarmed bandit problem. Biometrika 66, 561565. Google Scholar
Gittins, J. C. and Wang, Y.-G. (1992). The learning component of dynamic allocation indices. Ann. Statist. 20, 16251636. Google Scholar
Gittins, J. C., Glazebrook, K. D. and Weber, R. (2011). Multi-Armed Bandit Allocation Indices, 2nd edn. John Wiley, Oxford. Google Scholar
Glazebrook, K. D. and Minty, R. (2009). A generalized Gittins index for a class of multiarmed bandits with general resource requirements. Math. Operat. Res. 34, 2644. CrossRefGoogle Scholar
Glazebrook, K. D., Meissner, J. and Schurr, J. (2013). How big should my store be? On the interplay between shelf-space, demand learning and assortment decisions. Working paper, Lancaster University. Google Scholar
Itô, K., Barndorff-Nielsen, O. E. and Sato, K.-I. (2004). Stochastic Processes: Lectures Given at Aarhus University. Springer, Berlin. Google Scholar
Jouini, W. and Moy, C. (2012). Channel selection with Rayleigh fading: a multi-armed bandit framework. In Proceedings of the 13th IEEE International Workshop on Signal Processing Advances in Wireless Communications, IEEE, New York, pp. 299303. Google Scholar
Kaspi, H. and Mandelbaum, A. (1995). Lévy bandits: multi-armed bandits driven by Lévy processes. Ann. Appl. Prob. 5, 541565. Google Scholar
Katehakis, M. N. and VeinottA. F., Jr. A. F., Jr. (1987). The multi-armed bandit problem: decomposition and computation. Math. Operat. Res. 12, 262268. CrossRefGoogle Scholar
Kyprianou, A. E. (2006). Introductory Lectures on Fluctuations of Lévy Processes with Applications. Springer, Berlin. Google Scholar
Lamberton, D. and Pagès, G. (1990). Sur l'approximation des réduites. Ann. Inst. H. Poincaré Prob. Statist. 26, 331355. Google Scholar
Lariviere, M. A. and Porteus, E. L. (1999). Stalking information: Bayesian inventory management with unobserved lost sales. Manag. Sci. 45, 346363. CrossRefGoogle Scholar
Mandelbaum, A. (1986). Discrete multiarmed bandits and multiparameter processes. Prob. Theory Relat. Fields 71, 129147. CrossRefGoogle Scholar
Mandelbaum, A. (1987). Continuous multi-armed bandits and multiparameter processes. Ann. Prob. 15, 15271556. Google Scholar
Monroe, I. (1978). Processes that can be embedded in Brownian motion. Ann. Prob. 6, 4256. Google Scholar
Müller, A. (1997). How does the value function of a Markov decision process depend on the transition probabilities? Math. Operat. Res. 22, 872885. CrossRefGoogle Scholar
Müller, A. and Stoyan, D. (2002). Comparison Methods for Stochastic Models and Risks. John Wiley, Chichester. Google Scholar
Peskir, G. and Shiryaev, A. N. (2006). Optimal Stopping and Free-Boundary Problems. Birkhäuser, Basel. Google Scholar
Powell, W. B. and Ryzhov, I. O. (2012). Optimal Learning. John Wiley, Hoboken, NJ. Google Scholar
Ryzhov, I. O. and Powell, W. B. (2011). The value of information in multi-armed bandits with exponentially distributed rewards. In Proceedings of the 2011 International Conference on Computational Science, pp. 13631372. Google Scholar
Ryzhov, I. O., Powell, W. B. and Frazier, P. I. (2012). The knowledge gradient algorithm for a general class of online learning problems. Operat. Res. 60, 180195. Google Scholar
Sato, K.-I. (1999). Lévy Processes and Infinitely Divisible Distributions. Cambridge University Press. Google Scholar
Shaked, M. and Shanthikumar, J. G. (2007). Stochastic Orders. Springer, New York. Google Scholar
Steele, J. M. (2001). Stochastic Calculus and Financial Applications. Springer, New York. Google Scholar
Van Moerbeke, P. (1976). On optimal stopping and free boundary problems. Arch. Rational Mech. Anal. 60, 101148. Google Scholar
Vazquez, E. and Bect, J. (2010). Convergence properties of the expected improvement algorithm with fixed mean and covariance functions. J. Statist. Planning Infer. 140, 30883095. Google Scholar
Wang, X. and Wang, Y. (2010). Optimal investment and consumption with stochastic dividends. Appl. Stoch. Models Business Industry 26, 792808. Google Scholar
Yao, Y.-C. (2006). Some results on the Gittins index for a normal reward process. In Time Series and Related Topics, Institute of Mathematical Statistics, Beachwood, OH, pp. 284294. Google Scholar
Yu, Y. (2011). Structural properties of Bayesian bandits with exponential family distributions. Preprint. Available at http://arxiv.org/abs/1103.3089. Google Scholar
Zhang, Q., Seetharaman, P. B. and Narasimhan, C. (2012). The indirect impact of price deals on households' purchase decisions through the formation of expected future prices. J. Retailing 88, 88101. Google Scholar