Skip to main content Accessibility help

On the occupancy problem for a regime-switching model

  • Michael Grabchak (a1), Mark Kelbert (a2) and Quentin Paris (a2)


This article studies the expected occupancy probabilities on an alphabet. Unlike the standard situation, where observations are assumed to be independent and identically distributed, we assume that they follow a regime-switching Markov chain. For this model, we (1) give finite sample bounds on the expected occupancy probabilities, and (2) provide detailed asymptotics in the case where the underlying distribution is regularly varying. We find that in the regularly varying case the finite sample bounds are rate optimal and have, up to a constant, the same rate of decay as the asymptotic result.


Corresponding author

*Postal address: Department of Mathematics and Statistics, Charlotte, NC, USA. Email address:
**Postal address: National Research University Higher School of Economics (HSE), Faculty of Economics, Department of Statistics and Data Analysis, Moscow, Russia. Email address:
***Postal address: National Research University Higher School of Economics (HSE), Faculty of Computer Science, School of Data Analysis and Artificial Intelligence & HDI Lab, Moscow, Russia. Email address:


Hide All
[1]Ben-Hamou, A., Boucheron, S. and Gassiat, E. (2016) Pattern coding meets censoring: (Almost) adaptive coding on countable alphabets. arXiv:1608.08367.
[2]Ben-Hamou, A., Boucheron, S. and Ohannessian, M. I. (2017) Concentration inequalities in the infinite urn scheme for occupancy counts and the missing mass, with applications. Bernoulli 23, 249287.
[3]Bingham, N. H., Goldie, C. M. and Teugels, J. L. (1987) Regular Variation (Encyclopedia of Mathematics And Its Applications). Cambridge University Press.
[4]Bubeck, S., Ernst, D. and Garivier, A. (2013) Optimal discovery with probabilistic expert advice: Finite time analysis and macroscopic optimality. J. Mach. Learn. Res. 14, 601623.
[5]Chao, A. (1981) On estimating the probability of discovering a new species. Ann. Statist. 9, 13391342.10.1214/aos/1176345651
[6]Chen, S. F. and Goodman, J. (1999) An empirical study of smoothing techniques for language modeling. Comput. Speech Lang. 13, 359394.10.1006/csla.1999.0128
[7]Decrouez, G., Grabchak, M. and Paris, Q. (2018) Finite sample properties of the mean occupancy counts and probabilities. Bernoulli 24, 19101941.10.3150/16-BEJ915
[8]Efron, B. and Thisted, R. (1976) Estimating the number of unseen species: How many words did Shakespeare know? Biometrika 63, 435447.
[9]Gandolfi, A. and Sastri, C. C. A. (2004) Nonparametric estimations about species not observed in a random sample. Milan J. Math. 72, 81105.10.1007/s00032-004-0031-8
[10]Glynn, P. W. and Ormoneit, D. (2002) Hoeffding’s inequality for uniformly ergodic Markov chains. Statist. Prob. Lett. 56, 143146.
[11]Gnedin, A., Hansen, B. and Pitman, J. (2007) Notes on the occupancy problem with infinitely many boxes: General asymptotics and power laws. Prob. Surv. 4, 146171.10.1214/07-PS092
[12]Good, I. J. (1953) The population frequencies of species and the estimation of population parameters. Biometrika 40, 237264.10.1093/biomet/40.3-4.237
[13]Good, I. J. and Toulmin, G. H. (1956) The number of new species, and the increase in population coverage, when a sample is increased. Biometrika 43, 4563.10.1093/biomet/43.1-2.45
[14]Grabchak, M. and Zhang, Z. (2017) Asymptotic properties of Turing’s formula in relative error. Mach. Learn. 106, 17711785.10.1007/s10994-016-5620-6
[15]Johnson, N. L. and Kotz, S. (1977) Urn Models and Their Application. Wiley, New York.
[16]Karlin, S. (1967) Central limit theorems for certain infinite urn schemes. J. Math. Mech. 17, 373401.
[17]Mao, C. X. and Lindsay, B. G. (2002) A Poisson model for the coverage problem with a genomic application. Biometrika 89, 669681.10.1093/biomet/89.3.669
[18]Ohannessian, M. I. and Dahleh, M. A. (2012) Rare probability estimation under regularly varying heavy tails. In Proc. 25th Ann. Conf. on Learning Theory, Vol. 23, pp. 21.121.24.
[19]Orlitsky, A., Santhanam, N. P. and Zhang, J. (2004) Universal compression of memoryless sources over unknown alphabets. IEEE Trans. Inf. Theory 50, 14691481.10.1109/TIT.2004.830761
[20]Paulin, D. (2015) Concentration inequalities for Markov chains by Marton couplings and spectral methods. Electron. J. Prob. 20, 132.10.1214/EJP.v20-4039
[21]Resnick, S. I. (2007) Heavy-Tail Phenomena: Probabilistic and Statistical Modeling. Springer, New York.
[22]Roberts, G. O. and Rosenthal, J. S. (2004) General state space Markov chains and MCMC algorithms. Prob. Surv. 1, 2071.10.1214/154957804100000024
[23]Thisted, R. and Efron, B. (1987) Did Shakespeare write a newly discovered poem? Biometrika 74, 445455.10.1093/biomet/74.3.445
[24]Zhang, C. H. (2005) Estimation of sums of random variables: Examples and information bounds. Ann. Statist. 33, 20222041.10.1214/009053605000000390
[25]Zhang, Z. and Huang, H. (2007) Turing’s formula revisited. J. Quant. Ling. 14, 222241.10.1080/09296170701514189


MSC classification

On the occupancy problem for a regime-switching model

  • Michael Grabchak (a1), Mark Kelbert (a2) and Quentin Paris (a2)


Full text views

Total number of HTML views: 0
Total number of PDF views: 0 *
Loading metrics...

Abstract views

Total abstract views: 0 *
Loading metrics...

* Views captured on Cambridge Core between <date>. This data will be updated every 24 hours.

Usage data cannot currently be displayed.