Hostname: page-component-8448b6f56d-jr42d Total loading time: 0 Render date: 2024-04-19T21:55:21.548Z Has data issue: false hasContentIssue false

Error bounds on multivariate Normal approximations for word count statistics

Published online by Cambridge University Press:  01 July 2016

Haiyan Huang*
Affiliation:
University of Southern California
*
Current address: Department of Biostatistics, Harvard School of Public Health, Boston, MA 02115, USA. Email address: hhuang@hsph.harvard.edu

Abstract

Given a sequence S and a collection Ω of d words, it is of interest in many applications to characterize the multivariate distribution of the vector of counts U = (N(S,w1), …, N(S,wd)), where N(S,w) is the number of times a word w ∈ Ω appears in the sequence S. We obtain an explicit bound on the error made when approximating the multivariate distribution of U by the normal distribution, when the underlying sequence is i.i.d. or first-order stationary Markov over a finite alphabet. When the limiting covariance matrix of U is nonsingular, the error bounds decay at rate O((log n) / √n) in the i.i.d. case and O((log n)3 / √n) in the Markov case. In order for U to have a nondegenerate covariance matrix, it is necessary and sufficient that the counted word set Ω is not full, that is, that Ω is not the collection of all possible words of some length k over the given finite alphabet. To supply the bounds on the error, we use a version of Stein's method.

Type
General Applied Probability
Copyright
Copyright © Applied Probability Trust 2002 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Bhattacharya, R. N. and Ranga Rao, R. (1986). Normal Approximation and Asymptotic Expansions. Robert E. Krieger Publishing Co., Melbourne, FL (reprint of the 1976 original).Google Scholar
Biggins, J. D. and Cannings, C. (1987). Markov renewal processes, counters and repeated sequences in Markov chains. Adv. Appl. Math. 19, 521545.Google Scholar
Breen, S., Waterman, M. S. and Zhang, N. (1985). Renewal theory for several patterns. J. Appl. Prob. 22, 228234.CrossRefGoogle Scholar
Dembo, A. and Rinott, Y. (1994). Some examples of normal approximations by Stein's method. In Random Discrete Structures (IMA Vol. 76), Springer, New York, pp. 2544.Google Scholar
Goetze, B. (1991). On the rate of convergence in the multivariate CLT. Ann. Prob. 19, 724739.Google Scholar
Karlin, S. and Taylor, H. M. (1975). A First Course in Stochastic Processes, 2nd edn. Academic Press, New York.Google Scholar
Lundstrom, R. (1990). Stochastic models and statistical methods for DNA sequence data. , University of Southern California.Google Scholar
Prum, B., Rodolphe, F. and de Turckheim, È. (1995). Finding words with unexpected frequencies in deoxyribonucleic acid sequences. J. R. Statist. Soc. B 57, 205220.Google Scholar
Reinert, G. and Schbath, S. (1998). Compound Poisson and Poisson process approximations for occurrences of multiple words in Markov chains. J. Comput. Biol. 5, 223253.CrossRefGoogle ScholarPubMed
Reinert, G., Schbath, S. and Waterman, M. S. (2000). Probabilistic and statistical properties of words: an overview. J. Comput. Biol. 7, 146.CrossRefGoogle ScholarPubMed
Rinott, Y. and Rotar, V. (1996). A multivariate CLT for local dependence with n-1/2 log n rate and application to multivariate graph related statistics. J. Multivariate Anal. 56, 333350.CrossRefGoogle Scholar
Rinott, Y. and Rotar, V. (1997). On coupling constructions and rates in the CLT for dependent summands with applications to the antivector model and weighted U-statistics. Ann. Appl. Prob. 7, 10801105.CrossRefGoogle Scholar
Robin, S. and Schbath, S. (2001). Numerical comparison of several approximations of the word count distribution in random sequences. J. Comput. Biol. 8, 349359.CrossRefGoogle ScholarPubMed
Sazonov, V. V. (1968). On the multi-dimensional central limit theorem. Sankhyā A 30, 181204.Google Scholar
Tanushev, M. (1997). Joint central limit theorem for renewals of competing patterns. , University of Southern California.Google Scholar
Waterman, M. S. (1995). Introduction to Computational Biology: Maps, Sequences and Genomes. Chapman and Hall, London.CrossRefGoogle Scholar