Hostname: page-component-7479d7b7d-68ccn Total loading time: 0 Render date: 2024-07-10T23:38:00.700Z Has data issue: false hasContentIssue false

Distribution of the number of words with a prescribed frequency and tests of randomness

Published online by Cambridge University Press:  01 July 2016

Andrew L. Rukhin*
Affiliation:
University of Maryland, Baltimore County
*
* Postal address: Department of Mathematics and Statistics, UMBC, Baltimore, MD 21250, USA. Email address: rukhin@math.umbc.edu

Abstract

The goal of this paper is to investigate properties of statistical procedures based on numbers of different patterns by using generating functions for the probabilities of a prescribed number of occurrences of given patterns in a random text. The asymptotic formulae are derived for the expected value of the number of words occurring a given number of times and for the covariance matrix. The form of the optimal linear test based on these statistics is established. These problems appear in testing for the randomness of a string of binary bits, DNA sequencing, source coding, synchronization, quality control protocols, etc. Indeed, the probabilities of repeated (overlapping) patterns are important in information theory (the second-order properties of relative frequencies) and molecular biology problems (finding patterns with unexpectedly low or high frequencies).

Type
General Applied Probability
Copyright
Copyright © Applied Probability Trust 2002 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Barbour, A. D., Holst, L. and Janson, S. (1992). Poisson Approximation. Oxford University Press.CrossRefGoogle Scholar
Chryssaphinou, O. and Papastavridis, S. (1988). A limit theorem on the number of overlapping appearances of a pattern in a sequence of independent trials. Prob. Theory Relat. Fields 79, 129143.CrossRefGoogle Scholar
Chryssaphinou, O., Papastavridis, S. and Tsapelos, T. (1994). On the waiting time of appearance of given patterns. In Runs and Patterns in Probability, eds Godbole, A. P. and Papastavridis, S. G., Kluwer, Dordrecht, pp. 231241.CrossRefGoogle Scholar
Fudos, I., Pitoura, E. and Szpankowski, W. (1996). On pattern occurrences in a random text. Inf. Process. Lett. 57, 307312.CrossRefGoogle Scholar
Guibas, L. J. and Odlyzko, A. M. (1981). String overlaps, pattern matching and nontransitive games. J. Combinatorial Theory A 30, 183208.Google Scholar
Horn, R. A. and Johnson, C. R. (1985). Matrix Analysis. Cambridge University Press.CrossRefGoogle Scholar
Knuth, D. E. (1981). The Art of Computer Programming, Vol. 2, 2nd edn. Addison-Wesley, Reading, MA.Google Scholar
Kolchin, V. F., Sevastyanov, B. A. and Chistyakov, V. P. (1978). Random Allocations. Winston and Sons, Washington DC.Google Scholar
Li, S.-Y. R. (1980). A martingale approach to the study of occurrence of sequence patterns in repeated experiments. Ann. Prob. 8, 11711176.Google Scholar
Marsaglia, G. (1985). A current view of random number generation. In Computer Science and Statistics, Proc. 16th Symp. Interface (Atlanta, GA, March 1984), ed. Billard, L., Elsevier, New York, pp. 310.Google Scholar
Marsaglia, G. (1996). DIEHARD. Source code available at http://stat.fsu.edu/~geo/.Google Scholar
Marsaglia, G. and Zaman, A. (1993). Monkey tests for random number generators. Comput. Math. Appl. 9, 110.CrossRefGoogle Scholar
Mikhailov, V. G. (1989). Asymptotic normality of decomposable statistics from the frequencies of m-chains. Discrete Math. Appl. 1, 335347.Google Scholar
Mori, T. (1991). On the waiting time till each of some given patterns occurs as a run. Prob. Theory Relat. Fields 83, 313323.CrossRefGoogle Scholar
Prum, B., Rodolphe, F. and de Turckheim, E. (1995). Finding words with unexpected frequencies in a deoxy-ribo-nucleic acid sequence. J. R. Statist. Soc. B 57, 205220.Google Scholar
Regnier, M. and Szpankowski, W. (1997). On the approximate pattern occurrence in a text. In Proc. Internat. Conf. Compression and Complexity of Sequences 1997 (Salerno, 11–13 June 1997), eds Carpentieri, B. et al., IEEE Computer Society, Los Alimolitos, CA, pp. 253264.Google Scholar
Regnier, M. and Szpankowski, W. (1998). On pattern frequency occurrences in a Markovian sequence. Algorithmica 22, 631649.Google Scholar
Robin, S. and Daudin, J. J. (1999). Exact distribution of word occurrences in a random sequence of letters. J. Appl. Prob. 36, 179193.Google Scholar
Rukhin, A. L. et al., (2000). A statistical test suite for random and pseudorandom number generators for cryptographic applications. Special Publication 800-22, National Institute of Standards and Technology, Gaithersburg, MD.CrossRefGoogle Scholar
Shmueli, G. and Cohen, A. (2000). Run-related probability functions applied to sampling inspection. Technometrics 46, 21882202.Google Scholar
Szpankowski, W. (2001). Average Case Analysis of Algorithms on Sequences. John Wiley, New York.Google Scholar
Tikhomirova, M. I. and Chistyakov, V. P. (1997a). On the asymptotics of moments of the number of nonappearing s-chains. Discrete Math. Appl. 7, 1332.CrossRefGoogle Scholar
Tikhomirova, M. I. and Chistyakov, V. P. (1997b). On statistical tests based on missing s-patterns. In Trydy po Diskretnoi Matematike, Vol. 1, TVP Publishers, Moscow, pp. 265278 (in Russian).Google Scholar
Wilf, H. (1990). Generatingfunctionology. Academic Press, San Diego, CA.Google Scholar