Compound Poisson approximation of word counts in DNA sequences

Sophie Schbath

doi:10.1051/ps:1997100

Abstract

Identifying words with unexpected frequencies is an important problem in the analysis of long DNA sequences. To solve it, we need an approximation of the distribution of the number of occurrences N(W) of a word W. Modeling DNA sequences with m-order Markov chains, we use the Chen-Stein method to obtain Poisson approximations for two different counts. We approximate the “declumped” count of W by a Poisson variable and the number of occurrences N(W) by a compound Poisson variable. Combinatorial results are used to solve the general case of overlapping words and to calculate the parameters of these distributions.

Crossref Citations

This article has been cited by the following publications. This list is generated based on data provided by Crossref.

Pakes, A. G. and Stefanov, V. T. 1997. Explicit distributional results in pattern formation. The Annals of Applied Probability, Vol. 7, Issue. 3,

Chen, Louis H. Y. 1998. Probability Towards 2000. Vol. 128, Issue. , p. 97.

REINERT, GESINE and SCHBATH, SOPHIE 1998. Compound Poisson and Poisson Process Approximations for Occurrences of Multiple Words in Markov Chains. Journal of Computational Biology, Vol. 5, Issue. 2, p. 223.

BENSON, GARY and SU, XIAOPING 1998. On the Distribution of K-tuple Matches for Sequence Homology: A Constant Time Exact Calculation of the Variance. Journal of Computational Biology, Vol. 5, Issue. 1, p. 87.

Robin, S. and Daudin, J. J. 1999. Exact distribution of word occurrences in a random sequence of letters. Journal of Applied Probability, Vol. 36, Issue. 1, p. 179.

Reinert, Gesine Schbath, Sophie and Waterman, Michael S. 2000. Probabilistic and Statistical Properties of Words: An Overview. Journal of Computational Biology, Vol. 7, Issue. 1-2, p. 1.

Barbour, A Chryssaphinou, O and Vaggelatou, E 2000. Probability and Statistical Models with Applications.

Schbath, Sophie 2000. An Overview on the Distribution of Word Counts in Markov Chains. Journal of Computational Biology, Vol. 7, Issue. 1-2, p. 193.

Erhardsson, Torkel 2000. Compound Poisson approximation for counts of rare patterns in Markov chains and extreme sojourns in birth-death chains. The Annals of Applied Probability, Vol. 10, Issue. 2,

Robin, Stéphane and Schbath, Sophie 2001. Numerical Comparison of Several Approximations of the Word Count Distribution in Random Sequences. Journal of Computational Biology, Vol. 8, Issue. 4, p. 349.

Barbour, A. D. and Chryssaphinou, O. 2001. Compound Poisson approximation: a user's guide. The Annals of Applied Probability, Vol. 11, Issue. 3,

Lippert, Ross A. Huang, Haiyan and Waterman, Michael S. 2002. Distributional regimes for the number ofk-word matches between two random sequences. Proceedings of the National Academy of Sciences, Vol. 99, Issue. 22, p. 13980.

Ledent, Sabrina and Robin, Stéphane 2005. Checking Homogeneity of Motifs' Distribution in Heterogenous Sequences. Journal of Computational Biology, Vol. 12, Issue. 6, p. 672.

Leung, Ming-Ying Choi, Kwok Pui Xia, Aihua and Chen, Louis H.Y. 2005. Nonrandom Clusters of Palindromes in Herpesvirus Genomes. Journal of Computational Biology, Vol. 12, Issue. 3, p. 331.

Rødland, Einar Andreas 2006. Exact distribution of word counts in shuffled sequences. Advances in Applied Probability, Vol. 38, Issue. 1, p. 116.

Nuel, Grégory 2006. Pattern statistics on Markov chains and sensitivity to parameter estimation. Algorithms for Molecular Biology, Vol. 1, Issue. 1,

Mungan, Muhittin 2007. String Matching and 1d Lattice Gases. Journal of Statistical Physics, Vol. 126, Issue. 1, p. 207.

Stefanov, V.T. Robin, S. and Schbath, S. 2007. Waiting times for clumps of patterns and for structured motifs in random sequences. Discrete Applied Mathematics, Vol. 155, Issue. 6-7, p. 868.

Robin, Stéphane Schbath, Sophie and Vandewalle, Vincent 2007. Statistical tests to compare motif count exceptionalities. BMC Bioinformatics, Vol. 8, Issue. 1,

Roquain, Etienne and Schbath, Sophie 2007. Improved compound Poisson approximation for the number of occurrences of any rare word family in a stationary markov chain. Advances in Applied Probability, Vol. 39, Issue. 1, p. 128.

Download full list

Article contents

Compound Poisson approximation of word counts in DNA sequences

Abstract

Keywords

Access options

This article has been cited by the following publications. This list is generated based on data provided by Crossref.

Article contents

Compound Poisson approximation of word counts in DNA sequences

Abstract

Keywords

Access options

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests