Hostname: page-component-8448b6f56d-jr42d Total loading time: 0 Render date: 2024-04-23T21:35:59.442Z Has data issue: false hasContentIssue false

Exact distribution of word counts in shuffled sequences

Published online by Cambridge University Press:  01 July 2016

Einar Andreas Rødland*
Affiliation:
Rikshospitalet–Radiumhospitalet HF, University of Oslo
*
Postal address: Institute of Medical Microbiology, Centre for Molecular Biology and Neuroscience, University of Oslo, Rikshospitalet–Radiumhospitalet HF, N-0027 Oslo, Norway. Email address: e.a.rodland@medisin.uio.no
Rights & Permissions [Opens in a new window]

Abstract

Core share and HTML view are not available for this content. However, as you have access to this content, a full PDF is available via the ‘Save PDF’ action button.

In DNA sequences, specific words may take on biological functions as marker or signalling sequences. These may often be identified by frequent-word analyses as being particularly abundant. Accurate statistics is needed to assess the statistical significance of these word frequencies. The set of shuffled sequences - letter sequences having the same k-word composition, for some choice of k, as the sequence being analysed - is considered the most appropriate sample space for analysing word counts. However, little is known about these word counts. Here we present exact formulae for word counts in shuffled sequences.

Type
General Applied Probability
Copyright
Copyright © Applied Probability Trust 2006 

References

Altschul, S. and Erickson, B. (1985). Significance of nucleotide sequence alignments: a method for random sequence permutation that preserves dinucleotide and codon usage. Molec. Biol. Evol. 2, 526538.Google Scholar
Cowan, R. (1991). Expected frequencies of DNA patterns using Whittle's formula. J. Appl. Prob. 28, 886892.Google Scholar
Dawson, R. and Good, I. (1957). Exact Markov probabilities from oriented linear graphs. Ann. Math. Statist. 28, 946956.Google Scholar
Fitch, W. (1983). Random sequences. J. Molec. Biol. 163, 171176.Google Scholar
Goodman, L. (1958). Exact probabilities and asymptotic relationships for some statistics from m-th order Markov chains. Ann. Math. Statist. 29, 476490.Google Scholar
Kandel, D., Matias, Y., Unger, R. and Winkler, P. (1996). Shuffling biological sequences. Discrete Appl. Math. 71, 171185.CrossRefGoogle Scholar
Prum, B., Rodolphe, F. and de Turckheim, E. (1995). Finding words with unexpected frequencies in deoxyribonucleic acid sequences. J. R. Statist. Soc. B 57, 205220.Google Scholar
Robin, S. and Daudin, J. (1999). Exact distribution of word occurrences in a random sequence of letters. J. Appl. Prob. 36, 179193.CrossRefGoogle Scholar
Robin, S. and Schbath, S. (2001). Numerical comparison of several approximations of the word count distribution in random sequences. J. Comput. Biol. 8, 349359.Google Scholar
Schbath, S. (1995). Compound Poisson approximation of word counts in DNA sequences. ESAIM Prob. Statist. 1, 116.Google Scholar
Schbath, S. (1995). Étude asymptotique du nombre d'occurrences d'un mot dans une chaıcirc;ne de Markov et application à la recherche de mots de fréquence exceptionnelle dans les séquences d'ADN. , Université René Descartes, Paris V.Google Scholar
Tutte, W. (1948). The dissection of equilateral triangles into equilateral triangles. Proc. Camb. Philos. Soc. 44, 463482.Google Scholar
Van Aardenne-Ehrenfest, T. and de Bruijn, N. (1951). Circuits and trees in oriented linear graphs. Simon Stevin 28, 203217.Google Scholar
Whittle, P. (1955). Some distribution and moment formulae for the Markov chain. J. R. Statist. Soc. B 17, 235242.Google Scholar
Zaman, A. (1984). Urn models for Markov exchangeability. Ann. Prob. 12, 223229.Google Scholar