Skip to main content Accessibility help
×
Home

Asymptotic Behavior of k-Word Matches Between two Uniformly Distributed Sequences

  • M. R. Kantorovitz (a1), H. S. Booth (a2), C. J. Burden (a2) and S. R. Wilson (a2)

Abstract

Given two sequences of length n over a finite alphabet A of size |A| = d, the D 2 statistic is the number of k-letter word matches between the two sequences. This statistic is used in bioinformatics for EST sequence database searches. Under the assumption of independent and identically distributed letters in the sequences, Lippert, Huang and Waterman (2002) raised questions about the asymptotic behavior of D 2 when the alphabet is uniformly distributed. They expressed a concern that the commonly assumed normality may create errors in estimating significance. In this paper we answer those questions. Using Stein's method, we show that, for large enough k, the D 2 statistic is approximately normal as n gets large. When k = 1, we prove that, for large enough d, the D 2 statistic is approximately normal as n gets large. We also give a formula for the variance of D 2 in the uniform case.

    • Send article to Kindle

      To send this article to your Kindle, first ensure no-reply@cambridge.org is added to your Approved Personal Document E-mail List under your Personal Document Settings on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part of your Kindle email address below. Find out more about sending to your Kindle. Find out more about sending to your Kindle.

      Note you can select to send to either the @free.kindle.com or @kindle.com variations. ‘@free.kindle.com’ emails are free but can only be sent to your device when it is connected to wi-fi. ‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.

      Find out more about the Kindle Personal Document Service.

      Asymptotic Behavior of k-Word Matches Between two Uniformly Distributed Sequences
      Available formats
      ×

      Send article to Dropbox

      To send this article to your Dropbox account, please select one or more formats and confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your <service> account. Find out more about sending content to Dropbox.

      Asymptotic Behavior of k-Word Matches Between two Uniformly Distributed Sequences
      Available formats
      ×

      Send article to Google Drive

      To send this article to your Google Drive account, please select one or more formats and confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your <service> account. Find out more about sending content to Google Drive.

      Asymptotic Behavior of k-Word Matches Between two Uniformly Distributed Sequences
      Available formats
      ×

Copyright

Corresponding author

Postal address: Department of Mathematics, University of Illinois, Urbana, IL 61801, USA. Email address: ruth@math.uiuc.edu
∗∗ H. S. Booth died 26 May 2005.
∗∗∗ Postal address: Mathematical Sciences Institute, Australian National University, Canberra, ACT 0200, Australia.
∗∗∗∗ Email address: conrad.burden@anu.edu.au
∗∗∗∗∗ Email address: sue.wilson@anu.edu.au

References

Hide All
[1] Barbour, A. and Chryssaphinou, O. (2001). Compound Poisson approximation: a user's guide. Ann. Appl. Prob. 11, 9641002.
[2] Billingsley, P. (1995). Probability and Measure, 3rd edn. John Wiley, New York.
[3] Burke, J., Davison, D. and Hide, W. (1999). d2 cluster: a validated method for clustering EST and full-length cDNA sequences. Genome Res. 9, 11351142.
[4] Carpenter, J. E., Christoffels, A., Weinbach, Y. and Hide, W. A. (2002). Assessment of the parallelization approach of d2 cluster for high-performance sequence clustering. J. Comput. Chem. 23, 755757.
[5] Chen, L. H. Y. (1975). Poisson approximation for dependent trials. Ann. Prob. 3, 534545.
[6] Christoffels, A. et al. (2001). STACK: sequence tag alignment and consensus knowledgebase. Nucleic Acids Res. 29, 234238.
[7] Dembo, A. and Rinott, Y. (1996). Some examples of normal approximations by Stein's method. In Random Discrete Structures (IMA Vol. Math. Appl. 76), Springer, New York, pp. 2544.
[8] Johnson, N. L. and Kotz, S. (1970). Distributions in Statistics. Continuous Univariate Distributions. 1. Houghton Mifflin Co., Boston, MA.
[9] Lippert, R. A., Huang, H and Waterman, M. S. (2002). Distributional regimes for the number of k-word matches between two random sequences. Proc. Nat. Acad. Sci. USA 99, 1398013989.
[10] Miller, R. T. et al. (1999). A comprehensive approach to clustering of expressed human gene sequence: the sequence tag alignment and consensus knowledge base. Genome Res. 9, 11431155.
[11] Smith, T. F. and Waterman, M. S. (1981). Identification of common molecular subsequences. J. Molec. Biol. 147, 195197.
[12] Stein, C. (1972). A bound for the error in the normal approximation to the distribution of a sum of dependent random variables. In Proc. Sixth Berkeley Symp. Math. Statist. Prob., Vol. II, University of California Press, Berkeley, pp. 583602.
[13] Stein, C. (1986). Approximate Computation of Expectations. Institute of Mathematical Statistics, Hayward, CA.
[14] Vinga, S. and Almeida, J. S. (2003). Alignment-free sequence comparison – a review. Bioinformatics 19, 513523.
[15] Waterman, M. S. (1995). Introduction to Computational Biology. Chapman & Hall, New York.
[16] Zhang, Y. X. et al. (2002). Genome shuffling leads to rapid phenotypic improvement in bacteria. Nature 415, 644646.

Keywords

MSC classification

Metrics

Full text views

Total number of HTML views: 0
Total number of PDF views: 0 *
Loading metrics...

Abstract views

Total abstract views: 0 *
Loading metrics...

* Views captured on Cambridge Core between <date>. This data will be updated every 24 hours.

Usage data cannot currently be displayed