Hostname: page-component-7479d7b7d-pfhbr Total loading time: 0 Render date: 2024-07-11T12:18:13.571Z Has data issue: false hasContentIssue false

Fast String Matching in Stationary Ergodic Sources

Published online by Cambridge University Press:  12 September 2008

John Shawe-Taylor
Affiliation:
Department of Computer Science, Royal Holloway and Bedford New College, University of London, Egham, Surrey TW20 0EX, UK e-mail: john@dcs.rhbnc.ac.uk

Abstract

A connection is made between the theory of ergodicity and the expected complexity of string searching. In particular, a substring search algorithm is introduced which, when applied to searching in text that has been produced by an appropriate stationary ergodic source, has an expected running time of O((N/m + m)logm), for a text string of length N and search string of length m. Similar expected complexity results have been obtained before, but the analysis is performed in a significantly more general framework, which models with greater accuracy the statistics of many types of strings, including natural language. The analysis also sheds light on the performance of the Boyer-Moore algorithm and the Sunday algorithm when applied to natural language.

Type
Research Article
Copyright
Copyright © Cambridge University Press 1996

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

[1]Knuth, D. E., Morris, J. H. and Pratt, V. R. (1977) Fast pattern matching in strings. SIAM. J. Comput. 6 323350.CrossRefGoogle Scholar
[2]Smit, G. V. (1982) A comparison of three string matching algorithms. Software - Practice & Experience 12 5766.CrossRefGoogle Scholar
[3]Sunday, D. M. (1990) A very fast substring search algorithm. Comm. ACM 33 132142.CrossRefGoogle Scholar
[4]Boyer, R. S. and Moore, J. S. (1977) A fast string searching algorithm. Comm. ACM 20 762772.CrossRefGoogle Scholar
[5]Guibas, L. J. and Odlyzko, A. M. (1980) A new proof of the linearity of the Boyer-Moore string searching algorithm. SIAM J. Comput. 9 672682.CrossRefGoogle Scholar
[6]Baeza-Yates, R. A. (1989) String searching algorithms revisited. Proc. Workshop in Algorithms and Data Structures. Lecture Notes in Computer Science 382, pp. 7596. Springer-Verlag.Google Scholar
[7]Horspool, N. (1980) Practical fast searching in strings. Software - Practice & Experience 16 501506.CrossRefGoogle Scholar
[8]Schaback, R. (1988) On the expected sublinearity of the Boyer-Moore algorithm. SIAM J. Comput. 17 648658.CrossRefGoogle Scholar
[9]Yao, A. C-C. (1979) The complexity of pattern matching for a random string. SIAM J. Comput. 8 368387.CrossRefGoogle Scholar
[10]Shannon, C. E. (1948) A mathematical theory of communication. Bell. Syst. Tech. J. 27 379423, 623656.CrossRefGoogle Scholar
[11]Kim, J. Y. and Shawe-Taylor, J. S. (1994) Fast expected string matching using an n-gram algorithm. Software - Practice & Experience 24 7988.CrossRefGoogle Scholar
[12]Welsh, D. (1988) Codes and Cryptography. Oxford University Press.Google Scholar
[13]Billingsley, P. (1965) Ergodic Theory and Information. Wiley.Google Scholar
[14]Thomasian, A. J. (1960) An elementary proof of the AEP of information theory. Ann. Math. Statist. 31 452456.CrossRefGoogle Scholar
[15]Kim, J. Y. and Shawe-Taylor, J. S. (1992) An approximate string matching algorithm. Theor. Comput. Sci. 92 107117.CrossRefGoogle Scholar