Repeated patterns and related phenomena in words are known to play a central role in many facets of computer science, telecommunications, coding, data compression, data mining, and molecular biology. One of the most fundamental questions arising in such studies is the frequency of pattern occurrences in a given string known as the text. Applications of these results include gene finding in biology, executing and analyzing tree-like protocols for multiaccess systems, discovering repeated strings in Lempel–Ziv schemes and other data compression algorithms, evaluating string complexity and its randomness, synchronization codes, user searching in wireless communications, and detecting the signatures of an attacker in intrusion detection.
The basic pattern matching problem is to find for a given (or random) pattern w or set of patterns W and a text X how many times W occurs in the text X and how long it takes for W to occur in X for the first time. There are many variations of this basic pattern matching setting which is known as exact string matching. In approximate string matching, better known as generalized string matching, certain words from W are expected to occur in the text while other words are forbidden and cannot appear in the text. In some applications, especially in constrained coding and neural data spikes, one puts restrictions on the text (e.g., only text without the patterns 000 and 0000 is permissible), leading to constrained string matching. Finally, in the most general case, patterns from the set W do not need to occur as strings (i.e., consecutively) but rather as subsequences; that leads to subsequence pattern matching, also known as hidden pattern matching.
These various pattern matching problems find a myriad of applications. Molecular biology provides an important source of applications of pattern matching, be it exact or approximate or subsequence pattern matching. There are examples in abundance: finding signals in DNA; finding split genes where exons are interrupted by introns; searching for starting and stopping signals in genes; finding tandem repeats in DNA.