4 - Extended string matching
Published online by Cambridge University Press: 18 December 2014
Summary
Basic concepts
Up to now we have considered search patterns that are sequences of characters. However, in many cases one may be interested in a more sophisticated form of searching. The most complex patterns that we consider in this book are regular expressions, which are covered in Chapter 5. However, regular expression searching is costly in processing time and complex to program, so one should resort to it only if necessary. In many cases one needs far less flexibility, and the search problem can be solved more efficiently with much simpler algorithms.
We have designed this chapter on “extended strings” as a middle point between simple strings and regular expressions. We provide simple search algorithms for a number of enhancements over the basic string search, which can be solved more easily than general regular expressions. We focus on those used in text searching and computational biology applications.
We consider four extensions to the string search problem: classes of characters, bounded length gaps, optional characters, and repeatable characters. The first one allows specifying sets of characters at any pattern or text position. The second permits searching patterns containing bounded length gaps, which is of interest for protein searching (e.g., PROSITE patterns [Gus97, HBFB99]). The third allows certain characters to appear optionally in a pattern occurrence, and the last permits a given character to appear multiple times in an occurrence, which includes wild cards. We finally consider some limited multipattern search capabilities.
Different occurrences of a pattern may have different lengths, and there may be several occurrences starting or ending at the same text position. Among the several choices for reporting these occurrences, we choose to report all the initial or all the final occurrence positions, depending on what is more natural for each algorithm.
In this chapter we make heavy use of bit-parallel algorithms. With some extra work, other algorithms can be adapted to handle some extended patterns as well, but bit-parallel algorithms provide the maximum flexibility and in general the best performance.
- Type
- Chapter
- Information
- Flexible Pattern Matching in StringsPractical On-Line Search Algorithms for Texts and Biological Sequences, pp. 77 - 98Publisher: Cambridge University PressPrint publication year: 2002
- 1
- Cited by