Hostname: page-component-848d4c4894-8kt4b Total loading time: 0 Render date: 2024-06-29T05:08:44.875Z Has data issue: false hasContentIssue false

Efficient dictionary-based text rewriting using subsequential transducers

Published online by Cambridge University Press:  01 December 2007

S. MIHOV
Affiliation:
Institute for Parallel Processing, Bulgarian Academy of Sciences, Bulgaria e-mail: stoyan@lml.bas.bg
K. U. SCHULZ
Affiliation:
CIS, University of Munich, Munich, Germany e-mail: schulz@cis.uni-muenchen.de

Abstract

Problems in the area of text and document processing can often be described as text rewriting tasks: given an input text, produce a new text by applying some fixed set of rewriting rules. In its simplest form, a rewriting rule is given by a pair of strings, representing a source string (the “original”) and its substitute. By a rewriting dictionary, we mean a finite list of such pairs; dictionary-based text rewriting means to replace in an input text occurrences of originals by their substitutes. We present an efficient method for constructing, given a rewriting dictionary D, a subsequential transducer that accepts any text t as input and outputs the intended rewriting result under the so-called “leftmost-longest match” replacement with skips, t'. The time needed to compute the transducer is linear in the size of the input dictionary. Given the transducer, any text t of length |t| is rewritten in a deterministic manner in time O(|t|+|t'|), where t' denotes the resulting output text. Hence the resulting rewriting mechanism is very efficient. As a second advantage, using standard tools, the transducer can be directly composed with other transducers to efficiently solve more complex rewriting tasks in a single processing step.

Type
Papers
Copyright
Copyright © Cambridge University Press 2007

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Abney, S. (1997) Partial parsing via finite-state cascades. Natural Language Engineering, 2 (3): 337344.CrossRefGoogle Scholar
Aho, A. and Corasick, M. (1975) Efficient string matching: an aid to bibliographic search. Communications of the ACM, 18: 333340.CrossRefGoogle Scholar
Appelt, D., Hobbs, J. R., Bear, J., Israel, D. and Tyson, M. (1993) A finite state processor for information extraction from real world text. Proceedings International Joint Conference on Artificial Intelligence, pp. 1172–1178.Google Scholar
Aho, A., Hopcroft, J. and Ullman, J. (1983) Data Structures and Algorithms. Addison-Wesley, Reading, Massachutes.Google Scholar
Brill, E. (1992) A simple rule-based part of speech tagger. Proceedings of the Third Conference on Applied Natural Language Processing, pp. 152–155.CrossRefGoogle Scholar
Brown, R. (2000) Example-based machine translation in the pangloss system. Proceedings of COLING'96: The 16th International Conference on Computational Linguistics, pp. 169–174.Google Scholar
Carr, L., Hall, W., Bechhofer, S. and Goble, C. (2001) Conceptual linking: Ontology-based open hypermedia. Proceedings of the 12th International World Wide Web Conference, pp. 334–342.CrossRefGoogle Scholar
Daciuk, J., Mihov, S., Watson, B. and Watson, R. (2000) Incremental construction of minimal acyclic finite state automata. Computational Linguistics, 26 (1): 316.CrossRefGoogle Scholar
Gerdemann, D. and van Noord, G. (1999) Transducers from rewrite rules with backreferences. Proceedings of the 9th Conference of the European Chapter of the Association for Computational Linguistics (EACL 99), pp. 126–133.CrossRefGoogle Scholar
Hopcroft, J., Motwani, R. and Ullman, J. (2000) Introduction to Automata Theory, Languages, and Computation, 2nd ed.Addison-Wesley.Google Scholar
Karttunen, L. (1996) Directed replacement. Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics, pp. 108–115.CrossRefGoogle Scholar
Karttunen, L. (1997) The replace operator. In: Roche, E. and Schabes, Y. (editors), Finite-State Language Processing, pp. 117147. MIT Press.CrossRefGoogle Scholar
Kaplan, R. and Kay, M. (1994) Regular models of phonological rule systems. Computational Linguistics, 20 (3): 331–279.Google Scholar
Kozen, D. (1997) Automata and Computability. Springer.CrossRefGoogle Scholar
Krovetz, B. (1993) Viewing morphology as an inference process. Proceedings of the 16th ACM SIGIR Conference, pp. 191–202.CrossRefGoogle Scholar
Kukich, K. (1992) Techniques for automatically correcting words in texts. ACM Comput. Surv. 377–439.CrossRefGoogle Scholar
Kupiec, J. (1992) Robust part-of-speech tagging using a hidden Markov model. Computer Speech and Language, 6: 225242.CrossRefGoogle Scholar
Laporte, E. (1997) Phonetic conversion and phonology. In: Roche, E. and Schabes, Y. (editors), Finite-State Language Processing, pp. 407429. MIT Press.CrossRefGoogle Scholar
Mihov, S. (1998) Direct building of minimal automaton for given list. Annuaire de l'Université de Sofia “St. Kl. Ohridski”, 91 (1): 212225.Google Scholar
Maier-Meyer, P. (1995) Lexikon und automatische Lemmatisierung. PhD thesis, CIS, Universität München.Google Scholar
Mihov, S. and Maurel, D. (2001) Direct construction of minimal acyclic subsequential transducers. Proceedings of the Conference on Implementation and Application of Automata CIAA'2000, number 2088 in LNCS, pp. 217–229. Springer.CrossRefGoogle Scholar
Mohri, M. (1996) On some applications of finite-state automata theory to natural language processing. Natural Language Engineering, 2: 120.CrossRefGoogle Scholar
Mohri, M. (1997) Finite-state transducers in language and speech processing. Computational Linguistics, 23 (2): 269311.Google Scholar
Mohri, M. (2001) Weighted grammar tools: the GRM library. In: Junqua, J. C. and Noord, G. van (editors), Robustness in Language and Speech Technology, pp. 165186. Kluwer Academic.CrossRefGoogle Scholar
Mohri, M., Fernando, P. and Michael, R. (1996) Weighted automata in text and speech processing. Proceedings of the 12th biennial European Conference on Artificial Intelligence (ECAI-96), Workshop on Extended Finite State Models of Language, pp. 152–155.Google Scholar
Mohri, M. and Sproat, R. (1996) An efficient compiler for weighted rewrite rules. Proceedings of the 34th Meeting of the Association for Computational Linguistics (ACL'96), pp. 231–238.CrossRefGoogle Scholar
Porter, M. An algorithm for suffix stripping. Program, 14(3): 130–137.CrossRefGoogle Scholar
Roche, E. and Schabes, Y. (1995) Deterministic part-of-speech tagging with finite state transducers. Computational Linguistics, 22 (2): 227253.Google Scholar
Roche, E. and Schabes, Y. (1997) Introduction. In: Roche, E. and Schabes, Y. (editors), Finite-State Language Processing, pp. 166. MIT Press.CrossRefGoogle Scholar
Strohmaier, C., Ringlstetter, C., Schulz, K. U. and Mihov, S. (2003a) Lexical postcorrection of OCR-results: The web as a dynamic secondary dictionary? Proceedings of the International Conference on Document Analysis and Recognition, ICDAR-03, pp. 1133–1137.Google Scholar
Strohmaier, C., Ringlstetter, C., Schulz, K. U. and Mihov, S. (2003b) A visual and interactive tool for optimizing lexical postcorrection of OCR-results. Proceedings of DIAR-03.CrossRefGoogle Scholar
Vogel, S. and Ney, H. Translation with cascaded finite-state transducers. Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics, ACL-2000, pp. 23–30.CrossRefGoogle Scholar