Efficient dictionary-based text rewriting using subsequential transducers†

S. MIHOV; K. U. SCHULZ

doi:10.1017/S1351324905004092

Efficient dictionary-based text rewriting using subsequential transducers†

Published online by Cambridge University Press: 01 December 2007

S. MIHOV and

K. U. SCHULZ

Show author details

S. MIHOV: Affiliation:
Institute for Parallel Processing, Bulgarian Academy of Sciences, Bulgaria e-mail: stoyan@lml.bas.bg
K. U. SCHULZ: Affiliation:
CIS, University of Munich, Munich, Germany e-mail: schulz@cis.uni-muenchen.de

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

Problems in the area of text and document processing can often be described as text rewriting tasks: given an input text, produce a new text by applying some fixed set of rewriting rules. In its simplest form, a rewriting rule is given by a pair of strings, representing a source string (the “original”) and its substitute. By a rewriting dictionary, we mean a finite list of such pairs; dictionary-based text rewriting means to replace in an input text occurrences of originals by their substitutes. We present an efficient method for constructing, given a rewriting dictionary D, a subsequential transducer that accepts any text t as input and outputs the intended rewriting result under the so-called “leftmost-longest match” replacement with skips, t'. The time needed to compute the transducer is linear in the size of the input dictionary. Given the transducer, any text t of length |t| is rewritten in a deterministic manner in time O(|t|+|t'|), where t' denotes the resulting output text. Hence the resulting rewriting mechanism is very efficient. As a second advantage, using standard tools, the transducer can be directly composed with other transducers to efficiently solve more complex rewriting tasks in a single processing step.

Type: Papers
Information: Natural Language Engineering , Volume 13 , Issue 4 , December 2007 , pp. 353 - 381

DOI: https://doi.org/10.1017/S1351324905004092 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2007

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Abney, S. (1997) Partial parsing via finite-state cascades. Natural Language Engineering, 2 (3): 337–344.CrossRef Google Scholar

Aho, A. and Corasick, M. (1975) Efficient string matching: an aid to bibliographic search. Communications of the ACM, 18: 333–340.CrossRef Google Scholar

Appelt, D., Hobbs, J. R., Bear, J., Israel, D. and Tyson, M. (1993) A finite state processor for information extraction from real world text. Proceedings International Joint Conference on Artificial Intelligence, pp. 1172–1178.Google Scholar

Aho, A., Hopcroft, J. and Ullman, J. (1983) Data Structures and Algorithms. Addison-Wesley, Reading, Massachutes.Google Scholar

Brill, E. (1992) A simple rule-based part of speech tagger. Proceedings of the Third Conference on Applied Natural Language Processing, pp. 152–155.CrossRef Google Scholar

Brown, R. (2000) Example-based machine translation in the pangloss system. Proceedings of COLING'96: The 16th International Conference on Computational Linguistics, pp. 169–174.Google Scholar

Carr, L., Hall, W., Bechhofer, S. and Goble, C. (2001) Conceptual linking: Ontology-based open hypermedia. Proceedings of the 12th International World Wide Web Conference, pp. 334–342.CrossRef Google Scholar

Daciuk, J., Mihov, S., Watson, B. and Watson, R. (2000) Incremental construction of minimal acyclic finite state automata. Computational Linguistics, 26 (1): 3–16.CrossRef Google Scholar

Gerdemann, D. and van Noord, G. (1999) Transducers from rewrite rules with backreferences. Proceedings of the 9th Conference of the European Chapter of the Association for Computational Linguistics (EACL 99), pp. 126–133.CrossRef Google Scholar

Hopcroft, J., Motwani, R. and Ullman, J. (2000) Introduction to Automata Theory, Languages, and Computation, 2nd ed.Addison-Wesley.Google Scholar

Karttunen, L. (1996) Directed replacement. Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics, pp. 108–115.CrossRef Google Scholar

Karttunen, L. (1997) The replace operator. In: Roche, E. and Schabes, Y. (editors), Finite-State Language Processing, pp. 117–147. MIT Press.CrossRef Google Scholar

Kaplan, R. and Kay, M. (1994) Regular models of phonological rule systems. Computational Linguistics, 20 (3): 331–279.Google Scholar

Kozen, D. (1997) Automata and Computability. Springer.CrossRef Google Scholar

Krovetz, B. (1993) Viewing morphology as an inference process. Proceedings of the 16th ACM SIGIR Conference, pp. 191–202.CrossRef Google Scholar

Kukich, K. (1992) Techniques for automatically correcting words in texts. ACM Comput. Surv. 377–439.CrossRef Google Scholar

Kupiec, J. (1992) Robust part-of-speech tagging using a hidden Markov model. Computer Speech and Language, 6: 225–242.CrossRef Google Scholar

Laporte, E. (1997) Phonetic conversion and phonology. In: Roche, E. and Schabes, Y. (editors), Finite-State Language Processing, pp. 407–429. MIT Press.CrossRef Google Scholar

Mihov, S. (1998) Direct building of minimal automaton for given list. Annuaire de l'Université de Sofia “St. Kl. Ohridski”, 91 (1): 212–225.Google Scholar

Maier-Meyer, P. (1995) Lexikon und automatische Lemmatisierung. PhD thesis, CIS, Universität München.Google Scholar

Mihov, S. and Maurel, D. (2001) Direct construction of minimal acyclic subsequential transducers. Proceedings of the Conference on Implementation and Application of Automata CIAA'2000, number 2088 in LNCS, pp. 217–229. Springer.CrossRef Google Scholar

Mohri, M. (1996) On some applications of finite-state automata theory to natural language processing. Natural Language Engineering, 2: 1–20.CrossRef Google Scholar

Mohri, M. (1997) Finite-state transducers in language and speech processing. Computational Linguistics, 23 (2): 269–311.Google Scholar

Mohri, M. (2001) Weighted grammar tools: the GRM library. In: Junqua, J. C. and Noord, G. van (editors), Robustness in Language and Speech Technology, pp. 165–186. Kluwer Academic.CrossRef Google Scholar

Mohri, M., Fernando, P. and Michael, R. (1996) Weighted automata in text and speech processing. Proceedings of the 12th biennial European Conference on Artificial Intelligence (ECAI-96), Workshop on Extended Finite State Models of Language, pp. 152–155.Google Scholar

Mohri, M. and Sproat, R. (1996) An efficient compiler for weighted rewrite rules. Proceedings of the 34th Meeting of the Association for Computational Linguistics (ACL'96), pp. 231–238.CrossRef Google Scholar

Porter, M. An algorithm for suffix stripping. Program, 14(3): 130–137.CrossRef Google Scholar

Roche, E. and Schabes, Y. (1995) Deterministic part-of-speech tagging with finite state transducers. Computational Linguistics, 22 (2): 227–253.Google Scholar

Roche, E. and Schabes, Y. (1997) Introduction. In: Roche, E. and Schabes, Y. (editors), Finite-State Language Processing, pp. 1–66. MIT Press.CrossRef Google Scholar

Strohmaier, C., Ringlstetter, C., Schulz, K. U. and Mihov, S. (2003a) Lexical postcorrection of OCR-results: The web as a dynamic secondary dictionary? Proceedings of the International Conference on Document Analysis and Recognition, ICDAR-03, pp. 1133–1137.Google Scholar

Strohmaier, C., Ringlstetter, C., Schulz, K. U. and Mihov, S. (2003b) A visual and interactive tool for optimizing lexical postcorrection of OCR-results. Proceedings of DIAR-03.CrossRef Google Scholar

Vogel, S. and Ney, H. Translation with cascaded finite-state transducers. Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics, ACL-2000, pp. 23–30.CrossRef Google Scholar

Article contents

Efficient dictionary-based text rewriting using subsequential transducers†

Abstract

Access options

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests