Weighted finite-state transducers for normalization of historical texts

Izaskun Etxeberria; Iñaki Alegria; Larraitz Uria

doi:10.1017/S1351324918000505

Weighted finite-state transducers for normalization of historical texts

Published online by Cambridge University Press: 01 April 2019

Izaskun Etxeberria ,

Iñaki Alegria and

Larraitz Uria

Show author details

Izaskun Etxeberria*: Affiliation:
IXA Group, University of the Basque Country, Donostia-San Sebastián, Spain
Iñaki Alegria: Affiliation:
IXA Group, University of the Basque Country, Donostia-San Sebastián, Spain
Larraitz Uria: Affiliation:
IXA Group, University of the Basque Country, Donostia-San Sebastián, Spain
*: *Corresponding author. Email: izaskun.etxeberria@ehu.eus

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

This paper presents a study about methods for normalization of historical texts. The aim of these methods is learning relations between historical and contemporary word forms. We have compiled training and test corpora for different languages and scenarios, and we have tried to read the results related to the features of the corpora and languages. Our proposed method, based on weighted finite-state transducers, is compared to previously published ones. Our method learns to map phonological changes using a noisy channel model; it is a simple solution that can use a limited amount of supervision in order to achieve adequate performance. The compiled corpora are ready to be used for other researchers in order to compare results. Concerning the amount of supervision for the task, we investigate how the size of training corpus affects the results and identify some interesting factors to anticipate the difficulty of the task.

Keywords

lexical normalization historical texts computational morphology phonology

Type: Article
Information: Natural Language Engineering , Volume 25 , Issue 2 , March 2019 , pp. 307 - 321

DOI: https://doi.org/10.1017/S1351324918000505 [Opens in a new window]
Copyright: © Cambridge University Press 2019

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Alegria, I., Etxeberria, I., Hulden, M. and Maritxalar, M. (2009). Porting Basque morphological grammars to foma, an open-source tool. In Finite-State Methods and Natural Language Processing. FSMNLP 2009. Lecture Notes in Computer Science, vol. 6062. Berlin: Springer, pp. 105–113.CrossRef Google Scholar

Alegria, I., Etxeberria, I. and Labaka, G. (2013). Una cascada de transductores simples para normalizar tweets. In Proceedings of the Tweet Normalization Workshop at the conference of the Spanish Society for Natural Language Processing (SEPLN). Spain: Madrid, pp. 15–19.Google Scholar

Allauzen, C., Riley, M., Schalkwyk, J., Skut, W. and Mohri, M. (2007). OpenFST: A general and efficient weighted finite-state transducer library. In International Conference on Implementation and Application of Automata (CIAA). Lecture Notes in Computer Science, vol. 4783. Berlin: Springer, pp. 11–23.CrossRef Google Scholar

Almeida, J.J., Santos, A. and Simoes, A. (2010). Bigorna—A toolkit for orthography migration challenges. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC2010). Valleta, Malta, pp. 227–232.Google Scholar

Beesley, K.R. and Karttunen, L. (2003). Finite-state Morphology: Xerox Tools and Techniques. Stanford: CSLI Publications.Google Scholar

Bollmann, M., Dipper, S., Krasselt, J. and Petran, F. (2012). Manual and semi-automatic normalization of historical spelling-case studies from Early New High German. In Proceedings of the First International Workshop on Language Technology for Historical Text(s). Vienna, Austria, pp. 342–350.Google Scholar

Bollmann, M. and Søgaard, A. (2016). Improving historical spelling normalization with bi-directional LSTMS and multi-task learning. In Proceedings of the 26th International Conference on Computational Linguistics (COLING 2016). Osaka, Japan, pp. 131–139.Google Scholar

Carreras, X., Chao, I., Padró, L. and Padró, M. (2004). Freeling: An open-source suite of language analyzers. In Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC2004). Lisbon, Portugal, pp. 239–242.Google Scholar

Erjavec, T. (2015). The IMP historical Slovene language resources. Language Resources and Evaluation 49(3), 753–775.CrossRef Google Scholar

Etxeberria, I., Alegria, I., Hulden, M. and Uria, L. (2014). Learning to map variation-standard forms using a limited parallel corpus and the standard morphology. Procesamiento del Lenguaje Natural 52, 13–20.Google Scholar

Etxeberria, I., Alegria, I., Uria, L. and Hulden, M. (2016a). Combining phonology and morphology for the normalization of historical texts. In Proceedings of the 10th SIGHUM Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH). Berlin, Germany, pp. 100–105.CrossRef Google Scholar

Etxeberria, I., Alegria, I., Uria, L. and Hulden, M. (2016b). Evaluating the noisy channel model for the normalization of historical texts: Basque, Spanish and Slovene. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC2016). Portorož, Slovenia, pp. 1064–1069.Google Scholar

Fiebranz, R., Lindberg, E., Lindström, J. and Ågren, M. 2011. Making verbs count: The research project Gender and Work and its methodology. Scandinavian Economic History Review 59(3), 273–293.CrossRef Google Scholar

Hulden, M. (2009). Foma: A finite-state compiler and library. In Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics: Demonstrations Session. Athens, Greece, pp. 29–32. Association for Computational Linguistics.Google Scholar

Hulden, M., Alegria, I., Etxeberria, I. and Maritxalar, M. (2011). Learning word-level dialectal variation as phonological replacement rules using a limited parallel corpus. In Proceedings of the First Workshop on Algorithms and Resources for Modelling of Dialects and Language Varieties. Edinburgh, Scotland, pp. 39–48. Association for Computational Linguistics.Google Scholar

Jiampojamarn, S., Kondrak, G. and Sherif, T. (2007). Applying many-to-many alignments and hidden Markov models to letter-to-phoneme conversion. In Proceedings of HLT-NAACL’07. Rochester, New York, pp. 372–379.Google Scholar

Jurish, B. (2010). Comparing canonicalizations of historical German text. In Proceedings of the 11th Meeting of the ACL Special Interest Group on Computational Morphology and Phonology. Uppsala, Sweden. pp. 72–77. Association for Computational Linguistics.Google Scholar

Kestemont, M., Daelemans, W. and Pauw, G.D. (2010). Weigh your words—Memory-based lemmatization for Middle Dutch. Literary and Linguistic Computing 25(3), 287–301.CrossRef Google Scholar

Korchagina, N. (2017). Normalizing medieval German texts: From rules to deep learning. In Proceedings of the NoDaLiDa 2017 Workshop on Processing Historical Language. Gothenburg, Sweden, pp. 12–17.Google Scholar

Ljubešic, N., Zupan, K., Fišer, D. and Erjavec, T. (2016). Normalising Slovene data: Historical texts vs. user-generated content. In Proceedings of the 13th Conference on Natural Language Processing (KONVENS 2016). Bochum, Germany, pp. 146–155.Google Scholar

Mann, G.S. and Yarowsky, D. (2001). Multipath translation lexicon induction via bridge languages. In Proceedings of the Second Meeting of the North American Chapter of the Association for Computational Linguistics on Language Technologies. Pittsburgh, Pennsylvania: Association for Computational Linguistics, pp. 151–158.Google Scholar

Muggleton, S. and De Raedt, L. (1994). Inductive logic programming: Theory and methods. The Journal of Logic Programming 19, 629–679.CrossRef Google Scholar

Novak, J.R., Minematsu, N. and Hirose, K. (2012). WFST-based grapheme-to-phoneme conversion: Open source tools for alignment, model-building and decoding. In Proceedings of the 10th International Workshop on Finite State Methods and Natural Language Processing (FSMNLP2012). Donostia, Spain: Association for Computational Linguistics, pp. 45–49.Google Scholar

Novak, J.R., Minematsu, N. and Hirose, K. (2016). Phonetisaurus: Exploring grapheme-to-phoneme conversion with joint n-gram models in the WFST framework. Natural Language Engineering 22(6), 907–938.CrossRef Google Scholar

Pettersson, E. (2016). Spelling Normalisation and Linguistic Analysis of Historical Text for Information Extraction. Doctoral dissertation, Uppsala: Acta Universitatis Upsaliensis.Google Scholar

Pettersson, E., Megyesi, B. and Nivre, J. (2014). A multilingual evaluation of three spelling normalisation methods for historical text. In Proceedings of the 8th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH). Gothenburg, Sweden, pp. 32–41.CrossRef Google Scholar

Piotrowski, M. (2012). Natural Language Processing for Historical Texts. Synthesis Lectures on Human Language Technologies. San Rafael, USA: Morgan & Claypool Publishers.CrossRef Google Scholar

Porta, J., Sancho, J.-L. and Gómez, J. (2013). Edit transducers for spelling variation in Old Spanish. In Proceedings of the Workshop on Computational Historical Linguistics at NODALIDA 2013. NEALT Proceedings Series. Oslo, Norway, vol. 18, pp. 70–79.Google Scholar

Rognvaldsson, E., Ingason, A.K., Sigurdhsson, E.F. and Wallenberg, J. (2012). The Icelandic Parsed Historical Corpus (IcePaHC). In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC2012). Istanbul, Turkey, pp. 1977–1984.Google Scholar

Scheible, S., Whitt, R.J., Durrell, M. and Bennett, P. (2011). A gold standard corpus of Early Modern German. In Proceedings of the 5th Linguistic Annotation Workshop. Portland, OR: Association for Computational Linguistics, pp. 124–128.Google Scholar

Scherrer, Y. (2007). Adaptive string distance measures for bilingual dialect lexicon induction. In Proceedings of the 45th Annual Meeting of the ACL: Student Research Workshop. Prague, Czech Republic: Association for Computational Linguistics, pp. 55–60.CrossRef Google Scholar

Scherrer, Y. and Erjavec, T. (2016). Modernising historical Slovene words. Natural Language Engineering 22(6), 881–905.CrossRef Google Scholar

Simon, E. (2014). Corpus building from Old Hungarian codices. In The Evolution of Functional Left Peripheries in Hungarian Syntax. Oxford: Oxford University Press, pp. 224–236.Google Scholar

Stenetorp, P., Pyysalo, S., Topić, G., Ohta, T., Ananiadou, S. and Tsujii, J. (2012). BRAT: A web-based tool for NLP-assisted text annotation. In Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics. Avignon, France: Association for Computational Linguistics, pp. 102–107.Google Scholar

Tjong, Kim Sang E., Bollman, M., Boschker, R., Casacuberta, F., Dietz, F.M., Dipper, S., Domingo, M., van der Goot, R., van Koppen, J.M., Ljubešić, N., Östling, R., Petran, F., Pettersson, E., Scherrer, Y., Schraagen, M., Sevens, L., Tiedemann, J., Vanallemeersch, T. and Zervanou, K. (2017). The CLIN27 shared task: Translating historical text to contemporary language for improving automatic linguistic annotation. Computational Linguistics in the Netherlands Journal 7, 53–64.Google Scholar

Article contents

Weighted finite-state transducers for normalization of historical texts

Abstract

Keywords

Access options

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests