Hostname: page-component-84b7d79bbc-lrf7s Total loading time: 0 Render date: 2024-07-30T02:11:23.010Z Has data issue: false hasContentIssue false

RoLEX: The development of an extended Romanian lexical dataset and its evaluation at predicting concurrent lexical information

Published online by Cambridge University Press:  26 August 2022

Beáta Lőrincz*
Affiliation:
Babeş-Bolyai University, Cluj-Napoca, Romania
Elena Irimia
Affiliation:
Research Institute for Artificial Intelligence ‘Mihai Dragănescu’, Romanian Academy, Bucharest, Romania
Adriana Stan
Affiliation:
Technical University of Cluj-Napoca, Cluj-Napoca, Romania
Verginica Barbu Mititelu
Affiliation:
Research Institute for Artificial Intelligence ‘Mihai Dragănescu’, Romanian Academy, Bucharest, Romania
*
*Corresponding author. E-mail: beata.lorincz@ubbcluj.ro

Abstract

In this article, we introduce an extended, freely available resource for the Romanian language, named RoLEX. The dataset was developed mainly for speech processing applications, yet its applicability extends beyond this domain. RoLEX includes over 330,000 curated entries with information regarding lemma, morphosyntactic description, syllabification, lexical stress and phonemic transcription. The process of selecting the list of word entries and semi-automatically annotating the complete lexical information associated with each of the entries is thoroughly described.

The dataset’s inherent knowledge is then evaluated in a task of concurrent prediction of syllabification, lexical stress marking and phonemic transcription. The evaluation looked into several dataset design factors, such as the minimum viable number of entries for correct prediction, the optimisation of the minimum number of required entries through expert selection and the augmentation of the input with morphosyntactic information, as well as the influence of each task in the overall accuracy. The best results were obtained when the orthographic form of the entries was augmented with the complete morphosyntactic tags. A word error rate of 3.08% and a character error rate of 1.08% were obtained this way. We show that using a carefully selected subset of entries for training can result in a similar performance to the performance obtained by a larger set of randomly selected entries (twice as many). In terms of prediction complexity, the lexical stress marking posed most problems and accounts for around 60% of the errors in the predicted sequence.

Type
Article
Copyright
© The Author(s), 2022. Published by Cambridge University Press

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Footnotes

Beáta Lőrincz, Elena Irimia, Adriana Stan, and Verginica Barbu Mititelu contributed equally.

References

Barbu, A.-M. (2008). Romanian lexical data bases: Inflected and syllabic forms dictionaries. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC’08) . Marrakech: European Language Resources Association (ELRA), pp. 19371941.Google Scholar
Barbu Mititelu, V., Tufiş, D. and Irimia, E. (2018). The reference corpus of the contemporary Romanian language (CoRoLa). In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). Miyazaki: European Language Resources Association (ELRA), pp. 11781185.Google Scholar
Bartlett, S., Kondrak, G. and Cherry, C. (2009). On the syllabification of phonemes. In Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the ACL, Boulder, pp. 308316.CrossRefGoogle Scholar
Băcilă, F.-M. (2011). O posibilă clasificare a omografelor româneşti. Philologica Banatica V(1), 3646.Google Scholar
Boroş, T., Dumitrescu, S. D. and Pais, V. (2018). Tools and resources for Romanian text-to-speech and speech-to-text applications. CoRR, abs/1802.05583.Google Scholar
Chae, M., Park, K., Bang, J., Suh, S., Park, J., Kim, N. and Park, L. (2018). Convolutional sequence to sequence model with non-sequential greedy decoding for grapheme to phoneme conversion. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pp. 24862490.CrossRefGoogle Scholar
Ciobanu, A. M., Dinu, A. and Dinu, L. P. (2014). Predicting Romanian stress assignment. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, Volume 2: Short Papers, pp. 6468.CrossRefGoogle Scholar
Cucu, H., Buzo, A., Besacier, L. and Burileanu, C. (2014). SMT-based ASR domain adaptation methods for under-resourced languages: Application to Romanian. Speech Communication 56(8), 195212.CrossRefGoogle Scholar
de Mareüil, P. B., d’Alessandro, C., Yvon, F., Aubergé, V., Vaissière, J. and Amelot, A. (2000). A French phonetic lexicon with variants for speech and language processing. In Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00) . Athens: European Language Resources Association (ELRA).Google Scholar
Devlin, J., Chang, M.-W., Lee, K. and Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) . Minneapolis, MN: Association for Computational Linguistics, pp. 41714186.Google Scholar
Diaconescu, S.-S., Codirlasu, F.-C., Ionescu, M., Rizea, M.-M., Radulescu, M., Minca, A. and Fulea, S. (2015a). Fonetica Limbii Romane: Vol. 2 Dictionarul morfologic si fonetic al limbii romane (A-L), Vol. 3 Dictionarul morfologic si fonetic al limbii romane (M-Z). Scotts Valley, CA: CreateSpace.Google Scholar
Diaconescu, S.-S., Codirlasu, F.-C., Ionescu, M., Rizea, M.-M., Radulescu, M., Minca, A. and Fulea, S. (2015b). Fonetica Limbii Romane: Vol. 2 Dictionarul morfologic si fonetic al limbii romane (A-L), Vol. 3 Dictionarul morfologic si fonetic al limbii romane (M-Z). Scotts Valley, CA: CreateSpace.Google Scholar
Dinu, L. (2004). Despartirea automata in silabe a cuvintelor din limba română. Aplicatii in construcţia bazei de date a silabelor limbii române. Universitatea Bucuresti.Google Scholar
Dinu, L., Ciobanu, A. M., Chitoran, I. and Niculae, V. (2014). Using a machine learning model to assess the complexity of stress systems. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14) . Reykjavik: European Language Resources Association (ELRA), pp. 331336.Google Scholar
Dinu, L. and Dinu, A. (2006). On the data base of Romanian syllables and some of its quantitative and cryptographic aspects. In Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06) . Genoa: European Language Resources Association (ELRA), pp. 17951798.Google Scholar
Dinu, L. P. (2003). An approach to syllables via some extensions of Marcus contextual grammars. Grammars 6(1), 112.CrossRefGoogle Scholar
Dinu, L. P., Niculae, V. and Sulea, O.-M. (2013). Romanian syllabication using machine learning. In International Conference on Text, Speech and Dialogue. Pilsen: Springer, pp. 450456.CrossRefGoogle Scholar
Domokos, J., Buza, O. and Toderean, G. (2012). 100K+ words, machine-readable, pronunciation dictionary for the Romanian language. In 2012 Proceedings of the 20th European Signal Processing Conference (EUSIPCO) . Bucharest: IEEE, pp. 320324.Google Scholar
DOOM (2005). The Orthographic, Orthoepic and Morphologic Dictionary of the Romanian Language (DOOM2). Bucharest: Univers Enciclopedic.Google Scholar
Dou, Q., Bergsma, S., Jiampojamarn, S. and Kondrak, G. (2009). A ranking approach to stress prediction for letter-to-phoneme conversion. In Proceedings of the Joint Conference of the 47th annual meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, Suntec, pp. 118126.CrossRefGoogle Scholar
Gehring, J., Auli, M., Grangier, D., Yarats, D. and Dauphin, Y. N. (2017). Convolutional sequence to sequence learning. In International Conference on Machine Learning. Sydney: PMLR, pp. 12431252.Google Scholar
Georgescu, A.-L., Cucu, H. and Burileanu, C. (2017). SpeeD’s DNN approach to Romanian speech recognition. In 2017 International Conference on Speech Technology and Human-Computer Dialogue (SpeD) . Bucharest: IEEE, pp. 18.CrossRefGoogle Scholar
Georgescu, A.-L., Cucu, H., Buzo, A. and Burileanu, C. (2020). RSC: A Romanian read speech corpus for automatic speech recognition. In Proceedings of The 12th Language Resources and Evaluation Conference, Marseille, pp. 66066612.Google Scholar
Goslin, J., Galluzzi, C. and Romani, C. (2014). PhonItalia: A phonological lexicon for Italian. Behavior Research Methods 46(3), 872886.CrossRefGoogle ScholarPubMed
Halpern, J. (2022). Comprehensive Full-Form Lexicon for Arabic NLP and Speech Technology. Online. Available at https://www.cjk.org/wp-content/uploads/Halpern-LREC2022Paper.pdf 18 July 2022.Google Scholar
Ion, R. (2018). TEPROLIN: An extensible, online text preprocessing platform for Romanian. In Proceedings of the 13th International Conference on Linguistic Resources and Tools for Processing the Romanian Language, Iaşi.Google Scholar
Kyparissiadis, A., van Heuven, W. J., Pitchford, N. J. and Ledgeway, T. (2017). GreekLex 2: A comprehensive lexical database with part-of-speech, syllabic, phonological, and stress information. PloS one 12(2), e0172493.CrossRefGoogle ScholarPubMed
Levenshtein, V. (1966). Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady 10, 707.Google Scholar
Lőrincz, B. (2020). Concurrent phonetic transcription, lexical stress assignment and syllabification with deep neural networks. In 24th International Conference on Knowledge-Based and Intelligent Information & Engineering Systems, 176 , pp. 108117.CrossRefGoogle Scholar
Milde, B., Schmidt, C. and Köhler, J. (2017). Multitask sequence-to-sequence models for grapheme-to-phoneme conversion. In Proceedings of Interspeech 2017, Stockholm, pp. 25362540.CrossRefGoogle Scholar
Pearson, S., Kuhn, R., Fincke, S. and Kibre, N. (2000). Automatic methods for lexical stress assignment and syllabification. In Sixth International Conference on Spoken Language Processing, Beijing.CrossRefGoogle Scholar
Peiró-Lilja, A. and Farrús, M. (2020). Naturalness enhancement with linguistic information in end-to-end TTS using unsupervised parallel encoding. In Proceedings of Interspeech 2020, Shanghai, pp. 39943998.CrossRefGoogle Scholar
Protopapas, A., Tzakosta, M., Chalamandaris, A. and Tsiakoulis, P. (2012). IPLR: An online resource for Greek word-level and sublexical information. Language resources and evaluation 46(3), 449459.CrossRefGoogle Scholar
Rao, K., Peng, F., Sak, H. and Beaufays, F. (2015). Grapheme-to-phoneme conversion using Long Short-Term Memory recurrent neural networks. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , South Brisbane, pp. 42254229.CrossRefGoogle Scholar
Rehm, G., Berger, M., Elsholz, E., Hegele, S., Kintzel, F., Marheinecke, K., Piperidis, S., Deligiannis, M., Galanis, D., Gkirtzou, K., Labropoulou, P., Bontcheva, K., Jones, D., Roberts, I., Hajič, J., Hamrlová, J., Kačena, L., Choukri, K., Arranz, V., Vasiļjevs, A., Anvari, O., Lagzdiņš, A., Meļņika, J., Backfried, G., Dikici, E., Janosik, M., Prinz, K., Prinz, C., Stampler, S., Thomas-Aniola, D., Gómez-Pérez, J. M., Garcia Silva, A., Berrío, C., Germann, U., Renals, S. and Klejch, O. (2020). European language grid: An overview. In Proceedings of the 12th Language Resources and Evaluation Conference. Marseille: European Language Resources Association, pp. 33663380.Google Scholar
Română, A. (1982). Dicţionarul ortografic, ortoepic şi morfologic al limbii române. Bucharest: Editura Academiei Republicii Socialiste România.Google Scholar
Soares, A. P., Iriarte, Á., De Almeida, J. J., Simões, A., Costa, A., Machado, J., França, P., Comesaña, M., Rauber, A., Rato, A., et al. (2018). Procura-PALavras (P-PAL): A web-based interface for a new European Portuguese lexical database. Behavior Research Methods 50(4), 14611481.CrossRefGoogle ScholarPubMed
Stan, A. (2019). Input encoding for sequence-to-sequence learning of Romanian grapheme-to-phoneme conversion. In Proceedings of the 10th IEEE International Conference on Speech Technology and Human-Computer Dialogue (SpeD), Timisoara.CrossRefGoogle Scholar
Stan, A. (2020). RECOApy: Data recording, pre-processing and phonetic transcription for end-to-end speech-based applications. In Proceedings of Interspeech 2020, Shanghai.CrossRefGoogle Scholar
Stan, A., Dinescu, F., Ţiple, C., Meza, Ş., Orza, B., Chirilă, M. and Giurgiu, M. (2017). The SWARA speech corpus: A large parallel Romanian read speech dataset. In 2017 International Conference on Speech Technology and Human-Computer Dialogue (SpeD) . Bucharest: IEEE, pp. 16.CrossRefGoogle Scholar
Stan, A. and Giurgiu, M. (2018). A comparison between traditional machine learning approaches and deep neural networks for text processing in Romanian. In Proceedings of the 13th International Conference on Linguistic Resources and Tools for Processing Romanian Language (ConsILR), Iaşi.Google Scholar
Stan, A., Lőrincz, B., Nuţu, M. and Giurgiu, M. (2021). The MARA corpus: Expressivity in end-to-end TTS systems using synthesised speech data. In 2021 International Conference on Speech Technology and Human-Computer Dialogue (SpeD) , Bucharest, pp. 8590.CrossRefGoogle Scholar
Stan, A., Yamagishi, J., King, S. and Aylett, M. (2011). The Romanian speech synthesis (RSS) corpus: Building a high quality HMM-based speech synthesis system using a high sampling rate. Speech Communication 53(3), 442450.CrossRefGoogle Scholar
Sun, H., Tan, X., Gan, J.-W., Liu, H., Zhao, S., Qin, T. and Liu, T.-Y. (2019). Token-level ensemble distillation for grapheme-to-phoneme conversion. In Proceedings of Interspeech 2019, Graz, pp. 21152119.CrossRefGoogle Scholar
Taylor, J. and Richmond, K. (2020). Enhancing sequence-to-sequence text-to-speech with morphology. In Proceedings of Interspeech 2020, Shanghai, pp. 17381742.CrossRefGoogle Scholar
Toma, S.-A. and Munteanu, D.-P. (2009). Rule-based automatic phonetic transcription for the Romanian language. In 2009 Computation World: Future Computing, Service Computation, Cognitive, Adaptive, Content, Patterns. Athens, pp. 682686.Google Scholar
Toma, S.-A., Stan, A., Pura, M.-L. and Barsan, T. (2017). MaRePhoR — An open access machine-readable phonetic dictionary for Romanian. In 2017 International Conference on Speech Technology and Human-Computer Dialogue (SpeD) , Bucharest, pp. 16.CrossRefGoogle Scholar
Toshniwal, S. and Livescu, K. (2016). Jointly learning to align and convert graphemes to phonemes with neural attention models. In 2016 IEEE Spoken Language Technology Workshop (SLT). San Juan: IEEE, pp. 7682.CrossRefGoogle Scholar
Trandabat, D., Irimia, E., Barbu Mititielu, V., Cristea, D. and Tufiş, D. (2012). The Romanian Language in the Digital Era. Metanet White Paper Series. Heidelberg: Springer.Google Scholar
van Esch, D., Chua, M. and Rao, K. (2016). Predicting pronunciations with syllabification and stress with recurrent neural networks. In Proceedings of Interspeech 2016, San Francisco, CA, pp. 28412845.CrossRefGoogle Scholar
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L.u and Polosukhin, I. (2017). Attention is all you need. In Guyon I., Luxburg U. V., Bengio S., Wallach H., Fergus R., Vishwanathan S. and Garnett R., (eds), Advances in Neural Information Processing Systems 30. Long Beach, CA: Curran Associates, Inc., pp. 59986008.Google Scholar
Webster, G. (2004). Improving letter-to-pronunciation accuracy with automatic morphologically-based stress prediction. In Proceedings of Interspeech 2004, Jeju Island, pp. 25732576.CrossRefGoogle Scholar
Yao, K. and Zweig, G. (2015). Sequence-to-sequence neural net models for grapheme-to-phoneme conversion. In Proceedings of Interspeech 2015, Dresden, pp. 33303334.CrossRefGoogle Scholar
Yolchuyeva, S., Németh, G. and Gyires-Tóth, B. (2019a). Grapheme-to-phoneme conversion with convolutional neural networks. Applied Sciences 9(6), 1143.CrossRefGoogle Scholar
Yolchuyeva, S., Németh, G. and Gyires-Tóth, B. (2019b). Transformer based grapheme-to-phoneme conversion. In Proceedings of Interspeech 2019, Graz, pp. 20952099.CrossRefGoogle Scholar
Zeineldeen, M., Zeyer, A., Zhou, W., Ng, T., Schlüter, R. and Ney, H. (2020). A systematic comparison of grapheme-based vs. phoneme-based label units for encoder-decoder-attention models. arXiv preprint, arXiv: 2005.09336.Google Scholar
Zhang, A., Lipton, Z. C., Li, M. and Smola, A. J. (2020). Dive into Deep Learning. Available at https://d2l.ai Google Scholar