Measuring bilingual corpus comparability

BO LI; ERIC GAUSSIER; DAN YANG

doi:10.1017/S1351324917000481

Measuring bilingual corpus comparability

Published online by Cambridge University Press: 15 January 2018

BO LI ,

ERIC GAUSSIER and

DAN YANG

Show author details

BO LI: Affiliation:
Department of Computer Science, Central China Normal University, Wuhan, China e-mail: libo@mail.ccnu.edu.cn
ERIC GAUSSIER: Affiliation:
CNRS-LIG/AMA, Université Grenoble Alpes, Grenoble, France e-mail: eric.gaussier@imag.fr
DAN YANG: Affiliation:
China Electric Power Research Institute, Wuhan, China e-mail: yangdan3@epri.sgcc.com.cn

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

Comparable corpora serve as an important substitute for parallel resources in cases of under-resourced language pairs. Previous work mostly aims to find a better strategy to exploit existing comparable corpora, while ignoring the variety in corpus quality. The quality of comparable corpora affects a lot its usability in practice, a fact that has been justified by several studies. However, researchers have not been able to establish a widely accepted and fully validated framework to measure corpus quality. We will thus investigate in this paper a comprehensive methodology to deal with the quality of comparable corpora. To be exact, we will propose several comparability measures and a quantitative strategy to test those measures. Our experiments show that the proposed comparability measure can capture gold-standard comparability levels very well and is robust to the bilingual dictionary used. Moreover, we will show in the task of bilingual lexicon extraction that the proposed measure correlates well with the performance of the real world application.

Type: Article
Information: Natural Language Engineering , Volume 24 , Issue 4 , July 2018 , pp. 523 - 549

DOI: https://doi.org/10.1017/S1351324917000481 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2018

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Abdul-Rauf, S., and Schwenk, H. 2009. On the use of comparable corpora to improve SMT performance. In Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, pp. 16–23.Google Scholar

Bahdanau, D., Cho, K., and Bengio, Y., 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of the 3rd International Conference on Learning Representations, San Diego, CA, pp. 1–15.Google Scholar

Ballesteros, L., and Croft, W. B., 1997. Phrasal translation and query expansion techniques for cross-language information retrieval. In Proceedings of the 20th ACM SIGIR, Philadelphia, Pennsylvania, USA, pp. 84–91.Google Scholar

Blei, A., and Jordan, I., 2003. Latent dirichlet allocation. Journal of Machine Learning Research 3 : 993–1022.Google Scholar

Boyd-Graber, J., and Blei, D. M. 2009. Multilingual topic models for unaligned text. In Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence (UAI-2009) , pp. 75–82.Google Scholar

Chebel, M., Latiri, C., and Gaussier, E., 2017. Bilingual lexicon extraction from comparable corpora based on closed concepts mining. In Proceedings of the 21st Pacific-Asia Conference on Knowledge Discovery and Data Mining, Jeju, Korea, pp. 586–598.Google Scholar

Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, R., 1990. Indexing by latent semantic analysis. Journal of the American Society for Information Science 41 (6): 391–407.Google Scholar

Deshmukh, A., and Hegde, G., 2012. A literature survey on latent semantic indexing. International Journal of Engineering Inventions 1 (4): 1–5.Google Scholar

Fung, P., and Yee, L. Y., 1998. An IR approach for translating new words from nonparallel, comparable texts. In Proceedings of the 17th International Conference on Computational Linguistics, Montreal, Quebec, Canada, pp. 414–20.Google Scholar

Gaussier, E., Renders, J. M., Matveeva, I., Goutte, C., and Déjean, H. D., 2004. A geometric view on bilingual lexicon extraction from comparable corpora. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics, Barcelona, Spain, pp. 526–33.Google Scholar

Hazem, A., and Morin, E., 2016. Efficient Data Selection for Bilingual Terminology Extraction from Comparable Corpora. In Proceedings of the 26th International Conference on Computational Linguistics: Technical Papers, Osaka, Japan, pp. 3401–11.Google Scholar

Hermann, K. M., and Blunsom, P., 2014. Multilingual Models for Compositional Distributional Semantics. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Maryland, USA, pp. 58–68.Google Scholar

Hewavitharana, S., and Vogel, S. 2008. Enhancing a statistical machine translation system by using an automatically extracted parallel corpus from comparable sources. In Proceedings of the LREC 2008 Workshop on Comparable Corpora.Google Scholar

Ji, H. 2009. Mining name translations from comparable corpora by creating bilingual information networks. In Proceedings of the 2nd Workshop on Building and Using Comparable Corpora: from Parallel to Non-parallel Corpora (BUCC-2009), pp. 34–7.Google Scholar

Kilgarriff, A., 2001. Comparing corpora. International Journal of Corpus Linguistics 6 : 97–133.CrossRef Google Scholar

Koehn, P. 2005. Europarl: a parallel corpus for statistical machine translation. In Proceedings of MT Summit 2005.Google Scholar

Li, B., and Gaussier, E., 2010. Improving corpus comparability for bilingual lexicon extraction from comparable corpora. In Proceedings of the 23rd International Conference on Computational Linguistics, Beijing, China, pp. 644–52.Google Scholar

Li, B., Gaussier, E., and Aizawa, A., 2011. Clustering comparable corpora for bilingual lexicon extraction. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, Oregon, USA, pp. 473–8.Google Scholar

Luong, T., Pham, H., and Manning, C. D. 2015. Bilingual Word Representations with Monolingual Quality in Mind. In Proceedings of the NAACL Workshop on Vector Space Modeling for NLP.Google Scholar

Markantonatou, S., Sofianopoulos, S., Spilioti, V., Tambouratzis, G., Vassiliou, M., and Yannoutsou, O. 2006. Using patterns for machine translation. In Proceedings of the European Association for Machine Translation, pp. 239–46.Google Scholar

Mathieu, B., Besancon, R., and Fluhr, C. 2004. Multilingual document clusters discovery. In Proceedings of RIAO. pp. 116–25.Google Scholar

Morin, E., Daille, B., Takeuchi, K., and Kageura, K., 2007. Bilingual terminology mining - using brain, not brawn comparable corpora. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, Prague, Czech Republic, pp. 664–71.Google Scholar

Munteanu, D. S., Fraser, A., and Marcu, A., 2004. Improved machine translation performance via parallel sentence extraction from comparable corpora. In Proceedings of the HLT-NAACL 2004, Boston, MA., USA, pp. 265–72.Google Scholar

Munteanu, D. S., and Marcu, D., 2006. Extracting parallel sub-sentential fragments from non-parallel corpora. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, Sydney, Australia, pp. 81–8.Google Scholar

Ni, X., Sun, J. T., Hu, J., and Chen, Z. 2009. Mining multilingual topics from wikipedia. In Proceedings of the 18th International Conference on World Wide Web. WWW ’09, pp. 1155–6.Google Scholar

Och, F. J., and Ney, H., 2003. A systematic comparison of various statistical alignment models. Computational Linguistics 29 (1): 19–51.Google Scholar

Papineni, K., Roukos, S., Ward, T., and Zhu, W. J. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp. 311–8.Google Scholar

Pekar, V., Mitkov, R., Blagoev, D., and Mulloni, A., 2006. Finding translations for low-frequency words in comparable corpora. Machine Translation 20 (4): 247–66.CrossRef Google Scholar

Rapp, R., 1999. Automatic identification of word translations from unrelated English and German corpora. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics, College Park, Maryland, USA, pp. 519–26.Google Scholar

Rayson, P., and Garside, R. 2000. Comparing corpora using frequency profiling. In Proceedings of the ACL Workshop on Comparing Corpora, pp. 1–6.Google Scholar

Robitaille, X., Sasaki, Y., Tonoike, M., Sato, S., and Utsuro, T., 2006. Compiling French-Japanese terminologies from the web. In Proceedings of the 11st Conference of the European Chapter of the Association for Computational Linguistics, Trento, Italy, pp. 225–32.Google Scholar

Salton, G., Wong, A., and Yang, C. S., 1975. A vector space model for automatic indexing. Communications of the ACM 18 : 613–20.Google Scholar

Saralegi, X., SanVicente, I., and Gurrutxaga, A. 2008. Automatic extraction of bilingual terms from comparable corpora in a popular science domain. In Proceedings of the 6th International Conference on Language Resources and Evaluations - Building and Using Comparable Corpora Workshop.Google Scholar

Schmid, H. 1995. Improvements in part-of-speech tagging with an application to German. In Proceedings of the ACL SIGDAT-Workshop, pp. 47–50.Google Scholar

Shapiro, S. S., and Wilk, M. B., 1965. An analysis of variance test for normality (complete samples). Biometrika 52 (3): 591–611.Google Scholar

Sharoff, S. 2007. Classifying web corpora into domain and genre using automatic feature identification. In Proceedings of Web as Corpus Workshop, Louvain-la-Neuve.Google Scholar

Sharoff, S., Rapp, R., and Zweigenbaum, P. 2013. Overviewing Important Aspects of the Last Twenty Years of Research in Comparable Corpora. In Sharoff, S., Rapp, R., Zweigenbaum, P., Fung, P. (eds.), Building and Using Comparable Corpora. Berlin: Springer-Verlag, pp. 1–17.Google Scholar

Skadina, I., Vasiljevs, A., Skadins, R., Gaizauskas, R., Tufis, D., and Gornostay, T. 2010. Analysis and evaluation of comparable corpora for under resourced areas of machine translation. In Proceedings of the 3rd Workshop on Building and Using Comparable Corpora (LREC-2010), pp. 6–14.Google Scholar

Talvensaari, T., Laurikkala, J., Järvelin, L., Juhola, M., and Keskustalo, H., 2007. Creating and exploiting a comparable corpus in cross-language information retrieval. ACM Transactions on Information Systems 25 (1): 4.Google Scholar

Upadhyay, S., Faruqui, M., Dyer, C., and Roth, D., 2016. Cross-lingual models of word embeddings: an empirical comparison. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany, pp. 1661–1670.Google Scholar

Vulic, I., and Moens, M. F. 2015. Bilingual word embeddings from non-parallel document-aligned data applied to bilingual lexicon induction. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics, pp. 719–725.Google Scholar

Washtell, J. 2009. Co-dispersion: a windowless approach to lexical association. In Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, pp. 861–9.Google Scholar

Article contents

Measuring bilingual corpus comparability

Abstract

Access options

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests