Skip to main content Accessibility help
×
Home

Exploiting unbalanced specialized comparable corpora for bilingual lexicon extraction

  • EMMANUEL MORIN (a1) and AMIR HAZEM (a1)

Abstract

The main work in bilingual lexicon extraction from comparable corpora is based on the implicit hypothesis that corpora are balanced in terms of size. However, the historical context-based projection method is relatively insensitive to the size of each part of the comparable corpus. Within this context, we have carried out a study on the influence of unbalanced specialized comparable corpora and on the quality of bilingual terminology extraction by doing different experiments. Moreover, we have introduced a strategy into the context-based projection method to re-estimate word co-occurrence observations. This is done by using smoothing or prediction techniques that boost the observations of word co-occurrences which are mainly useful for the smallest part of an unbalanced comparable corpus. Our results show that the use of unbalanced specialized comparable corpora results in a significant improvement in the quality of extracted lexicons.

Copyright

Footnotes

Hide All

We thank the two anonymous reviewers whose comments and suggestions helped improve and clarify this manuscript. This work is supported by the French National Research Agency under grant ANR-12-CORD-0020.

Footnotes

References

Hide All
Agresti, A. 2007. An Introduction to Categorical Data Analysis, 2nd ed.Hoboken, New Jersey: Wiley & Sons, Inc.
Bouamor, D., Semmar, N. and Zweigenbaum, P. 2013. Context vector disambiguation for bilingual lexicon extraction from comparable corpora. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL'13), Sofia, Bulgaria, pp. 759–64.
Chen, S. F. and Goodman, J. 1999. An empirical study of smoothing techniques for language modeling. Computer Speech & Language 13 (4): 359–93.
Chiao, Y.-C. and Zweigenbaum, P. 2002. Looking for candidate translational equivalents in specialized, comparable corpora. In Proceedings of the 19th International Conference on Computational Linguistics (COLING'02), Tapei, Taiwan, pp. 1208–12.
Chiao, Y.-C. and Zweigenbaum, P. 2003. The effect of a general lexicon in corpus-based identification of french-english medical word translations. In Baud, R., Fieschi, M., Le Beux, P., and Ruch, P. (eds.), The New Navigators: from Professionals to Patients, Actes Medica Informatics Europe, pp. 397402. Studies in Health Technology and Informatics, vol. 95. Amsterdam: IOS Press.
Christensen, R. 1997. Log-Linear Models and Logistic Regression. Berlin: Springer-Verlag.
Déjean, H., Gaussier, É., and Sadat, F. 2002. An approach based on multilingual thesauri and model combination for bilingual lexicon extraction. In Proceedings of the 19th International Conference on Computational Linguistics (COLING'02), Taipei, Taiwan, pp. 1–7.
Diab, M. T. and Finch, S. 2000. A statistical word-level translation model for comparable corpora. In Proceedings of the 6th International Conference on Computer-Assisted Information Retrieval (RIAO'00), Paris, France, pp. 1500–01.
Dunning, T. 1993. Accurate methods for the statistics of surprise and coincidence. Computational Linguistics 19 (1): 6174.
Evert, S. 2005. The Statistics of Word Cooccurrences: Word Pairs and Collocations. PhD Thesis, Dissertation, Institut für maschinelle Sprachverarbeitung, University of Stuttgart.
Evert, S. and Baroni, M. 2007. Zipfr: word frequency modeling in r. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics (ACL'07), Prague, Czech Republic.
Fano, R. M. 1961. Transmission of Information: a Statistical Theory of Communications. Cambridge, MA, USA: MIT Press.
Firth, J. R. 1957. A synopsis of linguistic theory 1930–1955. In Studies in Linguistic Analysis (special volume of the Philological Society), pp. 132. Oxford: Blackwell.
Fung, P. 1995. Compiling bilingual lexicon entries from a non-parallel english-chinese corpus. In Proceedings of the 3rd Annual Workshop on Very Large Corpora (VLC'95), Cambridge, MA, USA, pp. 173–83.
Fung, P. 1998. A statistical view on bilingual lexicon extraction: from parallel corpora to non-parallel corpora. In Proceedings of the 3rd Conference of the Association for Machine Translation in the Americas on Machine Translation and the Information Soup (AMTA'98), Langhorne, PA, USA, pp. 1–17.
Fung, P. and Cheung, P. 2004. Mining very-non-parallel corpora: parallel sentence and lexicon extraction via bootstrapping and EM. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP'04), Barcelona, Spain, pp. 57–63.
Fung, P. and McKeown, K. 1997. Finding terminology translations from non-parallel corpora. In Proceedings of the 5th Annual Workshop on Very Large Corpora (VLC'97), Hong Kong, pp. 192–202.
Gamallo, P. 2007. Learning bilingual lexicons from comparable english and spanish corpora. In Proceedings of the 11th Conference on Machine Translation Summit (MT Summit XI), Copenhagen, Denmark, pp. 191–98.
Gaussier, E., Renders, J.-M., Matveeva, I., Goutte, C., and Déjean, H. (2004. A geometric view on bilingual lexicon extraction from comparable corpora. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL'04), Barcelona, Spain, pp. 526–33.
Good, I. J. 1953. The population frequencies of species and the estimation of population parameters. Biometrika 40 (3/4): 237–64.
Grefenstette, G. 1994a. Corpus-derived first, second and third-order word affinities. In Proceedings of the 6th Congress of the European Association for Lexicography (EURALEX'94), Amsterdam, The Netherlands, pp. 279–90.
Grefenstette, G. 1994b. Explorations in Automatic Thesaurus Discovery. Boston, MA, USA: Kluwer Academic Publisher.
Hazem, A. and Morin, E. 2012. Adaptive dictionary for bilingual lexicon extraction from comparable corpora. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC'12), Istanbul, Turkey, pp. 288–92.
Hazem, A. and Morin, E. 2013. Word co-occurrence counts prediction for bilingual terminology extraction from comparable corpora. In Proceedings of the 6th International Joint Conference on Natural Language Processing (IJCNLP'13), Nagoya, Japan, pp. 1392–1400.
Hazem, A. and Morin, E. 2014. Improving bilingual lexicon extraction from comparable corpora using window-based and syntax-based models. In Proceedings of the 15th International Computational Linguistics and Intelligent Text Processing (CICLing'14), Kathmandu, Nepal, pp. 310–23.
Ismail, A. and Manandhar, S. 2010. Bilingual lexicon extraction from comparable corpora using in-domain terms. In Proceedings of the 23rd International Conference on Computational Linguistics (COLING'10), Beijing, China, pp. 481–89.
Jeffreys, H. 1948. Theory of Probability. Oxford: The Clarendon Press.
Johnson, W. 1932. Probability: the deductive and inductive problems. Mind 41 (164): 409–23.
Katz, S. M. 1987. Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Transactions on Acoustics, Speech and Signal Processing 35 (3): 400–01.
Kneser, R. and Ney, H. 1995. Improved backing-off for M-gram language modeling. In Proceedings of the 20th International Conference on Acoustics, Speech, and Signal Processing (ICASSP'95), Detroit, MI, USA, pp. 181–84.
Koehn, P. and Knight, K. 2002. Learning a translation lexicon from monolingual corpora. In Proceedings of the ACL-02 Workshop on Unsupervised Lexical Acquisition (ULA'02), Philadelphia, PA, USA, pp. 9–16.
Laroche, A. and Langlais, P. 2010. Revisiting context-based projection methods for term-translation spotting in comparable corpora. In Proceedings of the 23rd International Conference on Computational Linguistics (COLING'10), Beijing, China, pp. 617–25.
Li, B. and Gaussier, É. 2010. Improving corpus comparability for bilingual lexicon extraction from comparable corpora. In Proceedings of the 23rd International Conference on Computational Linguistics (COLING'10), Beijing, China, pp. 644–52.
Lidstone, G. J. 1920. Note on the general case of the bayes-laplace formula for inductive or a posteriori probabilities. Transactions of the Faculty of Actuaries 8: 182–92.
Manning, C. D., Raghavan, P. and Schütze, H. 2008. Introduction to Information Retrieval. New York, NY, USA: Cambridge University Press.
McEnery, A., and Xiao, Z. 2007. Parallel and comparable corpora: what are they up to? In Anderman, G., and Rogers, M. (eds.), Incorporating Corpora: Translation and the Linguist, Multilingual Matters, chapter 2, Clevedon, UK, pp. 18–31.
Mercer, L. and Jelinek, F. 1980. Interpolated estimation of markov source parameters from sparse data. In Workshop on Pattern Recognition in Practice, Amsterdam.
Morin, E., Daille, B., Takeuchi, K. and Kageura, K. 2007. Bilingual terminology mining – using brain, not brawn comparable corpora. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics (ACL'07), Prague, Czech Republic, pp. 664–71.
Morin, E., Daille, B., Takeuchi, K. and Kageura, K. 2010. Brains, not brawn: the use of ‘smart’ comparable corpora in bilingual terminology mining. ACM Transactions on Speech and Language Processing 7 (1): 123.
Morin, E. and Hazem, A. 2014. Looking at unbalanced specialized comparable Corpora for bilingual lexicon extraction. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (ACL'14), Baltimore, Maryland, pp. 1284–93.
Morin, E. and Prochasson, E. 2011. Bilingual lexicon extraction from comparable corpora enhanced with parallel corpora. In Proceedings of the 4th Workshop on Building and Using Comparable Corpora (BUCC'11), Portland, OR, USA, pp. 27–34.
Pekar, V., Mitkov, R., Blagoev, D. and Mulloni, A. 2006. Finding translations for low-frequency words in comparable corpora. Machine Translation 20 (4): 247–66.
Prochasson, E. and Fung, P. 2011. Rare word translation extraction from aligned comparable documents. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics (ACL'11), Portland, OR, USA, pp. 1327–35.
Prochasson, E., Morin, E. and Kageura, K. 2009. Anchor points for bilingual lexicon extraction from small comparable corpora. In Proceedings of the 12th Conference on Machine Translation Summit (MT Summit XII), Ottawa, Canada, pp. 284–91.
Rapp, R. 1995. Identify word translations in non-parallel texts. In Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics (ACL'95), Boston, MA, USA, pp. 320–22.
Rapp, R. 1999. Automatic identification of word translations from unrelated english and german corpora. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics (ACL'99), College Park, MD, USA, pp. 519–26.
Rubino, R. and Linarès, G. 2011. A multi-view approach for term translation spotting. In Proceedings of the 12th International Conference on Computational Linguistics and Intelligent Text Processing (CICLing'11), Tokyo, Japan, pp. 29–40.
Salton, G. and Lesk, M. E. 1968. Computer evaluation of indexing and text processing. Journal of the Association for Computational Machinery 15 (1): 836.
Sinclair, J. 2005. Corpus and text - basic principles. In Wynne, M. (ed.), Developing Linguistic Corpora: a Guide to Good Practice, pp. 116. Oxford: Oxbow Books. Available online from ota.ox.ac.uk/documents/creating/dlc/ [Accessed 2015-03-03].
Tanaka, K. and Iwasaki, H. 1996. Extraction of lexical translations from non-aligned corpora. In Proceedings of the 16th International Conference on Computational Linguistics (COLING'96), Copenhagen, Denmark, pp. 580–85.
Yu, K. and Tsujii, J. 2009. Extracting bilingual dictionary from comparable corpora with dependency heterogeneity. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT'09), Boulder, CO, USA, pp. 121–24.
Zipf, G. K. 1949. Human Behaviour and the Principle of Least Effort: An Introduction to Human Ecology. Cambridge, MA: Addison-Wesley.

Exploiting unbalanced specialized comparable corpora for bilingual lexicon extraction

  • EMMANUEL MORIN (a1) and AMIR HAZEM (a1)

Metrics

Full text views

Total number of HTML views: 0
Total number of PDF views: 0 *
Loading metrics...

Abstract views

Total abstract views: 0 *
Loading metrics...

* Views captured on Cambridge Core between <date>. This data will be updated every 24 hours.

Usage data cannot currently be displayed