Skip to main content Accessibility help
×
Home

Extracting parallel phrases from comparable data for machine translation

  • SANJIKA HEWAVITHARANA (a1) and STEPHAN VOGEL (a2)

Abstract

Mining parallel data from comparable corpora is a promising approach for overcoming the data sparseness in statistical machine translation and other natural language processing applications. In this paper, we address the task of detecting parallel phrase pairs embedded in comparable sentence pairs. We present a novel phrase alignment approach that is designed to only align parallel sections bypassing non-parallel sections of the sentence. We compare the proposed approach with two other alignment methods: (1) the standard phrase extraction algorithm, which relies on the Viterbi path of the word alignment, (2) a binary classifier to detect parallel phrase pairs when presented with a large collection of phrase pair candidates. We evaluate the accuracy of these approaches using a manually aligned data set, and show that the proposed approach outperforms the other two approaches. Finally, we demonstrate the effectiveness of the extracted phrase pairs by using them in Arabic–English and Urdu–English translation systems, which resulted in improvements upto 1.2 Bleu over the baseline. The main contributions of this paper are two-fold: (1) novel phrase alignment algorithms to extract parallel phrase pairs from comparable sentences, (2) evaluating the utility of the extracted phrases by using them directly in the MT decoder.

Copyright

Footnotes

Hide All

Part of this work was conducted when the authors were affiliated to the Language Technologies Institute, Carnegie Mellon University, Pittsburgh, PA 15213, USA.

Footnotes

References

Hide All
Banerjee, S. and Lavie, A. 2005. METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Ann Arbor, Michigan, USA, June, pp. 65–72.
Bourdaillet, J., Huet, S., Langlais, P. and Lapalme, G. 2010. TransSearch: from a bilingual concordancer to a translation finder. Machine Translation 24 (3–4): 241–71, December.
Brown, P. F., Della Pietra, S. A., Della Pietra, V. J., and Mercer, R. L. 1993. The mathematics of statistical machine translation: parameter estimation. Computational Linguistics 19 (2): 263311.
Crammer, K., Dekel, O., Keshet, J., Shalev-Shwartz, S., and Singer, Y. 2006. Online passive-agressive algorithms. Journal of Machine Learning Research 7 (March): 551–85.
Fung, P. and Cheung, P. 2004. Mining very non-parallel corpora: parallel sentence and lexicon extraction via bootstrapping and EM. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Barcelona, Spain, pp. 57–63.
Fung, P. and Yee, L. Y. 1998. An IR approach for translating new words from nonparallel, comparable texts. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics, Montreal, Canada, pp. 414–20.
Gupta, M., Hewavitharana, S. and Vogel, S. 2011. Extending a probabilistic phrase alignment approach for SMT. In Proceedings of the International Workshop on Spoken Language Translation (IWSLT), San Francisco, CA, December.
Gupta, R., Pal, S. and Bandyopadhyay, S. 2013. Improving MT system using extracted parallel fragments of text from comparable corpora. In Proceedings of the 6th Workshop on Building and Using Comparable Corpora, Sofia, Bulgaria, August.
Hewavitharana, S. and Vogel, S. 2011. Extracting parallel phrases from comparable data. In Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web, Portland, Oregon, pp. 61–8.
Hewavitharana, S. and Vogel, S. 2013. Extracting parallel phrases from comparable data. In Sharoff, S., Reinhard, R., Zweigenbaum, P., and Fung, P. (eds.), Building and Using Comparable Corpora. Berlin Heidelberg: Springer, pp. 191204.
Kikui, G., Sumita, E., Takezawa, T. and Yamamoto, S. 2003. Creating corpora for speech-to-speech translation. In Proceedings of EUROSPEECH, Geneva, pp. 381–84.
Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., and Herbst, E. 2007. Moses: open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, Prague, Czech Republic, June.
Kumano, T., Tanaka, H. and Tokunaga, T. 2007. Extracting phrasal alignments from comparable corpora by using joint probability SMT model. In Proceedings of the International Conference on Theoretical and Methodological Issues in Machine Translation, Skvde, Sweden, September.
Munteanu, D. S. and Marcu, D. 2005. Improving machine translation performance by exploiting non-parallel corpora. Computational Linguistics 31 (4): 477504.
Munteanu, D. S. and Marcu, D. 2006. Extracting parallel sub-sentential fragments from non-parallel corpora. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics, Sydney, Australia, pp. 81–8.
Och, F. J. 2003. Minimum error rate training in statistical machine translation. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, Sapporo, Japan, pp. 160–67.
Papineni, K., Roukos, S., Ward, T. and Zhu, W.-J. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, July, pp. 311–18.
Quirk, C., Udupa, R. U. and Menezes, A. 2007. Generative models of noisy translations with applications to parallel fragment extraction. In Proceedings of the Machine Translation Summit XI, Copenhagen, Denmark, pp. 377–84.
Rapp, R. 1995. Identifying word translations in non-parallel texts. In Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics, Cambridge, Massachusetts, pp. 320–22.
Rapp, R. 1999. Automatic identification of word translations from unrelated English and German corpora. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics, College Park, Maryland, USA, pp. 519–26.
Resnik, P. and Smith, N. 2003. The web as a parallel corpus. Computational Linguistics 29 (3): 349–80.
Snover, M., Dorr, B., Schwartz, R., Micciulla, L., and Makhoul, J. 2006. A study of translation edit rate with targeted human annotation. In Proceedings of Association for Machine Translation in the Americas, Cambridge, MA.
Tillmann, C. and Hewavitharana, S. 2011. An efficient unified alignment algorithm for bilingual data. In Proceedings of Interspeech 2011, Florence, Italy, August.
Tillmann, C. and Hewavitharana, S. 2013. A unified alignment algorithm for bilingual data. Natural Language Engineering 19 (01): 3360, Januray.
Tillmann, C. and Xu, J.-M. 2009. A simple sentence-level extraction algorithm for comparable data. In Companion Vol. of NAACL HLT 09, Boulder, CA, June.
Utiyama, M. and Isahara, H. 2003. Reliable measures for aligning Japanese-English news articles and sentences. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, Sapporo, Japan, pp. 72–9.
Vogel, S. 2003. SMT decoder dissected: word reordering. In Proceedings of IEEE International Conference on Natural Language Processing and Knowledge Engineering, Beijing, China, October, pp. 561–66.
Vogel, S. 2005. PESA: phrase pair extraction as sentence splitting. In Proceedings of the Machine Translation Summit X, Phuket, Thailand, September.
Zhao, B. and Vogel, S. 2002a. Adaptive parallel sentence mining from web bilingual news collection. In Proceedings of the IEEE International Conference on Data Mining, Maebashi City, Japan, pp. 745–48.
Zhao, B. and Vogel, S. 2002b. Full-text story alignment models for Chinese-English bilingual news corpora. In Proceedings of the ICSLP '02, Denver, CO, September.

Extracting parallel phrases from comparable data for machine translation

  • SANJIKA HEWAVITHARANA (a1) and STEPHAN VOGEL (a2)

Metrics

Full text views

Total number of HTML views: 0
Total number of PDF views: 0 *
Loading metrics...

Abstract views

Total abstract views: 0 *
Loading metrics...

* Views captured on Cambridge Core between <date>. This data will be updated every 24 hours.

Usage data cannot currently be displayed