Hostname: page-component-848d4c4894-89wxm Total loading time: 0 Render date: 2024-07-06T20:50:49.842Z Has data issue: false hasContentIssue false

Assessing multilingual multimodal image description: Studies of native speaker preferences and translator choices

Published online by Cambridge University Press:  23 April 2018

STELLA FRANK
Affiliation:
Centre for Language Evolution, University of Edinburgh, Edinburgh, UK e-mail: stella.frank@ed.ac.uk
DESMOND ELLIOTT
Affiliation:
Institute of Language, Cognition and Computation, University of Edinburgh, Edinburgh, UK e-mail: d.elliott@ed.ac.uk
LUCIA SPECIA
Affiliation:
Department of Computer Science, University of Sheffield, Sheffield, UK e-mail: l.specia@sheffield.ac.uk

Abstract

Two studies on multilingual multimodal image description provide empirical evidence towards two questions at the core of the task: (i) whether target language speakers prefer descriptions generated directly in their native language, as compared to descriptions translated from a different language; (ii) whether images improve human translation of descriptions. These results provide guidance for future work in multimodal natural language processing by first showing that on the whole, translations are not distinguished from native language descriptions, and second delineating and quantifying the information gained from the image during the human translation task.

Type
Articles
Copyright
Copyright © Cambridge University Press 2018 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C. L., and Parikh, D. 2015. VQA: Visual question answering. In Proceedings of the International Conference on Computer Vision (ICCV), Santiago, Chile.Google Scholar
Bernardi, R., Cakici, R., Elliott, D., Erdem, A., Erdem, E., Ikizler-Cinbis, N., Keller, F., Muscat, A., and Plank, B. 2016. Automatic description generation from images: a survey of models, datasets, and evaluation measures. Journal of Artificial Intelligence Research, 55: 409442.Google Scholar
Berzak, Y., Barbu, A., Harari, D., Katz, B., and Ullman, S., 2015. Do you see what I mean? Visual resolution of linguistic ambiguities. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, pp. 14771487.Google Scholar
Chen, D. L., and Dolan, W. B. 2011. Building a persistent workforce on Mechanical Turk for multilingual data collection. In Proceedings of the 3rd Human Computation Workshop (HCOMP –2011), San Francisco, CA, USA.Google Scholar
Denkowski, M., and Lavie, A., 2014. Meteor universal: Language specific translation evaluation for any target language. In Proceedings of the 9th Workshop on Statistical Machine Translation, Baltimore, Maryland, pp. 376380.Google Scholar
Duygulu, P., Barnard, K., de Freitas, J. F., and Forsyth, D. A., 2002. Object recognition as machine translation: Learning a lexicon for a fixed image vocabulary. In Proceedings of European Conference on Computer Vision, Copenhagen, Denmark, pp. 97112.Google Scholar
Elliott, D., Frank, S., Barrault, L., Bougares, F., and Specia, L., 2017. Findings of the second shared task on multimodal machine translation and multilingual image description. In Proceedings of the 2nd Conference on Machine Translation, vol. 2: Shared Task Papers, Copenhagen, Denmark, pp. 215233.Google Scholar
Elliott, D., Frank, S., Sima’an, K., and Specia, L. 2016. Multi30K: multilingual English-German image descriptions. In Proceedings of the 5th Workshop on Vision and Language, Berlin, Germany.Google Scholar
Elliott, D., and Keller, F., 2014. Comparing automatic evaluation measures for image description. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Baltimore, Maryland, pp. 452457.Google Scholar
Elliott, D., and Kleppe, M. 2016. 1 million captioned Dutch newspaper images. In Calzolari, N., Choukri, K., Declerck, T., Goggi, S., Grobelnik, M., Maegaard, B., Mariani, J., Mazo, H., Moreno, A., Odijk, J., and Piperidis, S., (eds.), In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC-2016), Paris, France: European Language Resources Association (ELRA).Google Scholar
Gella, S., Lapata, M., and Keller, F., 2016. Unsupervised visual sense disambiguation for verbs using multimodal embeddings. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, California, pp. 182192.Google Scholar
Graham, Y., Baldwin, T., Moffat, A., and Zobel, J. 2015. Can machine translation systems be evaluated by the crowd alone? Natural Language Engineering, 23 (1): 330.Google Scholar
Hardmeier, C., Nakov, P., Stymne, S., Tiedemann, J., Versley, Y., and Cettolo, M., 2015. Pronoun-focused MT and cross-lingual pronoun prediction: Findings of the 2015 DiscoMT shared task on pronoun translation. In Proceedings of the 2nd Workshop on Discourse in Machine Translation, Lisbon, Portugal, pp. 116.Google Scholar
Hitschler, J., Schamoni, S., and Riezler, S., 2016. Multimodal pivots for image caption translation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany, pp. 23992409.Google Scholar
Hodosh, M., Young, P., and Hockenmaier, J. 2013. Framing image description as a ranking task: Data, models and evaluation metrics. Journal of Artificial Intelligence Research, 47: 853899.CrossRefGoogle Scholar
Hollink, L., Bedjeti, A., van Harmelen, M., and Elliott, D. 2016. A corpus of images and text in online news. In Calzolari, N., Choukri, K., Declerck, T., Goggi, S., Grobelnik, M., Maegaard, B., Mariani, J., Mazo, H., Moreno, A., Odijk, J., and Piperidis, S., (eds.), Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC-2016), Paris, France: European Language Resources Association (ELRA).Google Scholar
Kilickaya, M., Erdem, A., Ikizler-Cinbis, N., and Erdem, E., 2017. Re-evaluating automatic metrics for image captioning. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Long Papers, Valencia, Spain, pp. 199209.Google Scholar
Lazaridou, A., Pham, N. T., and Baroni, M., 2015. Combining language and vision with a multimodal skip-gram model. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Denver, Colorado, pp. 153163.Google Scholar
Li, X., Lan, W., Dong, J., and Liu, H., 2016. Adding Chinese captions to images. In Proceedings of the ACM International Conference on Multimedia Retrieval, New York, NY, USA, pp. 271275.Google Scholar
Miyazaki, T., and Shimizu, N., 2016. Cross-lingual image caption generation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany, pp. 17801790.Google Scholar
Ordonez, V., Kulkarni, G., and Berg, T. L., 2011. Im2text: describing images using 1 million captioned photographs. In Proceedings of the Advances in Neural Information Processing Systems, Granada, Spain, pp. 11431151.Google Scholar
Papineni, K., Roukos, S., Ard, T., and Zhu, W.-J., 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, Pennsylvania, pp. 311318.Google Scholar
Rajendran, J., Khapra, M. M., Chandar, S., and Ravindran, B., 2016. Bridge correlational neural networks for multilingual multimodal representation learning. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, CA, USA, pp. 171181.Google Scholar
Ramanathan, V., Joulin, A., Liang, P., and Fei-Fei, L., 2014. Linking people in videos with “their” names using coreference resolution. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, pp. 95110.Google Scholar
Ramisa, A., Yan, F., Moreno-Noguer, F., and Mikolajczyk, K. 2017. BreakingNews: Article annotation by image and text processing. IEEE Transactions on Pattern Analysis and Machine Intelligence, PP (99):1–1.Google Scholar
Silberer, C., and Lapata, M., 2014. Learning grounded meaning representations with autoencoders. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Long Papers), Baltimore, Maryland, pp. 721732.CrossRefGoogle Scholar
Specia, L., Frank, S., Sima’an, K., and Elliott, D., 2016. A shared task on multimodal machine translation and crosslingual image description. In Proceedings of the 1st Conference on Machine Translation, Shared Task Papers, WMT, Berlin, Germany, pp. 540550.Google Scholar
Unal, M. E., Citamak, B., Yagcioglu, S., Erdem, A., Erdem, E., Cinbis, N. I., and Cakici, R. 2016. Tasviret: Görüntülerden otomatik türkçe açıklama oluşturma İçin bir denektaçı veri kümesi (TasvirEt: A benchmark dataset for automatic Turkish description generation from images). In IEEE Sinyal İşleme ve İletişim Uygulamaları Kurultayı (SIU 2016).Google Scholar
van Miltenburg, E., Elliott, D., and Vossen, P., 2017. Cross-linguistic differences and similarities in image descriptions. In Proceedings of the 10th International Conference on Natural Language Generation, Santiago de Compostela, Spain, pp. 2130.Google Scholar
Yoshikawa, Y., Shigeto, Y., and Takeuchi, A., 2017. STAIR captions: Constructing a large-scale Japanese image caption dataset. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (vol. 2: Short Papers), Vancouver, Canada, pp. 417421.Google Scholar
Young, P., Lai, A., Hodosh, M., and Hockenmaier, J. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2: 6778.CrossRefGoogle Scholar
Zhao, J., Wang, T., Yatskar, M., Ordonez, V., and Chang, K.-W., 2017. Men also like shopping: Reducing gender bias amplification using corpus-level constraints. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, pp. 29412951.Google Scholar