Assessing multilingual multimodal image description: Studies of native speaker preferences and translator choices

STELLA FRANK; DESMOND ELLIOTT; LUCIA SPECIA

doi:10.1017/S1351324918000074

Assessing multilingual multimodal image description: Studies of native speaker preferences and translator choices

Published online by Cambridge University Press: 23 April 2018

STELLA FRANK ,

DESMOND ELLIOTT and

LUCIA SPECIA

Show author details

STELLA FRANK: Affiliation:
Centre for Language Evolution, University of Edinburgh, Edinburgh, UK e-mail: stella.frank@ed.ac.uk
DESMOND ELLIOTT: Affiliation:
Institute of Language, Cognition and Computation, University of Edinburgh, Edinburgh, UK e-mail: d.elliott@ed.ac.uk
LUCIA SPECIA: Affiliation:
Department of Computer Science, University of Sheffield, Sheffield, UK e-mail: l.specia@sheffield.ac.uk

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

Two studies on multilingual multimodal image description provide empirical evidence towards two questions at the core of the task: (i) whether target language speakers prefer descriptions generated directly in their native language, as compared to descriptions translated from a different language; (ii) whether images improve human translation of descriptions. These results provide guidance for future work in multimodal natural language processing by first showing that on the whole, translations are not distinguished from native language descriptions, and second delineating and quantifying the information gained from the image during the human translation task.

Type: Articles
Information: Natural Language Engineering , Volume 24 , Special Issue 3: Language for Images , May 2018 , pp. 393 - 413

DOI: https://doi.org/10.1017/S1351324918000074 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2018

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C. L., and Parikh, D. 2015. VQA: Visual question answering. In Proceedings of the International Conference on Computer Vision (ICCV), Santiago, Chile.Google Scholar

Bernardi, R., Cakici, R., Elliott, D., Erdem, A., Erdem, E., Ikizler-Cinbis, N., Keller, F., Muscat, A., and Plank, B. 2016. Automatic description generation from images: a survey of models, datasets, and evaluation measures. Journal of Artificial Intelligence Research, 55: 409–442.Google Scholar

Berzak, Y., Barbu, A., Harari, D., Katz, B., and Ullman, S., 2015. Do you see what I mean? Visual resolution of linguistic ambiguities. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, pp. 1477–1487.Google Scholar

Chen, D. L., and Dolan, W. B. 2011. Building a persistent workforce on Mechanical Turk for multilingual data collection. In Proceedings of the 3rd Human Computation Workshop (HCOMP –2011), San Francisco, CA, USA.Google Scholar

Denkowski, M., and Lavie, A., 2014. Meteor universal: Language specific translation evaluation for any target language. In Proceedings of the 9th Workshop on Statistical Machine Translation, Baltimore, Maryland, pp. 376–380.Google Scholar

Duygulu, P., Barnard, K., de Freitas, J. F., and Forsyth, D. A., 2002. Object recognition as machine translation: Learning a lexicon for a fixed image vocabulary. In Proceedings of European Conference on Computer Vision, Copenhagen, Denmark, pp. 97–112.Google Scholar

Elliott, D., Frank, S., Barrault, L., Bougares, F., and Specia, L., 2017. Findings of the second shared task on multimodal machine translation and multilingual image description. In Proceedings of the 2nd Conference on Machine Translation, vol. 2: Shared Task Papers, Copenhagen, Denmark, pp. 215–233.Google Scholar

Elliott, D., Frank, S., Sima’an, K., and Specia, L. 2016. Multi30K: multilingual English-German image descriptions. In Proceedings of the 5th Workshop on Vision and Language, Berlin, Germany.Google Scholar

Elliott, D., and Keller, F., 2014. Comparing automatic evaluation measures for image description. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Baltimore, Maryland, pp. 452–457.Google Scholar

Elliott, D., and Kleppe, M. 2016. 1 million captioned Dutch newspaper images. In Calzolari, N., Choukri, K., Declerck, T., Goggi, S., Grobelnik, M., Maegaard, B., Mariani, J., Mazo, H., Moreno, A., Odijk, J., and Piperidis, S., (eds.), In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC-2016), Paris, France: European Language Resources Association (ELRA).Google Scholar

Gella, S., Lapata, M., and Keller, F., 2016. Unsupervised visual sense disambiguation for verbs using multimodal embeddings. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, California, pp. 182–192.Google Scholar

Graham, Y., Baldwin, T., Moffat, A., and Zobel, J. 2015. Can machine translation systems be evaluated by the crowd alone? Natural Language Engineering, 23 (1): 3–30.Google Scholar

Hardmeier, C., Nakov, P., Stymne, S., Tiedemann, J., Versley, Y., and Cettolo, M., 2015. Pronoun-focused MT and cross-lingual pronoun prediction: Findings of the 2015 DiscoMT shared task on pronoun translation. In Proceedings of the 2nd Workshop on Discourse in Machine Translation, Lisbon, Portugal, pp. 1–16.Google Scholar

Hitschler, J., Schamoni, S., and Riezler, S., 2016. Multimodal pivots for image caption translation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany, pp. 2399–2409.Google Scholar

Hodosh, M., Young, P., and Hockenmaier, J. 2013. Framing image description as a ranking task: Data, models and evaluation metrics. Journal of Artificial Intelligence Research, 47: 853–899.CrossRef Google Scholar

Hollink, L., Bedjeti, A., van Harmelen, M., and Elliott, D. 2016. A corpus of images and text in online news. In Calzolari, N., Choukri, K., Declerck, T., Goggi, S., Grobelnik, M., Maegaard, B., Mariani, J., Mazo, H., Moreno, A., Odijk, J., and Piperidis, S., (eds.), Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC-2016), Paris, France: European Language Resources Association (ELRA).Google Scholar

Kilickaya, M., Erdem, A., Ikizler-Cinbis, N., and Erdem, E., 2017. Re-evaluating automatic metrics for image captioning. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Long Papers, Valencia, Spain, pp. 199–209.Google Scholar

Lazaridou, A., Pham, N. T., and Baroni, M., 2015. Combining language and vision with a multimodal skip-gram model. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Denver, Colorado, pp. 153–163.Google Scholar

Li, X., Lan, W., Dong, J., and Liu, H., 2016. Adding Chinese captions to images. In Proceedings of the ACM International Conference on Multimedia Retrieval, New York, NY, USA, pp. 271–275.Google Scholar

Miyazaki, T., and Shimizu, N., 2016. Cross-lingual image caption generation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany, pp. 1780–1790.Google Scholar

Ordonez, V., Kulkarni, G., and Berg, T. L., 2011. Im2text: describing images using 1 million captioned photographs. In Proceedings of the Advances in Neural Information Processing Systems, Granada, Spain, pp. 1143–1151.Google Scholar

Papineni, K., Roukos, S., Ard, T., and Zhu, W.-J., 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, Pennsylvania, pp. 311–318.Google Scholar

Rajendran, J., Khapra, M. M., Chandar, S., and Ravindran, B., 2016. Bridge correlational neural networks for multilingual multimodal representation learning. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, CA, USA, pp. 171–181.Google Scholar

Ramanathan, V., Joulin, A., Liang, P., and Fei-Fei, L., 2014. Linking people in videos with “their” names using coreference resolution. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, pp. 95–110.Google Scholar

Ramisa, A., Yan, F., Moreno-Noguer, F., and Mikolajczyk, K. 2017. BreakingNews: Article annotation by image and text processing. IEEE Transactions on Pattern Analysis and Machine Intelligence, PP (99):1–1.Google Scholar

Silberer, C., and Lapata, M., 2014. Learning grounded meaning representations with autoencoders. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Long Papers), Baltimore, Maryland, pp. 721–732.CrossRef Google Scholar

Specia, L., Frank, S., Sima’an, K., and Elliott, D., 2016. A shared task on multimodal machine translation and crosslingual image description. In Proceedings of the 1st Conference on Machine Translation, Shared Task Papers, WMT, Berlin, Germany, pp. 540–550.Google Scholar

Unal, M. E., Citamak, B., Yagcioglu, S., Erdem, A., Erdem, E., Cinbis, N. I., and Cakici, R. 2016. Tasviret: Görüntülerden otomatik türkçe açıklama oluşturma İçin bir denektaçı veri kümesi (TasvirEt: A benchmark dataset for automatic Turkish description generation from images). In IEEE Sinyal İşleme ve İletişim Uygulamaları Kurultayı (SIU 2016).Google Scholar

van Miltenburg, E., Elliott, D., and Vossen, P., 2017. Cross-linguistic differences and similarities in image descriptions. In Proceedings of the 10th International Conference on Natural Language Generation, Santiago de Compostela, Spain, pp. 21–30.Google Scholar

Yoshikawa, Y., Shigeto, Y., and Takeuchi, A., 2017. STAIR captions: Constructing a large-scale Japanese image caption dataset. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (vol. 2: Short Papers), Vancouver, Canada, pp. 417–421.Google Scholar

Young, P., Lai, A., Hodosh, M., and Hockenmaier, J. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2: 67–78.CrossRef Google Scholar

Zhao, J., Wang, T., Yatskar, M., Ordonez, V., and Chang, K.-W., 2017. Men also like shopping: Reducing gender bias amplification using corpus-level constraints. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, pp. 2941–2951.Google Scholar

Article contents

Assessing multilingual multimodal image description: Studies of native speaker preferences and translator choices

Abstract

Access options

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests