Where to put the image in an image caption generator

MARC TANTI; ALBERT GATT; KENNETH P. CAMILLERI

doi:10.1017/S1351324918000098

Where to put the image in an image caption generator

Published online by Cambridge University Press: 23 April 2018

MARC TANTI ,

ALBERT GATT and

KENNETH P. CAMILLERI

Show author details

MARC TANTI: Affiliation:
Institute of Linguistics and Language Technology, University of Malta, Msida MSD, Malta e-mail: marc.tanti.06@um.edu.mt, albert.gatt@um.edu.mt
ALBERT GATT: Affiliation:
Institute of Linguistics and Language Technology, University of Malta, Msida MSD, Malta e-mail: marc.tanti.06@um.edu.mt, albert.gatt@um.edu.mt
KENNETH P. CAMILLERI: Affiliation:
Department of Systems and Control Engineering, University of Malta, Msida MSD, Malta e-mail: kenneth.camilleri@um.edu.mt

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

When a recurrent neural network (RNN) language model is used for caption generation, the image information can be fed to the neural network either by directly incorporating it in the RNN – conditioning the language model by ‘injecting’ image features – or in a layer following the RNN – conditioning the language model by ‘merging’ image features. While both options are attested in the literature, there is as yet no systematic comparison between the two. In this paper, we empirically show that it is not especially detrimental to performance whether one architecture is used or another. The merge architecture does have practical advantages, as conditioning by merging allows the RNN’s hidden state vector to shrink in size by up to four times. Our results suggest that the visual and linguistic modalities for caption generation need not be jointly encoded by the RNN as that yields large, memory-intensive models with few tangible advantages in performance; rather, the multimodal integration should be delayed to a subsequent stage.

Type: Articles
Information: Natural Language Engineering , Volume 24 , Special Issue 3: Language for Images , May 2018 , pp. 467 - 489

DOI: https://doi.org/10.1017/S1351324918000098 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2018

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Banerjee, S., and Lavie, A. 2005. METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, vol.u 29, pp. 65–72.Google Scholar

Bernardi, R., Cakici, R., Elliott, D., Erdem, A., Erdem, E., Ikizler-Cinbis, N., Keller, F., Muscat, A., and Plank, B., 2016. Automatic description generation from images: a survey of models, datasets, and evaluation measures. JAIR 55: 409–42.CrossRef Google Scholar

Chen, X. and Zitnick, C. L. (2014). Learning a recurrent visual representation for image caption generation. CoRR, 1411.5654.Google Scholar

Chen, X., and Zitnick, C. L. 2015. Mind’s eye: a recurrent visual representation for image caption generation. In Proceedings of the CVPR’15.CrossRef Google Scholar

Chung, J., Gülçehre, Ç., Cho, K., and Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. CoRR, 1412.3555.Google Scholar

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. 2009. ImageNet: a large-scale hierarchical image database. In Proceedings of the CVPR’09.Google Scholar

Devlin, J., Cheng, H., Fang, H., Gupta, S., Deng, L., He, X., Zweig, G., and Mitchell, M. (2015). Language models for image captioning: the quirks and what works. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, ACL 2015, July 26-31, 2015, Volume 2: Short Papers, Beijing, China, pp. 100–105.Google Scholar

Diederik, P. K., and Ba, J. 2014. Adam: a method for stochastic optimization. CoRR, 1412.6980.Google Scholar

Donahue, J., Hendricks, L. A., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., and Darrell, T. 2015. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the CVPR’15.Google Scholar

Glorot, X., and Bengio, Y., 2010. Understanding the difficulty of training deep feedforward neural networks. Aistats 9: 249–56.Google Scholar

Harnad, S., 1990. The symbol grounding problem. Physica D 42: 335–46.CrossRef Google Scholar

Hendricks, L. A., Venugopalan, S., Rohrbach, M., Mooney, R., Saenko, K., and Darrell, T. 2016. Deep compositional captioning: describing novel object categories without paired training data. In Proceedings of the CVPR’16.Google Scholar

Hessel, J., Savva, N., and Wilber, M. J. 2015. Image representations and new domains in neural image captioning. CoRR, 1508.02091.Google Scholar

Hochreiter, S., and Schmidhuber, J., 1997. Long short-term memory. Neural Computation 9 (8): 1735–80.Google Scholar

Hodosh, M., Young, P., and Hockenmaier, J., 2013. Framing image description as a ranking task: data, models and evaluation metrics. JAIR 47 (1): 853–99.Google Scholar

Karpathy, A., and Fei-Fei, L. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the CVPR’15.Google Scholar

Kiros, R., Salakhutdinov, R., and Zemel, R. S. 2014a. Multimodal neural language models. In Proceedings of the ICML’14, pp. 595–603.Google Scholar

Kiros, R., Salakhutdinov, R., and Zemel, R. S. 2014b. Unifying visual-semantic embeddings with multimodal neural language models. CoRR, 1411.2539.Google Scholar

Lin, C.-Y. and Och, F. J. 2004. Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In Proceedings of the ACL’04.Google Scholar

Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., and Zitnick, C. L. 2014. Microsoft COCO: common objects in context. In Proceedings of the ECCV’14, pp. 740–55.Google Scholar

Liu, S., Zhu, Z., Ye, N., Guadarrama, S., and Murphy, K. 2016. Optimization of image description metrics using policy gradient methods. CoRR, 1612.00370.Google Scholar

Lu, J., Xiong, C., Parikh, D., and Socher, R. 2017. Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pp. 3242–3250.Google Scholar

Ma, S., and Han, Y. 2016. Describing images by feeding LSTM with structural words. In Proceedings of the ICME’16.Google Scholar

Mao, J., Xu, W., Yang, Y., Wang, J., Huang, Z., and Yuille, A. 2015a. Deep captioning with multimodal recurrent neural networks (m-RNN). In Proceedings of the ICLR’15.Google Scholar

Mao, J., Xu, W., Yang, Y., Wang, J., Huang, Z., and Yuille, A. 2015b. Learning like a child: fast novel visual concept learning from sentence descriptions of images. In Proceedings of the ICCV’15.Google Scholar

Mao, J., Xu, W., Yang, Y., Wang, J., and Yuille, A. L. 2014. Explain images with multimodal recurrent neural networks. In Proceedings of the NIPS’14.Google Scholar

Mikolov, T., Chen, K., Corrado, G., and Dean, J. 2013. Efficient estimation of word representations in vector space. CoRR, 1301.3781.Google Scholar

Mnih, A., and Hinton, G. 2007. Three new graphical models for statistical language modelling. In Proceedings of the ICML’07.Google Scholar

Nina, O., and Rodriguez, A. 2015. Simplified LSTM unit and search space probability exploration for image description. In Proceedings of the ICICS’15.Google Scholar

Oruganti, R. M., Sah, S., Pillai, S., and Ptucha, R. 2016. Image description through fusion based recurrent multi-modal learning. In Proceedings of the ICIP’16.CrossRef Google Scholar

Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the ACL’02, pp. 311–8.Google Scholar

Rennie, S. J., Marcheret, E., Mroueh, Y., Ross, J., and Goel, V., 2017. Self-critical sequence training for image captioning. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, July 21-26, 2017, Honolulu, HI, USA, pp. 1179–1195.Google Scholar

Roy, D., 2005. Semiotic schemas: A framework for grounding language in action and perception. Artificial Intelligence 167 (1–2): 170–205.Google Scholar

Simonyan, K. and Zisserman, A. 2014. Very deep convolutional networks for large-scale image recognition. CoRR, 1409.1556.Google Scholar

Song, M., and Yoo, C. D. 2016. Multimodal representation: Kneser-Ney smoothing/skip-gram based neural language model. In Proceedings of the ICIP’16.Google Scholar

Sutskever, I., Vinyals, O., and Le, Q. V. 2014. Sequence to sequence learning with neural networks. In Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N. D., and Weinberger, K. Q. (eds.), Advances in Neural Information Processing Systems 27, pp. 3104–3112. Curran Associates, Inc.Google Scholar

Vedantam, R., Zitnick, C. L., and Parikh, D. 2015. CIDEr: consensus-based image description evaluation. In Proceedings of the CVPR’15.Google Scholar

Vinyals, O., Toshev, A., Bengio, S., and Erhan, D. 2015. Show and tell: A neural image caption generator. In Proceedings of the CVPR’15.Google Scholar

Wang, M., Song, L., Yang, X., and Luo, C. 2016. A parallel-fusion RNN-LSTM architecture for image caption generation. In Proceedings of the ICIP’16.Google Scholar

Wu, Q., Shen, C., van den Hengel, A., Liu, L., and Dick, A. R. 2015. Image captioning with an intermediate attributes layer. CoRR, 1506.01144.Google Scholar

Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A. C., Salakhutdinov, R., Zemel, R. S., and Bengio, Y. 2015. Show, attend and tell: neural image caption generation with visual attention. In Proceedings of the ICML’15.Google Scholar

Yao, T., Pan, Y., Li, Y., Qiu, Z., and Mei, T., 2017. Boosting image captioning with attributes. In IEEE International Conference on Computer Vision, ICCV 2017, October 22-29, 2017, Venice, Italy, pp. 4904–4912.Google Scholar

You, Q., Jin, H., Wang, Z., Fang, C., and Luo, J. 2016. Image captioning with semantic attention. In Proceedings of the CVPR’16.CrossRef Google Scholar

Young, P., Lai, A., Hodosh, M., and Hockenmaier, J., 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. TACL 2: 67–78.Google Scholar

Zhou, L., Xu, C., Koch, P., and Corso, J. J. 2016. Image caption generation with text-conditional semantic attention. CoRR, 1606.04621.Google Scholar

Article contents

Where to put the image in an image caption generator

Abstract

Access options

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests