Work in computer vision and natural language processing involving images and text has been experiencing explosive growth over the past decade, with a particular boost coming from the neural network revolution. The present volume brings together five research articles from several different corners of the area: multilingual multimodal image description (Frank et al.), multimodal machine translation (Madhyastha et al., Frank et al.), image caption generation (Madhyastha et al., Tanti et al.), visual scene understanding (Silberer et al.), and multimodal learning of high-level attributes (Sorodoc et al.). In this article, we touch upon all of these topics as we review work involving images and text under the three main headings of image description (Section 2), visually grounded referring expression generation (REG) and comprehension (Section 3), and visual question answering (VQA) (Section 4).