Generating Descriptive and Accurate Image Captions with Neural Networks

Publication Type:
Thesis
Issue Date:
2019
Full metadata record
Image captioning is to automatically describe an image with a sentence, which is a topic connecting computer vision and natural language processing. Research on image captioning has great impact to help visually impaired people understand their surroundings, and it has potential benefits for the sentence-level photo organization. Early work typically tackled this task by retrieval methods or template methods. Modern methods were mainly based on a combination of Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN). However, generating accurate and descriptive captions remains a challenging task. Accurate captions refer to sentences consistent with the visual content, and descriptive captions refer to those with diverse descriptions rather than plain common sentences. Generally, the vision model is required to encode the context comprehensively and the language model is required to express the visual representation into a readable sentence consistently. Additionally, the training strategy also affects the performance. In this thesis, we develop methods and technics to generate descriptive and accurate image captions from three aspects with neural networks. First, we consider how to express the visual representation consistently in the language model. We propose a Recall Network that can selectively import the visual information using GridLSTM units in the RNN. This design efficiently prevents the RNN from deviating from the visual representation while gradually generating each word. Second, we explore a comprehensive visual representation in the vision model based on Graph Neural Networks (GCN). A grid-level visual graph is introduced to work collaboratively with a region-level graph, and GCN are applied to aggregate visual neighborhood information in the graphs. Finally, we design a new training strategy with Reinforcement Learning (RL) technics to boost captions’ diversity. Unlike previous CNN-RNN frameworks, our framework contains an additive noise module which can manipulate the transition hidden states in the RNN. We train the noise module by our proposed noise-critic training algorithm. Then we extend the caption generation into Modern Chinese Poetry creation from images. We identify three challenges in this task: (a) semantic inconsistency between images and poems, (b) topic drift problems, and (c) frequent occurrence of certain words. Regarding the challenges, we develop a Constrained Topic-Aware Model. Particularly, we construct a visual semantic vector via image captions. A topic-aware generator is developed based on the Recall Network. An Anti-Frequency Decoding scheme is introduced to constrain the high-frequency characters. Overall, we propose three novel modules for image captioning and an image-to-poetry framework. This thesis can facilitate the connection between computer vision and natural language, and it extends the generation into specific domains.
Please use this identifier to cite or link to this item: