show, Attend and Tell: Neural Image Caption Generation with Visual Attention
Introduction
Motivation
Contributions
- Two attention-based image caption generators using a common framework. A "soft" deterministic attention mechanism and a "hard" stochastic mechanism.
- Show how to gain insight and interpret results of this framework by visualizing "where" and "what" the attention focused on.
- Quantitatively validate the usefulness of attention in caption generation with state of the art performance on three datasets (Flickr8k, Flickr30k, and MS COCO)
Previous Work
Model
The attention framework learns latent alignments from scratch instead of explicitly using object detectors. This allows the model to go beyond "objectness" and learn to attend to abstract concepts.
Results
References
<references />