show, Attend and Tell: Neural Image Caption Generation with Visual Attention

From statwiki
Revision as of 13:44, 4 November 2015 by Bjkomer (talk | contribs) (Created page with "= Introduction = = Motivation = = Contributions = * Two attention-based image caption generators using a common framework. A "soft" deterministic attention mechanism and a "h...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Introduction

Motivation

Contributions

  • Two attention-based image caption generators using a common framework. A "soft" deterministic attention mechanism and a "hard" stochastic mechanism.
  • Show how to gain insight and interpret results of this framework by visualizing "where" and "what" the attention focused on.
  • Quantitatively validate the usefulness of attention in caption generation with state of the art performance on three datasets (Flickr8k, Flickr30k, and MS COCO)

Model

The attention framework learns latent alignments from scratch instead of explicitly using object detectors. This allows the model to go beyond "objectness" and learn to attend to abstract concepts.

Results

References

<references />