On The Convergence Of ADAM And Beyond
Introduction
Somewhat different to the presentation I gave in class, this paper focuses strictly on the pitfalls in convergance of the ADAM training algorithm for neural networks from a theoretical standpoint and proposes a novel improvement to ADAM called AMSGrad. The result essentially introduces the idea that it is possible for ADAM to get itself "stuck" in it's weighted average history which must be prevented somehow.
Notation
The paper presents the following framework that generalizes training algorithms to allow us to define a specific variant such as AMSGrad or SGD entirely within it:
[math]\displaystyle{ \beta }[/math]