http://wiki.math.uwaterloo.ca/statwiki/api.php?action=feedcontributions&user=Swalsh&feedformat=atomstatwiki - User contributions [US]2022-09-27T14:42:00ZUser contributionsMediaWiki 1.28.3http://wiki.math.uwaterloo.ca/statwiki/index.php?title=End-to-End_Differentiable_Adversarial_Imitation_Learning&diff=36276End-to-End Differentiable Adversarial Imitation Learning2018-04-07T07:21:44Z<p>Swalsh: </p>
<hr />
<div>= Introduction =<br />
The ability to imitate an expert policy is very beneficial in the case of automating human demonstrated tasks. Assuming that a sequence of state action pairs (trajectories) of an expert policy are available, a new policy can be trained that imitates the expert without having access to the original reward signal used by the expert. There are two main approaches to solve the problem of imitating a policy; they are Behavioural Cloning (BC) and Inverse Reinforcement Learning (IRL). BC directly learns the conditional distribution of actions over states in a supervised fashion by training on single time-step state-action pairs. The disadvantage of BC is that the training requires large amounts of expert data, which is hard to obtain. In addition, an agent trained using BC is unaware of how its action can affect future state distribution. The second method using IRL involves recovering a reward signal under which the expert is uniquely optimal; the main disadvantage is that it’s an ill-posed problem.<br />
<br />
To address the problem of imitating an expert policy, techniques based on Generative Adversarial Networks (GANs) have been proposed in recent years. GANs use a discriminator to guide the generative model towards producing patterns like those of the expert. The generator is guided as it tries to produce samples on the correct side of the discriminators decision boundary hyper-plane, as seen in Figure 1. This idea was used by (Ho & Ermon, 2016)<sup>[[#References|[2]]]</sup> in their work titled Generative Adversarial Imitation Learning (GAIL) to imitate an expert policy in a model-free setup. A model free setup is the one where the agent cannot make predictions about what the next state and reward will be before it takes each action since the transition function to move from state A to state B is not learned. <br />
<br />
The disadvantage of the model-free approach comes to light when training stochastic policies. The presence of stochastic elements breaks the flow of information (gradients) from one neural network to the other, thus prohibiting the use of backpropagation. In this situation, a standard solution is to use gradient estimation (Williams, 1992)<sup>[[#References|[8]]]</sup>. This tends to suffer from high variance, resulting in a need for larger sample sizes as well as variance reduction methods. This paper proposes a model-based imitation learning algorithm (MGAIL), in which information propagates from the guiding neural network (D) to the generative model (G), which in this case represents the policy <math>\pi</math> that is to be trained. Training policy <math>\pi</math> assumes the existence of an expert policy <math>\pi_{E}</math> with given trajectories <math>\{s_{0},a_{0},s_{1},...\}^{N}_{i=0}</math> which it aims to imitate without access to the original reward signal <math>r_{e}</math>. This is achieved by two steps: (1) learning a forward model that approximates the environment’s dynamics (2) building an end-to-end differentiable computation graph that spans over multiple time-steps. The gradient in such a graph carries information from future states to earlier time-steps, helping the policy to account for compounding errors.<br />
<br />
<br />
[[File:GeneratorFollowingDiscriminator.png|center]]<br />
<br />
Figure 1: '''Illustration of GANs.''' The generative model follows the discriminating hyper-plane defined by the discriminator. Eventually, G will produce patterns similar to the expert patterns.<br />
<br />
= Background =<br />
== Markov Decision Process ==<br />
Consider an infinite-horizon discounted Markov decision process (MDP), defined by the tuple <math>(S, A, P, r, \rho_0, \gamma)</math> where <math>S</math> is the set of states, <math>A</math> is a set of actions, <math>P :<br />
S × A × S → [0, 1]</math> is the transition probability distribution, <math>r : (S × A) → R</math> is the reward function, <math>\rho_0 : S → [0, 1]</math> is the distribution over initial states, and <math>γ ∈ (0, 1)</math> is the discount factor. Let <math>π</math> denote a stochastic policy <math>π : S × A → [0, 1]</math>, <math>R(π)</math> denote its expected discounted reward: <math>E_πR = E_π [\sum_{t=0}^T \gamma^t r_t]</math> and <math>τ</math> denote a trajectory of states and actions <math>τ = {s_0, a_0, s_1, a_1, ...}</math>.<br />
<br />
== Imitation Learning ==<br />
A common technique for performing imitation learning is to train a policy <math> \pi </math> that minimizes some loss function <math> l(s, \pi(s)) </math> with respect to a discounted state distribution encountered by the expert: <math> d_\pi(s) = (1-\gamma)\sum_{t=0}^{\infty}\gamma^t p(s_t) </math>. This can be obtained using any supervised learning (SL) algorithm with <math> d_\pi(s) = argmin_{\pi \in \prod}\mathbb{E}_{s \sim d_{\pi}}[l(s,\pi (s))]</math>, but the policy's prediction affects future state distributions; this violates the independent and identically distributed (i.i.d) assumption made by most SL algorithms. This process is susceptible to compounding errors since a slight deviation in the learner's behavior can lead to different state distributions not encountered by the expert policy. <br />
<br />
This issue was overcome through the use of the Forward Training (FT) algorithm which trains a non-stationary policy iteratively over time. At each time step a new policy is trained on the state distribution induced by the previously trained policies <math>\pi_0</math>, <math>\pi_1</math>, ...<math>\pi_{t-1}</math>. This is continued till the end of the time horizon to obtain a policy that can mimic the expert policy. This requirement to train a policy at each time step till the end makes the FT algorithm impractical for cases where the time horizon is very large or undefined. This shortcoming is resolved using the Stochastic Mixing Iterative Learning (SMILe) algorithm. SMILe trains a stochastic stationary policy over several iterations under the trajectory distribution induced by the previously trained policy: <math> \pi_t = \pi_{t-1} + \alpha (1 - \alpha)^{t-1}(\hat{\pi}_t - \pi_0)</math>, with <math>\pi_0</math> following expert's policy at the start of training.<br />
<br />
== Generative Adversarial Networks ==<br />
GANs learn a generative model that can fool the discriminator by using a two-player zero-sum game:<br />
<br />
\begin{align} <br />
\underset{G}{\operatorname{argmin}}\; \underset{D\in (0,1)}{\operatorname{argmax}} = \mathbb{E}_{x\sim p_E}[log(D(x)]\ +\ \mathbb{E}_{z\sim p_z}[log(1 - D(G(z)))]<br />
\end{align}<br />
<br />
In the above equation, <math> p_E </math> represents the expert distribution and <math> p_z </math> represents the input noise distribution from which the input to the generator is sampled. The generator produces patterns and the discriminator judges if the pattern was generated or from the expert data. When the discriminator cannot distinguish between the two distributions the game ends and the generator has learned to mimic the expert. GANs rely on basic ideas such as binary classification and algorithms such as backpropagation in order to learn the expert distribution.<br />
<br />
GAIL applies GANs to the task of imitating an expert policy in a model-free approach. GAIL uses similar objective functions like GANs, but the expert distribution in GAIL represents the joint distribution over state action tuples:<br />
<br />
\begin{align} <br />
\underset{\pi}{\operatorname{argmin}}\; \underset{D\in (0,1)}{\operatorname{argmax}} = \mathbb{E}_{\pi}[log(D(s,a)]\ +\ \mathbb{E}_{\pi_E}[log(1 - D(s,a))] - \lambda H(\pi))<br />
\end{align}<br />
<br />
where <math> H(\pi) \triangleq \mathbb{E}_{\pi}[-log\: \pi(a|s)]</math> is the entropy.<br />
<br />
This problem cannot be solved using the standard methods described for GANs because the generator in GAIL represents a stochastic policy. The exact form of the first term in the above equation is given by: <math> \mathbb{E}_{s\sim \rho_\pi(s)}\mathbb{E}_{a\sim \pi(\cdot |s)} [log(D(s,a)] </math>.<br />
<br />
The two-player game now depends on the stochastic properties (<math> \theta </math>) of the policy, and it is unclear how to differentiate the above equation with respect to <math> \theta </math>. This problem can be overcome using score functions such as REINFORCE to obtain an unbiased gradient estimation:<br />
<br />
\begin{align}<br />
\nabla_\theta\mathbb{E}_{\pi} [log\; D(s,a)] \cong \hat{\mathbb{E}}_{\tau_i}[\nabla_\theta\; log\; \pi_\theta(a|s)Q(s,a)]<br />
\end{align}<br />
<br />
where <math> Q(\hat{s},\hat{a}) </math> is the score function of the gradient:<br />
<br />
\begin{align}<br />
Q(\hat{s},\hat{a}) = \hat{\mathbb{E}}_{\tau_i}[log\; D(s,a) | s_0 = \hat{s}, a_0 = \hat{a}]<br />
\end{align}<br />
<br />
<br />
REINFORCE gradients suffer from high variance which makes them difficult to work with even after applying variance reduction techniques. While recent general variance reduction techniques like RELAX (Grathwohl et al., 2017)<sup>[[#References|[7]]]</sup> work well, they rely on multiple evaluations of the loss function or learning a surrogate neural network. Unfortunately, this is too computationally difficult for our task. In order to better understand the changes required to fool the discriminator we need access to the gradients of the discriminator network, which can be obtained from the Jacobian of the discriminator. This paper demonstrates the use of a forward model along with the Jacobian of the discriminator to train a policy, without using high-variance gradient estimations.<br />
<br />
= Algorithm =<br />
This section first analyzes the characteristics of the discriminator network, then describes how a forward model can enable policy imitation through GANs. Lastly, the model based adversarial imitation learning algorithm is presented.<br />
<br />
== The discriminator network ==<br />
The discriminator network is trained to predict the conditional distribution: <math> D(s,a) = p(y|s,a) </math> where <math> y \in (\pi_E, \pi) </math>. <math>D(s,a)</math> here is the likelihood ratio with the pair <math>{s,a}</math> generated by <math>\pi</math>.<br />
<br />
The discriminator is trained on an even distribution of expert and generated examples; hence <math> p(\pi) = p(\pi_E) = \frac{1}{2} </math>. Given this and applying Bayes' theorem, we can rearrange and factor <math> D(s,a) </math> to obtain:<br />
<br />
\begin{aligned}<br />
D(s,a) &= p(\pi|s,a) \\<br />
& = \frac{p(s,a|\pi)p(\pi)}{p(s,a|\pi)p(\pi) + p(s,a|\pi_E)p(\pi_E)} \\<br />
& = \frac{p(s,a|\pi)}{p(s,a|\pi) + p(s,a|\pi_E)} \\<br />
& = \frac{1}{1 + \frac{p(s,a|\pi_E)}{p(s,a|\pi)}} \\<br />
& = \frac{1}{1 + \frac{p(a|s,\pi_E)}{p(a|s,\pi)} \cdot \frac{p(s|\pi_E)}{p(s|\pi)}} \\<br />
\end{aligned}<br />
<br />
Define <math> \varphi(s,a) </math> and <math> \psi(s) </math> to be:<br />
<br />
\begin{aligned}<br />
\varphi(s,a) = \frac{p(a|s,\pi_E)}{p(a|s,\pi)}, \psi(s) = \frac{p(s|\pi_E)}{p(s|\pi)}<br />
\end{aligned}<br />
<br />
to get the final expression for <math> D(s,a) </math>:<br />
\begin{aligned}<br />
D(s,a) = \frac{1}{1 + \varphi(s,a)\cdot \psi(s)}<br />
\end{aligned}<br />
<br />
<math> \varphi(s,a) </math> represents a policy likelihood ratio, and <math> \psi(s) </math> represents a state distribution likelihood ratio. Based on these expressions, the paper states that the discriminator makes its decisions by answering two questions. The first question relates to state distribution: what is the likelihood of encountering state <math> s </math> under the distribution induces by <math> \pi_E </math> vs <math> \pi </math>? The second question is about behavior: given a state <math> s </math>, how likely is action a under <math> \pi_E </math> vs <math> \pi </math>? The desired change in state is given by <math> \psi_s \equiv \partial \psi / \partial s </math>; this information can by obtained from the partial derivatives of <math> D(s,a) </math>, which is why these derivatives are proposed to be used for training policies (see following sections):<br />
<br />
\begin{aligned}<br />
\nabla_aD &= - \frac{\varphi_a(s,a)\psi(s)}{(1 + \varphi(s,a)\psi(s))^2} \\<br />
\nabla_sD &= - \frac{\varphi_s(s,a)\psi(s) + \varphi(s,a)\psi_s(s)}{(1 + \varphi(s,a)\psi(s))^2} \\<br />
\end{aligned}<br />
<br />
== Backpropagating through stochastic units ==<br />
There is interest in training stochastic policies because stochasticity encourages exploration for Policy Gradient methods. This is a problem for algorithms that build differentiable computation graphs where the gradients flow from one component to another since it is unclear how to backpropagate through stochastic units. The following subsections show how to estimate the gradients of continuous and categorical stochastic elements for continuous and discrete action domains respectively.<br />
<br />
=== Continuous Action Distributions ===<br />
In the case of continuous action policies, re-parameterization was used to enable computing the derivatives of stochastic models. Assuming that the stochastic policy has a Gaussian distribution <math> \mathcal{N}(\mu_{\theta} (s), \sigma_{\theta}^2 (s))</math>, where the mean and variance are given by some deterministic functions <math>\mu_{\theta}</math> and <math>\sigma_{\theta}</math>, then the policy <math> \pi </math> can be written as <math> \pi_\theta(a|s) = \mu_\theta(s) + \xi \sigma_\theta(s) </math>, where <math> \xi \sim N(0,1) </math>. This way, the authors are able to get a Monte-Carlo estimator of the derivative of the expected value of <math> D(s, a) </math> with respect to <math> \theta </math>:<br />
<br />
\begin{align}<br />
\nabla_\theta\mathbb{E}_{\pi(a|s)}D(s,a) = \mathbb{E}_{\rho (\xi )}\nabla_a D(a,s) \nabla_\theta \pi_\theta(a|s) \cong \frac{1}{M}\sum_{i=1}^{M} \nabla_a D(s,a) \nabla_\theta \pi_\theta(a|s)\Bigr|_{\substack{\xi=\xi_i}}<br />
\end{align}<br />
<br />
=== Categorical Action Distributions ===<br />
In the case of discrete action domains, the paper uses categorical re-parameterization with Gumbel-Softmax. This method relies on the Gumbel-Max trick which is a method for drawing samples from a categorical distribution with class probabilities <math> \pi(a_1|s),\pi(a_2|s),...,\pi(a_N|s) </math>:<br />
<br />
\begin{align}<br />
a_{argmax} = \underset{i}{argmax}[g_i + log\ \pi(a_i|s)]\textrm{, where } g_i \sim Gumbel(0, 1).<br />
\end{align}<br />
<br />
Gumbel-Softmax provides a differentiable approximation of the samples obtained using the Gumbel-Max trick (Gumbel-softmax allows us to generate a differentiable sample from a discrete distribution, which is needed in this trajectory imitation setting.):<br />
<br />
\begin{align}<br />
a_{softmax} = \frac{exp[\frac{1}{\tau}(g_i + log\ \pi(a_i|s))]}{\sum_{j=1}^{k}exp[\frac{1}{\tau}(g_j + log\ \pi(a_i|s))]}<br />
\end{align}<br />
<br />
<br />
In the above equation, the hyper-parameter <math> \tau </math> (temperature) trades bias for variance. When <math> \tau </math> gets closer to zero, the softmax operator acts like argmax resulting in a low bias, but high variance; vice versa when the <math> \tau </math> is large.<br />
<br />
The authors use <math> a_{softmax} </math> to interact with the environment; argmax is applied over <math> a_{softmax} </math> to obtain a single “pure” action, but the continuous approximation is used in the backward pass using the estimation: <math> \nabla_\theta\; a_{argmax} \approx \nabla_\theta\; a_{softmax} </math>.<br />
<br />
== Backpropagating through a Forward model ==<br />
The above subsections presented the means for extracting the partial derivative <math> \nabla_aD </math>. The main contribution of this paper is incorporating the use of <math> \nabla_sD </math>. In a model-free approach the state <math> s </math> is treated as a fixed input, therefore <math> \nabla_sD </math> is discarded. This is illustrated in Figure 2. This work uses a model-based approach which makes incorporating <math> \nabla_sD </math> more involved. In the model-based approach, a state <math> s_t </math> can be written as a function of the previous state action pair: <math> s_t = f(s_{t-1}, a_{t-1}) </math>, where <math> f </math> represents the forward model. Using the forward model and the law of total derivatives we get:<br />
<br />
\begin{align}<br />
\nabla_\theta D(s_t,a_t)\Bigr|_{\substack{s=s_t, a=a_t}} &= \frac{\partial D}{\partial a}\frac{\partial a}{\partial \theta}\Bigr|_{\substack{a=a_t}} + \frac{\partial D}{\partial s}\frac{\partial s}{\partial \theta}\Bigr|_{\substack{s=s_t}} \\<br />
&= \frac{\partial D}{\partial a}\frac{\partial a}{\partial \theta}\Bigr|_{\substack{a=a_t}} + \frac{\partial D}{\partial s}\left (\frac{\partial f}{\partial s}\frac{\partial s}{\partial \theta}\Bigr|_{\substack{s=s_{t-1}}} + \frac{\partial f}{\partial a}\frac{\partial a}{\partial \theta}\Bigr|_{\substack{a=a_{t-1}}} \right )<br />
\end{align}<br />
<br />
<br />
Using this formula, the error regarding deviations of future states <math> (\psi_s) </math> propagate back in time and influence the actions of policies in earlier times. This is summarized in Figure 3.<br />
<br />
[[File:modelFree_blockDiagram.PNG|400px|center]]<br />
<br />
Figure 2: Block-diagram of the model-free approach: given a state <math> s </math>, the policy outputs <math> \mu </math> which is fed to a stochastic sampling unit. An action <math> a </math> is sampled, and together with <math> s </math> are presented to the discriminator network. In the backward phase, the error message <math> \delta_a </math> is blocked at the stochastic sampling unit. From there, a high-variance gradient estimation is used (<math> \delta_{HV} </math>). Meanwhile, the error message <math> \delta_s </math> is flushed.<br />
<br />
[[File:modelBased_blockDiagram.PNG|700px|center]]<br />
<br />
Figure 3: Block diagram of model-based adversarial imitation learning. <br />
<br />
Figure 3 describes the computation graph for training the policy (i.e. G). The discriminator network D is fixed at this stage and is trained separately. At time <math> t </math> of the forward pass, <math> \pi </math> outputs a distribution over actions: <math> \mu_t = \pi(s_t) </math>, from which an action at is sampled. For example, in the continuous case, this is done using the re-parametrization trick: <math> a_t = \mu_t + \xi \cdot \sigma </math>, where <math> \xi \sim N(0,1) </math>. The next state <math> s_{t+1} = f(s_t, a_t) </math> is computed using the forward model (which is also trained separately), and the entire process repeats for time <math> t+1 </math>. In the backward pass, the gradient of <math> \pi </math> is comprised of a.) the error message <math> \delta_a </math> (Green) that propagates fluently through the differentiable approximation of the sampling process. And b.) the error message <math> \delta_s </math> (Blue) of future time-steps, that propagate back through the differentiable forward model.<br />
<br />
== MGAIL Algorithm ==<br />
Shalev-Shwartz et al. (2016)<sup>[[#References|[3]]]</sup> and Heess et al. (2015)<sup>[[#References|[4]]]</sup> built a multi-step computation graph for describing the familiar policy gradient objective; in this case it is given by:<br />
<br />
\begin{align}<br />
J(\theta) = \mathbb{E}\left [ \sum_{t=0}^{T} \gamma ^t D(s_t,a_t)|\theta\right ]<br />
\end{align}<br />
<br />
<br />
Using the results from Heess et al. (2015)<sup>[[#References|[4]]]</sup> this paper demonstrates how to differentiate <math> J(\theta) </math> over a trajectory of <math>(s,a,s’) </math> transitions:<br />
<br />
\begin{align}<br />
J_s &= \mathbb{E}_{p(a|s)}\mathbb{E}_{p(s'|s,a)}\left [ D_s + D_a \pi_s + \gamma J'_{s'}(f_s + f_a \pi_s) \right] \\<br />
J_\theta &= \mathbb{E}_{p(a|s)}\mathbb{E}_{p(s'|s,a)}\left [ D_a \pi_\theta + \gamma (J'_{s'} f_a \pi_\theta + J'_\theta) \right]<br />
\end{align}<br />
<br />
The policy gradient <math> \nabla_\theta J </math> is calculated by applying equations 12 and 13 recursively for <math> T </math> iterations. The MGAIL algorithm is presented below.<br />
<br />
[[File:MGAIL_alg.PNG]]<br />
<br />
== Forward Model Structure ==<br />
The stability of the learning process depends on the prediction accuracy of the forward model, but learning an accurate forward model is challenging by itself. The authors propose methods for improving the performance of the forward model based on two aspects of its functionality. First, the forward model should learn to use the action as an operator over the state space. To accomplish this, the actions and states, which are sampled form different distributions need to be first represented in a shared space. This is done by encoding the state and action using two separate neural networks and combining their outputs to form a single vector. Additionally, multiple previous states are used to predict the next state by representing the environment as an <math> n^{th} </math> order MDP. A gated recurrent units (GRU, a simpler variant on the LSTM model) layer is incorporated into the state encoder to enable recurrent connections from previous states. Using these modifications, the model is able to achieve better, and more stable results compared to the standard forward model based on a feed forward neural network. The comparison is presented in Figure 4.<br />
<br />
[[File:performance_comparison.PNG]]<br />
<br />
Figure 4: Performance comparison between a basic forward model (Blue), and the advanced forward model (Green).<br />
<br />
= Experiments =<br />
The proposed algorithm is evaluated on three discrete control tasks (Cartpole, Mountain-Car, Acrobot) and five continuous control tasks (Hopper, Walker, Half-Cheetah, Ant, and Humanoid). These tasks are modelled by the MuJoCo physics simulator (Todorov et al., 2012)<sup>[[#References|[9]]]</sup>, contain second order dynamics and utilize direct torque control. Expert policies are trained using the Trust Region Policy Optimization (TRPO) algorithm (Schulman et al., 2015)<sup>[[#References|[5]]]</sup>. Different number of trajectories are used to train the expert for each task, but all trajectories are of length 1000.<br />
The discriminator and generator (policy) networks contains two hidden layers with ReLU non-linearities and are trained using the ADAM optimizer. The total reward received over a period of <math> N </math> steps using BC, GAIL and MGAIL is presented in Table 1. The proposed algorithm achieved the highest reward for most environments while exhibiting performance comparable to the expert over all of them. A comparison between the basic forward model and the more advanced forward model is also made and described in the previous section of this summary. The two models compared are shown below.<br />
<br />
[[File:baram17_forward.PNG]]<br />
<br />
[[File:mgail_test_results_1.PNG]]<br />
<br />
[[File:mgail_test_results.PNG]]<br />
<br />
Table 1. Policy performance, <math> \pm </math> represents one standard deviation, a higher (reward) value is better. MGAIL consistently outperforms both GAIL and Behavioural cloning approaches, except on the Cartpole, where MGAIL and GAIL perform equally.<br />
<br />
= Discussion =<br />
This paper presented a model-free algorithm for imitation learning. It demonstrated how a forward model can be used to train policies using the exact gradient of the discriminator network. A downside of this approach is the need to learn a forward model; this could be difficult in certain domains. Learning the system dynamics directly from raw images is considered as one line of future work. Another future work is to address the violation of the fundamental assumption made by all supervised learning algorithms, which requires the data to be i.i.d. This problem arises because the discriminator and forward models are trained in a supervised learning fashion using data sampled from a dynamic distribution. The authors tried a solution proposed by another paper (Loshchilov & Hutter, 2016)<sup>[[#References|[10]]]</sup>, which is to reset the learning rate several times during training period, but it did not result in significant improvements.<br />
<br />
= Implementation =<br />
The following repository provides the source code for the paper: https://github.com/itaicaspi/mgail. The repository provides the source code written by the authors, in Tensorflow.<br />
<br />
= Source =<br />
# Baram, Nir, et al. "End-to-end differentiable adversarial imitation learning." International Conference on Machine Learning. 2017.<br />
# Ho, Jonathan, and Stefano Ermon. "Generative adversarial imitation learning." Advances in Neural Information Processing Systems. 2016.<br />
# Shalev-Shwartz, Shai, et al. "Long-term planning by short-term prediction." arXiv preprint arXiv:1602.01580 (2016).<br />
# Heess, Nicolas, et al. "Learning continuous control policies by stochastic value gradients." Advances in Neural Information Processing Systems. 2015.<br />
# Schulman, John, et al. "Trust region policy optimization." International Conference on Machine Learning. 2015.<br />
# Caspi, I. (n.d.). Itaicaspi/mgail. Retrieved March 25, 2018, from https://github.com/itaicaspi/mgail.<br />
# Grathwohl, W., Choi, D., Wu, Y., Roeder, G., & Duvenaud, D. (2017). Backpropagation through the Void: Optimizing control variates for black-box gradient estimation. arXiv preprint arXiv:1711.00123.<br />
# Williams, Ronald J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992.<br />
# Todorov, Emanuel, Erez, Tom, and Tassa, Yuval. Mujoco: A physics engine for model-based control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pp. 5026–5033. IEEE, 2012.<br />
# Loshchilov, Ilya and Hutter, Frank. Sgdr: Stochastic gradient descent with restarts. arXiv preprint arXiv:1608.03983, 2016.</div>Swalshhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Neural_Audio_Synthesis_of_Musical_Notes_with_WaveNet_autoencoders&diff=36275Neural Audio Synthesis of Musical Notes with WaveNet autoencoders2018-04-07T07:11:42Z<p>Swalsh: </p>
<hr />
<div>= Introduction =<br />
The authors of this paper have pointed out that the method in which most notes are created are hand-designed instruments modifying pitch, velocity and filter parameters to produce the required tone, timbre and dynamics of a sound. The authors suggest that this may be a problem and thus suggest a data-driven approach to audio synthesis. They demonstrate how to generate new types of expressive and realistic instrument sounds using a neural network model instead of using specific arrangements of oscillators or algorithms for sample playback. The model is capable of learning semantically meaningful hidden representations which can be used as control signals for manipulating tone, timbre, and dynamics during playback. To train such a data expensive model the authors highlight the need for a large dataset much like ImageNet for music. The motivation for this work stems from recent advances in autoregressive models like WaveNet <sup>[[#References|[5]]]</sup> and SampleRNN<sup>[[#References|[6]]]</sup>. These models are effective at modeling short and medium scale (~500ms) signals, but rely on external conditioning for large-term dependencies; the proposed model removes the need for external conditioning.<br />
<br />
= Contributions =<br />
This paper has two main contributions, one theoretical and one empirical: <br />
<br />
=== Theoretical contribution ===<br />
Proposed Wavenet-style autoencoder that learn to encode temural data over a long term audio structures without requiring external conditioning.<br />
<br />
=== Empirical contribution === <br />
Provided NSynth data set. The authors constructed this data set from scratch, which is a a large data set of musical notes inspired by the emerging of large image data sets. This data set servers as a great training/test resource for future works.<br />
<br />
= Models =<br />
<br />
[[File:paper26-figure1-models.png|center]]<br />
<br />
== WaveNet Autoencoder ==<br />
<br />
While the proposed autoencoder structure is very similar to that of WaveNet the authors argue that the algorithm is novel in two ways:<br />
* It is able to attain consistent long-term structure without any external conditioning <br />
* Creating meaningful embedding which can be interpolated between<br />
<br />
In the original WaveNet architecture the authors use a stack of dilated convolutions to predict the next sample of audio given a prior sample. This approach was prone to "babbling" since it did not take into account long-term structure of the audio. In this model the joint probability of generating audio <math>x</math> is:<br />
<br />
\begin{align}<br />
p(x) = \prod_{i=1}^N\{x_i | x_1, … , x_N-1\}<br />
\end{align}<br />
<br />
They authors try to capture long-term structure by passing the raw audio through the encoder to produce an embedding <math>Z = f(x) </math>, and then shifting the input and feeding it into the decoder which reproduces the input. The resulting probability distribution: <br />
<br />
\begin{align}<br />
p(x) = \prod_{i=1}^N\{x_i | x_1, … , x_N-1, f(x) \}<br />
\end{align}<br />
<br />
<br />
A detailed block diagram of the modified WaveNet structure can be seen in figure 1b. This diagram demonstrates the encoder as a 30 layer network in each each node is a ReLU nonlinearity followed by a non-causal dilated convolution. Dilated convolution (aka convolutions with holes) is a type of convolution in which the filter skips input values with a certain step (step size of 1 is equivalent to the standard convolution), effectively allowing the network to operate at a coarser scale compared to traditional convolutional layers and have very large receptive fields. The resulting convolution is 128 channels all feed into another ReLU nonlinearity which is feed into another 1x1 convolution before getting down sampled with average pooling to produce a 16 dimension <math>Z </math> distribution. Each <math>Z </math> encoding is for a specific temporal resolution which the authors of the paper tuned to 32ms. This means that there are 125, 16 dimension <math>Z </math> encodings for each 4 second note present in the NSynth database (1984 embeddings). <br />
Before the <math>Z </math> embedding enters the decoder it is first upsampled to the original audio rate using nearest neighbor interpolation. The embedding then passes through the decoder to recreate the original audio note. The input audio data is first quantized using 8-bit mu-law encoding into 256 possible values, and the output prediction is the softmax over the possible values.<br />
<br />
== Baseline: Spectral Autoencoder ==<br />
Being unable to find an alternative fully deep model which the authors could use to compare to there proposed WaveNet autoencoder to, the authors just made a strong baseline. The baseline algorithm that the authors developed is a spectral autoencoder. The block diagram of its architecture can be seen in figure 1a. The baseline network is 10 layer deep. Each layer has a 4x4 kernels with 2x2 strides followed by a leaky-ReLU (0.1) and batch normalization. The final hidden vector(Z) was set to 1984 to exactly match the hidden vector of the WaveNet autoencoder. <br />
<br />
Given the simple architecture, the authors first attempted to train the baseline on raw waveforms as input, with a mean-squared error cost. This did not work well and showed the problem of the independent Gaussian assumption. Spectral representations from FFT worked better, but had low perceptual quality despite having low MSE cost after training. Training on the log magnitude of the power spectra, normalized between 0 and 1, was found to be best correlated with perceptual distortion. The authors also explored several representations of phase, finding that estimating magnitude and using established iterative techniques to reconstruct phase to be most effective. (The technique to reconstruct the phase from the magnitude comes from (Griffin and Lim 1984). It can be summarized as follows. In each iteration, generate a Fourier signal z by taking the Short Time Fourier transform of the current estimate of the complete time-domain signal, and replacing its magnitude component with the known true magnitude. Then find the time-domain signal whose Short Time Fourier transform is closest to z in the least-squares sense. This is the estimate of the complete signal for the next iteration. ) A final heuristic that was used by the authors to increase the accuracy of the baseline was weighting the mean square error (MSE) loss starting at 10 for 0 HZ and decreasing linearly to 1 at 4000 Hz and above. This is valid as the fundamental frequency of most instrument are found at lower frequencies. <br />
<br />
== Training ==<br />
Both the modified WaveNet and the baseline autoencoder used stochastic gradient descent with an Adam optimizer. The authors trained the baseline autoencoder model asynchronously for 1800000 epocs with a batch size of 8 with a learning rate of 1e-4. Where as the WaveNet modules were trained synchronously for 250000 epocs with a batch size of 32 with a decaying learning rate ranging from 2e-4 to 6e-6.<br />
<br />
= The NSynth Dataset =<br />
To evaluate the WaveNet autoencoder model, the authors' wanted an audio dataset that let them explore the learned embeddings. Musical notes are an ideal setting for this study. Prior to this paper, the existing music datasets included the RWC music database (Goto et al., 2003)<sup>[[#References|[8]]]</sup> and the dataset from Romani Picas et al.<sup>[[#References|[9]]]</sup> However, the authors wanted to develop a larger dataset.<br />
<br />
The NSynth dataset has 306 043 unique musical notes (each have a unique pitch, timbre, envelope) all 4 seconds in length sampled at 16,000 Hz. The data set consists of 1006 different instruments playing on average of 65.4 different pitches across on average 4.75 different velocities. Average pitches and velocities are used as not all instruments, can reach all 88 MIDI frequencies, or the 5 velocities desired by the authors. The dataset has the following split: training set with 289,205 notes, validation set with 12,678 notes, and test set with 4,096 notes.<br />
<br />
Along with each note the authors also included the following annotations:<br />
* Source - The way each sound was produced. There were 3 classes ‘acoustic’, ‘electronic’ and ‘synthetic’.<br />
* Family - The family class of instruments that produced each note. There are 11 classes which include: {‘bass’, ‘brass’, ‘vocal’ ext.}<br />
* Qualities - Sonic qualities about each note<br />
<br />
The full dataset is publicly available here: https://magenta.tensorflow.org/datasets/nsynth as TFRecord files with training and holdout splits.<br />
<br />
[[File:nsynth_table.png | 400px|thumb|center|Full details of the NSynth dataset.]]<br />
<br />
= Evaluation =<br />
<br />
To fully analyze all aspects of WaveNet the authors proposed three evaluations:<br />
* Reconstruction - Both Quantitative and Qualitative analysis were considered<br />
* Interpolation in Timbre and Dynamics<br />
* Entanglement of Pitch and Timbre <br />
<br />
Sound is historically very difficult to quantify from a picture representation as it requires training and expertise to analyze. Even with expertise it can be difficult to complete a full analysis as two very different sounds can look quite similar in their respective pictorial representations. This is why the authors recommend all readers to listen to the created notes which can be found here: https://magenta.tensorflow.org/nsynth.<br />
<br />
However, even when taking this under consideration the authors do pictorially demonstrate differences in the two proposed algorithms along with the original note, as it is hard to publish a paper with sound included. To demonstrate the pictorial difference the authors demonstrate each note using constant-q transform (CQT) which is able to capture the dynamics of timbre along with representing the frequencies of the sound.<br />
<br />
== Reconstruction ==<br />
<br />
[[File:paper27-figure2-reconstruction.png|center]]<br />
<br />
The authors attempted to show magnitude and phase on the same plot above. Instantaneous frequency is the derivative of the phase and the intensity of solid lines is proportional to the log magnitude of the power spectrum. If fharm and an FFT bin are not the same, then there will be a constant phase shift: <br />
<math><br />
\triangle \phi = (f_{bin} − f_{harm}) \dfrac{hopsize}{samplerate}<br />
</math>.<br />
<br />
=== Qualitative Comparison ===<br />
In Figure 2, CQT spectrograms are displayed from 3 different instruments, including the original note spectrograms and the model reconstruction spectrograms. For the model reconstruction spectrograms, a baseline is adopted to compare with WaveNet. Each note contains some noise, a fundamental frequency with a series of harmonics, and a decay. In the Glockenspiel the WaveNet autoencoder is able to reproduce the magnitude, phase of the fundamental frequency (A and C in Figure 2), and the attack (B in Figure 2) of the instrument; Whereas the Baseline autoencoder introduces non existing harmonics (D in Figure 2). The flugelhorn on the other hand, presents the starkest difference between the WaveNet and baseline autoencoders. The WaveNet while not perfect is able to reproduce the verbarto (I and J in Figure 2) across multiple frequencies, which results in a natural sounding note. The baseline not only fails to do this but also adds extra noise (K in Figure 2). The authors do add that the WaveNet produces some strikes (L in Figure 2) however they argue that they are inaudible.<br />
<br />
[[File:paper27-table1.png|center]]<br />
<br />
Mu-law encoding was used in the original WaveNet [https://arxiv.org/pdf/1609.03499.pdf paper] to make the problem "more tractable" compared to raw 16-bit integer values. In that paper, they note that "especially for speech, this non-linear quantization produces a significantly better reconstruction" compared to a linear scheme. This might be expected considering that the mu-law companding transformation was designed to [https://www.cisco.com/c/en/us/support/docs/voice/h323/8123-waveform-coding.html#t4 encode speech]. In this application though, using this encoding creates perceptible distortion that sounds similar to clipping.<br />
<br />
=== Quantitative Comparison ===<br />
For a quantitative comparison the authors trained a separate multi-task classifier to classify a note using given pitch or quality of a note. The results of both the Baseline and the WaveNet where then inputted and attempted to be classified. As seen in table 1 WaveNet significantly outperformed the Baseline in both metrics posting a ~70% increase when only considering pitch.<br />
<br />
== Interpolation in Timbre and Dynamics ==<br />
<br />
[[File:paper27-figure3-interpolation.png|center]]<br />
<br />
For this evaluation the authors reconstructed from linear interpolations in Z space among different instruments and compared these to superimposed position of the original two instruments. Not surprisingly the model fuse aspects of both instruments during the recreation. The authors claim however, that WaveNet produces much more realistic sounding results. <br />
To support their claim the authors the authors point to WaveNet ability to create dynamic mixing of overtone in time, even jumping to higher harmonics (A in Figure 3), capturing the timbre and dynamics of both the bass and flute. This can be once again seen in (B in Figure 3) where Wavenet adds additional harmonics as well as a sub-harmonics to the original flute note.<br />
<br />
== Entanglement of Pitch and Timbre ==<br />
<br />
[[File:paper27-table2.png|center]]<br />
<br />
[[File:paper27-figure4-entanglement.png|center]]<br />
<br />
To study the entanglement between pitch and Z space the authors constructed a classifier which was expected to drop in accuracy if the representation of pitch and timbre is disentangled as it relies heavily on the pitch information. This is clearly demonstrated by the first two rows of table 2 where WaveNet relies more strongly on pitch then the baseline algorithm. The authors provide a more qualitative demonstrating in figure 4. They demonstrate a situation in which a classifier may be confused; a note with pitch of +12 is almost exactly the same as the original apart from an emergence of sub-harmonics.<br />
<br />
Further insight can be gained on the relationship between pitch and timbre by studying the trend amongst the network embeddings among the pitches for specific instruments. This is depicted in figure 5 for several instruments across their entire 88 note range at 127 velocity. It can be noted from the figure that the instruments have unique separation of two or more registers over which the embeddings of notes with different pitches are similar. This is expected since instrumental dynamics and timbre varies dramatically over the range of the instrument.<br />
<br />
= Conclusion & Future Directions =<br />
<br />
This paper presents a Wavelet autoencoder model which is built on top of the WaveNet model and evaluate the model on NSynth dataset. The paper also introduces a new large scale dataset of musical notes: NSynth.<br />
<br />
One significant area which the authors claim great improvement is needed is the large memory constraints required by there algorithm. Due to the large memory requirement the current WaveNet must rely on down sampling thus being unable to fully capture the global context. This is an area where model compression techniques could be beneficial. That is, quantization and pruning could be effective: with 4-bit quantization during the entire process (weights, activations, gradients, error as in the work of Wu et al., 2016<sup>[[#References|[7]]]</sup>), memory requirement could be reduced by at least 8 times. The authors also claim that research using different input representations (instead of mu-law) to minimize distortion is ongoing.<br />
<br />
= Critique = <br />
* Authors have never conducted a human study determining sound similarity between the original, baseline, and WaveNet.<br />
* Architecture is not very novel.<br />
* In order to have a comparison, they set out to create a straight-forward baseline for the neural audio synthesis experiments.<br />
<br />
= Open Source Code =<br />
<br />
Google has released all code related to this paper at the following open source repository: https://github.com/tensorflow/magenta/tree/master/magenta/models/nsynth<br />
<br />
= References =<br />
<br />
# Engel, J., Resnick, C., Roberts, A., Dieleman, S., Norouzi, M., Eck, D. & Simonyan, K.. (2017). Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders. Proceedings of the 34th International Conference on Machine Learning, in PMLR 70:1068-1077<br />
# Griffin, Daniel, and Jae Lim. "Signal estimation from modified short-time Fourier transform." IEEE Transactions on Acoustics, Speech, and Signal Processing 32.2 (1984): 236-243.<br />
# NSynth: Neural Audio Synthesis. (2017, April 06). Retrieved March 19, 2018, from https://magenta.tensorflow.org/nsynth <br />
# The NSynth Dataset. (2017, April 05). Retrieved March 19, 2018, from https://magenta.tensorflow.org/datasets/nsynth<br />
# Oord, Aaron van den, Nal Kalchbrenner, and Koray Kavukcuoglu. "Pixel recurrent neural networks." arXiv preprint arXiv:1601.06759 (2016).<br />
# Mehri, Soroush, et al. "SampleRNN: An unconditional end-to-end neural audio generation model." arXiv preprint arXiv:1612.07837 (2016).<br />
# Wu, S., Li, G., Chen, F., & Shi, L. (2018). Training and Inference with Integers in Deep Neural Networks. arXiv preprint arXiv:1802.04680.<br />
# Goto, Masataka, Hashiguchi, Hiroki, Nishimura, Takuichi, and Oka, Ryuichi. Rwc music database: Music genre database and musical instrument sound database. 2003.<br />
# Romani Picas, Oriol, Parra Rodriguez, Hector, Dabiri, Dara, Tokuda, Hiroshi, Hariya, Wataru, Oishi, Koji, and Serra, Xavier. A real-time system for measuring sound goodness in instrumental sounds. In Audio Engineering Society Convention 138. Audio Engineering Society, 2015</div>Swalshhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Neural_Audio_Synthesis_of_Musical_Notes_with_WaveNet_autoencoders&diff=36274Neural Audio Synthesis of Musical Notes with WaveNet autoencoders2018-04-07T07:07:34Z<p>Swalsh: /* Qualitative Comparison */</p>
<hr />
<div>= Introduction =<br />
The authors of this paper have pointed out that the method in which most notes are created are hand-designed instruments modifying pitch, velocity and filter parameters to produce the required tone, timbre and dynamics of a sound. The authors suggest that this may be a problem and thus suggest a data-driven approach to audio synthesis. They demonstrate how to generate new types of expressive and realistic instrument sounds using a neural network model instead of using specific arrangements of oscillators or algorithms for sample playback. The model is capable of learning semantically meaningful hidden representations which can be used as control signals for manipulating tone, timbre, and dynamics during playback. To train such a data expensive model the authors highlight the need for a large dataset much like ImageNet for music. The motivation for this work stems from recent advances in autoregressive models like WaveNet <sup>[[#References|[5]]]</sup> and SampleRNN<sup>[[#References|[6]]]</sup>. These models are effective at modeling short and medium scale (~500ms) signals, but rely on external conditioning for large-term dependencies; the proposed model removes the need for external conditioning.<br />
<br />
= Contributions =<br />
This paper has two main contributions, one theoretical and one empirical: <br />
<br />
=== Theoretical contribution ===<br />
Proposed Wavenet-style autoencoder that learn to encode temural data over a long term audio structures without requiring external conditioning.<br />
<br />
=== Empirical contribution === <br />
Provided NSynth data set. The authors constructed this data set from scratch, which is a a large data set of musical notes inspired by the emerging of large image data sets. This data set servers as a great training/test resource for future works.<br />
<br />
= Models =<br />
<br />
[[File:paper26-figure1-models.png|center]]<br />
<br />
== WaveNet Autoencoder ==<br />
<br />
While the proposed autoencoder structure is very similar to that of WaveNet the authors argue that the algorithm is novel in two ways:<br />
* It is able to attain consistent long-term structure without any external conditioning <br />
* Creating meaningful embedding which can be interpolated between<br />
<br />
In the original WaveNet architecture the authors use a stack of dilated convolutions to predict the next sample of audio given a prior sample. This approach was prone to "babbling" since it did not take into account long-term structure of the audio. In this model the joint probability of generating audio <math>x</math> is:<br />
<br />
\begin{align}<br />
p(x) = \prod_{i=1}^N\{x_i | x_1, … , x_N-1\}<br />
\end{align}<br />
<br />
They authors try to capture long-term structure by passing the raw audio through the encoder to produce an embedding <math>Z = f(x) </math>, and then shifting the input and feeding it into the decoder which reproduces the input. The resulting probability distribution: <br />
<br />
\begin{align}<br />
p(x) = \prod_{i=1}^N\{x_i | x_1, … , x_N-1, f(x) \}<br />
\end{align}<br />
<br />
<br />
A detailed block diagram of the modified WaveNet structure can be seen in figure 1b. This diagram demonstrates the encoder as a 30 layer network in each each node is a ReLU nonlinearity followed by a non-causal dilated convolution. Dilated convolution (aka convolutions with holes) is a type of convolution in which the filter skips input values with a certain step (step size of 1 is equivalent to the standard convolution), effectively allowing the network to operate at a coarser scale compared to traditional convolutional layers and have very large receptive fields. The resulting convolution is 128 channels all feed into another ReLU nonlinearity which is feed into another 1x1 convolution before getting down sampled with average pooling to produce a 16 dimension <math>Z </math> distribution. Each <math>Z </math> encoding is for a specific temporal resolution which the authors of the paper tuned to 32ms. This means that there are 125, 16 dimension <math>Z </math> encodings for each 4 second note present in the NSynth database (1984 embeddings). <br />
Before the <math>Z </math> embedding enters the decoder it is first upsampled to the original audio rate using nearest neighbor interpolation. The embedding then passes through the decoder to recreate the original audio note. The input audio data is first quantized using 8-bit mu-law encoding into 256 possible values, and the output prediction is the softmax over the possible values.<br />
<br />
== Baseline: Spectral Autoencoder ==<br />
Being unable to find an alternative fully deep model which the authors could use to compare to there proposed WaveNet autoencoder to, the authors just made a strong baseline. The baseline algorithm that the authors developed is a spectral autoencoder. The block diagram of its architecture can be seen in figure 1a. The baseline network is 10 layer deep. Each layer has a 4x4 kernels with 2x2 strides followed by a leaky-ReLU (0.1) and batch normalization. The final hidden vector(Z) was set to 1984 to exactly match the hidden vector of the WaveNet autoencoder. <br />
<br />
Given the simple architecture, the authors first attempted to train the baseline on raw waveforms as input, with a mean-squared error cost. This did not work well and showed the problem of the independent Gaussian assumption. Spectral representations from FFT worked better, but had low perceptual quality despite having low MSE cost after training. Training on the log magnitude of the power spectra, normalized between 0 and 1, was found to be best correlated with perceptual distortion. The authors also explored several representations of phase, finding that estimating magnitude and using established iterative techniques to reconstruct phase to be most effective. (The technique to reconstruct the phase from the magnitude comes from (Griffin and Lim 1984). It can be summarized as follows. In each iteration, generate a Fourier signal z by taking the Short Time Fourier transform of the current estimate of the complete time-domain signal, and replacing its magnitude component with the known true magnitude. Then find the time-domain signal whose Short Time Fourier transform is closest to z in the least-squares sense. This is the estimate of the complete signal for the next iteration. ) A final heuristic that was used by the authors to increase the accuracy of the baseline was weighting the mean square error (MSE) loss starting at 10 for 0 HZ and decreasing linearly to 1 at 4000 Hz and above. This is valid as the fundamental frequency of most instrument are found at lower frequencies. <br />
<br />
== Training ==<br />
Both the modified WaveNet and the baseline autoencoder used stochastic gradient descent with an Adam optimizer. The authors trained the baseline autoencoder model asynchronously for 1800000 epocs with a batch size of 8 with a learning rate of 1e-4. Where as the WaveNet modules were trained synchronously for 250000 epocs with a batch size of 32 with a decaying learning rate ranging from 2e-4 to 6e-6.<br />
<br />
= The NSynth Dataset =<br />
To evaluate the WaveNet autoencoder model, the authors' wanted an audio dataset that let them explore the learned embeddings. Musical notes are an ideal setting for this study. Prior to this paper, the existing music datasets included the RWC music database (Goto et al., 2003) and the dataset from Romani Picas et al. However, the authors wanted to develop a larger dataset.<br />
<br />
The NSynth dataset has 306 043 unique musical notes (each have a unique pitch, timbre, envelope) all 4 seconds in length sampled at 16,000 Hz. The data set consists of 1006 different instruments playing on average of 65.4 different pitches across on average 4.75 different velocities. Average pitches and velocities are used as not all instruments, can reach all 88 MIDI frequencies, or the 5 velocities desired by the authors. The dataset has the following split: training set with 289,205 notes, validation set with 12,678 notes, and test set with 4,096 notes.<br />
<br />
Along with each note the authors also included the following annotations:<br />
* Source - The way each sound was produced. There were 3 classes ‘acoustic’, ‘electronic’ and ‘synthetic’.<br />
* Family - The family class of instruments that produced each note. There are 11 classes which include: {‘bass’, ‘brass’, ‘vocal’ ext.}<br />
* Qualities - Sonic qualities about each note<br />
<br />
The full dataset is publicly available here: https://magenta.tensorflow.org/datasets/nsynth as TFRecord files with training and holdout splits.<br />
<br />
[[File:nsynth_table.png | 400px|thumb|center|Full details of the NSynth dataset.]]<br />
<br />
= Evaluation =<br />
<br />
To fully analyze all aspects of WaveNet the authors proposed three evaluations:<br />
* Reconstruction - Both Quantitative and Qualitative analysis were considered<br />
* Interpolation in Timbre and Dynamics<br />
* Entanglement of Pitch and Timbre <br />
<br />
Sound is historically very difficult to quantify from a picture representation as it requires training and expertise to analyze. Even with expertise it can be difficult to complete a full analysis as two very different sounds can look quite similar in their respective pictorial representations. This is why the authors recommend all readers to listen to the created notes which can be found here: https://magenta.tensorflow.org/nsynth.<br />
<br />
However, even when taking this under consideration the authors do pictorially demonstrate differences in the two proposed algorithms along with the original note, as it is hard to publish a paper with sound included. To demonstrate the pictorial difference the authors demonstrate each note using constant-q transform (CQT) which is able to capture the dynamics of timbre along with representing the frequencies of the sound.<br />
<br />
== Reconstruction ==<br />
<br />
[[File:paper27-figure2-reconstruction.png|center]]<br />
<br />
The authors attempted to show magnitude and phase on the same plot above. Instantaneous frequency is the derivative of the phase and the intensity of solid lines is proportional to the log magnitude of the power spectrum. If fharm and an FFT bin are not the same, then there will be a constant phase shift: <br />
<math><br />
\triangle \phi = (f_{bin} − f_{harm}) \dfrac{hopsize}{samplerate}<br />
</math>.<br />
<br />
=== Qualitative Comparison ===<br />
In Figure 2, CQT spectrograms are displayed from 3 different instruments, including the original note spectrograms and the model reconstruction spectrograms. For the model reconstruction spectrograms, a baseline is adopted to compare with WaveNet. Each note contains some noise, a fundamental frequency with a series of harmonics, and a decay. In the Glockenspiel the WaveNet autoencoder is able to reproduce the magnitude, phase of the fundamental frequency (A and C in Figure 2), and the attack (B in Figure 2) of the instrument; Whereas the Baseline autoencoder introduces non existing harmonics (D in Figure 2). The flugelhorn on the other hand, presents the starkest difference between the WaveNet and baseline autoencoders. The WaveNet while not perfect is able to reproduce the verbarto (I and J in Figure 2) across multiple frequencies, which results in a natural sounding note. The baseline not only fails to do this but also adds extra noise (K in Figure 2). The authors do add that the WaveNet produces some strikes (L in Figure 2) however they argue that they are inaudible.<br />
<br />
[[File:paper27-table1.png|center]]<br />
<br />
Mu-law encoding was used in the original WaveNet [https://arxiv.org/pdf/1609.03499.pdf paper] to make the problem "more tractable" compared to raw 16-bit integer values. In that paper, they note that "especially for speech, this non-linear quantization produces a significantly better reconstruction" compared to a linear scheme. This might be expected considering that the mu-law companding transformation was designed to [https://www.cisco.com/c/en/us/support/docs/voice/h323/8123-waveform-coding.html#t4 encode speech]. In this application though, using this encoding creates perceptible distortion that sounds similar to clipping.<br />
<br />
=== Quantitative Comparison ===<br />
For a quantitative comparison the authors trained a separate multi-task classifier to classify a note using given pitch or quality of a note. The results of both the Baseline and the WaveNet where then inputted and attempted to be classified. As seen in table 1 WaveNet significantly outperformed the Baseline in both metrics posting a ~70% increase when only considering pitch.<br />
<br />
== Interpolation in Timbre and Dynamics ==<br />
<br />
[[File:paper27-figure3-interpolation.png|center]]<br />
<br />
For this evaluation the authors reconstructed from linear interpolations in Z space among different instruments and compared these to superimposed position of the original two instruments. Not surprisingly the model fuse aspects of both instruments during the recreation. The authors claim however, that WaveNet produces much more realistic sounding results. <br />
To support their claim the authors the authors point to WaveNet ability to create dynamic mixing of overtone in time, even jumping to higher harmonics (A in Figure 3), capturing the timbre and dynamics of both the bass and flute. This can be once again seen in (B in Figure 3) where Wavenet adds additional harmonics as well as a sub-harmonics to the original flute note.<br />
<br />
== Entanglement of Pitch and Timbre ==<br />
<br />
[[File:paper27-table2.png|center]]<br />
<br />
[[File:paper27-figure4-entanglement.png|center]]<br />
<br />
To study the entanglement between pitch and Z space the authors constructed a classifier which was expected to drop in accuracy if the representation of pitch and timbre is disentangled as it relies heavily on the pitch information. This is clearly demonstrated by the first two rows of table 2 where WaveNet relies more strongly on pitch then the baseline algorithm. The authors provide a more qualitative demonstrating in figure 4. They demonstrate a situation in which a classifier may be confused; a note with pitch of +12 is almost exactly the same as the original apart from an emergence of sub-harmonics.<br />
<br />
Further insight can be gained on the relationship between pitch and timbre by studying the trend amongst the network embeddings among the pitches for specific instruments. This is depicted in figure 5 for several instruments across their entire 88 note range at 127 velocity. It can be noted from the figure that the instruments have unique separation of two or more registers over which the embeddings of notes with different pitches are similar. This is expected since instrumental dynamics and timbre varies dramatically over the range of the instrument.<br />
<br />
= Conclusion & Future Directions =<br />
<br />
This paper presents a Wavelet autoencoder model which is built on top of the WaveNet model and evaluate the model on NSynth dataset. The paper also introduces a new large scale dataset of musical notes: NSynth.<br />
<br />
One significant area which the authors claim great improvement is needed is the large memory constraints required by there algorithm. Due to the large memory requirement the current WaveNet must rely on down sampling thus being unable to fully capture the global context. This is an area where model compression techniques could be beneficial. That is, quantization and pruning could be effective: with 4-bit quantization during the entire process (weights, activations, gradients, error as in the work of Wu et al., 2016<sup>[[#References|[7]]]</sup>), memory requirement could be reduced by at least 8 times. The authors also claim that research using different input representations (instead of mu-law) to minimize distortion is ongoing.<br />
<br />
= Critique = <br />
* Authors have never conducted a human study determining sound similarity between the original, baseline, and WaveNet.<br />
* Architecture is not very novel.<br />
* In order to have a comparison, they set out to create a straight-forward baseline for the neural audio synthesis experiments.<br />
<br />
= Open Source Code =<br />
<br />
Google has released all code related to this paper at the following open source repository: https://github.com/tensorflow/magenta/tree/master/magenta/models/nsynth<br />
<br />
= References =<br />
<br />
# Engel, J., Resnick, C., Roberts, A., Dieleman, S., Norouzi, M., Eck, D. & Simonyan, K.. (2017). Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders. Proceedings of the 34th International Conference on Machine Learning, in PMLR 70:1068-1077<br />
# Griffin, Daniel, and Jae Lim. "Signal estimation from modified short-time Fourier transform." IEEE Transactions on Acoustics, Speech, and Signal Processing 32.2 (1984): 236-243.<br />
# NSynth: Neural Audio Synthesis. (2017, April 06). Retrieved March 19, 2018, from https://magenta.tensorflow.org/nsynth <br />
# The NSynth Dataset. (2017, April 05). Retrieved March 19, 2018, from https://magenta.tensorflow.org/datasets/nsynth<br />
# Oord, Aaron van den, Nal Kalchbrenner, and Koray Kavukcuoglu. "Pixel recurrent neural networks." arXiv preprint arXiv:1601.06759 (2016).<br />
# Mehri, Soroush, et al. "SampleRNN: An unconditional end-to-end neural audio generation model." arXiv preprint arXiv:1612.07837 (2016).<br />
# Wu, S., Li, G., Chen, F., & Shi, L. (2018). Training and Inference with Integers in Deep Neural Networks. arXiv preprint arXiv:1802.04680.</div>Swalshhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Neural_Audio_Synthesis_of_Musical_Notes_with_WaveNet_autoencoders&diff=36273Neural Audio Synthesis of Musical Notes with WaveNet autoencoders2018-04-07T07:06:03Z<p>Swalsh: /* Interpolation in Timbre and Dynamics */</p>
<hr />
<div>= Introduction =<br />
The authors of this paper have pointed out that the method in which most notes are created are hand-designed instruments modifying pitch, velocity and filter parameters to produce the required tone, timbre and dynamics of a sound. The authors suggest that this may be a problem and thus suggest a data-driven approach to audio synthesis. They demonstrate how to generate new types of expressive and realistic instrument sounds using a neural network model instead of using specific arrangements of oscillators or algorithms for sample playback. The model is capable of learning semantically meaningful hidden representations which can be used as control signals for manipulating tone, timbre, and dynamics during playback. To train such a data expensive model the authors highlight the need for a large dataset much like ImageNet for music. The motivation for this work stems from recent advances in autoregressive models like WaveNet <sup>[[#References|[5]]]</sup> and SampleRNN<sup>[[#References|[6]]]</sup>. These models are effective at modeling short and medium scale (~500ms) signals, but rely on external conditioning for large-term dependencies; the proposed model removes the need for external conditioning.<br />
<br />
= Contributions =<br />
This paper has two main contributions, one theoretical and one empirical: <br />
<br />
=== Theoretical contribution ===<br />
Proposed Wavenet-style autoencoder that learn to encode temural data over a long term audio structures without requiring external conditioning.<br />
<br />
=== Empirical contribution === <br />
Provided NSynth data set. The authors constructed this data set from scratch, which is a a large data set of musical notes inspired by the emerging of large image data sets. This data set servers as a great training/test resource for future works.<br />
<br />
= Models =<br />
<br />
[[File:paper26-figure1-models.png|center]]<br />
<br />
== WaveNet Autoencoder ==<br />
<br />
While the proposed autoencoder structure is very similar to that of WaveNet the authors argue that the algorithm is novel in two ways:<br />
* It is able to attain consistent long-term structure without any external conditioning <br />
* Creating meaningful embedding which can be interpolated between<br />
<br />
In the original WaveNet architecture the authors use a stack of dilated convolutions to predict the next sample of audio given a prior sample. This approach was prone to "babbling" since it did not take into account long-term structure of the audio. In this model the joint probability of generating audio <math>x</math> is:<br />
<br />
\begin{align}<br />
p(x) = \prod_{i=1}^N\{x_i | x_1, … , x_N-1\}<br />
\end{align}<br />
<br />
They authors try to capture long-term structure by passing the raw audio through the encoder to produce an embedding <math>Z = f(x) </math>, and then shifting the input and feeding it into the decoder which reproduces the input. The resulting probability distribution: <br />
<br />
\begin{align}<br />
p(x) = \prod_{i=1}^N\{x_i | x_1, … , x_N-1, f(x) \}<br />
\end{align}<br />
<br />
<br />
A detailed block diagram of the modified WaveNet structure can be seen in figure 1b. This diagram demonstrates the encoder as a 30 layer network in each each node is a ReLU nonlinearity followed by a non-causal dilated convolution. Dilated convolution (aka convolutions with holes) is a type of convolution in which the filter skips input values with a certain step (step size of 1 is equivalent to the standard convolution), effectively allowing the network to operate at a coarser scale compared to traditional convolutional layers and have very large receptive fields. The resulting convolution is 128 channels all feed into another ReLU nonlinearity which is feed into another 1x1 convolution before getting down sampled with average pooling to produce a 16 dimension <math>Z </math> distribution. Each <math>Z </math> encoding is for a specific temporal resolution which the authors of the paper tuned to 32ms. This means that there are 125, 16 dimension <math>Z </math> encodings for each 4 second note present in the NSynth database (1984 embeddings). <br />
Before the <math>Z </math> embedding enters the decoder it is first upsampled to the original audio rate using nearest neighbor interpolation. The embedding then passes through the decoder to recreate the original audio note. The input audio data is first quantized using 8-bit mu-law encoding into 256 possible values, and the output prediction is the softmax over the possible values.<br />
<br />
== Baseline: Spectral Autoencoder ==<br />
Being unable to find an alternative fully deep model which the authors could use to compare to there proposed WaveNet autoencoder to, the authors just made a strong baseline. The baseline algorithm that the authors developed is a spectral autoencoder. The block diagram of its architecture can be seen in figure 1a. The baseline network is 10 layer deep. Each layer has a 4x4 kernels with 2x2 strides followed by a leaky-ReLU (0.1) and batch normalization. The final hidden vector(Z) was set to 1984 to exactly match the hidden vector of the WaveNet autoencoder. <br />
<br />
Given the simple architecture, the authors first attempted to train the baseline on raw waveforms as input, with a mean-squared error cost. This did not work well and showed the problem of the independent Gaussian assumption. Spectral representations from FFT worked better, but had low perceptual quality despite having low MSE cost after training. Training on the log magnitude of the power spectra, normalized between 0 and 1, was found to be best correlated with perceptual distortion. The authors also explored several representations of phase, finding that estimating magnitude and using established iterative techniques to reconstruct phase to be most effective. (The technique to reconstruct the phase from the magnitude comes from (Griffin and Lim 1984). It can be summarized as follows. In each iteration, generate a Fourier signal z by taking the Short Time Fourier transform of the current estimate of the complete time-domain signal, and replacing its magnitude component with the known true magnitude. Then find the time-domain signal whose Short Time Fourier transform is closest to z in the least-squares sense. This is the estimate of the complete signal for the next iteration. ) A final heuristic that was used by the authors to increase the accuracy of the baseline was weighting the mean square error (MSE) loss starting at 10 for 0 HZ and decreasing linearly to 1 at 4000 Hz and above. This is valid as the fundamental frequency of most instrument are found at lower frequencies. <br />
<br />
== Training ==<br />
Both the modified WaveNet and the baseline autoencoder used stochastic gradient descent with an Adam optimizer. The authors trained the baseline autoencoder model asynchronously for 1800000 epocs with a batch size of 8 with a learning rate of 1e-4. Where as the WaveNet modules were trained synchronously for 250000 epocs with a batch size of 32 with a decaying learning rate ranging from 2e-4 to 6e-6.<br />
<br />
= The NSynth Dataset =<br />
To evaluate the WaveNet autoencoder model, the authors' wanted an audio dataset that let them explore the learned embeddings. Musical notes are an ideal setting for this study. Prior to this paper, the existing music datasets included the RWC music database (Goto et al., 2003) and the dataset from Romani Picas et al. However, the authors wanted to develop a larger dataset.<br />
<br />
The NSynth dataset has 306 043 unique musical notes (each have a unique pitch, timbre, envelope) all 4 seconds in length sampled at 16,000 Hz. The data set consists of 1006 different instruments playing on average of 65.4 different pitches across on average 4.75 different velocities. Average pitches and velocities are used as not all instruments, can reach all 88 MIDI frequencies, or the 5 velocities desired by the authors. The dataset has the following split: training set with 289,205 notes, validation set with 12,678 notes, and test set with 4,096 notes.<br />
<br />
Along with each note the authors also included the following annotations:<br />
* Source - The way each sound was produced. There were 3 classes ‘acoustic’, ‘electronic’ and ‘synthetic’.<br />
* Family - The family class of instruments that produced each note. There are 11 classes which include: {‘bass’, ‘brass’, ‘vocal’ ext.}<br />
* Qualities - Sonic qualities about each note<br />
<br />
The full dataset is publicly available here: https://magenta.tensorflow.org/datasets/nsynth as TFRecord files with training and holdout splits.<br />
<br />
[[File:nsynth_table.png | 400px|thumb|center|Full details of the NSynth dataset.]]<br />
<br />
= Evaluation =<br />
<br />
To fully analyze all aspects of WaveNet the authors proposed three evaluations:<br />
* Reconstruction - Both Quantitative and Qualitative analysis were considered<br />
* Interpolation in Timbre and Dynamics<br />
* Entanglement of Pitch and Timbre <br />
<br />
Sound is historically very difficult to quantify from a picture representation as it requires training and expertise to analyze. Even with expertise it can be difficult to complete a full analysis as two very different sounds can look quite similar in their respective pictorial representations. This is why the authors recommend all readers to listen to the created notes which can be found here: https://magenta.tensorflow.org/nsynth.<br />
<br />
However, even when taking this under consideration the authors do pictorially demonstrate differences in the two proposed algorithms along with the original note, as it is hard to publish a paper with sound included. To demonstrate the pictorial difference the authors demonstrate each note using constant-q transform (CQT) which is able to capture the dynamics of timbre along with representing the frequencies of the sound.<br />
<br />
== Reconstruction ==<br />
<br />
[[File:paper27-figure2-reconstruction.png|center]]<br />
<br />
The authors attempted to show magnitude and phase on the same plot above. Instantaneous frequency is the derivative of the phase and the intensity of solid lines is proportional to the log magnitude of the power spectrum. If fharm and an FFT bin are not the same, then there will be a constant phase shift: <br />
<math><br />
\triangle \phi = (f_{bin} − f_{harm}) \dfrac{hopsize}{samplerate}<br />
</math>.<br />
<br />
=== Qualitative Comparison ===<br />
In Figure 2, CQT spectrograms are displayed from 3 different instruments, including the original note spectrograms and the model reconstruction spectrograms. For the model reconstruction spectrograms, a baseline is adopted to compare with WaveNet. Each note contains some noise, a fundamental frequency with a series of harmonics, and a decay. In the Glockenspiel the WaveNet autoencoder is able to reproduce the magnitude, phase of the fundamental frequency (A and C in figure 2), and the attack (B in figure 2) of the instrument; Whereas the Baseline autoencoder introduces non existing harmonics (D in figure 2). The flugelhorn on the other hand, presents the starkest difference between the WaveNet and baseline autoencoders. The WaveNet while not perfect is able to reproduce the verbarto (I and J in figure 2) across multiple frequencies, which results in a natural sounding note. The baseline not only fails to do this but also adds extra noise (K in figure 2). The authors do add that the WaveNet produces some strikes (L in figure 2) however they argue that they are inaudible.<br />
<br />
[[File:paper27-table1.png|center]]<br />
<br />
Mu-law encoding was used in the original WaveNet [https://arxiv.org/pdf/1609.03499.pdf paper] to make the problem "more tractable" compared to raw 16-bit integer values. In that paper, they note that "especially for speech, this non-linear quantization produces a significantly better reconstruction" compared to a linear scheme. This might be expected considering that the mu-law companding transformation was designed to [https://www.cisco.com/c/en/us/support/docs/voice/h323/8123-waveform-coding.html#t4 encode speech]. In this application though, using this encoding creates perceptible distortion that sounds similar to clipping.<br />
<br />
=== Quantitative Comparison ===<br />
For a quantitative comparison the authors trained a separate multi-task classifier to classify a note using given pitch or quality of a note. The results of both the Baseline and the WaveNet where then inputted and attempted to be classified. As seen in table 1 WaveNet significantly outperformed the Baseline in both metrics posting a ~70% increase when only considering pitch.<br />
<br />
== Interpolation in Timbre and Dynamics ==<br />
<br />
[[File:paper27-figure3-interpolation.png|center]]<br />
<br />
For this evaluation the authors reconstructed from linear interpolations in Z space among different instruments and compared these to superimposed position of the original two instruments. Not surprisingly the model fuse aspects of both instruments during the recreation. The authors claim however, that WaveNet produces much more realistic sounding results. <br />
To support their claim the authors the authors point to WaveNet ability to create dynamic mixing of overtone in time, even jumping to higher harmonics (A in Figure 3), capturing the timbre and dynamics of both the bass and flute. This can be once again seen in (B in Figure 3) where Wavenet adds additional harmonics as well as a sub-harmonics to the original flute note.<br />
<br />
== Entanglement of Pitch and Timbre ==<br />
<br />
[[File:paper27-table2.png|center]]<br />
<br />
[[File:paper27-figure4-entanglement.png|center]]<br />
<br />
To study the entanglement between pitch and Z space the authors constructed a classifier which was expected to drop in accuracy if the representation of pitch and timbre is disentangled as it relies heavily on the pitch information. This is clearly demonstrated by the first two rows of table 2 where WaveNet relies more strongly on pitch then the baseline algorithm. The authors provide a more qualitative demonstrating in figure 4. They demonstrate a situation in which a classifier may be confused; a note with pitch of +12 is almost exactly the same as the original apart from an emergence of sub-harmonics.<br />
<br />
Further insight can be gained on the relationship between pitch and timbre by studying the trend amongst the network embeddings among the pitches for specific instruments. This is depicted in figure 5 for several instruments across their entire 88 note range at 127 velocity. It can be noted from the figure that the instruments have unique separation of two or more registers over which the embeddings of notes with different pitches are similar. This is expected since instrumental dynamics and timbre varies dramatically over the range of the instrument.<br />
<br />
= Conclusion & Future Directions =<br />
<br />
This paper presents a Wavelet autoencoder model which is built on top of the WaveNet model and evaluate the model on NSynth dataset. The paper also introduces a new large scale dataset of musical notes: NSynth.<br />
<br />
One significant area which the authors claim great improvement is needed is the large memory constraints required by there algorithm. Due to the large memory requirement the current WaveNet must rely on down sampling thus being unable to fully capture the global context. This is an area where model compression techniques could be beneficial. That is, quantization and pruning could be effective: with 4-bit quantization during the entire process (weights, activations, gradients, error as in the work of Wu et al., 2016<sup>[[#References|[7]]]</sup>), memory requirement could be reduced by at least 8 times. The authors also claim that research using different input representations (instead of mu-law) to minimize distortion is ongoing.<br />
<br />
= Critique = <br />
* Authors have never conducted a human study determining sound similarity between the original, baseline, and WaveNet.<br />
* Architecture is not very novel.<br />
* In order to have a comparison, they set out to create a straight-forward baseline for the neural audio synthesis experiments.<br />
<br />
= Open Source Code =<br />
<br />
Google has released all code related to this paper at the following open source repository: https://github.com/tensorflow/magenta/tree/master/magenta/models/nsynth<br />
<br />
= References =<br />
<br />
# Engel, J., Resnick, C., Roberts, A., Dieleman, S., Norouzi, M., Eck, D. & Simonyan, K.. (2017). Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders. Proceedings of the 34th International Conference on Machine Learning, in PMLR 70:1068-1077<br />
# Griffin, Daniel, and Jae Lim. "Signal estimation from modified short-time Fourier transform." IEEE Transactions on Acoustics, Speech, and Signal Processing 32.2 (1984): 236-243.<br />
# NSynth: Neural Audio Synthesis. (2017, April 06). Retrieved March 19, 2018, from https://magenta.tensorflow.org/nsynth <br />
# The NSynth Dataset. (2017, April 05). Retrieved March 19, 2018, from https://magenta.tensorflow.org/datasets/nsynth<br />
# Oord, Aaron van den, Nal Kalchbrenner, and Koray Kavukcuoglu. "Pixel recurrent neural networks." arXiv preprint arXiv:1601.06759 (2016).<br />
# Mehri, Soroush, et al. "SampleRNN: An unconditional end-to-end neural audio generation model." arXiv preprint arXiv:1612.07837 (2016).<br />
# Wu, S., Li, G., Chen, F., & Shi, L. (2018). Training and Inference with Integers in Deep Neural Networks. arXiv preprint arXiv:1802.04680.</div>Swalshhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Neural_Audio_Synthesis_of_Musical_Notes_with_WaveNet_autoencoders&diff=36272Neural Audio Synthesis of Musical Notes with WaveNet autoencoders2018-04-07T07:05:01Z<p>Swalsh: /* Conclusion & Future Directions */</p>
<hr />
<div>= Introduction =<br />
The authors of this paper have pointed out that the method in which most notes are created are hand-designed instruments modifying pitch, velocity and filter parameters to produce the required tone, timbre and dynamics of a sound. The authors suggest that this may be a problem and thus suggest a data-driven approach to audio synthesis. They demonstrate how to generate new types of expressive and realistic instrument sounds using a neural network model instead of using specific arrangements of oscillators or algorithms for sample playback. The model is capable of learning semantically meaningful hidden representations which can be used as control signals for manipulating tone, timbre, and dynamics during playback. To train such a data expensive model the authors highlight the need for a large dataset much like ImageNet for music. The motivation for this work stems from recent advances in autoregressive models like WaveNet <sup>[[#References|[5]]]</sup> and SampleRNN<sup>[[#References|[6]]]</sup>. These models are effective at modeling short and medium scale (~500ms) signals, but rely on external conditioning for large-term dependencies; the proposed model removes the need for external conditioning.<br />
<br />
= Contributions =<br />
This paper has two main contributions, one theoretical and one empirical: <br />
<br />
=== Theoretical contribution ===<br />
Proposed Wavenet-style autoencoder that learn to encode temural data over a long term audio structures without requiring external conditioning.<br />
<br />
=== Empirical contribution === <br />
Provided NSynth data set. The authors constructed this data set from scratch, which is a a large data set of musical notes inspired by the emerging of large image data sets. This data set servers as a great training/test resource for future works.<br />
<br />
= Models =<br />
<br />
[[File:paper26-figure1-models.png|center]]<br />
<br />
== WaveNet Autoencoder ==<br />
<br />
While the proposed autoencoder structure is very similar to that of WaveNet the authors argue that the algorithm is novel in two ways:<br />
* It is able to attain consistent long-term structure without any external conditioning <br />
* Creating meaningful embedding which can be interpolated between<br />
<br />
In the original WaveNet architecture the authors use a stack of dilated convolutions to predict the next sample of audio given a prior sample. This approach was prone to "babbling" since it did not take into account long-term structure of the audio. In this model the joint probability of generating audio <math>x</math> is:<br />
<br />
\begin{align}<br />
p(x) = \prod_{i=1}^N\{x_i | x_1, … , x_N-1\}<br />
\end{align}<br />
<br />
They authors try to capture long-term structure by passing the raw audio through the encoder to produce an embedding <math>Z = f(x) </math>, and then shifting the input and feeding it into the decoder which reproduces the input. The resulting probability distribution: <br />
<br />
\begin{align}<br />
p(x) = \prod_{i=1}^N\{x_i | x_1, … , x_N-1, f(x) \}<br />
\end{align}<br />
<br />
<br />
A detailed block diagram of the modified WaveNet structure can be seen in figure 1b. This diagram demonstrates the encoder as a 30 layer network in each each node is a ReLU nonlinearity followed by a non-causal dilated convolution. Dilated convolution (aka convolutions with holes) is a type of convolution in which the filter skips input values with a certain step (step size of 1 is equivalent to the standard convolution), effectively allowing the network to operate at a coarser scale compared to traditional convolutional layers and have very large receptive fields. The resulting convolution is 128 channels all feed into another ReLU nonlinearity which is feed into another 1x1 convolution before getting down sampled with average pooling to produce a 16 dimension <math>Z </math> distribution. Each <math>Z </math> encoding is for a specific temporal resolution which the authors of the paper tuned to 32ms. This means that there are 125, 16 dimension <math>Z </math> encodings for each 4 second note present in the NSynth database (1984 embeddings). <br />
Before the <math>Z </math> embedding enters the decoder it is first upsampled to the original audio rate using nearest neighbor interpolation. The embedding then passes through the decoder to recreate the original audio note. The input audio data is first quantized using 8-bit mu-law encoding into 256 possible values, and the output prediction is the softmax over the possible values.<br />
<br />
== Baseline: Spectral Autoencoder ==<br />
Being unable to find an alternative fully deep model which the authors could use to compare to there proposed WaveNet autoencoder to, the authors just made a strong baseline. The baseline algorithm that the authors developed is a spectral autoencoder. The block diagram of its architecture can be seen in figure 1a. The baseline network is 10 layer deep. Each layer has a 4x4 kernels with 2x2 strides followed by a leaky-ReLU (0.1) and batch normalization. The final hidden vector(Z) was set to 1984 to exactly match the hidden vector of the WaveNet autoencoder. <br />
<br />
Given the simple architecture, the authors first attempted to train the baseline on raw waveforms as input, with a mean-squared error cost. This did not work well and showed the problem of the independent Gaussian assumption. Spectral representations from FFT worked better, but had low perceptual quality despite having low MSE cost after training. Training on the log magnitude of the power spectra, normalized between 0 and 1, was found to be best correlated with perceptual distortion. The authors also explored several representations of phase, finding that estimating magnitude and using established iterative techniques to reconstruct phase to be most effective. (The technique to reconstruct the phase from the magnitude comes from (Griffin and Lim 1984). It can be summarized as follows. In each iteration, generate a Fourier signal z by taking the Short Time Fourier transform of the current estimate of the complete time-domain signal, and replacing its magnitude component with the known true magnitude. Then find the time-domain signal whose Short Time Fourier transform is closest to z in the least-squares sense. This is the estimate of the complete signal for the next iteration. ) A final heuristic that was used by the authors to increase the accuracy of the baseline was weighting the mean square error (MSE) loss starting at 10 for 0 HZ and decreasing linearly to 1 at 4000 Hz and above. This is valid as the fundamental frequency of most instrument are found at lower frequencies. <br />
<br />
== Training ==<br />
Both the modified WaveNet and the baseline autoencoder used stochastic gradient descent with an Adam optimizer. The authors trained the baseline autoencoder model asynchronously for 1800000 epocs with a batch size of 8 with a learning rate of 1e-4. Where as the WaveNet modules were trained synchronously for 250000 epocs with a batch size of 32 with a decaying learning rate ranging from 2e-4 to 6e-6.<br />
<br />
= The NSynth Dataset =<br />
To evaluate the WaveNet autoencoder model, the authors' wanted an audio dataset that let them explore the learned embeddings. Musical notes are an ideal setting for this study. Prior to this paper, the existing music datasets included the RWC music database (Goto et al., 2003) and the dataset from Romani Picas et al. However, the authors wanted to develop a larger dataset.<br />
<br />
The NSynth dataset has 306 043 unique musical notes (each have a unique pitch, timbre, envelope) all 4 seconds in length sampled at 16,000 Hz. The data set consists of 1006 different instruments playing on average of 65.4 different pitches across on average 4.75 different velocities. Average pitches and velocities are used as not all instruments, can reach all 88 MIDI frequencies, or the 5 velocities desired by the authors. The dataset has the following split: training set with 289,205 notes, validation set with 12,678 notes, and test set with 4,096 notes.<br />
<br />
Along with each note the authors also included the following annotations:<br />
* Source - The way each sound was produced. There were 3 classes ‘acoustic’, ‘electronic’ and ‘synthetic’.<br />
* Family - The family class of instruments that produced each note. There are 11 classes which include: {‘bass’, ‘brass’, ‘vocal’ ext.}<br />
* Qualities - Sonic qualities about each note<br />
<br />
The full dataset is publicly available here: https://magenta.tensorflow.org/datasets/nsynth as TFRecord files with training and holdout splits.<br />
<br />
[[File:nsynth_table.png | 400px|thumb|center|Full details of the NSynth dataset.]]<br />
<br />
= Evaluation =<br />
<br />
To fully analyze all aspects of WaveNet the authors proposed three evaluations:<br />
* Reconstruction - Both Quantitative and Qualitative analysis were considered<br />
* Interpolation in Timbre and Dynamics<br />
* Entanglement of Pitch and Timbre <br />
<br />
Sound is historically very difficult to quantify from a picture representation as it requires training and expertise to analyze. Even with expertise it can be difficult to complete a full analysis as two very different sounds can look quite similar in their respective pictorial representations. This is why the authors recommend all readers to listen to the created notes which can be found here: https://magenta.tensorflow.org/nsynth.<br />
<br />
However, even when taking this under consideration the authors do pictorially demonstrate differences in the two proposed algorithms along with the original note, as it is hard to publish a paper with sound included. To demonstrate the pictorial difference the authors demonstrate each note using constant-q transform (CQT) which is able to capture the dynamics of timbre along with representing the frequencies of the sound.<br />
<br />
== Reconstruction ==<br />
<br />
[[File:paper27-figure2-reconstruction.png|center]]<br />
<br />
The authors attempted to show magnitude and phase on the same plot above. Instantaneous frequency is the derivative of the phase and the intensity of solid lines is proportional to the log magnitude of the power spectrum. If fharm and an FFT bin are not the same, then there will be a constant phase shift: <br />
<math><br />
\triangle \phi = (f_{bin} − f_{harm}) \dfrac{hopsize}{samplerate}<br />
</math>.<br />
<br />
=== Qualitative Comparison ===<br />
In Figure 2, CQT spectrograms are displayed from 3 different instruments, including the original note spectrograms and the model reconstruction spectrograms. For the model reconstruction spectrograms, a baseline is adopted to compare with WaveNet. Each note contains some noise, a fundamental frequency with a series of harmonics, and a decay. In the Glockenspiel the WaveNet autoencoder is able to reproduce the magnitude, phase of the fundamental frequency (A and C in figure 2), and the attack (B in figure 2) of the instrument; Whereas the Baseline autoencoder introduces non existing harmonics (D in figure 2). The flugelhorn on the other hand, presents the starkest difference between the WaveNet and baseline autoencoders. The WaveNet while not perfect is able to reproduce the verbarto (I and J in figure 2) across multiple frequencies, which results in a natural sounding note. The baseline not only fails to do this but also adds extra noise (K in figure 2). The authors do add that the WaveNet produces some strikes (L in figure 2) however they argue that they are inaudible.<br />
<br />
[[File:paper27-table1.png|center]]<br />
<br />
Mu-law encoding was used in the original WaveNet [https://arxiv.org/pdf/1609.03499.pdf paper] to make the problem "more tractable" compared to raw 16-bit integer values. In that paper, they note that "especially for speech, this non-linear quantization produces a significantly better reconstruction" compared to a linear scheme. This might be expected considering that the mu-law companding transformation was designed to [https://www.cisco.com/c/en/us/support/docs/voice/h323/8123-waveform-coding.html#t4 encode speech]. In this application though, using this encoding creates perceptible distortion that sounds similar to clipping.<br />
<br />
=== Quantitative Comparison ===<br />
For a quantitative comparison the authors trained a separate multi-task classifier to classify a note using given pitch or quality of a note. The results of both the Baseline and the WaveNet where then inputted and attempted to be classified. As seen in table 1 WaveNet significantly outperformed the Baseline in both metrics posting a ~70% increase when only considering pitch.<br />
<br />
== Interpolation in Timbre and Dynamics ==<br />
<br />
[[File:paper27-figure3-interpolation.png|center]]<br />
<br />
For this evaluation the authors reconstructed from linear interpolations in Z space among different instruments and compared these to superimposed position of the original two instruments. Not surprisingly the model fuse aspects of both instruments during the recreation. The authors claim however, that WaveNet produces much more realistic sounding results. <br />
To support their claim the authors the authors point to WaveNet ability to create dynamic mixing of overtone in time, even jumping to higher harmonics (A in figure 3), capturing the timbre and dynamics of both the bass and flute. This can be once again seen in (B in figure 3) where Wavenet adds additional harmonics as well as a sub-harmonics to the original flute note. <br />
<br />
<br />
== Entanglement of Pitch and Timbre ==<br />
<br />
[[File:paper27-table2.png|center]]<br />
<br />
[[File:paper27-figure4-entanglement.png|center]]<br />
<br />
To study the entanglement between pitch and Z space the authors constructed a classifier which was expected to drop in accuracy if the representation of pitch and timbre is disentangled as it relies heavily on the pitch information. This is clearly demonstrated by the first two rows of table 2 where WaveNet relies more strongly on pitch then the baseline algorithm. The authors provide a more qualitative demonstrating in figure 4. They demonstrate a situation in which a classifier may be confused; a note with pitch of +12 is almost exactly the same as the original apart from an emergence of sub-harmonics.<br />
<br />
Further insight can be gained on the relationship between pitch and timbre by studying the trend amongst the network embeddings among the pitches for specific instruments. This is depicted in figure 5 for several instruments across their entire 88 note range at 127 velocity. It can be noted from the figure that the instruments have unique separation of two or more registers over which the embeddings of notes with different pitches are similar. This is expected since instrumental dynamics and timbre varies dramatically over the range of the instrument.<br />
<br />
= Conclusion & Future Directions =<br />
<br />
This paper presents a Wavelet autoencoder model which is built on top of the WaveNet model and evaluate the model on NSynth dataset. The paper also introduces a new large scale dataset of musical notes: NSynth.<br />
<br />
One significant area which the authors claim great improvement is needed is the large memory constraints required by there algorithm. Due to the large memory requirement the current WaveNet must rely on down sampling thus being unable to fully capture the global context. This is an area where model compression techniques could be beneficial. That is, quantization and pruning could be effective: with 4-bit quantization during the entire process (weights, activations, gradients, error as in the work of Wu et al., 2016<sup>[[#References|[7]]]</sup>), memory requirement could be reduced by at least 8 times. The authors also claim that research using different input representations (instead of mu-law) to minimize distortion is ongoing.<br />
<br />
= Critique = <br />
* Authors have never conducted a human study determining sound similarity between the original, baseline, and WaveNet.<br />
* Architecture is not very novel.<br />
* In order to have a comparison, they set out to create a straight-forward baseline for the neural audio synthesis experiments.<br />
<br />
= Open Source Code =<br />
<br />
Google has released all code related to this paper at the following open source repository: https://github.com/tensorflow/magenta/tree/master/magenta/models/nsynth<br />
<br />
= References =<br />
<br />
# Engel, J., Resnick, C., Roberts, A., Dieleman, S., Norouzi, M., Eck, D. & Simonyan, K.. (2017). Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders. Proceedings of the 34th International Conference on Machine Learning, in PMLR 70:1068-1077<br />
# Griffin, Daniel, and Jae Lim. "Signal estimation from modified short-time Fourier transform." IEEE Transactions on Acoustics, Speech, and Signal Processing 32.2 (1984): 236-243.<br />
# NSynth: Neural Audio Synthesis. (2017, April 06). Retrieved March 19, 2018, from https://magenta.tensorflow.org/nsynth <br />
# The NSynth Dataset. (2017, April 05). Retrieved March 19, 2018, from https://magenta.tensorflow.org/datasets/nsynth<br />
# Oord, Aaron van den, Nal Kalchbrenner, and Koray Kavukcuoglu. "Pixel recurrent neural networks." arXiv preprint arXiv:1601.06759 (2016).<br />
# Mehri, Soroush, et al. "SampleRNN: An unconditional end-to-end neural audio generation model." arXiv preprint arXiv:1612.07837 (2016).<br />
# Wu, S., Li, G., Chen, F., & Shi, L. (2018). Training and Inference with Integers in Deep Neural Networks. arXiv preprint arXiv:1802.04680.</div>Swalshhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Neural_Audio_Synthesis_of_Musical_Notes_with_WaveNet_autoencoders&diff=36271Neural Audio Synthesis of Musical Notes with WaveNet autoencoders2018-04-07T07:03:52Z<p>Swalsh: /* Introduction */</p>
<hr />
<div>= Introduction =<br />
The authors of this paper have pointed out that the method in which most notes are created are hand-designed instruments modifying pitch, velocity and filter parameters to produce the required tone, timbre and dynamics of a sound. The authors suggest that this may be a problem and thus suggest a data-driven approach to audio synthesis. They demonstrate how to generate new types of expressive and realistic instrument sounds using a neural network model instead of using specific arrangements of oscillators or algorithms for sample playback. The model is capable of learning semantically meaningful hidden representations which can be used as control signals for manipulating tone, timbre, and dynamics during playback. To train such a data expensive model the authors highlight the need for a large dataset much like ImageNet for music. The motivation for this work stems from recent advances in autoregressive models like WaveNet <sup>[[#References|[5]]]</sup> and SampleRNN<sup>[[#References|[6]]]</sup>. These models are effective at modeling short and medium scale (~500ms) signals, but rely on external conditioning for large-term dependencies; the proposed model removes the need for external conditioning.<br />
<br />
= Contributions =<br />
This paper has two main contributions, one theoretical and one empirical: <br />
<br />
=== Theoretical contribution ===<br />
Proposed Wavenet-style autoencoder that learn to encode temural data over a long term audio structures without requiring external conditioning.<br />
<br />
=== Empirical contribution === <br />
Provided NSynth data set. The authors constructed this data set from scratch, which is a a large data set of musical notes inspired by the emerging of large image data sets. This data set servers as a great training/test resource for future works.<br />
<br />
= Models =<br />
<br />
[[File:paper26-figure1-models.png|center]]<br />
<br />
== WaveNet Autoencoder ==<br />
<br />
While the proposed autoencoder structure is very similar to that of WaveNet the authors argue that the algorithm is novel in two ways:<br />
* It is able to attain consistent long-term structure without any external conditioning <br />
* Creating meaningful embedding which can be interpolated between<br />
<br />
In the original WaveNet architecture the authors use a stack of dilated convolutions to predict the next sample of audio given a prior sample. This approach was prone to "babbling" since it did not take into account long-term structure of the audio. In this model the joint probability of generating audio <math>x</math> is:<br />
<br />
\begin{align}<br />
p(x) = \prod_{i=1}^N\{x_i | x_1, … , x_N-1\}<br />
\end{align}<br />
<br />
They authors try to capture long-term structure by passing the raw audio through the encoder to produce an embedding <math>Z = f(x) </math>, and then shifting the input and feeding it into the decoder which reproduces the input. The resulting probability distribution: <br />
<br />
\begin{align}<br />
p(x) = \prod_{i=1}^N\{x_i | x_1, … , x_N-1, f(x) \}<br />
\end{align}<br />
<br />
<br />
A detailed block diagram of the modified WaveNet structure can be seen in figure 1b. This diagram demonstrates the encoder as a 30 layer network in each each node is a ReLU nonlinearity followed by a non-causal dilated convolution. Dilated convolution (aka convolutions with holes) is a type of convolution in which the filter skips input values with a certain step (step size of 1 is equivalent to the standard convolution), effectively allowing the network to operate at a coarser scale compared to traditional convolutional layers and have very large receptive fields. The resulting convolution is 128 channels all feed into another ReLU nonlinearity which is feed into another 1x1 convolution before getting down sampled with average pooling to produce a 16 dimension <math>Z </math> distribution. Each <math>Z </math> encoding is for a specific temporal resolution which the authors of the paper tuned to 32ms. This means that there are 125, 16 dimension <math>Z </math> encodings for each 4 second note present in the NSynth database (1984 embeddings). <br />
Before the <math>Z </math> embedding enters the decoder it is first upsampled to the original audio rate using nearest neighbor interpolation. The embedding then passes through the decoder to recreate the original audio note. The input audio data is first quantized using 8-bit mu-law encoding into 256 possible values, and the output prediction is the softmax over the possible values.<br />
<br />
== Baseline: Spectral Autoencoder ==<br />
Being unable to find an alternative fully deep model which the authors could use to compare to there proposed WaveNet autoencoder to, the authors just made a strong baseline. The baseline algorithm that the authors developed is a spectral autoencoder. The block diagram of its architecture can be seen in figure 1a. The baseline network is 10 layer deep. Each layer has a 4x4 kernels with 2x2 strides followed by a leaky-ReLU (0.1) and batch normalization. The final hidden vector(Z) was set to 1984 to exactly match the hidden vector of the WaveNet autoencoder. <br />
<br />
Given the simple architecture, the authors first attempted to train the baseline on raw waveforms as input, with a mean-squared error cost. This did not work well and showed the problem of the independent Gaussian assumption. Spectral representations from FFT worked better, but had low perceptual quality despite having low MSE cost after training. Training on the log magnitude of the power spectra, normalized between 0 and 1, was found to be best correlated with perceptual distortion. The authors also explored several representations of phase, finding that estimating magnitude and using established iterative techniques to reconstruct phase to be most effective. (The technique to reconstruct the phase from the magnitude comes from (Griffin and Lim 1984). It can be summarized as follows. In each iteration, generate a Fourier signal z by taking the Short Time Fourier transform of the current estimate of the complete time-domain signal, and replacing its magnitude component with the known true magnitude. Then find the time-domain signal whose Short Time Fourier transform is closest to z in the least-squares sense. This is the estimate of the complete signal for the next iteration. ) A final heuristic that was used by the authors to increase the accuracy of the baseline was weighting the mean square error (MSE) loss starting at 10 for 0 HZ and decreasing linearly to 1 at 4000 Hz and above. This is valid as the fundamental frequency of most instrument are found at lower frequencies. <br />
<br />
== Training ==<br />
Both the modified WaveNet and the baseline autoencoder used stochastic gradient descent with an Adam optimizer. The authors trained the baseline autoencoder model asynchronously for 1800000 epocs with a batch size of 8 with a learning rate of 1e-4. Where as the WaveNet modules were trained synchronously for 250000 epocs with a batch size of 32 with a decaying learning rate ranging from 2e-4 to 6e-6.<br />
<br />
= The NSynth Dataset =<br />
To evaluate the WaveNet autoencoder model, the authors' wanted an audio dataset that let them explore the learned embeddings. Musical notes are an ideal setting for this study. Prior to this paper, the existing music datasets included the RWC music database (Goto et al., 2003) and the dataset from Romani Picas et al. However, the authors wanted to develop a larger dataset.<br />
<br />
The NSynth dataset has 306 043 unique musical notes (each have a unique pitch, timbre, envelope) all 4 seconds in length sampled at 16,000 Hz. The data set consists of 1006 different instruments playing on average of 65.4 different pitches across on average 4.75 different velocities. Average pitches and velocities are used as not all instruments, can reach all 88 MIDI frequencies, or the 5 velocities desired by the authors. The dataset has the following split: training set with 289,205 notes, validation set with 12,678 notes, and test set with 4,096 notes.<br />
<br />
Along with each note the authors also included the following annotations:<br />
* Source - The way each sound was produced. There were 3 classes ‘acoustic’, ‘electronic’ and ‘synthetic’.<br />
* Family - The family class of instruments that produced each note. There are 11 classes which include: {‘bass’, ‘brass’, ‘vocal’ ext.}<br />
* Qualities - Sonic qualities about each note<br />
<br />
The full dataset is publicly available here: https://magenta.tensorflow.org/datasets/nsynth as TFRecord files with training and holdout splits.<br />
<br />
[[File:nsynth_table.png | 400px|thumb|center|Full details of the NSynth dataset.]]<br />
<br />
= Evaluation =<br />
<br />
To fully analyze all aspects of WaveNet the authors proposed three evaluations:<br />
* Reconstruction - Both Quantitative and Qualitative analysis were considered<br />
* Interpolation in Timbre and Dynamics<br />
* Entanglement of Pitch and Timbre <br />
<br />
Sound is historically very difficult to quantify from a picture representation as it requires training and expertise to analyze. Even with expertise it can be difficult to complete a full analysis as two very different sounds can look quite similar in their respective pictorial representations. This is why the authors recommend all readers to listen to the created notes which can be found here: https://magenta.tensorflow.org/nsynth.<br />
<br />
However, even when taking this under consideration the authors do pictorially demonstrate differences in the two proposed algorithms along with the original note, as it is hard to publish a paper with sound included. To demonstrate the pictorial difference the authors demonstrate each note using constant-q transform (CQT) which is able to capture the dynamics of timbre along with representing the frequencies of the sound.<br />
<br />
== Reconstruction ==<br />
<br />
[[File:paper27-figure2-reconstruction.png|center]]<br />
<br />
The authors attempted to show magnitude and phase on the same plot above. Instantaneous frequency is the derivative of the phase and the intensity of solid lines is proportional to the log magnitude of the power spectrum. If fharm and an FFT bin are not the same, then there will be a constant phase shift: <br />
<math><br />
\triangle \phi = (f_{bin} − f_{harm}) \dfrac{hopsize}{samplerate}<br />
</math>.<br />
<br />
=== Qualitative Comparison ===<br />
In Figure 2, CQT spectrograms are displayed from 3 different instruments, including the original note spectrograms and the model reconstruction spectrograms. For the model reconstruction spectrograms, a baseline is adopted to compare with WaveNet. Each note contains some noise, a fundamental frequency with a series of harmonics, and a decay. In the Glockenspiel the WaveNet autoencoder is able to reproduce the magnitude, phase of the fundamental frequency (A and C in figure 2), and the attack (B in figure 2) of the instrument; Whereas the Baseline autoencoder introduces non existing harmonics (D in figure 2). The flugelhorn on the other hand, presents the starkest difference between the WaveNet and baseline autoencoders. The WaveNet while not perfect is able to reproduce the verbarto (I and J in figure 2) across multiple frequencies, which results in a natural sounding note. The baseline not only fails to do this but also adds extra noise (K in figure 2). The authors do add that the WaveNet produces some strikes (L in figure 2) however they argue that they are inaudible.<br />
<br />
[[File:paper27-table1.png|center]]<br />
<br />
Mu-law encoding was used in the original WaveNet [https://arxiv.org/pdf/1609.03499.pdf paper] to make the problem "more tractable" compared to raw 16-bit integer values. In that paper, they note that "especially for speech, this non-linear quantization produces a significantly better reconstruction" compared to a linear scheme. This might be expected considering that the mu-law companding transformation was designed to [https://www.cisco.com/c/en/us/support/docs/voice/h323/8123-waveform-coding.html#t4 encode speech]. In this application though, using this encoding creates perceptible distortion that sounds similar to clipping.<br />
<br />
=== Quantitative Comparison ===<br />
For a quantitative comparison the authors trained a separate multi-task classifier to classify a note using given pitch or quality of a note. The results of both the Baseline and the WaveNet where then inputted and attempted to be classified. As seen in table 1 WaveNet significantly outperformed the Baseline in both metrics posting a ~70% increase when only considering pitch.<br />
<br />
== Interpolation in Timbre and Dynamics ==<br />
<br />
[[File:paper27-figure3-interpolation.png|center]]<br />
<br />
For this evaluation the authors reconstructed from linear interpolations in Z space among different instruments and compared these to superimposed position of the original two instruments. Not surprisingly the model fuse aspects of both instruments during the recreation. The authors claim however, that WaveNet produces much more realistic sounding results. <br />
To support their claim the authors the authors point to WaveNet ability to create dynamic mixing of overtone in time, even jumping to higher harmonics (A in figure 3), capturing the timbre and dynamics of both the bass and flute. This can be once again seen in (B in figure 3) where Wavenet adds additional harmonics as well as a sub-harmonics to the original flute note. <br />
<br />
<br />
== Entanglement of Pitch and Timbre ==<br />
<br />
[[File:paper27-table2.png|center]]<br />
<br />
[[File:paper27-figure4-entanglement.png|center]]<br />
<br />
To study the entanglement between pitch and Z space the authors constructed a classifier which was expected to drop in accuracy if the representation of pitch and timbre is disentangled as it relies heavily on the pitch information. This is clearly demonstrated by the first two rows of table 2 where WaveNet relies more strongly on pitch then the baseline algorithm. The authors provide a more qualitative demonstrating in figure 4. They demonstrate a situation in which a classifier may be confused; a note with pitch of +12 is almost exactly the same as the original apart from an emergence of sub-harmonics.<br />
<br />
Further insight can be gained on the relationship between pitch and timbre by studying the trend amongst the network embeddings among the pitches for specific instruments. This is depicted in figure 5 for several instruments across their entire 88 note range at 127 velocity. It can be noted from the figure that the instruments have unique separation of two or more registers over which the embeddings of notes with different pitches are similar. This is expected since instrumental dynamics and timbre varies dramatically over the range of the instrument.<br />
<br />
= Conclusion & Future Directions =<br />
<br />
This paper presents a Wavelet autoencoder model which is built on top of the WaveNet model and evaluate the model on NSynth dataset. The paper also introduces a new large scale dataset of musical notes: NSynth.<br />
<br />
One significant area which the authors claim great improvement is needed is the large memory constraints required by there algorithm. Due to the large memory requirement the current WaveNet must rely on down sampling thus being unable to fully capture the global context. This is an area where model compression techniques could be beneficial. That is, quantization and pruning could be effective: with 4-bit quantization during the entire process (weights, activations, gradients, error as in the work of Wu et al., 2016), memory requirement could be reduced by at least 8 times. The authors also claim that research using different input representations (instead of mu-law) to minimize distortion is ongoing.<br />
<br />
= Critique = <br />
* Authors have never conducted a human study determining sound similarity between the original, baseline, and WaveNet.<br />
* Architecture is not very novel.<br />
* In order to have a comparison, they set out to create a straight-forward baseline for the neural audio synthesis experiments.<br />
<br />
= Open Source Code =<br />
<br />
Google has released all code related to this paper at the following open source repository: https://github.com/tensorflow/magenta/tree/master/magenta/models/nsynth<br />
<br />
= References =<br />
<br />
# Engel, J., Resnick, C., Roberts, A., Dieleman, S., Norouzi, M., Eck, D. & Simonyan, K.. (2017). Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders. Proceedings of the 34th International Conference on Machine Learning, in PMLR 70:1068-1077<br />
# Griffin, Daniel, and Jae Lim. "Signal estimation from modified short-time Fourier transform." IEEE Transactions on Acoustics, Speech, and Signal Processing 32.2 (1984): 236-243.<br />
# NSynth: Neural Audio Synthesis. (2017, April 06). Retrieved March 19, 2018, from https://magenta.tensorflow.org/nsynth <br />
# The NSynth Dataset. (2017, April 05). Retrieved March 19, 2018, from https://magenta.tensorflow.org/datasets/nsynth<br />
# Oord, Aaron van den, Nal Kalchbrenner, and Koray Kavukcuoglu. "Pixel recurrent neural networks." arXiv preprint arXiv:1601.06759 (2016).<br />
# Mehri, Soroush, et al. "SampleRNN: An unconditional end-to-end neural audio generation model." arXiv preprint arXiv:1612.07837 (2016).<br />
# Wu, S., Li, G., Chen, F., & Shi, L. (2018). Training and Inference with Integers in Deep Neural Networks. arXiv preprint arXiv:1802.04680.</div>Swalshhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/AmbientGAN:_Generative_Models_from_Lossy_Measurements&diff=36270stat946w18/AmbientGAN: Generative Models from Lossy Measurements2018-04-07T07:00:28Z<p>Swalsh: /* References */</p>
<hr />
<div>= Introduction =<br />
Generative Adversarial Networks operate by simulating complex distributions but training them requires access to large amounts of high quality data. Often, we only have access to noisy or partial observations, which will, from here on, be referred to as measurements of the true data. If we know the measurement function and would like to train a generative model for the true data, there are several ways to continue which have varying degrees of success. We will use noisy MNIST data as an illustrative example, and show the results of 1. ignoring the problem, 2. trying to recover the lost information, and 3. using AmbientGAN as a way to recover the true data distribution. Suppose we only see MNIST data that has been run through a Gaussian kernel (blurred) with some noise from a <math>N(0, 0.5^2)</math> distribution added to each pixel:<br />
<br />
<gallery mode="packed"><br />
File:mnist.png| True Data (Unobserved)<br />
File:mnistmeasured.png| Measured Data (Observed)<br />
</gallery><br />
<br />
<br />
=== Ignore the problem ===<br />
[[File:GANignore.png|500px]] [[File:mnistignore.png|300px]]<br />
<br />
Train a generative model directly on the measured data. This will obviously be unable to generate the true distribution before measurement has occurred. <br />
<br />
<br />
=== Try to recover the information lost ===<br />
[[File:GANrecovery.png|420px]] [[File:mnistrecover.png|300px]]<br />
<br />
Works better than ignoring the problem but depends on how easily the measurement function can be inverted.<br />
<br />
=== AmbientGAN ===<br />
[[File:GANambient.png|500px]] [[File:mnistambient.png|300px]]<br />
<br />
Ashish Bora, Eric Price and Alexandros G. Dimakis propose AmbientGAN as a way to recover the true underlying distribution from measurements of the true data. AmbientGAN works by training a generator which attempts to have the measurements of the output it generates fool the discriminator. The discriminator must distinguish between real and generated measurements. This paper is published in ICLR 2018.<br />
<br />
== Contributions ==<br />
The paper makes the following contributions: <br />
<br />
=== Theoretical Contribution ===<br />
The authors show that the distribution of measured images uniquely determines the distribution of original images. This implies that a pure Nash equilibrium for the GAN game must find a generative model that matches the true distribution. They show similar results for a dropout measurement model, where each pixel is set to zero with some probability p, and a random projection measurement model, where they observe the inner product of the image with a random Gaussian vector. <br />
<br />
=== Empirical Contribution ===<br />
The authors consider CelebA and MNIST dataset for which the measurement model is unknown and show that Ambient GAN recovers a lot of the underlying structure.<br />
<br />
= Related Work = <br />
Currently there exist two distinct approaches for constructing neural network based generative models; they are autoregressive [4,5] and adversarial [6] based methods. The adversarial model has shown to be very successful in modeling complex data distributions such as images, 3D models, state action distributions and many more. This paper is related to the work in [7] where the authors create 3D object shapes from a dataset of 2D projections. This paper states that the work in [7] is a special case of the AmbientGAN framework where the measurement process creates 2D projections using weighted sums of voxel occupancies.<br />
<br />
= Datasets and Model Architectures=<br />
We used three datasets for our experiments: MNIST, CelebA and CIFAR-10 datasets We briefly describe the generative models used for the experiments. For the MNIST dataset, we use two GAN models. The first model is a conditional DCGAN, while the second model is an unconditional Wasserstein GAN with gradient penalty (WGANGP). For the CelebA dataset, we use an unconditional DCGAN. For the CIFAR-10 dataset, we use an Auxiliary Classifier Wasserstein GAN with gradient penalty (ACWGANGP). For measurements with 2D outputs, i.e. Block-Pixels, Block-Patch, Keep-Patch, Extract-Patch, and Convolve+Noise, we use the same discriminator architectures as in the original work. For 1D projections, i.e. Pad-Rotate-Project, Pad-Rotate-Project-θ, we use fully connected discriminators. The architecture of the fully connected discriminator used for the MNIST dataset was 25-25-1 and for the CelebA dataset was 100-100-1.<br />
<br />
= Model =<br />
For the following variables superscript <math>r</math> represents the true distributions while superscript <math>g</math> represents the generated distributions. Let <math>x</math>, represent the underlying space and <math>y</math> for the measurement.<br />
<br />
Thus, <math>p_x^r</math> is the real underlying distribution over <math>\mathbb{R}^n</math> that we are interested in. However if we assume that our (known) measurement functions, <math>f_\theta: \mathbb{R}^n \to \mathbb{R}^m</math> are parameterized by <math>\Theta \sim p_\theta</math>, we can then observe <math>Y = f_\theta(x) \sim p_y^r</math> where <math>p_y^r</math> is a distribution over the measurements <math>y</math>.<br />
<br />
Mirroring the standard GAN setup we let <math>Z \in \mathbb{R}^k, Z \sim p_z</math> and <math>\Theta \sim p_\theta</math> be random variables coming from a distribution that is easy to sample. <br />
<br />
If we have a generator <math>G: \mathbb{R}^k \to \mathbb{R}^n</math> then we can generate <math>X^g = G(Z)</math> which has distribution <math>p_x^g</math> a measurement <math>Y^g = f_\Theta(G(Z))</math> which has distribution <math>p_y^g</math>. <br />
<br />
Unfortunately, we do not observe any <math>X^g \sim p_x</math> so we cannot use the discriminator directly on <math>G(Z)</math> to train the generator. Instead we will use the discriminator to distinguish between the <math>Y^g -<br />
f_\Theta(G(Z))</math> and <math>Y^r</math>. That is, we train the discriminator, <math>D: \mathbb{R}^m \to \mathbb{R}</math> to detect if a measurement came from <math>p_y^r</math> or <math>p_y^g</math>.<br />
<br />
AmbientGAN has the objective function:<br />
<br />
\begin{align}<br />
\min_G \max_D \mathbb{E}_{Y^r \sim p_y^r}[q(D(Y^r))] + \mathbb{E}_{Z \sim p_z, \Theta \sim p_\theta}[q(1 - D(f_\Theta(G(Z))))]<br />
\end{align}<br />
<br />
where <math>q(.)</math> is the quality function; for the standard GAN <math>q(x) = log(x)</math> and for Wasserstein GAN <math>q(x) = x</math>.<br />
<br />
As a technical limitation we require <math>f_\theta</math> to be differentiable with respect to each input for all values of <math>\theta</math>.<br />
<br />
With this set up we sample <math>Z \sim p_z</math>, <math>\Theta \sim p_\theta</math>, and <math>Y^r \sim U\{y_1, \cdots, y_s\}</math> each iteration and use them to compute the stochastic gradients of the objective function. We alternate between updating <math>G</math> and updating <math>D</math>.<br />
<br />
= Empirical Results =<br />
<br />
The paper continues to present results of AmbientGAN under various measurement functions when compared to baseline models. We have already seen one example in the introduction: a comparison of AmbientGAN in the Convolve + Noise Measurement case compared to the ignore-baseline, and the unmeasure-baseline. <br />
<br />
=== Convolve + Noise ===<br />
Additional results with the convolve + noise case with the celebA dataset. The AmbientGAN is compared to the baseline results with Wiener deconvolution. It is clear that AmbientGAN has superior performance in this case. The measurement is created using a Gaussian kernel and IID Gaussian noise, with <math>f_{\Theta}(x) = k*x + \Theta</math>, where <math>*</math> is the convolution operation, <math>k</math> is the convolution kernel, and <math>\Theta \sim p_{\theta}</math> is the noise distribution.<br />
<br />
[[File:paper7_fig3.png]]<br />
<br />
Images undergone convolve + noise transformations (left). Results with Wiener deconvolution (middle). Results with AmbientGAN (right).<br />
<br />
=== Block-Pixels ===<br />
With the block-pixels measurement function each pixel is independently set to 0 with probability <math>p</math>.<br />
<br />
[[File:block-pixels.png]]<br />
<br />
Measurements from the celebA dataset with <math>p=0.95</math> (left). Images generated from GAN trained on unmeasured (via blurring) data (middle). Results generated from AmbientGAN (right).<br />
<br />
=== Block-Patch ===<br />
<br />
[[File:block-patch.png]]<br />
<br />
A random 14x14 patch is set to zero (left). Unmeasured using-navier-stoke inpainting (middle). AmbientGAN (right). <br />
<br />
=== Pad-Rotate-Project-<math>\theta</math> ===<br />
<br />
[[File:pad-rotate-project-theta.png]]<br />
<br />
Results generated by AmbientGAN where the measurement function 0 pads the images, rotates it by <math>\theta</math>, and projects it on to the x axis. For each measurement the value of <math>\theta</math> is known. <br />
<br />
The generated images only have the basic features of a face and is referred to as a failure case in the paper. However the measurement function performs relatively well given how lossy the measurement function is. <br />
<br />
For the Keep-Patch measurement model, no pixels outside a box are known and thus inpainting methods are not suitable. For the Pad-Rotate-Project-θ measurements, a conventional technique is to sample many angles, and use techniques for inverting the Radon transform . However, since only a few projections are observed at a time, these methods aren’t readily applicable hence it is unclear how to obtain an approximate inverse function shown below. <br />
<br />
[[File:keep-patch.png]]<br />
<br />
=== Explanation of Inception Score ===<br />
To evaluate GAN performance, the authors make use of the inception score, a metric introduced by Salimans et al.(2016). To evaluate the inception score on a datapoint, a pre-trained inception classification model (Szegedy et al. 2016) is applied to that datapoint, and the KL divergence between its label distribution conditional on the datapoint and its marginal label distribution is computed. This KL divergence is the inception score. The idea is that meaningful images should be recognized by the inception model as belonging to some class, and so the conditional distribution should have low entropy, while the model should produce a variety of images, so the marginal should have high entropy. Thus an effective GAN should have a high inception score.<br />
<br />
=== MNIST Inception ===<br />
<br />
[[File:MNIST-inception.png]]<br />
<br />
AmbientGAN was compared with baselines through training several models with different probability <math>p</math> of blocking pixels. The plot on the left shows that the inception scores change as the block probability <math>p</math> changes. All four models are similar when no pixels are blocked <math>(p=0)</math>. By the increase of the blocking probability, AmbientGAN models present a relatively stable performance and perform better than the baseline models. Therefore, AmbientGAN is more robust than all other baseline models.<br />
<br />
The plot on the right reveals the changes in inception scores while the standard deviation of the additive Gaussian noise increased. Baselines perform better when the noise is small. By the increase of the variance, AmbientGAN models present a much better performance compare to the baseline models. Further AmbientGAN retains high inception scores as measurements become more and more lossy.<br />
<br />
For 1D projection, Pad-Rotate-Project model achieved an inception score of 4.18. Pad-Rotate-Project-θ model achieved an inception score of 8.12, which is close to the score of vanilla GAN 8.99.<br />
<br />
=== CIFAR-10 Inception ===<br />
<br />
[[File:CIFAR-inception.png]]<br />
<br />
AmbientGAN is faster to train and more robust even on more complex distributions such as CIFAR-10. Similar trends were observed on the CIFAR-10 data, and AmbientGAN maintains relatively stable inception score as the block probability was increased.<br />
<br />
=== Robustness To Measurement Model ===<br />
<br />
In order to empirically gauge robustness to measurement modelling error, the authors used the block-pixels measurement model: the image dataset was computed with <math> p^* = 0.5 </math>, and several versions of the model were trained, each using different values of blocking probability <math> p </math>. The inception scores were calculated and plotted as a function of <math> p </math>. This is shown on the left below:<br />
<br />
[[File:robustnessambientgan.png | 800px]]<br />
<br />
The authors observe that the inception score peaks when the model uses the correct probability, but decreases smoothly as the probability moves away, demonstrating some robustness.<br />
<br />
=== Compressed Sensing ===<br />
<br />
As described in Bora et al. (2017), generative models were found to outperform sparsity-based approaches in sensing. Using this knowledge, the generator from AmbientGAN can be tested against Lasso to determine the required measurements to minimize the reconstruction error. As shown on the right of Figure 16, AmbientGAN outperforms Lasso in a fraction of the number of measurements<br />
<br />
= Theoretical Results =<br />
<br />
The theoretical results in the paper prove the true underlying distribution of <math>p_x^r</math> can be recovered when we have data that comes from the Gaussian-Projection measurement, Fourier transform measurement and the block-pixels measurement. The do this by showing the distribution of the measurements <math>p_y^r</math> corresponds to a unique distribution <math>p_x^r</math>. Thus even when the measurement itself is non-invertible the effect of the measurement on the distribution <math>p_x^r</math> is invertible. Lemma 5.1 ensures this is sufficient to provide the AmbientGAN training process with a consistency guarantee. For full proofs of the results please see appendix A. <br />
<br />
=== Lemma 5.1 === <br />
Let <math>p_x^r</math> be the true data distribution, and <math>p_\theta</math> be the distributions over the parameters of the measurement function. Let <math>p_y^r</math> be the induced measurement distribution. <br />
<br />
Assume for <math>p_\theta</math> there is a unique probability distribution <math>p_x^r</math> that induces <math>p_y^r</math>. <br />
<br />
Then for the standard GAN model if the discriminator <math>D</math> is optimal such that <math>D(\cdot) = \frac{p_y^r(\cdot)}{p_y^r(\cdot) + p_y^g(\cdot)}</math>, then a generator <math>G</math> is optimal if and only if <math>p_x^g = p_x^r</math>. <br />
<br />
=== Theorems 5.2===<br />
For the Gussian-Projection measurement model, there is a unique underlying distribution <math>p_x^{r} </math> that can induce the observed measurement distribution <math>p_y^{r} </math>.<br />
<br />
=== Theorems 5.3===<br />
Let <math> \mathcal{F} (\cdot) </math> denote the Fourier transform and let <math>supp (\cdot) </math> be the support of a function. Consider the Convolve+Noise measurement model with the convolution kernel <math> k </math>and additive noise distribution <math>p_\theta </math>. If <math> supp( \mathcal{F} (k))^{c}=\phi </math> and <math> supp( \mathcal{F} (p_\theta))^{c}=\phi </math>, then there is a unique distribution <math>p_x^{r} </math> that can induce the measurement distribution <math>p_y^{r} </math>.<br />
<br />
=== Theorems 5.4===<br />
Assume that each image pixel takes values in a finite set P. Thus <math>x \in P^n \subset \mathbb{R}^{n} </math>. Assume <math>0 \in P </math>, and consider the Block-Pixels measurement model with <math>p </math> being the probability of blocking a pixel. If <math>p <1</math>, then there is a unique distribution <math>p_x^{r} </math> that can induce the measurement distribution <math>p_y^{r} </math>. Further, for any <math> \epsilon > 0, \delta \in (0, 1] </math>, given a dataset of<br />
\begin{equation}<br />
s=\Omega \left( \frac{|P|^{2n}}{(1-p)^{2n} \epsilon^{2}} log \left( \frac{|P|^{n}}{\delta} \right) \right)<br />
\end{equation}<br />
IID measurement samples from pry , if the discriminator D is optimal, then with probability <math> \geq 1 - \delta </math> over the dataset, any optimal generator G must satisfy <math> d_{TV} \left( p^g_x , p^r_x \right) \leq \epsilon </math>, where <math> d_{TV} \left( \cdot, \cdot \right) </math> is the total variation distance.<br />
<br />
= Conclusion =<br />
Generative models are powerful tools, but constructing a generative model requires a large, high quality dataset of the distribution of interest. The authors show how to relax this requirement, by learning a distribution from a dataset that only contains incomplete, noisy measurements of the distribution. This allows for the construction of new generative models of distributions for which no high quality dataset exists.<br />
<br />
= Future Research =<br />
<br />
One critical weakness of AmbientGAN is the assumption that the measurement model is known and that this <math>f_theta</math> is also differentiable. It would be nice to be able to train an AmbientGAN model when we have an unknown measurement model but also a small sample of unmeasured data, or at the very least to remove the differentiability restriction from math>f_theta</math>.<br />
<br />
A related piece of work is [https://arxiv.org/abs/1802.01284 here]. In particular, Algorithm 2 in the paper excluding the discriminator is similar to AmbientGAN.<br />
<br />
=Open Source Code=<br />
An implementation of Ambient GAN can be found here: https://github.com/AshishBora/ambient-gan.<br />
<br />
= References =<br />
# https://openreview.net/forum?id=Hy7fDog0b<br />
# Salimans, Tim, et al. "Improved techniques for training gans." Advances in Neural Information Processing Systems. 2016.<br />
# Szegedy, Christian, et al. "Rethinking the inception architecture for computer vision." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016.<br />
# Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv:1312.6114, 2013.<br />
# Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016a.<br />
# Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural infor- mation processing systems, pp. 2672–2680, 2014.<br />
# Matheus Gadelha, Subhransu Maji, and Rui Wang. 3d shape induction from 2d views of multiple objects. arXiv preprint arXiv:1612.05872, 2016.<br />
# Ashish Bora, Ajil Jalal, Eric Price, and Alexandros G Dimakis. Compressed sensing using generative models. arXiv preprint arXiv:1703.03208, 2017.</div>Swalshhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/AmbientGAN:_Generative_Models_from_Lossy_Measurements&diff=36269stat946w18/AmbientGAN: Generative Models from Lossy Measurements2018-04-07T06:59:48Z<p>Swalsh: /* Empirical Results */</p>
<hr />
<div>= Introduction =<br />
Generative Adversarial Networks operate by simulating complex distributions but training them requires access to large amounts of high quality data. Often, we only have access to noisy or partial observations, which will, from here on, be referred to as measurements of the true data. If we know the measurement function and would like to train a generative model for the true data, there are several ways to continue which have varying degrees of success. We will use noisy MNIST data as an illustrative example, and show the results of 1. ignoring the problem, 2. trying to recover the lost information, and 3. using AmbientGAN as a way to recover the true data distribution. Suppose we only see MNIST data that has been run through a Gaussian kernel (blurred) with some noise from a <math>N(0, 0.5^2)</math> distribution added to each pixel:<br />
<br />
<gallery mode="packed"><br />
File:mnist.png| True Data (Unobserved)<br />
File:mnistmeasured.png| Measured Data (Observed)<br />
</gallery><br />
<br />
<br />
=== Ignore the problem ===<br />
[[File:GANignore.png|500px]] [[File:mnistignore.png|300px]]<br />
<br />
Train a generative model directly on the measured data. This will obviously be unable to generate the true distribution before measurement has occurred. <br />
<br />
<br />
=== Try to recover the information lost ===<br />
[[File:GANrecovery.png|420px]] [[File:mnistrecover.png|300px]]<br />
<br />
Works better than ignoring the problem but depends on how easily the measurement function can be inverted.<br />
<br />
=== AmbientGAN ===<br />
[[File:GANambient.png|500px]] [[File:mnistambient.png|300px]]<br />
<br />
Ashish Bora, Eric Price and Alexandros G. Dimakis propose AmbientGAN as a way to recover the true underlying distribution from measurements of the true data. AmbientGAN works by training a generator which attempts to have the measurements of the output it generates fool the discriminator. The discriminator must distinguish between real and generated measurements. This paper is published in ICLR 2018.<br />
<br />
== Contributions ==<br />
The paper makes the following contributions: <br />
<br />
=== Theoretical Contribution ===<br />
The authors show that the distribution of measured images uniquely determines the distribution of original images. This implies that a pure Nash equilibrium for the GAN game must find a generative model that matches the true distribution. They show similar results for a dropout measurement model, where each pixel is set to zero with some probability p, and a random projection measurement model, where they observe the inner product of the image with a random Gaussian vector. <br />
<br />
=== Empirical Contribution ===<br />
The authors consider CelebA and MNIST dataset for which the measurement model is unknown and show that Ambient GAN recovers a lot of the underlying structure.<br />
<br />
= Related Work = <br />
Currently there exist two distinct approaches for constructing neural network based generative models; they are autoregressive [4,5] and adversarial [6] based methods. The adversarial model has shown to be very successful in modeling complex data distributions such as images, 3D models, state action distributions and many more. This paper is related to the work in [7] where the authors create 3D object shapes from a dataset of 2D projections. This paper states that the work in [7] is a special case of the AmbientGAN framework where the measurement process creates 2D projections using weighted sums of voxel occupancies.<br />
<br />
= Datasets and Model Architectures=<br />
We used three datasets for our experiments: MNIST, CelebA and CIFAR-10 datasets We briefly describe the generative models used for the experiments. For the MNIST dataset, we use two GAN models. The first model is a conditional DCGAN, while the second model is an unconditional Wasserstein GAN with gradient penalty (WGANGP). For the CelebA dataset, we use an unconditional DCGAN. For the CIFAR-10 dataset, we use an Auxiliary Classifier Wasserstein GAN with gradient penalty (ACWGANGP). For measurements with 2D outputs, i.e. Block-Pixels, Block-Patch, Keep-Patch, Extract-Patch, and Convolve+Noise, we use the same discriminator architectures as in the original work. For 1D projections, i.e. Pad-Rotate-Project, Pad-Rotate-Project-θ, we use fully connected discriminators. The architecture of the fully connected discriminator used for the MNIST dataset was 25-25-1 and for the CelebA dataset was 100-100-1.<br />
<br />
= Model =<br />
For the following variables superscript <math>r</math> represents the true distributions while superscript <math>g</math> represents the generated distributions. Let <math>x</math>, represent the underlying space and <math>y</math> for the measurement.<br />
<br />
Thus, <math>p_x^r</math> is the real underlying distribution over <math>\mathbb{R}^n</math> that we are interested in. However if we assume that our (known) measurement functions, <math>f_\theta: \mathbb{R}^n \to \mathbb{R}^m</math> are parameterized by <math>\Theta \sim p_\theta</math>, we can then observe <math>Y = f_\theta(x) \sim p_y^r</math> where <math>p_y^r</math> is a distribution over the measurements <math>y</math>.<br />
<br />
Mirroring the standard GAN setup we let <math>Z \in \mathbb{R}^k, Z \sim p_z</math> and <math>\Theta \sim p_\theta</math> be random variables coming from a distribution that is easy to sample. <br />
<br />
If we have a generator <math>G: \mathbb{R}^k \to \mathbb{R}^n</math> then we can generate <math>X^g = G(Z)</math> which has distribution <math>p_x^g</math> a measurement <math>Y^g = f_\Theta(G(Z))</math> which has distribution <math>p_y^g</math>. <br />
<br />
Unfortunately, we do not observe any <math>X^g \sim p_x</math> so we cannot use the discriminator directly on <math>G(Z)</math> to train the generator. Instead we will use the discriminator to distinguish between the <math>Y^g -<br />
f_\Theta(G(Z))</math> and <math>Y^r</math>. That is, we train the discriminator, <math>D: \mathbb{R}^m \to \mathbb{R}</math> to detect if a measurement came from <math>p_y^r</math> or <math>p_y^g</math>.<br />
<br />
AmbientGAN has the objective function:<br />
<br />
\begin{align}<br />
\min_G \max_D \mathbb{E}_{Y^r \sim p_y^r}[q(D(Y^r))] + \mathbb{E}_{Z \sim p_z, \Theta \sim p_\theta}[q(1 - D(f_\Theta(G(Z))))]<br />
\end{align}<br />
<br />
where <math>q(.)</math> is the quality function; for the standard GAN <math>q(x) = log(x)</math> and for Wasserstein GAN <math>q(x) = x</math>.<br />
<br />
As a technical limitation we require <math>f_\theta</math> to be differentiable with respect to each input for all values of <math>\theta</math>.<br />
<br />
With this set up we sample <math>Z \sim p_z</math>, <math>\Theta \sim p_\theta</math>, and <math>Y^r \sim U\{y_1, \cdots, y_s\}</math> each iteration and use them to compute the stochastic gradients of the objective function. We alternate between updating <math>G</math> and updating <math>D</math>.<br />
<br />
= Empirical Results =<br />
<br />
The paper continues to present results of AmbientGAN under various measurement functions when compared to baseline models. We have already seen one example in the introduction: a comparison of AmbientGAN in the Convolve + Noise Measurement case compared to the ignore-baseline, and the unmeasure-baseline. <br />
<br />
=== Convolve + Noise ===<br />
Additional results with the convolve + noise case with the celebA dataset. The AmbientGAN is compared to the baseline results with Wiener deconvolution. It is clear that AmbientGAN has superior performance in this case. The measurement is created using a Gaussian kernel and IID Gaussian noise, with <math>f_{\Theta}(x) = k*x + \Theta</math>, where <math>*</math> is the convolution operation, <math>k</math> is the convolution kernel, and <math>\Theta \sim p_{\theta}</math> is the noise distribution.<br />
<br />
[[File:paper7_fig3.png]]<br />
<br />
Images undergone convolve + noise transformations (left). Results with Wiener deconvolution (middle). Results with AmbientGAN (right).<br />
<br />
=== Block-Pixels ===<br />
With the block-pixels measurement function each pixel is independently set to 0 with probability <math>p</math>.<br />
<br />
[[File:block-pixels.png]]<br />
<br />
Measurements from the celebA dataset with <math>p=0.95</math> (left). Images generated from GAN trained on unmeasured (via blurring) data (middle). Results generated from AmbientGAN (right).<br />
<br />
=== Block-Patch ===<br />
<br />
[[File:block-patch.png]]<br />
<br />
A random 14x14 patch is set to zero (left). Unmeasured using-navier-stoke inpainting (middle). AmbientGAN (right). <br />
<br />
=== Pad-Rotate-Project-<math>\theta</math> ===<br />
<br />
[[File:pad-rotate-project-theta.png]]<br />
<br />
Results generated by AmbientGAN where the measurement function 0 pads the images, rotates it by <math>\theta</math>, and projects it on to the x axis. For each measurement the value of <math>\theta</math> is known. <br />
<br />
The generated images only have the basic features of a face and is referred to as a failure case in the paper. However the measurement function performs relatively well given how lossy the measurement function is. <br />
<br />
For the Keep-Patch measurement model, no pixels outside a box are known and thus inpainting methods are not suitable. For the Pad-Rotate-Project-θ measurements, a conventional technique is to sample many angles, and use techniques for inverting the Radon transform . However, since only a few projections are observed at a time, these methods aren’t readily applicable hence it is unclear how to obtain an approximate inverse function shown below. <br />
<br />
[[File:keep-patch.png]]<br />
<br />
=== Explanation of Inception Score ===<br />
To evaluate GAN performance, the authors make use of the inception score, a metric introduced by Salimans et al.(2016). To evaluate the inception score on a datapoint, a pre-trained inception classification model (Szegedy et al. 2016) is applied to that datapoint, and the KL divergence between its label distribution conditional on the datapoint and its marginal label distribution is computed. This KL divergence is the inception score. The idea is that meaningful images should be recognized by the inception model as belonging to some class, and so the conditional distribution should have low entropy, while the model should produce a variety of images, so the marginal should have high entropy. Thus an effective GAN should have a high inception score.<br />
<br />
=== MNIST Inception ===<br />
<br />
[[File:MNIST-inception.png]]<br />
<br />
AmbientGAN was compared with baselines through training several models with different probability <math>p</math> of blocking pixels. The plot on the left shows that the inception scores change as the block probability <math>p</math> changes. All four models are similar when no pixels are blocked <math>(p=0)</math>. By the increase of the blocking probability, AmbientGAN models present a relatively stable performance and perform better than the baseline models. Therefore, AmbientGAN is more robust than all other baseline models.<br />
<br />
The plot on the right reveals the changes in inception scores while the standard deviation of the additive Gaussian noise increased. Baselines perform better when the noise is small. By the increase of the variance, AmbientGAN models present a much better performance compare to the baseline models. Further AmbientGAN retains high inception scores as measurements become more and more lossy.<br />
<br />
For 1D projection, Pad-Rotate-Project model achieved an inception score of 4.18. Pad-Rotate-Project-θ model achieved an inception score of 8.12, which is close to the score of vanilla GAN 8.99.<br />
<br />
=== CIFAR-10 Inception ===<br />
<br />
[[File:CIFAR-inception.png]]<br />
<br />
AmbientGAN is faster to train and more robust even on more complex distributions such as CIFAR-10. Similar trends were observed on the CIFAR-10 data, and AmbientGAN maintains relatively stable inception score as the block probability was increased.<br />
<br />
=== Robustness To Measurement Model ===<br />
<br />
In order to empirically gauge robustness to measurement modelling error, the authors used the block-pixels measurement model: the image dataset was computed with <math> p^* = 0.5 </math>, and several versions of the model were trained, each using different values of blocking probability <math> p </math>. The inception scores were calculated and plotted as a function of <math> p </math>. This is shown on the left below:<br />
<br />
[[File:robustnessambientgan.png | 800px]]<br />
<br />
The authors observe that the inception score peaks when the model uses the correct probability, but decreases smoothly as the probability moves away, demonstrating some robustness.<br />
<br />
=== Compressed Sensing ===<br />
<br />
As described in Bora et al. (2017), generative models were found to outperform sparsity-based approaches in sensing. Using this knowledge, the generator from AmbientGAN can be tested against Lasso to determine the required measurements to minimize the reconstruction error. As shown on the right of Figure 16, AmbientGAN outperforms Lasso in a fraction of the number of measurements<br />
<br />
= Theoretical Results =<br />
<br />
The theoretical results in the paper prove the true underlying distribution of <math>p_x^r</math> can be recovered when we have data that comes from the Gaussian-Projection measurement, Fourier transform measurement and the block-pixels measurement. The do this by showing the distribution of the measurements <math>p_y^r</math> corresponds to a unique distribution <math>p_x^r</math>. Thus even when the measurement itself is non-invertible the effect of the measurement on the distribution <math>p_x^r</math> is invertible. Lemma 5.1 ensures this is sufficient to provide the AmbientGAN training process with a consistency guarantee. For full proofs of the results please see appendix A. <br />
<br />
=== Lemma 5.1 === <br />
Let <math>p_x^r</math> be the true data distribution, and <math>p_\theta</math> be the distributions over the parameters of the measurement function. Let <math>p_y^r</math> be the induced measurement distribution. <br />
<br />
Assume for <math>p_\theta</math> there is a unique probability distribution <math>p_x^r</math> that induces <math>p_y^r</math>. <br />
<br />
Then for the standard GAN model if the discriminator <math>D</math> is optimal such that <math>D(\cdot) = \frac{p_y^r(\cdot)}{p_y^r(\cdot) + p_y^g(\cdot)}</math>, then a generator <math>G</math> is optimal if and only if <math>p_x^g = p_x^r</math>. <br />
<br />
=== Theorems 5.2===<br />
For the Gussian-Projection measurement model, there is a unique underlying distribution <math>p_x^{r} </math> that can induce the observed measurement distribution <math>p_y^{r} </math>.<br />
<br />
=== Theorems 5.3===<br />
Let <math> \mathcal{F} (\cdot) </math> denote the Fourier transform and let <math>supp (\cdot) </math> be the support of a function. Consider the Convolve+Noise measurement model with the convolution kernel <math> k </math>and additive noise distribution <math>p_\theta </math>. If <math> supp( \mathcal{F} (k))^{c}=\phi </math> and <math> supp( \mathcal{F} (p_\theta))^{c}=\phi </math>, then there is a unique distribution <math>p_x^{r} </math> that can induce the measurement distribution <math>p_y^{r} </math>.<br />
<br />
=== Theorems 5.4===<br />
Assume that each image pixel takes values in a finite set P. Thus <math>x \in P^n \subset \mathbb{R}^{n} </math>. Assume <math>0 \in P </math>, and consider the Block-Pixels measurement model with <math>p </math> being the probability of blocking a pixel. If <math>p <1</math>, then there is a unique distribution <math>p_x^{r} </math> that can induce the measurement distribution <math>p_y^{r} </math>. Further, for any <math> \epsilon > 0, \delta \in (0, 1] </math>, given a dataset of<br />
\begin{equation}<br />
s=\Omega \left( \frac{|P|^{2n}}{(1-p)^{2n} \epsilon^{2}} log \left( \frac{|P|^{n}}{\delta} \right) \right)<br />
\end{equation}<br />
IID measurement samples from pry , if the discriminator D is optimal, then with probability <math> \geq 1 - \delta </math> over the dataset, any optimal generator G must satisfy <math> d_{TV} \left( p^g_x , p^r_x \right) \leq \epsilon </math>, where <math> d_{TV} \left( \cdot, \cdot \right) </math> is the total variation distance.<br />
<br />
= Conclusion =<br />
Generative models are powerful tools, but constructing a generative model requires a large, high quality dataset of the distribution of interest. The authors show how to relax this requirement, by learning a distribution from a dataset that only contains incomplete, noisy measurements of the distribution. This allows for the construction of new generative models of distributions for which no high quality dataset exists.<br />
<br />
= Future Research =<br />
<br />
One critical weakness of AmbientGAN is the assumption that the measurement model is known and that this <math>f_theta</math> is also differentiable. It would be nice to be able to train an AmbientGAN model when we have an unknown measurement model but also a small sample of unmeasured data, or at the very least to remove the differentiability restriction from math>f_theta</math>.<br />
<br />
A related piece of work is [https://arxiv.org/abs/1802.01284 here]. In particular, Algorithm 2 in the paper excluding the discriminator is similar to AmbientGAN.<br />
<br />
=Open Source Code=<br />
An implementation of Ambient GAN can be found here: https://github.com/AshishBora/ambient-gan.<br />
<br />
= References =<br />
# https://openreview.net/forum?id=Hy7fDog0b<br />
# Salimans, Tim, et al. "Improved techniques for training gans." Advances in Neural Information Processing Systems. 2016.<br />
# Szegedy, Christian, et al. "Rethinking the inception architecture for computer vision." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016.<br />
# Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv:1312.6114, 2013.<br />
# Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016a.<br />
# Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural infor- mation processing systems, pp. 2672–2680, 2014.<br />
# Matheus Gadelha, Subhransu Maji, and Rui Wang. 3d shape induction from 2d views of multiple objects. arXiv preprint arXiv:1612.05872, 2016.</div>Swalshhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Rethinking_the_Smaller-Norm-Less-Informative_Assumption_in_Channel_Pruning_of_Convolutional_Layers&diff=36268stat946w18/Rethinking the Smaller-Norm-Less-Informative Assumption in Channel Pruning of Convolutional Layers2018-04-07T06:41:30Z<p>Swalsh: /* Image Foreground-Background Segmentation Experiment */</p>
<hr />
<div>== Introduction ==<br />
<br />
With the recent and ongoing surge in low-power, intelligent agents (such as wearables, smartphones, and IoT devices), there exists a growing need for machine learning models to work well in resource-constrained environments. Deep learning models have achieved state-of-the-art on a broad range of tasks; however, they are difficult to deploy in their original forms. For example, AlexNet (Krizhevsky et al., 2012), a model for image classification, contains 61 million parameters and requires 1.5 billion floating point operations (FLOPs) in one inference pass. A more accurate model, ResNet-50 (He et al., 2016), has 25 million parameters but requires 4.08 FLOPs. A high-end desktop GPU such as a Titan Xp is capable of [https://www.nvidia.com/en-us/titan/titan-xp/ (12 TFLOPS (tera-FLOPs per second))], while the Adreno 540 GPU used in a Samsung Galaxy S8 is only capable of [https://gflops.surge.sh (567 GFLOPS)] which is less than 5% of the Titan Xp. Clearly, it would be difficult to deploy and run these models on low-power devices.<br />
<br />
In general, model compression can be accomplished using four main non-mutually exclusive methods (Cheng et al., 2017): weight pruning, quantization, matrix transformations, and weight tying. By non-mutually exclusive, we mean that these methods can be used not only separately but also in combination for compressing a single model; the use of one method does not exclude any of the other methods from being viable. <br />
<br />
Ye et al. (2018) explores pruning entire channels in a convolutional neural network (CNN). Past work has mostly focused on norm[based or error-based heuristics to prune channels; instead, Ye et al. (2018) show that their approach is easily reproducible and has favorable qualities from an optimization standpoint. In other words, they argue that the norm-based assumption is not as informative or theoretically justified as their approach, and provide strong empirical evidence of these findings.<br />
<br />
== Motivation ==<br />
<br />
Some previous works on pruning channel filters (Li et al., 2016; Molchanov et al., 2016) have focused on using the L1 norm to determine the importance of a channel. Ye et al. (2018) show that, in the deep linear convolution case, penalizing the per-layer norm is coarse-grained; they argue that one cannot assign different coefficients to L1 penalties associated with different layers without risking the loss function being susceptible to trivial re-parameterizations. As an example, consider the following deep linear convolutional neural network with modified LASSO loss:<br />
<br />
$$\min \mathbb{E}_D \lVert W_{2n} * \dots * W_1 x - y\rVert^2 + \lambda \sum_{i=1}^n \lVert W_{2i} \rVert_1$$<br />
<br />
where W are the weights and * is convolution. Here we have chosen the coefficient 0 for the L1 penalty associated with odd-numbered layers and the coefficient 1 for the L1 penalty associated with even-numbered layers. This loss is susceptible to trivial re-parameterizations: without affecting the least-squares loss, we can always reduce the LASSO loss by halving the weights of all even-numbered layers and doubling the weights of all odd-numbered layers.<br />
<br />
Furthermore, batch normalization (Ioffe, 2015) is incompatible with this method of weight regularization. Consider batch normalization at the <math>l</math>-th layer.<br />
<br />
<center><math>x^{l+1} = max\{\gamma \cdot BN_{\mu,\sigma,\epsilon}(W^l * x^l) + \beta, 0\}</math></center><br />
<br />
Due to the batch normalization, any uniform scaling of <math>W^l</math> which would change <math>l_1</math> and <math>l_2</math> norms, but has no have no effect on <math>x^{l+1}</math>. Thus, when trying to minimize weight norms of multiple layers, it is unclear how to properly choose penalties for each layer. Therefore, penalizing the norm of a filter in a deep convolutional network is hard to justify from a theoretical perspective.<br />
<br />
In contrast with these existing approaches, the authors focus on enforcing sparsity of a tiny set of parameters in CNN — scale parameter <math>\gamma</math> in all batch normalization. Not only placing sparse constraints on <math>\gamma</math> is simpler and easier to monitor, but more importantly, they put forward two reasons:<br />
<br />
1. Every <math>\gamma</math> always multiplies a normalized random variable, thus the channel importance becomes comparable across different layers by measuring the magnitude values of <math>\gamma</math>;<br />
<br />
2. The reparameterization effect across different layers is avoided if its subsequent convolution layer is also batch-normalized. In other words, the impacts from the scale changes of <math>\gamma</math> parameter are independent across different layers.<br />
<br />
Thus, although not providing a complete theoretical guarantee on loss, Ye et al. (2018) develop a pruning technique that claims to be more justified than norm-based pruning is.<br />
<br />
== Method ==<br />
<br />
At a high level, Ye et al. (2018) propose that, instead of discovering sparsity via penalizing the per-filter or per-channel norm, penalize the batch normalization scale parameters ''gamma'' instead. The reasoning is that by having fewer parameters to constrain and working with normalized values, sparsity is easier to enforce, monitor, and learn. Having sparse batch normalization terms has the effect of pruning '''entire''' channels: if ''gamma'' is zero, then the output at that layer becomes constant (the bias term), and thus the preceding channels can be pruned.<br />
<br />
=== Summary ===<br />
<br />
The basic algorithm can be summarized as follows:<br />
<br />
1. Penalize the L1-norm of the batch normalization scaling parameters in the loss<br />
<br />
2. Train until loss plateaus<br />
<br />
3. Remove channels that correspond to a downstream zero in batch normalization<br />
<br />
4. Fine-tune the pruned model using regular learning<br />
<br />
=== Details ===<br />
<br />
There still exist a few problems that this summary has not addressed so far. Sub-gradient descent is known to have inverse square root convergence rate on subdifferentials (Gordon et al., 2012), so the sparsity gradient descent update may be suboptimal. Furthermore, the sparse penalty needs to be normalized with respect to previous channel sizes, since the penalty should be roughly equally distributed across all convolution layers.<br />
<br />
==== Slow Convergence ====<br />
To address the issue of slow convergence, Ye et al. (2018) use an iterative shrinking-thresholding algorithm (ISTA) (Beck & Teboulle, 2009) to update the batch normalization scale parameter. The intuition for ISTA is that the structure of the optimization objective can be taken advantage of. Consider: $$L(x) = f(x) + g(x).$$<br />
<br />
Let ''f'' be the model loss and ''g'' be the non-differentiable penalty (LASSO). ISTA is able to use the structure of the loss and converge in O(1/n), instead of O(1/sqrt(n)) when using subgradient descent, which assumes no structure about the loss. Even though ISTA is used in convex settings, Ye et. al (2018) argue that it still performs better than gradient descent.<br />
<br />
==== Penalty Normalization ====<br />
<br />
In the paper, Ye et al. (2018) normalize the per-layer sparse penalty with respect to the global input size, the current layer kernel areas, the previous layer kernel areas, and the local input feature map area.<br />
<br />
[[File:Screenshot_from_2018-02-28_17-06-41.png]] (Ye et al., 2018)<br />
<br />
To control the global penalty, a hyperparamter ''rho'' is multiplied with all the per-layer ''lambda'' in the final loss.<br />
<br />
=== Steps ===<br />
<br />
The final algorithm can be summarized as follows:<br />
<br />
1. Compute the per-layer normalized sparse penalty constant <math>\lambda</math><br />
<br />
2. Compute the global LASSO loss with global scaling constant <math>\rho</math><br />
<br />
3. Until convergence, train scaling parameters using ISTA and non-scaling parameters using regular gradient descent.<br />
<br />
4. Remove channels that correspond to a downstream zero in batch normalization<br />
<br />
5. Fine-tune the pruned model using regular learning<br />
<br />
== Results ==<br />
<br />
The authors show state-of-the-art performance, compared with other channel-pruning approaches. It is important to note that it would be unfair to compare against general pruning approaches; channel pruning specifically removes channels without introducing '''intra-kernel sparsity''', whereas other pruning approaches introduce irregular kernel sparsity and hence computational inefficiencies.<br />
<br />
=== CIFAR-10 Experiment ===<br />
<br />
Model A is trained with a sparse penalty of <math>\rho = 0.0002</math> for 30 thousand steps, and then increased to <math>\rho = 0.001</math>. Model B is trained by taking Model A and increasing the sparse penalty up to 0.002. Similarly Model C is a continuation of Model B with a penalty of 0.008. <br />
<br />
[[File:Screenshot_from_2018-02-28_17-24-25.png]]<br />
<br />
For the convNet, reducing the number of parameters in the base model increased the accuracy in model A. This suggests that the base model is over-parameterized. Otherwise, there would be a trade-off of accuracy and model efficiency.<br />
<br />
=== ILSVRC2012 Experiment ===<br />
<br />
The authors note that while ResNet-101 takes hundreds of epochs to train, pruning only takes 5-10, with fine-tuning adding another 2, giving an empirical example how long pruning might take in practice. Both models were trained with an aggressive sparsity penalty of 0.1.<br />
<br />
[[File:Screenshot_from_2018-02-28_17-24-36.png]]<br />
<br />
=== Image Foreground-Background Segmentation Experiment ===<br />
<br />
The authors note that it is common practice to take a network with pre-trained on a large task and fine-tune it to apply it to a different, smaller task. One might expect there might be some extra channels that while useful for the large task, can be omitted for the simpler task. This experiment replicated that use-case by taking a NN originally trained on multiple datasets and applying the proposed pruning method. The authors note that the pruned network actually improves over the original network in all but the most challenging test dataset, which is in line with the initial expectation. The model was trained with a sparsity penalty of 0.5 and the results are shown in table below<br />
<br />
[[File:paper8_Segmentation.png|700px]]<br />
<br />
The neural network used in this experiment is composed of two branches:<br />
* An inception branch that locates the foreground objects<br />
* A DenseNet branch to regress the edges<br />
<br />
It was found that the pruning primarily affected the inception branch as shown in Figure 1 below. This likely explains the poor performance on more challenging datasets as a result of a higher requirement on foreground objects, which has been impacted by the pruning of the inception branch.<br />
<br />
[[File:pruned_inception.png|600px]]<br />
<br />
== Conclusion ==<br />
<br />
Pruning large neural architectures to fit on low-power devices is an important task. For a real quantitative measure of efficiency, it would be interesting to conduct actual power measurements on the pruned models versus baselines; reduction in FLOPs doesn't necessarily correspond with vastly reduced power since memory accesses dominate energy consumption (Han et al., 2015). However, the reduction in the number of FLOPs and parameters is encouraging, so moderate power savings should be expected.<br />
<br />
It would also be interesting to combine multiple approaches, or "throw the whole kitchen sink" at this task. Han et al. (2015) sparked much recent interest by successfully combining weight pruning, quantization, and Huffman coding without loss in accuracy. However, their approach introduced irregular sparsity in the convolutional layers, so a direct comparison cannot be made.<br />
<br />
In conclusion, this novel, theoretically-motivated interpretation of channel pruning was successfully applied to several important tasks.<br />
<br />
== Implementation == <br />
A PyTorch implementation is available here: https://github.com/jack-willturner/batchnorm-pruning<br />
<br />
<br />
== References ==<br />
<br />
* Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (pp. 1097-1105).<br />
* He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).<br />
* Cheng, Y., Wang, D., Zhou, P., & Zhang, T. (2017). A Survey of Model Compression and Acceleration for Deep Neural Networks. arXiv preprint arXiv:1710.09282.<br />
* Ye, J., Lu, X., Lin, Z., & Wang, J. Z. (2018). Rethinking the Smaller-Norm-Less-Informative Assumption in Channel Pruning of Convolution Layers. arXiv preprint arXiv:1802.00124.<br />
* Li, H., Kadav, A., Durdanovic, I., Samet, H., & Graf, H. P. (2016). Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710.<br />
* Molchanov, P., Tyree, S., Karras, T., Aila, T., & Kautz, J. (2016). Pruning convolutional neural networks for resource efficient inference.<br />
* Ioffe, S., & Szegedy, C. (2015, June). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning (pp. 448-456).<br />
* Gordon, G., & Tibshirani, R. (2012). Subgradient method. https://www.cs.cmu.edu/~ggordon/10725-F12/slides/06-sg-method.pdf<br />
* Beck, A., & Teboulle, M. (2009). A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences, 2(1), 183-202.<br />
* Han, S., Mao, H., & Dally, W. J. (2015). Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149</div>Swalshhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Rethinking_the_Smaller-Norm-Less-Informative_Assumption_in_Channel_Pruning_of_Convolutional_Layers&diff=36267stat946w18/Rethinking the Smaller-Norm-Less-Informative Assumption in Channel Pruning of Convolutional Layers2018-04-07T06:40:50Z<p>Swalsh: /* Image Foreground-Background Segmentation Experiment */</p>
<hr />
<div>== Introduction ==<br />
<br />
With the recent and ongoing surge in low-power, intelligent agents (such as wearables, smartphones, and IoT devices), there exists a growing need for machine learning models to work well in resource-constrained environments. Deep learning models have achieved state-of-the-art on a broad range of tasks; however, they are difficult to deploy in their original forms. For example, AlexNet (Krizhevsky et al., 2012), a model for image classification, contains 61 million parameters and requires 1.5 billion floating point operations (FLOPs) in one inference pass. A more accurate model, ResNet-50 (He et al., 2016), has 25 million parameters but requires 4.08 FLOPs. A high-end desktop GPU such as a Titan Xp is capable of [https://www.nvidia.com/en-us/titan/titan-xp/ (12 TFLOPS (tera-FLOPs per second))], while the Adreno 540 GPU used in a Samsung Galaxy S8 is only capable of [https://gflops.surge.sh (567 GFLOPS)] which is less than 5% of the Titan Xp. Clearly, it would be difficult to deploy and run these models on low-power devices.<br />
<br />
In general, model compression can be accomplished using four main non-mutually exclusive methods (Cheng et al., 2017): weight pruning, quantization, matrix transformations, and weight tying. By non-mutually exclusive, we mean that these methods can be used not only separately but also in combination for compressing a single model; the use of one method does not exclude any of the other methods from being viable. <br />
<br />
Ye et al. (2018) explores pruning entire channels in a convolutional neural network (CNN). Past work has mostly focused on norm[based or error-based heuristics to prune channels; instead, Ye et al. (2018) show that their approach is easily reproducible and has favorable qualities from an optimization standpoint. In other words, they argue that the norm-based assumption is not as informative or theoretically justified as their approach, and provide strong empirical evidence of these findings.<br />
<br />
== Motivation ==<br />
<br />
Some previous works on pruning channel filters (Li et al., 2016; Molchanov et al., 2016) have focused on using the L1 norm to determine the importance of a channel. Ye et al. (2018) show that, in the deep linear convolution case, penalizing the per-layer norm is coarse-grained; they argue that one cannot assign different coefficients to L1 penalties associated with different layers without risking the loss function being susceptible to trivial re-parameterizations. As an example, consider the following deep linear convolutional neural network with modified LASSO loss:<br />
<br />
$$\min \mathbb{E}_D \lVert W_{2n} * \dots * W_1 x - y\rVert^2 + \lambda \sum_{i=1}^n \lVert W_{2i} \rVert_1$$<br />
<br />
where W are the weights and * is convolution. Here we have chosen the coefficient 0 for the L1 penalty associated with odd-numbered layers and the coefficient 1 for the L1 penalty associated with even-numbered layers. This loss is susceptible to trivial re-parameterizations: without affecting the least-squares loss, we can always reduce the LASSO loss by halving the weights of all even-numbered layers and doubling the weights of all odd-numbered layers.<br />
<br />
Furthermore, batch normalization (Ioffe, 2015) is incompatible with this method of weight regularization. Consider batch normalization at the <math>l</math>-th layer.<br />
<br />
<center><math>x^{l+1} = max\{\gamma \cdot BN_{\mu,\sigma,\epsilon}(W^l * x^l) + \beta, 0\}</math></center><br />
<br />
Due to the batch normalization, any uniform scaling of <math>W^l</math> which would change <math>l_1</math> and <math>l_2</math> norms, but has no have no effect on <math>x^{l+1}</math>. Thus, when trying to minimize weight norms of multiple layers, it is unclear how to properly choose penalties for each layer. Therefore, penalizing the norm of a filter in a deep convolutional network is hard to justify from a theoretical perspective.<br />
<br />
In contrast with these existing approaches, the authors focus on enforcing sparsity of a tiny set of parameters in CNN — scale parameter <math>\gamma</math> in all batch normalization. Not only placing sparse constraints on <math>\gamma</math> is simpler and easier to monitor, but more importantly, they put forward two reasons:<br />
<br />
1. Every <math>\gamma</math> always multiplies a normalized random variable, thus the channel importance becomes comparable across different layers by measuring the magnitude values of <math>\gamma</math>;<br />
<br />
2. The reparameterization effect across different layers is avoided if its subsequent convolution layer is also batch-normalized. In other words, the impacts from the scale changes of <math>\gamma</math> parameter are independent across different layers.<br />
<br />
Thus, although not providing a complete theoretical guarantee on loss, Ye et al. (2018) develop a pruning technique that claims to be more justified than norm-based pruning is.<br />
<br />
== Method ==<br />
<br />
At a high level, Ye et al. (2018) propose that, instead of discovering sparsity via penalizing the per-filter or per-channel norm, penalize the batch normalization scale parameters ''gamma'' instead. The reasoning is that by having fewer parameters to constrain and working with normalized values, sparsity is easier to enforce, monitor, and learn. Having sparse batch normalization terms has the effect of pruning '''entire''' channels: if ''gamma'' is zero, then the output at that layer becomes constant (the bias term), and thus the preceding channels can be pruned.<br />
<br />
=== Summary ===<br />
<br />
The basic algorithm can be summarized as follows:<br />
<br />
1. Penalize the L1-norm of the batch normalization scaling parameters in the loss<br />
<br />
2. Train until loss plateaus<br />
<br />
3. Remove channels that correspond to a downstream zero in batch normalization<br />
<br />
4. Fine-tune the pruned model using regular learning<br />
<br />
=== Details ===<br />
<br />
There still exist a few problems that this summary has not addressed so far. Sub-gradient descent is known to have inverse square root convergence rate on subdifferentials (Gordon et al., 2012), so the sparsity gradient descent update may be suboptimal. Furthermore, the sparse penalty needs to be normalized with respect to previous channel sizes, since the penalty should be roughly equally distributed across all convolution layers.<br />
<br />
==== Slow Convergence ====<br />
To address the issue of slow convergence, Ye et al. (2018) use an iterative shrinking-thresholding algorithm (ISTA) (Beck & Teboulle, 2009) to update the batch normalization scale parameter. The intuition for ISTA is that the structure of the optimization objective can be taken advantage of. Consider: $$L(x) = f(x) + g(x).$$<br />
<br />
Let ''f'' be the model loss and ''g'' be the non-differentiable penalty (LASSO). ISTA is able to use the structure of the loss and converge in O(1/n), instead of O(1/sqrt(n)) when using subgradient descent, which assumes no structure about the loss. Even though ISTA is used in convex settings, Ye et. al (2018) argue that it still performs better than gradient descent.<br />
<br />
==== Penalty Normalization ====<br />
<br />
In the paper, Ye et al. (2018) normalize the per-layer sparse penalty with respect to the global input size, the current layer kernel areas, the previous layer kernel areas, and the local input feature map area.<br />
<br />
[[File:Screenshot_from_2018-02-28_17-06-41.png]] (Ye et al., 2018)<br />
<br />
To control the global penalty, a hyperparamter ''rho'' is multiplied with all the per-layer ''lambda'' in the final loss.<br />
<br />
=== Steps ===<br />
<br />
The final algorithm can be summarized as follows:<br />
<br />
1. Compute the per-layer normalized sparse penalty constant <math>\lambda</math><br />
<br />
2. Compute the global LASSO loss with global scaling constant <math>\rho</math><br />
<br />
3. Until convergence, train scaling parameters using ISTA and non-scaling parameters using regular gradient descent.<br />
<br />
4. Remove channels that correspond to a downstream zero in batch normalization<br />
<br />
5. Fine-tune the pruned model using regular learning<br />
<br />
== Results ==<br />
<br />
The authors show state-of-the-art performance, compared with other channel-pruning approaches. It is important to note that it would be unfair to compare against general pruning approaches; channel pruning specifically removes channels without introducing '''intra-kernel sparsity''', whereas other pruning approaches introduce irregular kernel sparsity and hence computational inefficiencies.<br />
<br />
=== CIFAR-10 Experiment ===<br />
<br />
Model A is trained with a sparse penalty of <math>\rho = 0.0002</math> for 30 thousand steps, and then increased to <math>\rho = 0.001</math>. Model B is trained by taking Model A and increasing the sparse penalty up to 0.002. Similarly Model C is a continuation of Model B with a penalty of 0.008. <br />
<br />
[[File:Screenshot_from_2018-02-28_17-24-25.png]]<br />
<br />
For the convNet, reducing the number of parameters in the base model increased the accuracy in model A. This suggests that the base model is over-parameterized. Otherwise, there would be a trade-off of accuracy and model efficiency.<br />
<br />
=== ILSVRC2012 Experiment ===<br />
<br />
The authors note that while ResNet-101 takes hundreds of epochs to train, pruning only takes 5-10, with fine-tuning adding another 2, giving an empirical example how long pruning might take in practice. Both models were trained with an aggressive sparsity penalty of 0.1.<br />
<br />
[[File:Screenshot_from_2018-02-28_17-24-36.png]]<br />
<br />
=== Image Foreground-Background Segmentation Experiment ===<br />
<br />
The authors note that it is common practice to take a network with pre-trained on a large task and fine-tune it to apply it to a different, smaller task. One might expect there might be some extra channels that while useful for the large task, can be omitted for the simpler task. This experiment replicated that use-case by taking a NN originally trained on multiple datasets and applying the proposed pruning method. The authors note that the pruned network actually improves over the original network in all but the most challenging test dataset, which is in line with the initial expectation. The model was trained with a sparsity penalty of 0.5 and the results are shown in table below<br />
<br />
[[File:paper8_Segmentation.png|700px]]<br />
<br />
The neural network used in this experiment is composed of two branches:<br />
* An inception branch that locates the foreground objects<br />
* A DenseNet branch to regress the edges<br />
<br />
It was found that the pruning primarily affected the inception branch as shown in Figure 1 below. This likely explains the poor performance on more challenging datasets as a result of a higher requirement on foreground objects, which has been impacted by the pruning of the inception branch.<br />
<br />
[[File:pruned_inception.png|700px]]<br />
<br />
== Conclusion ==<br />
<br />
Pruning large neural architectures to fit on low-power devices is an important task. For a real quantitative measure of efficiency, it would be interesting to conduct actual power measurements on the pruned models versus baselines; reduction in FLOPs doesn't necessarily correspond with vastly reduced power since memory accesses dominate energy consumption (Han et al., 2015). However, the reduction in the number of FLOPs and parameters is encouraging, so moderate power savings should be expected.<br />
<br />
It would also be interesting to combine multiple approaches, or "throw the whole kitchen sink" at this task. Han et al. (2015) sparked much recent interest by successfully combining weight pruning, quantization, and Huffman coding without loss in accuracy. However, their approach introduced irregular sparsity in the convolutional layers, so a direct comparison cannot be made.<br />
<br />
In conclusion, this novel, theoretically-motivated interpretation of channel pruning was successfully applied to several important tasks.<br />
<br />
== Implementation == <br />
A PyTorch implementation is available here: https://github.com/jack-willturner/batchnorm-pruning<br />
<br />
<br />
== References ==<br />
<br />
* Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (pp. 1097-1105).<br />
* He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).<br />
* Cheng, Y., Wang, D., Zhou, P., & Zhang, T. (2017). A Survey of Model Compression and Acceleration for Deep Neural Networks. arXiv preprint arXiv:1710.09282.<br />
* Ye, J., Lu, X., Lin, Z., & Wang, J. Z. (2018). Rethinking the Smaller-Norm-Less-Informative Assumption in Channel Pruning of Convolution Layers. arXiv preprint arXiv:1802.00124.<br />
* Li, H., Kadav, A., Durdanovic, I., Samet, H., & Graf, H. P. (2016). Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710.<br />
* Molchanov, P., Tyree, S., Karras, T., Aila, T., & Kautz, J. (2016). Pruning convolutional neural networks for resource efficient inference.<br />
* Ioffe, S., & Szegedy, C. (2015, June). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning (pp. 448-456).<br />
* Gordon, G., & Tibshirani, R. (2012). Subgradient method. https://www.cs.cmu.edu/~ggordon/10725-F12/slides/06-sg-method.pdf<br />
* Beck, A., & Teboulle, M. (2009). A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences, 2(1), 183-202.<br />
* Han, S., Mao, H., & Dally, W. J. (2015). Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149</div>Swalshhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:pruned_inception.png&diff=36266File:pruned inception.png2018-04-07T06:40:14Z<p>Swalsh: </p>
<hr />
<div></div>Swalshhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Word_translation_without_parallel_data&diff=36265Word translation without parallel data2018-04-07T06:26:23Z<p>Swalsh: </p>
<hr />
<div>[[File:Toy_example.png]]<br />
<br />
= Presented by =<br />
<br />
Xia Fan<br />
<br />
= Introduction =<br />
<br />
Many successful methods for learning relationships between languages stem from the hypothesis that there is a relationship between the context of words and their meanings. This means that if an adequate representation of a language is found in a high dimensional space (this is called an embedding), then words similar to a given word are close to one another in this space (ex. some norm can be minimized to find a word with similar context). Historically, another significant hypothesis is that these embedding spaces show similar structures over different languages. That is to say that given an embedding space for English and one for Spanish, a mapping could be found that aligns the two spaces and such a mapping could be used as a tool for translation. Many papers exploit these hypotheses, but use large parallel datasets for training. Recently, to remove the need for supervised training, methods have been implemented that utilize identical character strings (ex. letters or digits) in order to try to align the embeddings. The downside of this approach is that the two languages need to be similar to begin with as they need to have some shared basic building block. The method proposed in this paper uses an adversarial method to find this mapping between the embedding spaces of two languages without the use of large parallel datasets.<br />
<br />
The contributions of this paper can be listed as follows: <br />
<br />
1. This paper introduces a model that either is on par, or outperforms supervised state-of-the-art methods, without employing any cross-lingual annotated data such as bilingual dictionaries or parallel corpora (large and structured sets of texts). This method uses an idea similar to GANs: it leverages adversarial training to learn a linear mapping from a source to distinguish between the mapped source embeddings and the target embeddings, while the mapping is jointly trained to fool the discriminator. <br />
<br />
2. Second, this paper extracts a synthetic dictionary from the resulting shared embedding space and fine-tunes the mapping with the closed-form Procrustes solution from Schonemann (1966). <br />
<br />
3. Third, this paper also introduces an unsupervised selection metric that is highly correlated with the mapping quality and that the authors use both as a stopping criterion and to select the best hyper-parameters. <br />
<br />
4. Fourth, they introduce a cross-domain similarity adaptation to mitigate the so-called hubness problem (points tending to be nearest neighbors of many points in high-dimensional spaces).<br />
<br />
5. They demonstrate the effectiveness of our method using an example of a low-resource language pair where parallel corpora are not available (English-Esperanto) for which their method is particularly suited.<br />
<br />
This paper is published in ICLR 2018.<br />
<br />
= Related Work =<br />
<br />
'''Bilingual Lexicon Induction'''<br />
<br />
Many papers have addressed this subject by using discrete word representations. Regularly however these methods need to have an initialization of prior knowledge, such as the editing distance between the input and output ground truth. This unfortunately only works for closely related languages.<br />
<br />
= Model =<br />
<br />
<br />
=== Estimation of Word Representations in Vector Space ===<br />
<br />
This model focuses on learning a mapping between the two sets such that translations are close in the shared space. Before talking about the model it used, a model which can exploit the similarities of monolingual embedding spaces should be introduced. Mikolov et al.(2013) use a known dictionary of n=5000 pairs of words <math> \{x_i,y_i\}_{i\in{1,n}} </math>. and learn a linear mapping W between the source and the target space such that <br />
<br />
\begin{align}<br />
W^*=argmin_{W{\in}M_d(R)}||WX-Y||_F \hspace{1cm} (1)<br />
\end{align}<br />
<br />
where d is the dimension of the embeddings, <math> M_d(R) </math> is the space of d*d matrices of real numbers, and X and Y are two aligned matrices of size d*n containing the embeddings of the words in the parallel vocabulary. <br />
<br />
Xing et al. (2015) showed that these results are improved by enforcing orthogonality constraint on W. In that case, equation (1) boils down to the Procrustes problem, a matrix approximation problem for which the goal is to find an orthogonal matrix that best maps two given matrices on the measure of the Frobenius norm. It advantageously offers a closed form solution obtained from the singular value decomposition (SVD) of <math> YX^T </math> :<br />
<br />
\begin{align}<br />
W^*=argmin_{W{\in}M_d(R)}||WX-Y||_F=UV^T\textrm{, with }U\Sigma V^T=SVD(YX^T).<br />
\end{align}<br />
<br />
<br />
This can be proven as follows. First note that <br />
\begin{align}<br />
&||WX-Y||_F\\<br />
&= \langle WX-Y, WX-Y\rangle_F\\ <br />
&= \langle WX, WX \rangle_F -2 \langle W X, Y \rangle_F + \langle Y, Y \rangle_F \\<br />
&= ||X||_F^2 -2 \langle W X, Y \rangle_F + || Y||_F^2, <br />
\end{align}<br />
<br />
where <math display="inline"> \langle \cdot, \cdot \rangle_F </math> denotes the Frobenius inner-product and we have used the orthogonality of <math display="inline"> W </math>. It follows that we need only maximize the inner-product above. Let <math display="inline"> u_1, \ldots, u_d </math> denote the columns of <math display="inline"> U </math>. Let <math display="inline"> v_1, \ldots , v_d </math> denote the columns of <math display="inline"> V </math>. Let <math display="inline"> \sigma_1, \ldots, \sigma_d </math> denote the diagonal entries of <math display="inline"> \Sigma </math>. We have<br />
\begin{align}<br />
&\langle W X, Y \rangle_F \\<br />
&= \text{Tr} (W^T Y X^T)\\<br />
& =\text{Tr}(W^T \sum_i \sigma_i u_i v_i^T)\\<br />
&=\sum_i \sigma_i \text{Tr}(W^T u_i v_i^T)\\<br />
&=\sum_i \sigma_i ((Wv_i)^T u_i )\text{ invariance of trace under cyclic permutations}\\<br />
&\le \sum_i \sigma_i ||Wv_i|| ||u_i||\text{ Cauchy-Swarz inequality}\\<br />
&= \sum_i \sigma_i<br />
\end{align}<br />
where we have used the invariance of trace under cyclic permutations, Cauchy-Schwarz, and the orthogonality of the columns of U and V. Note that choosing <br />
\begin{align}<br />
W=UV^T<br />
\end{align}<br />
achieves the bound. This completes the proof.<br />
<br />
=== Domain-adversarial setting ===<br />
<br />
This paper shows how to learn this mapping W without cross-lingual supervision. An illustration of the approach is given in Fig. 1. First, this model learns an initial proxy of W by using an adversarial criterion. Then, it uses the words that match the best as anchor points for Procrustes. Finally, it improves performance over less frequent words by changing the metric of the space, which leads to spread more of those points in dense region. <br />
<br />
[[File:Toy_example.png |frame|none|alt=Alt text|Figure 1: Toy illustration of the method. (A) There are two distributions of word embeddings, English words in red denoted by X and Italian words in blue denoted by Y , which we want to align/translate. Each dot represents a word in that space. The size of the dot is proportional to the frequency of the words in the training corpus of that language. (B) Using adversarial learning, we learn a rotation matrix W which roughly aligns the two distributions. The green stars are randomly selected words that are fed to the discriminator to determine whether the two word embeddings come from the same distribution. (C) The mapping W is further refined via Procrustes. This method uses frequent words aligned by the previous step as anchor points, and minimizes an energy function that corresponds to a spring system between anchor points. The refined mapping is then used to map all words in the dictionary. (D) Finally, we translate by using the mapping W and a distance metric, dubbed CSLS, that expands the space where there is high density of points (like the area around the word “cat”), so that “hubs” (like the word “cat”) become less close to other word vectors than they would otherwise (compare to the same region in panel (A)).]]<br />
<br />
Let <math> X={x_1,...,x_n} </math> and <math> Y={y_1,...,y_m} </math> be two sets of n and m word embeddings coming from a source and a target language respectively. A model is trained is trained to discriminate between elements randomly sampled from <math> WX={Wx_1,...,Wx_n} </math> and Y, We call this model the discriminator. W is trained to prevent the discriminator from making accurate predictions. As a result, this is a two-player adversarial game, where the discriminator aims at maximizing its ability to identify the origin of an embedding, and W aims at preventing the discriminator from doing so by making WX and Y as similar as possible. This approach is in line with the work of Ganin et al.(2016), who proposed to learn latent representations invariant to the input domain, where in this case, a domain is represented by a language(source or target).<br />
<br />
1. Discriminator objective<br />
<br />
Refer to the discriminator parameters as <math> \theta_D </math>. Consider the probability <math> P_{\theta_D}(source = 1|z) </math> that a vector z is the mapping of a source embedding (as opposed to a target embedding) according to the discriminator. The discriminator loss can be written as:<br />
<br />
\begin{align}<br />
L_D(\theta_D|W)=-\frac{1}{n} \sum_{i=1}^n log P_{\theta_D}(source=1|Wx_i)-\frac{1}{m} \sum_{i=1}^m log P_{\theta_D}(source=0|y_i)<br />
\end{align}<br />
<br />
2. Mapping objective <br />
<br />
In the unsupervised setting, W is now trained so that the discriminator is unable to accurately predict the embedding origins: <br />
<br />
\begin{align}<br />
L_W(W|\theta_D)=-\frac{1}{n} \sum_{i=1}^n log P_{\theta_D}(source=0|Wx_i)-\frac{1}{m} \sum_{i=1}^m log P_{\theta_D}(source=1|y_i)<br />
\end{align}<br />
<br />
3. Learning algorithm <br />
To train the model, the authors follow the standard training procedure of deep adversarial networks of Goodfellow et al. (2014). For every input sample, the discriminator and the mapping matrix W are trained successively with stochastic gradient updates to respectively minimize <math> L_D </math> and <math> L_W </math><br />
<br />
=== Refinement procedure ===<br />
<br />
The matrix W obtained with adversarial training gives good performance (see Table 1), but the results are still not on par with the supervised approach. In fact, the adversarial approach tries to align all words irrespective of their frequencies. However, rare words have embeddings that are less updated and are more likely to appear in different contexts in each corpus, which makes them harder to align. Under the assumption that the mapping is linear, it is then better to infer the global mapping using only the most frequent words as anchors. Besides, the accuracy on the most frequent word pairs is high after adversarial training.<br />
To refine the mapping, this paper build a synthetic parallel vocabulary using the W just learned with adversarial training. Specifically, this paper consider the most frequent words and retain only mutual nearest neighbors to ensure a high-quality dictionary. Subsequently, this paper apply the Procrustes solution in (2) on this generated dictionary. Considering the improved solution generated with the Procrustes algorithm, it is possible to generate a more accurate dictionary and apply this method iteratively, similarly to Artetxe et al. (2017). However, given that the synthetic dictionary obtained using adversarial training is already strong, this paper only observe small improvements when doing more than one iteration, i.e., the improvements on the word translation task are usually below 1%.<br />
<br />
=== Cross-Domain Similarity Local Scaling (CSLS) ===<br />
<br />
This paper considers a bi-partite neighborhood graph, in which each word of a given dictionary is connected to its K nearest neighbors in the other language. <math> N_T(Wx_s) </math> is used to denote the neighborhood, on this bi-partite graph, associated with a mapped source word embedding <math> Wx_s </math>. All K elements of <math> N_T(Wx_s) </math> are words from the target language. Similarly we denote by <math> N_S(y_t) </math> the neighborhood associated with a word t of the target language. Consider the mean similarity of a source embedding <math> x_s </math> to its target neighborhood as<br />
<br />
\begin{align}<br />
r_T(Wx_s)=\frac{1}{K}\sum_{y\in N_T(Wx_s)}cos(Wx_s,y_t)<br />
\end{align}<br />
<br />
where cos(.,.) is the cosine similarity. Likewise, the mean similarity of a target word <math> y_t </math> to its neighborhood is denoted as <math> r_S(y_t) </math>. This is used to define similarity measure CSLS(.,.) between mapped source words and target words as <br />
<br />
\begin{align}<br />
CSLS(Wx_s,y_t)=2cos(Wx_s,y_t)-r_T(Wx_s)-r_S(y_t)<br />
\end{align}<br />
<br />
This process increases the similarity associated with isolated word vectors, but decreases the similarity of vectors lying in dense areas. <br />
<br />
CSLS represents an improved measure for producing reliable matching words between two languages (i.e. neighbors of a word in one language should ideally correspond to the same words in the second language). The nearest neighbors algorithm is asymmetric, and in high-dimensional spaces, it suffers from the problem of hubness, in which some points are nearest neighbors to exceptionally many points, while others are not nearest neighbors to any points. Existing approaches for combating the effect of hubness on word translation retrieval involve performing similarity updates one language at a time without consideration for the other language in the pair (Dinu et al., 2015, Smith et al., 2017). Consequently, they yielded less accurate results when compared to CSLS in experiments conducted in this paper (Table 1).<br />
<br />
= Training and architectural choices =<br />
=== Architecture ===<br />
<br />
This paper use unsupervised word vectors that were trained using fastText2. These correspond to monolingual embeddings of dimension 300 trained on Wikipedia corpora; therefore, the mapping W has size 300 × 300. Words are lower-cased, and those that appear less than 5 times are discarded for training. As a post-processing step, only the first 200k most frequent words were selected in the experiments.<br />
For the discriminator, it use a multilayer perceptron with two hidden layers of size 2048, and Leaky-ReLU activation functions. The input to the discriminator is corrupted with dropout noise with a rate of 0.1. As suggested by Goodfellow (2016), a smoothing coefficient s = 0.2 is included in the discriminator predictions. This paper use stochastic gradient descent with a batch size of 32, a learning rate of 0.1 and a decay of 0.95 both for the discriminator and W . <br />
<br />
=== Discriminator inputs ===<br />
The embedding quality of rare words is generally not as good as the one of frequent words (Luong et al., 2013), and it is observed that feeding the discriminator with rare words had a small, but not negligible negative impact. As a result, this paper only feed the discriminator with the 50,000 most frequent words. At each training step, the word embeddings given to the discriminator are sampled uniformly. Sampling them according to the word frequency did not have any noticeable impact on the results.<br />
<br />
=== Orthogonality===<br />
In this work, the authors propose to use a simple update step to ensure that the matrix W stays close to an orthogonal matrix during training (Cisse et al. (2017)). Specifically, the following update rule on the matrix W is used:<br />
<br />
\begin{align}<br />
W \leftarrow (1+\beta)W-\beta(WW^T)W<br />
\end{align}<br />
<br />
where β = 0.01 is usually found to perform well. This method ensures that the matrix stays close to the manifold of orthogonal matrices after each update.<br />
<br />
This update rule can be justified as follows. Consider the function <br />
\begin{align}<br />
g: \mathbb{R}^{d\times d} \to \mathbb{R}^{d \times d}<br />
\end{align}<br />
defined by<br />
\begin{align}<br />
g(W)= W^T W -I.<br />
\end{align}<br />
<br />
The derivative of g at W is is the linear map<br />
\begin{align}<br />
Dg[W]: \mathbb{R}^{d \times d} \to \mathbb{R}^{d \times d}<br />
\end{align}<br />
defined by<br />
\begin{align}<br />
Dg[W](H)= H^T W + W^T H.<br />
\end{align}<br />
<br />
The adjoint of this linear map is<br />
<br />
\begin{align}<br />
D^\ast g[W](H)= WH^T +WH.<br />
\end{align}<br />
<br />
Now consider the function f<br />
\begin{align}<br />
f: \mathbb{R}^{d \times d} \to \mathbb{R}<br />
\end{align}<br />
<br />
defined by<br />
<br />
\begin{align}<br />
f(W)=||g(W) ||_F^2=||W^TW -I ||_F^2.<br />
\end{align}<br />
<br />
f has gradient:<br />
\begin{align}<br />
\nabla f (W) = 2D^\ast g[W] (g(W ) ) =2W(W^TW-I) +2W(W^TW-I)=4W W^TW-4W.<br />
\end{align}<br />
or<br />
\begin{align}<br />
\nabla f (W) = \nabla||W^TW-I||_F = \nabla\text{Tr}(W^TW-I)(W^TW-I)=4(\nabla(W^TW-I))(W^TW-I)=4W(W^TW-I)\text{ (check derivative of trace function)}<br />
\end{align}<br />
<br />
Thus the update<br />
\begin{align}<br />
W \leftarrow (1+\beta)W-\beta(WW^T)W<br />
\end{align}<br />
amounts to a step in the direction opposite the gradient of f. That is, a step toward the set of orthogonal matrices.<br />
<br />
=== Dictionary generation ===<br />
The refinement step requires the generation of a new dictionary at each iteration. In order for the Procrustes solution to work well, it is best to apply it on correct word pairs. As a result, the CSLS method is used to select more accurate translation pairs in the dictionary. To further increase the quality of the dictionary, and ensure that W is learned from correct translation pairs, only mutual nearest neighbors were considered, i.e. pairs of words that are mutually nearest neighbors of each other according to CSLS. This significantly decreases the size of the generated dictionary, but improves its accuracy, as well as the overall performance.<br />
<br />
=== Validation criterion for unsupervised model selection ===<br />
<br />
This paper consider the 10k most frequent source words, and use CSLS to generate a translation for each of them, then compute the average cosine similarity between these deemed translations, and use this average as a validation metric. The choice of using the 10 thousand most frequent source words is requires more justification since we would expect those to be the best trained words and may not accurately represent the entire data set. Perhaps a k-fold cross validation approach should be used instead. Figure 2 below shows the correlation between the evaluation score and this unsupervised criterion (without stabilization by learning rate shrinkage)<br />
<br />
<br />
<br />
[[File:fig2_fan.png |frame|none|alt=Alt text|Figure 2: Unsupervised model selection.<br />
Correlation between the unsupervised validation criterion (black line) and actual word translation accuracy (blue line). In this particular experiment, the selected model is at epoch 10. Observe how the criterion is well correlated with translation accuracy.]]<br />
<br />
= Results =<br />
<br />
The results on word translation retrieval using the bilingual dictionaries are presented in Table 1, and a comparison to previous work in shown in Table 2 where the unsupervised model significantly outperforms previous approaches. The results on the sentence translation retrieval task are presented in Table 3, and the cross-lingual word similarity task in Table 4. Finally, the results on word-by-word translation for English-Esperanto are presented in Table 5. The bilingual dictionary used here does not account for words with multiple meanings.<br />
<br />
[[File:table1_fan.png |frame|none|alt=Alt text|Table 1: Word translation retrieval P@1 for the released vocabularies in various language pairs. The authors consider 1,500 source test queries, and 200k target words for each language pair. The authors use fastText embeddings trained on Wikipedia. NN: nearest neighbors. ISF: inverted softmax. (’en’ is English, ’fr’ is French, ’de’ is German, ’ru’ is Russian, ’zh’ is classical Chinese and ’eo’ is Esperanto)]]<br />
<br />
<br />
[[File:table2_fan.png |frame|none|alt=Alt text|English-Italian word translation average precisions (@1, @5, @10) from 1.5k source word queries using 200k target words. Results marked with the symbol † are from Smith et al. (2017). Wiki means the embeddings were trained on Wikipedia using fastText. Note that the method used by Artetxe et al. (2017) does not use the same supervision as other supervised methods, as they only use numbers in their ini- tial parallel dictionary.]]<br />
<br />
[[File:table3_fan.png |frame|none|alt=Alt text|Table 3: English-Italian sentence translation retrieval. The authors report the average P@k from 2,000 source queries using 200,000 target sentences. The authors use the same embeddings as in Smith et al. (2017). Their results are marked with the symbol †.]]<br />
<br />
[[File:table4_fan.png |frame|none|alt=Alt text|Table 4: Cross-lingual wordsim task. NASARI<br />
(Camacho-Collados et al. (2016)) refers to the official SemEval2017 baseline. The authors report Pearson correlation.]]<br />
<br />
[[File:table5_fan.png |frame|none|alt=Alt text|Table 5: BLEU score on English-Esperanto.<br />
Although being a naive approach, word-by- word translation is enough to get a rough idea of the input sentence. The quality of the gener- ated dictionary has a significant impact on the BLEU score.]]<br />
<br />
[[File:paper9_fig3.png |frame|none|alt=Alt text|Figure 3: The paper also investigated the impact of monolingual embeddings. It was found that model from this paper can align embeddings obtained through different methods, but not embeddings obtained from different corpora, which explains the large performance increase in Table 2 due to the corpus change from WaCky to Wiki using CBOW embedding. This is conveyed in this figure which displays English to English world alignment accuracies with regard to word frequency. Perfect alignment is achieved using the same model and corpora (a). Also good alignment using different model and corpora, although CSLS consistently has better results (b). Worse results due to use of different corpora (c). Even worse results when both embedding model and corpora are different.]]<br />
<br />
= Conclusion =<br />
It is clear that one major downfall of this method when it actually comes to translation is the restriction that the two languages must have similar intrinsic structures to allow for the embeddings to align. However, given this assumption, this paper shows for the first time that one can align word embedding spaces without any cross-lingual supervision, i.e., solely based on unaligned datasets of each language, while reaching or outperforming the quality of previous supervised approaches in several cases. Using adversarial training, the model is able to initialize a linear mapping between a source and a target space, which is also used to produce a synthetic parallel dictionary. It is then possible to apply the same techniques proposed for supervised techniques, namely a Procrustean optimization.<br />
<br />
= Open source code =<br />
The source code for the paper is provided at the following Github link: https://github.com/facebookresearch/MUSE. The repository provides the source code as written in PyTorch by the authors of this paper.<br />
<br />
= Source =<br />
Dinu, Georgiana; Lazaridou, Angeliki; Baroni, Marco<br />
| Improving zero-shot learning by mitigating the hubness problem<br />
| arXiv:1412.6568<br />
<br />
Lample, Guillaume; Denoyer, Ludovic; Ranzato, Marc'Aurelio <br />
| Unsupervised Machine Translation Using Monolingual Corpora Only<br />
| arXiv: 1701.04087<br />
<br />
Smith, Samuel L; Turban, David HP; Hamblin, Steven; Hammerla, Nils Y<br />
| Offline bilingual word vectors, orthogonal transformations and the inverted softmax<br />
| arXiv:1702.03859<br />
<br />
Lample, G. (n.d.). Facebookresearch/MUSE. Retrieved March 25, 2018, from https://github.com/facebookresearch/MUSE</div>Swalshhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Predicting_Floor-Level_for_911_Calls_with_Neural_Networks_and_Smartphone_Sensor_Data&diff=36264stat946w18/Predicting Floor-Level for 911 Calls with Neural Networks and Smartphone Sensor Data2018-04-07T05:53:53Z<p>Swalsh: /* Criticism */</p>
<hr />
<div>= Introduction =<br />
During emergency 911 calls, knowing the exact position of the victims is crucial to a fast response and a successful rescue. Knowing the victim's floor level in an emergency can speed up the search by a factor proportional to the number of floors in the building. Problems arise when the caller is unable to give their physical position accurately. This can happen for instance when the caller is disoriented, held hostage, or a child is calling on behalf of the victim. GPS sensors on smartphones can provide the rescuers with the geographic location. However GPS fails to give an accurate floor level inside a tall building. Previous work have explored using Wi-Fi signals or beacons placed inside the buildings, but these methods are not self-contained and require prior infrastructure knowledge.<br />
<br />
Fortunately, today’s smartphones are equipped with many more sensors including barometers and magnetometers. Deep learning can be applied to predict floor level based on these sensor readings. <br />
Firstly, an LSTM is trained to classify whether the caller is indoors or outdoors using GPS, RSSI (Received Signal Strength Indication), and magnetometer sensor readings. Next, an unsupervised clustering algorithm is used to predict the floor level depending on the barometric pressure difference. With these two parts working together, a self-contained floor level prediction system can achieve 100% accuracy, without any external prior knowledge.<br />
<br />
This paper is published in ICLR 2018. The code, data, and app are open-source on [https://github.com/williamFalcon/Predicting-floor-level-for-911-Calls-with-Neural-Networks-and-Smartphone-Sensor-Data (GitHub)]<br />
<br />
= Data Description =<br />
The authors developed an iOS app called Sensory and used it to collect data on an iPhone 6. The following sensor readings were recorded: indoors, created at, session id, floor, RSSI strength, GPS latitude, GPS longitude, GPS vertical accuracy, GPS horizontal accuracy, GPS course, GPS speed, barometric relative altitude, barometric pressure, environment context, environment mean building floors, environment activity, city name, country name, magnet x, magnet y, magnet z, magnet total.<br />
<br />
The indoor-outdoor data has to be manually entered as soon as the user enters or exits a building. To gather the data for floor level prediction, the authors conducted 63 trials among five different buildings throughout New York City. The actual floor level was recorded manually for validation purposes only, since unsupervised learning is being used.<br />
<br />
=== Note: Barometric formula ===<br />
<br />
The barometric measures, sometimes called the exponential atomsphere or isothermal atmosphere, is the measure used to model how the pressure (or density) of the air changes with altitude. The pressure drops approximately by 11.3 Pa per meter in first 1000 meters above sea level.<br />
<br />
= Methods =<br />
The proposed method first determines if the user is indoor or outdoor and detects the instances of transition between them. When an outdoor to indoor transition event occurs, the elevation of the user is saved using an estimation from the cellphone barometer. Finally, the exact floor level is predicted through clustering techniques. Indoor/outdoor classification is critical to the working of this method. Once the user is detected to be outdoors, he is assumed to be at the ground level. The vertical height and floor estimation is applied only when the user is indoors. The indoor/outdoor transitions are used to save the barometer readings at the ground level for use as reference pressure.<br />
<br />
=== Indoor/Outdoor Classification === <br />
<br />
An LSTM network is used to solve the indoor-outdoor classification problem. Here is a diagram of the network architecture.<br />
<br />
[[File:lstm.jpg | 500px]]<br />
<br />
Figure 1: LSTM network architecture. A 3-layer LSTM. Inputs are sensor readings for d consecutive time-steps. Target is y = 1 if indoors and y = 0 if outdoors.<br />
<br />
<math> X_i</math> contains a set of <math>d</math> consecutive sensor readings, i.e. <math> X_i = [x_1, x_2,...,x_d] </math>. <math>Y</math> is labelled as 0 for outdoors and 1 for indoors. <math>d</math> is chosen to be 3 by random-search so that <math>X</math> has 3 points <math>X_i = [x_{j-1}, x_j, x_{j+1}]</math> and the middle <math>x_j</math> is used for the <math>y</math> label.<br />
The LSTM contains three layers. Layers one and two have 50 neurons followed by a dropout layer set to 0.2. Layer 3 has two neurons fed directly into a one-neuron feedforward layer with a sigmoid activation function. The input is the sensor readings, and the output is the indoor-outdoor label. The objective function is the cross-entropy between the true labels and the predictions.<br />
<br />
\begin{equation}<br />
C(y_i, \hat{y}_i) = \frac{1}{n} \sum_{i=1}^{n} -(y_i log(\hat{y_i}) + (1 - y_i) log(1 - \hat{y_i}))<br />
\label{equation:binCE}<br />
\end{equation}<br />
<br />
The main reason why the neural network is able to predict whether the user is indoors or outdoors is that it learns a pattern of how the walls of buildings interfere with the GPS signals. The LSTM is able to find the pattern in the GPS signal strength in combination with other sensor readings to give an accurate prediction. However, the change in GPS signal does not happen instantaneously as the user walks indoor. Thus, a window of 20 seconds is allowed, and the minimum barometric pressure reading within that window is recorded as the ground floor.<br />
<br />
=== Indoor/Outdoor Transition === <br />
To determine the exact time the user makes an indoor/outdoor transition, two vector masks are convolved across the LSTM predictions.<br />
<br />
\begin{equation}<br />
V_1 = [1, 1, 1, 1, 1, 0, 0, 0, 0, 0]<br />
\end{equation} <br />
<br />
\begin{equation}<br />
V_2 = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]<br />
\end{equation}<br />
<br />
The Jaccard distances measures the similarity of two sets and is calculated with the following equation:<br />
<br />
\begin{equation}<br />
J_j = J(s_i, V_j) = \frac{|s_i \cap V_j|}{|s_i| + |V_j| - |s_i \cap V_j|} <br />
\label{equation:Jaccard}<br />
\end{equation}<br />
<br />
If the Jaccard distance between <math>V_{1}</math> and sub-sequence <math> s_i </math> is greater or equal to the threshold 0.4, it means there was a transition from indoors to outdoors in the vicinity of the 20 second range of the vector mask. Similarly, a distance of to 0.4 or greater to <math>V_{2}</math> indicates a transition from outdoors to indoors. Sets of transition windows are merged together if they occur close in time to each other, with the average transition time of both windows being used as the new transition time.<br />
<br />
[[File:FindIOIndexes.png | 700px]]<br />
<br />
=== Vertical Height Estimation === <br />
Once the barometric pressure of the ground floor is known, the user’s current relative altitude can be calculated by the international pressure equation, where <math>m_\Delta</math> is the estimated height, <math> p_1 </math> is the pressure reading of the device, and <math> p_0 </math> is the reference pressure at ground level while transitioning from outdoor to indoor.<br />
<br />
\begin{equation}<br />
m_\Delta = f_{floor}(p_0, p_1) = 44330 (1 - (\frac{p_1}{p_0})^{\frac{1}{5.255}})<br />
\label{equation:baroHeight}<br />
\end{equation}<br />
<br />
In appendix B.1, the authors acknowledge that for this system to work, pressures variations due to weather or temperature must be accounted for as those variations are on the same order of magnitude or larger than the pressure variations caused by changing altitude. They suggest using a nearby reference station with known altitude to continuously measure and correct for this effect.<br />
<br />
=== Floor Estimation === <br />
Given the user’s relative altitude, the floor level can be determined. However, this is not a straightforward task because different buildings have different floor heights, different floor labeling (E.g. not including the 13th floor), and floor heights within the same building can vary from floor to floor. To solve these problems, altitude data collected are clustered into groups. Each cluster represents the approximate altitude of a floor.<br />
<br />
Here is an example of altitude data collected across 41 trials in the Uris Hall building in New York City. Each dashed line represent the center of a cluster.<br />
<br />
[[File:clusters.png | 500px]]<br />
<br />
Figure 2: Distribution of measurements across 41 trials in the Uris Hall building in New York City. A clear size difference is specially noticeable at the lobby. Each dotted line corresponds to an actual floor in the building learned from clustered data-points.<br />
<br />
Here is the algorithm for the floor level prediction.<br />
<br />
[[File:PredictFloor.png | 700px]]<br />
<br />
= Experiments and Results =<br />
The authors performed evaluation on two different tasks: The indoor-outdoor classification task and the floor level prediction task. In the indoor-outdoor detection task, they compared six different models, LSTM, feedforward neural networks, logistic regression, SVM, HMM and Random Forests. In the floor level prediction task, they evaluated the full system.<br />
<br />
== Indoor-Outdoor Classification Results ==<br />
Here are the results for the indoor-outdoor classification problem using different machine learning techniques. LSTM has the best performance on the test set.<br />
The LSTM is trained for 24 epochs with a batch size of 128. All the hyper-parameters such as learning rate(0.006), number of layers, d size, number of hidden units and dropout rate were searched through random search algorithm.<br />
<br />
[[File:IOResults.png]]<br />
<br />
== Floor Level Prediction Results ==<br />
The following are the results for the floor level prediction from the 63 collected samples. Results are given as the percent which matched the floor exactly, off by one, or off by more than one. In each column, the left number is the accuracy using a fixed floor height, and the number on the right is the accuracy when clustering was used to calculate a variable floor height. It was found that using the clustering technique produced 100% accuracy on floor predictions. The conclusion from these results is that using building-specific floor heights produces significantly better results.<br />
<br />
[[File:FloorLevelResults.png]]<br />
<br />
== Floor Level Clustering Results == <br />
Here is the comparison between the estimated floor height and the ground truth in the Uris Hall building.<br />
<br />
[[File:FloorComparison.png]]<br />
<br />
= Criticism =<br />
This paper is an interesting application of deep learning and achieves an outstanding result of 100% accuracy. However, it offers no new theoretical discoveries. The machine learning techniques used are fairly standard. The neural networks used in this paper only contains 3 layers, and the clustering is applied on one-dimensional data. This leads to the question whether deep learning is necessary and suitable for this task.<br />
<br />
It was explained in the paper that there are many cases where the system does not work. Some cases that were mentioned include: buildings with glass walls, delayed GPS signals, <br />
and pressure changes caused by air conditioning. Other examples I can think of are: uneven floors with some area higher than others, floors rarely visited, and tunnels from one building to another. These special cases are not specifically mentioned in the paper, but they do note that differences between outdoors and pressure-sealed buildings is a problem<br />
<br />
Another weakness of the method comes from the clustering technique. It requires a fair bit of training data. The author suggested two approaches. First, the data can be stored in the individual smartphone. This is not realistic as most people do not visit every single floor of every building, even if it is their own apartment buildings. The second approach is to let a central system (emergency department) collect data from multiple users (which is what the paper’s results are based on). However, such data collection would need to be done in accordance with local laws. Perhaps a better solution would be to use elevation reading to estimate a floor based on typical floor height. Even having a small range of floors of interest could help first responders significantly narrow down their response time.<br />
<br />
Aside from all the technical issues, if knowing the exact floor is required, would it maybe be easier to let the rescuers carry a barometer with them and search for the floor with the transmitted pressure reading?<br />
<br />
== Real-world Considerations == <br />
<br />
In the appendices the real-world issues discovered are discussed and possible solutions are proposed.<br />
<br />
'''Pressure Variance'''<br />
<br />
Changing weather conditions and geographical locations can greatly affect the barometric pressure from different cases. As a possible solution, gathering the current pressure conditions from a nearby landmark such as an airport can be used to normalize the local pressure. Alternatively, the knowledge of local wi-fi access points can establish if a user is changing locations or if the pressure is naturally changing.<br />
<br />
Another potential issue when using pressure readings is that different phones were found to read the local pressure at varying offsets from each-other. This shows that some form of calibration of the phone would have to be provided prior to the use of the app.<br />
<br />
'''Battery Impact''' <br />
<br />
In having an app regularly collecting data from the GPS and motion sensors, the battery life of the device will be severely impacted. While the motion sensing has already been addressed in iOS systems by running on a dedicated chip, the GPS would need to be sampled far less frequently.<br />
<br />
= Conclusion =<br />
This paper presented a novel deep learning application in predicting the floor level given sensory data from mobile phones. While there are no new theoretical discoveries, the application is novel and important for 911-responders; indeed, previous studies have shown that survival rates for urgent medical events drop exponentially for each floor increase. Although much of this is attributed to the actual floor height, this situation makes it all the more important to reduce ground-to-floor travel time.</div>Swalshhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Robust_Imitation_of_Diverse_Behaviors&diff=36263Robust Imitation of Diverse Behaviors2018-04-07T05:17:40Z<p>Swalsh: added experiment details</p>
<hr />
<div>=Introduction=<br />
One of the longest standing challenges in AI is building versatile embodied agents, both in the form of real robots and animated avatars, capable of a wide and diverse set of behaviors. State-of-the-art robots cannot compete with the effortless variety and adaptive flexibility of motor behaviors produced by toddlers. Towards addressing this challenge, the authors combine several deep generative approaches to imitation learning in a way that accentuates their individual strengths and addresses their limitations. The end product is a robust neural network policy that can imitate a large and diverse set of behaviors using few training demonstrations.<br />
<br />
=Motivation=<br />
Deep generative models have recently shown great promise in imitation learning. The authors primarily talk about two approaches viz. supervised approaches that condition on demonstrations and Generative Adversarial Imitation Learning (GAIL). The authors also talk about the strengths and limitations of the two approaches and try to combine the two approaches in order to get the best of both worlds.<br />
<br />
* Supervised approaches that condition on demonstrations using a variational autoencoder (VAE):<br />
** They require large training datasets in order to work for non-trivial tasks<br />
** Experiments show that the VAE learns a structured semantic embedding space, allowing for smooth policy interpolation<br />
** They tend to be brittle and fail when the agent diverges too much from the demonstration trajectories (As proof of this brittleness, the authors cite Ross et al. (2010), who provide a theorem showing that the cost incurred by this kind of model when it deviates from a demonstration trajectory with a small probability can be amplified in a manner quadratic in the number of time steps.)<br />
** VAEs provides a latent vector representation unlike GANS or autoregressive models which produce sharp and at times realistic image samples, but tend to be slow to sample from, that is why VAEs are used to learn demonstration trajectories.<br />
<br />
* Generative Adversarial Imitation Learning (GAIL)<br />
** Allows learning more robust policies with fewer demonstrations<br />
** Adversarial training leads to mode-collapse (the tendency of adversarial generative models to cover only a subset of modes of a probability distribution, resulting in a failure to produce adequately diverse samples)<br />
** More difficult and slow to train as they do not immediately provide a latent representation of the data<br />
<br />
Thus, the former approach can model diverse behaviors without dropping modes but does not learn robust policies, while the latter approach gives robust policies but insufficiently diverse behaviors. Thus, the authors combine the favorable aspects of these two approaches. The base of their model is a new type of variational autoencoder on demonstration trajectories that learns semantic policy embeddings. Leveraging these policy representations, they develop a new version of GAIL that <br />
# is much more robust than the purely-supervised controller, especially with few demonstrations, and <br />
# avoids mode collapse, capturing many diverse behaviors when GAIL on its own does not.<br />
<br />
=Model=<br />
The authors first introduce a variational autoencoder (VAE) for supervised imitation, consisting of a bi-directional LSTM encoder mapping demonstration sequences to embedding vectors, and two decoders. The first decoder is a multi-layer perceptron (MLP) policy mapping a trajectory embedding and the current state to a continuous action vector. The second is a dynamics model mapping the embedding and previous state to the present state while modeling correlations among states with a WaveNet. A conditional GAN is used for the GAIL step. GAIL is what actually generates the policy, but it is conditioned/initialized based on the VAE latent state.<br />
<br />
[[File: Model_Architecture.png|700px|center|]]<br />
<br />
==Behavioral cloning with VAE suited for control==<br />
<br />
In this section, the authors follow a similar approach to Duan et al. (2017), but opt for stochastic VAEs as having a distribution <math display="inline">q_\phi(z|x_{1:T})</math> to better regularize the latent space. In their VAE, an encoder stochastically maps a demonstration sequence to an embedding vector <math display="inline">z</math>. Given <math display="inline">z</math>, they decode both the state and action trajectories as shown in the figure above. To train the model, the following loss is minimized:<br />
<br />
\begin{align}<br />
L\left( \alpha, w, \phi; \tau_i \right) = - \pmb{\mathbb{E}}_{q_{\phi}(z|x_{1:T_i}^i)} \left[ \sum_{t=1}^{T_i} log \pi_\alpha \left( a_t^i|x_t^i, z \right) + log p_w \left( x_{t+1}^i|x_t^i, z\right) \right] +D_{KL}\left( q_\phi(z|x_{1:T_i}^i)||p(z) \right)<br />
\end{align}<br />
<br />
Where <math> \alpha </math> parameterizes the action decoder, <math> w </math> parameterizes the state decoder, <math> \phi </math> parameterizes the state encoder, and <math> T_i \in \tau_i </math> is the set of demonstration trajectories.<br />
<br />
The encoder <math display="inline">q</math> uses a bi-directional LSTM. To produce the final embedding, it calculates the average of all the outputs of the second layer of this LSTM before applying a final linear transformation to<br />
generate the mean and standard deviation of a Gaussian. Then, one sample from this Gaussian is taken as the demonstration encoding.<br />
<br />
The action decoder is an MLP that maps the concatenation of the state and the embedding of the parameters of a Gaussian policy. The state decoder is similar to a conditional WaveNet model. In particular, it conditions on the embedding <math display="inline">z</math> and previous state <math display="inline">x_{t-1}</math> to generate the vector <math display="inline">x_t</math> autoregressively. That is, the autoregression is over the components of the vector <math display="inline">x_t</math>. Finally, instead of a Softmax, the model uses a mixture of Gaussians as the output of the WaveNet.<br />
<br />
==Diverse generative adversarial imitiation learning==<br />
To enable GAIL to produce diverse solutions, the authors condition the discriminator on the embeddings generated by the VAE encoder and integrate out the GAIL objective with respect to the variational posterior <math display="inline">q_\phi(z|x_{1:T})</math>. Specifically, the authors train the discriminator by optimizing the following objective:<br />
<br />
\begin{align}<br />
{max}_{\psi} \pmb{\mathbb{E}}_{\tau_i \sim \pi_E} \left( \pmb{\mathbb{E}}_{q(z|x_{1:T_i}^i)} \left[\frac{1}{T_i} \sum_{t=1}^{T_i} logD_{\psi} \left( x_t^i, a_t^i | z \right) + \pmb{\mathbb{E}}_{\pi_\theta} \left[ log(1 - D_\psi(x, a | z)) \right] \right] \right)<br />
\end{align}<br />
<br />
There is related work which uses a conditional GAIL objective to learn controls for multiple behaviors from state trajectories, but the discriminator conditions on an annotated class label, as in conditional GANs.<br />
<br />
The authors condition on unlabeled trajectories, which have been passed through a powerful encoder, and hence this approach is capable of one-shot imitation learning. Moreover, the VAE encoder enables to obtain a continuous latent embedding space where interpolation is possible.<br />
<br />
Since the discriminator is conditional, the reward function is also conditional and clipped so that it is upper-bounded. Conditioning on <math>z</math> allows for the generation of an infinite number of reward functions, each tailored to imitate a different trajectory. Due to the diversity of the reward functions, the policy gradients will not collapse into one particular mode through mode skewing.<br />
<br />
To better motivate the objective, the authors propose on temporarily leaving the context of imitation learning and considering an alternative objective for training GANs<br />
<br />
\begin{align}<br />
{min}_{G}{max}_{D} V (G, D) = \int_{y} p(y) \int_{z} q(z|y) \left[ log D(y | z) + \int_{\hat{y}} G(\hat{y} | z) log (1 - D(\hat{y} | z)) d\hat{y} \right] dy dz<br />
\end{align}<br />
<br />
This function is a simplification of the previous objective function. Furthermore, it satisfies the following property.<br />
<br />
===Lemma 1===<br />
Assuming that <math display="inline">q</math> computes the true posterior distribution that is <math display="inline">q(z|y) = \frac{p(y|z)p(z)}{p(y)}</math> then<br />
<br />
\begin{align}<br />
V (G, D) = \int_{z} p(z) \left[ \int_{y} p(y|z) log D(y|z) dy + \int_{\hat{y}} G(\hat{y} | z) log (1 - D(\hat{y} | z)) d\hat{y} \right] dz<br />
\end{align}<br />
<br />
If an optimal discriminator is further assumed, the cost optimized by the generator then becomes<br />
<br />
\begin{align}<br />
C(G) = 2 \int_ p p(z) JSD[p(\cdot|z) || G(\cdot|z)] dz - log4<br />
\end{align}<br />
<br />
where <math display="inline">JSD</math> stands for the Jensen-Shannon divergence. In the context of the WaveNet described in the earlier section, <math>p(x)</math> is the distribution of a mixture of Gaussians, and <math>p(z)</math> is the distribution over the mixture components, so the conditional distribution over the latent <math>z</math>, <math>p(x | z)</math> is uni-modal, and optimizing the divergence will not lead to mode collapse.<br />
<br />
==Policy Optimization Strategy: TRPO==<br />
<br />
[[file:robust_behaviour_alg.png | 800px]]<br />
<br />
In Algorithm 1, it states that TRPO is used for policy parameter updates. TRPO is short for Trust Region Policy Optimization, which an iterative procedure for policy optimization, developed by John Schulman, Sergey Levine, Philip Moritz, Micheal Jordan and Pieter Abbeel. This optimization methods achieves monotonic improve in fields related to robotic motions, such as walking and swimming. For more details on TRPO, please refer to the [https://arxiv.org/pdf/1502.05477.pdf original paper].<br />
<br />
=Experiments=<br />
<br />
The primary focus of the paper's experimental evaluation is to demonstrate that the architecture allows learning of robust controllers capable of producing the full spectrum of demonstration behaviors for a diverse range of challenging control problems. The authors consider three bodies: a 9 DoF robotic arm, a 9 DoF planar walker, and a 62 DoF complex humanoid (56-actuated joint angles, and a freely translating and rotating 3d root joint). While for the reaching task BC is sufficient to obtain a working controller, for the other two problems the full learning procedure is critical.<br />
<br />
The authors analyze the resulting embedding spaces and demonstrate that they exhibit a rich and sensible structure that can be exploited for control. Finally, the authors show that the encoder can be used to capture the gist of novel demonstration trajectories which can then be reproduced by the controller.<br />
<br />
==Robotic arm reaching==<br />
In this experiment, the authors demonstrate the effectiveness of their VAE architecture and investigate the nature of the learned embedding space on a reaching task with a simulated Jaco arm.<br />
<br />
To obtain demonstrations, the authors trained 60 independent policies to reach to random target locations in the workspace starting from the same initial configuration. 30 trajectories from each of<br />
the first 50 policies were generated. These served as training data for the VAE model (1500 training trajectories in total). The remaining 10 policies were used to generate test data.<br />
<br />
Here are the trajectories produced by the VAE model.<br />
<br />
[[File: Robotic_arm_reaching_VAE.png|300px|center|]]<br />
<br />
The reaching task is relatively simple, so with this amount of data the VAE policy is fairly robust. After training, the VAE encodes and reproduces the demonstrations as shown in the figure below.<br />
<br />
[[File: Robotic_arm_reaching.png|650px|center|]]<br />
<br />
==2D Walker==<br />
<br />
As a more challenging test compared to the reaching task, the authors consider bipedal locomotion. Here, the authors train 60 neural network policies for a 2d walker to serve as demonstrations. These policies are each trained to move at different speeds both forward and backward depending on a label provided as additional input to the policy. Target speeds for training were chosen from a set of four different speeds (m/s): -1, 0, 1, 3.<br />
<br />
[[File: 2D_Walker.png|650px|center|]]<br />
<br />
The authors trained their model with 20 episodes per policy (1200 demonstration trajectories in total, each with a length of 400 steps or 10s of simulated time). In this experiment a full approach is required: training the VAE with BC alone can imitate some of the trajectories, but it performs poorly in general, presumably because the relatively small training set does not cover the space of trajectories sufficiently densely. On this generated dataset, they also train policies with GAIL using the same architecture and hyper-parameters. Due to the lack of conditioning, GAIL does not reproduce coherently trajectories. Instead, it simply meshes different behaviors together. In addition, the policies trained with GAIL also, exhibit dramatically less diversity; see [https://www.youtube.com/watch?v=kIguLQ4OwuM video].<br />
<br />
[[File: 2D_Walker_Optimized.gif|frame|center|In the left panel, the planar walker demonstrates a particular walking style. In the right panel, the model's agent imitates this walking style using a single policy network.]]<br />
<br />
==Complex humanoid==<br />
For this experiment, the authors consider a humanoid body of high dimensionality that poses a hard control problem. They generate training trajectories with the existing controllers, which can produce instances of one of six different movement styles. Examples of such trajectories are shown in Fig. 5.<br />
<br />
[[File: Complex_humanoid.png|650px|center|]]<br />
<br />
The training set consists of 250 random trajectories from 6 different neural network controllers that were trained to match 6 different movement styles from the CMU motion capture database. Each trajectory is 334 steps or 10s long. The authors use a second set of 5 controllers from which they generate trajectories for evaluation (3 of these policies were trained on the same movement styles as the policies used for generating training data).<br />
<br />
Surprisingly, despite the complexity of the body, supervised learning is quite effective at producing sensible controllers: The VAE policy is reasonably good at imitating the demonstration trajectories, although it lacks the robustness to be practically useful. Adversarial training dramatically improves the stability of the controller. The authors analyze the improvement quantitatively by computing the percentage of the humanoid falling down before the end of an episode while imitating either training or test policies. The results are summarized in Figure 5 right. The figure further shows sequences of frames of representative demonstration and associated imitation trajectories. Videos of demonstration and imitation behaviors can be found in the [https://www.youtube.com/watch?v=NaohsyUxpxw video].<br />
<br />
[[File: Complex_humanoid_optimized.gif|frame|center|In the left and middle panels we show two demonstrated behaviors. In the right panel, the model's agent produces an unseen transition between those behaviors.]]<br />
<br />
Also, for practical purposes, it is desirable to allow the controller to transition from one behavior to another. The authors test this possibility in an experiment similar to the one for the Jaco arm: They determine the<br />
embedding vectors of pairs of demonstration trajectories, start the trajectory by conditioning on the first embedding vector, and then transition from one behavior to the other half-way through the episode by linearly interpolating the embeddings of the two demonstration trajectories over a window of 20 control steps. Although not always successful the learned controller often transitions robustly, despite not having been trained to do so. An example of this can be seen in the gif above. More examples can be seen in this [https://www.youtube.com/watch?v=VBrIll0B24o video].<br />
<br />
==Experiment Details==<br />
<br />
The network specifications for all 3 experiments can be found in Table 1 below, while the fine-tuned specifications for the Walker and Humanoid experiments can be found in Table 3 and Table 4.<br />
<br />
[[File: VAE_network.png|650px|center|]]<br />
<br />
<br />
[[File: VAE_phaseSpec.png|550px|center|]]<br />
<br />
<br />
[[File: VAE_hyperParamSpec.png|600px|center|]]<br />
<br />
=Conclusions=<br />
The authors have proposed an approach for imitation learning that combines the favorable properties of techniques for density modeling with latent variables (VAEs) with those of GAIL. The result is a model that from a moderate number of demonstration tragectories, can learn:<br />
# a semantically well structured embedding of behaviors, <br />
# a corresponding multi-task controller that allows to robustly execute diverse behaviors from this embedding space, as well as <br />
# an encoder that can map new trajectories into the embedding space and hence allows for one-shot imitation. <br />
The experimental results demonstrate that this approach can work on a variety of control problems and that it scales even to very challenging ones such as the control of a simulated humanoid with a large number of degrees of freedoms.<br />
<br />
=Critique=<br />
The paper proposes a deep-learning-based approach to imitation learning which is sample-efficient and is able to imitate many diverse behaviors. The architecture can be seen as conditional generative adversarial imitation learning (GAIL). The conditioning vector is an embedding of a demonstrated trajectory, provided by a variational autoencoder. This results in one-shot imitation learning: at test time, a new demonstration can be embedded and provided as a conditioning vector to the imitation policy. The authors evaluate the method on several simulated motor control tasks.<br />
<br />
Pros:<br />
* Addresses a challenging problem of learning complex dynamics controllers / control policies<br />
* Well-written introduction / motivation<br />
* The proposed approach is able to learn complex and diverse behaviors and outperforms both the VAE alone (quantitatively) and GAIL alone (qualitatively).<br />
* Appealing qualitative results on the three evaluation problems. Interesting experiments with motion transitioning. <br />
<br />
Cons:<br />
* Comparisons to baselines could be more detailed.<br />
* Many key details are omitted (either on purpose, placed in the appendix, or simply absent, like the lack of definitions of terms in the modeling section, details of the planner model, simulation process, or the details of experimental settings)<br />
* Experimental evaluation is largely subjective (videos of robotic arm/biped/3D human motion). Even then, a rigorous study measuring subjective performance has not been performed.<br />
* A discussion of sample efficiency compared to GAIL and VAE would be interesting.<br />
* The presentation is not always clear, in particular, I had a hard time figuring out the notation in Section 3.<br />
* There has been some work on hybrids of VAEs and GANs, which seem worth mentioning when generative models are discussed, like:<br />
# Autoencoding beyond pixels using a learned similarity metric, Larsen et al., ICML 2016<br />
# Generating Images with Perceptual Similarity Metrics based on Deep Networks, Dosovitskiy&Brox. NIPS 2016<br />
These works share the intuition that good coverage of VAEs can be combined with sharp results generated by GANs.<br />
* Some more extensive analysis of the approach would be interesting. How sensitive is it to hyperparameters? How important is it to use VAE, not usual AE or supervised learning? How difficult will it be for others to apply it to new tasks?<br />
<br />
=References=<br />
# Duan, Y., Andrychowicz, M., Stadie, B., Ho, J., Schneider, J., Sutskever, I., Abbeel, P., & Zaremba, W. (2017). One-shot imitation learning. Preprint arXiv:1703.07326.<br />
# Ross, Stéphane, and Drew Bagnell. "Efficient reductions for imitation learning." Proceedings of the thirteenth international conference on artificial intelligence and statistics. 2010.<br />
# Wang, Z., Merel, J. S., Reed, S. E., de Freitas, N., Wayne, G., & Heess, N. (2017). Robust imitation of diverse behaviors. In Advances in Neural Information Processing Systems (pp. 5326-5335).<br />
# Producing flexible behaviours in simulated environments. (n.d.). Retrieved March 25, 2018, from https://deepmind.com/blog/producing-flexible-behaviours-simulated-environments/<br />
# Cmu humanoid. (2017, May 19). Retrieved March 25, 2018, from https://www.youtube.com/watch?v=NaohsyUxpxw<br />
# Cmu transitions. (2017, May 19). Retrieved March 25, 2018, from https://www.youtube.com/watch?v=VBrIll0B24o<br />
# Conditional GAN: https://arxiv.org/pdf/1411.1784.pdf</div>Swalshhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:VAE_hyperParamSpec.png&diff=36262File:VAE hyperParamSpec.png2018-04-07T05:14:16Z<p>Swalsh: </p>
<hr />
<div></div>Swalshhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:VAE_phaseSpec.png&diff=36261File:VAE phaseSpec.png2018-04-07T05:14:07Z<p>Swalsh: </p>
<hr />
<div></div>Swalshhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:VAE_network.png&diff=36260File:VAE network.png2018-04-07T05:13:57Z<p>Swalsh: </p>
<hr />
<div></div>Swalshhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=A_Neural_Representation_of_Sketch_Drawings&diff=36259A Neural Representation of Sketch Drawings2018-04-07T04:44:52Z<p>Swalsh: added further details on the training used</p>
<hr />
<div>= Introduction =<br />
<br />
There have been many recent advances in neural generative models for low resolution pixel-based images. Humans, however, do not see the world in a grid of pixels and more typically communicate drawings of the things we see using a series of pen strokes that represent components of objects. These pen strokes are similar to the way vector-based images store data. This paper proposes a new method for creating conditional and unconditional generative models for creating these kinds of vector sketch drawings based on recurrent neural networks (RNNs). For the conditional generation mode, the authors explore the model's latent space that it uses to express the vector image. The paper also explores many applications of these kinds of models, especially creative applications and makes available their unique dataset of vector images.<br />
<br />
= Related Work =<br />
<br />
Previous work related to sketch drawing generation includes methods that focused primarily on converting input photographs into equivalent vector line drawings. Image generating models using neural networks also exist but focused more on generation of pixel-based imagery. For example, Gatys et al.'s (2015) work focuses on separating style and content from pixel-based artwork and imagery. Some recent work has focused on handwritten character generation using RNNs and Mixture Density Networks to generate continuous data points. This work has been extended somewhat recently to conditionally and unconditionally generate handwritten vectorized Chinese Kanji characters by modeling them as a series of pen strokes. Furthermore, this paper builds on work that employed Sequence-to-Sequence models with Variational Auto-encoders to model English sentences in latent vector space.<br />
<br />
One of the limiting factors for creating models that operate on vector datasets has been the dearth of publicly available data. Previously available datasets include Sketch, a set of 20K vector drawings; Sketchy, a set of 70K vector drawings; and ShadowDraw, a set of 30K raster images with extracted vector drawings.<br />
<br />
= Methodology =<br />
<br />
=== Dataset ===<br />
<br />
The “QuickDraw” dataset used in this research was assembled from 75K user drawings extracted from the game “Quick, Draw!” where users drew objects from one of hundreds of classes in 20 seconds or less. The dataset is split into 70K training samples and 2.5K validation and test samples each and represents each sketch a set of “pen stroke actions”. Each action is provided as a vector in the form <math>(\Delta x, \Delta y, p_{1}, p_{2}, p_{3})</math>. For each vector, <math>\Delta x</math> and <math>\Delta y</math> give the movement of the pen from the previous point, with the initial location being the origin. The last three vector elements are a one-hot representation of pen states; <math>p_{1}</math> indicates that the pen is down and a line should be drawn between the current point and the next point, <math>p_{2}</math> indicates that the pen is up and no line should be drawn between the current point and the next point, and <math>p_{3}</math> indicates that the drawing is finished and subsequent points and the current point should not be drawn.<br />
<br />
=== Sketch-RNN ===<br />
[[File:sketchrnn.PNG]]<br />
<br />
The model is a Sequence-to-Sequence Variational Autoencoder (VAE). The encoder model is a symmetric and parallel set of two RNNs that individually process the sketch drawings (sequence <math>S</math>) in forward and reverse order, respectively. The hidden state produced by each encoder model is then concatenated into a single hidden state <math>h</math>. <br />
<br />
\begin{align}<br />
h_\rightarrow = \text{encode}_\rightarrow(S), h_\leftarrow = \text{encode}_\leftarrow(S_{\text{reverse}}), h=[h_\rightarrow; h_\leftarrow]<br />
\end{align}<br />
<br />
The concatenated hidden state <math>h</math> is then projected into two vectors <math>\mu</math> and <math>\hat{\sigma}</math> each of size <math>N_{z}</math> using a fully connected layer. <math>\hat{\sigma}</math> is then converted into a non-negative standard deviation parameter <math>\sigma</math> using an exponential operator. These two parameters <math>\mu</math> and <math>\sigma</math> are then used along with an IID Gaussian vector distributed as <math>\mathcal{N}(0, I)</math> of size <math>N_{z}</math> to construct a random vector <math>z \in ℝ^{N_{z}}</math>, similar to the method used for VAE:<br />
\begin{align}<br />
\mu = W_{\mu}h + b_{mu}\textrm{, }\hat{\sigma} = W_{\sigma}h + b_{\sigma}\textrm{, }\sigma = exp\bigg{(}\frac{\hat{\sigma}}{2}\bigg{)}\textrm{, }z = \mu + \sigma \odot \mathcal{N}(0,I)<br />
\end{align}<br />
<br />
The decoder model is an autoregressive RNN that samples output sketches from the latent vector <math>z</math>. The initial hidden states of each recurrent neuron are determined using <math>[h_{0}, c_{0}] = tanh(W_{z}z + b_{z})</math>. Each step of the decoder RNN accepts the previous point <math>S_{i-1}</math> and the latent vector <math>z</math> as concatenated input. The initial point given is the origin point with pen state down. The output at each step are the parameters for a probability distribution of the next point <math>S_{i}</math>. Outputs <math>\Delta x</math> and <math>\Delta y</math> are modeled using a Gaussian Mixture Model (GMM) with M normal distributions and output pen states <math>(q_{1}, q_{2}, q_{3})</math> modelled as a categorical distribution with one-hot encoding.<br />
\begin{align}<br />
P(\Delta x, \Delta y) = \sum_{j=1}^{M}\Pi_{j}\mathcal{N}(\Delta x, \Delta y | \mu_{x, j}, \mu_{y, j}, \sigma_{x, j}, \sigma_{y, j}, \rho_{xy, j})\textrm{, where }\sum_{j=1}^{M}\Pi_{j} = 1<br />
\end{align}<br />
<br />
For each of the M distributions in the GMM, parameters <math>\mu</math> and <math>\sigma</math> are output for both the x and y locations signifying the mean location of the next point and the standard deviation, respectively. Also output from each model is parameter <math>\rho_{xy}</math> signifying correlation of each bivariate normal distribution. An additional vector <math>\Pi</math> is an output giving the mixture weights for the GMM. The output <math>S_{i}</math> is determined from each of the mixture models using softmax sampling from these distributions.<br />
<br />
One of the key difficulties in training this model is the highly imbalanced class distribution of pen states. In particular, the state that signifies a drawing is complete will only appear one time per each sketch and is difficult to incorporate into the model. In order to have the model stop drawing, the authors introduce a hyperparameter <math>N_{max}</math> which basically is the length of the longest sketch in the dataset and limits the number of points per drawing to being no more than <math>N_{max}</math>, after which all output states form the model are set to (0, 0, 0, 0, 1) to force the drawing to stop.<br />
<br />
To sample from the model, the parameters required by the GMM and categorical distributions are generated at each time step and the model is sampled until a “stop drawing” state appears or the time state reaches time <math>N_{max}</math>. The authors also introduce a “temperature” parameter <math>\tau</math> that controls the randomness of the drawings by modifying the pen states, model standard deviations, and mixture weights as follows:<br />
<br />
\begin{align}<br />
\hat{q}_{k} \rightarrow \frac{\hat{q}_{k}}{\tau}\textrm{, }\hat{\Pi}_{k} \rightarrow \frac{\hat{\Pi}_{k}}{\tau}\textrm{, }\sigma^{2}_{x} \rightarrow \sigma^{2}_{x}\tau\textrm{, }\sigma^{2}_{y} \rightarrow \sigma^{2}_{y}\tau<br />
\end{align}<br />
<br />
This parameter <math>\tau</math> lies in the range (0, 1]. As the parameter approaches 0, the model becomes more deterministic and always produces the point locations with the maximum likelihood for a given timestep.<br />
<br />
=== Unconditional Generation ===<br />
<br />
[[File:paper15_Unconditional_Generation.png|800px|]]<br />
<br />
The authors also explored unconditional generation of sketch drawings by only training the decoder RNN module. To do this, the initial hidden states of the RNN were set to 0, and only vectors from the drawing input are used as input without any conditional latent variable <math>z</math>. Figure 3 above shows different sketches that are sampled from the network by only varying the temperature parameter <math>\tau</math> between 0.2 and 0.9.<br />
<br />
=== Training ===<br />
The training procedure follows the same approach as training for VAE and uses a loss function that consists of the sum of Reconstruction Loss <math>L_{R}</math> and KL Divergence Loss <math>L_{KL}</math>. The reconstruction loss term is composed of two terms; <math>L_{s}</math>, which tries to maximize the log-likelihood of the generated probability distribution explaining the training data <math>S</math> and <math>L_{p}</math> which is the log loss of the pen state terms.<br />
\begin{align}<br />
L_{s} = -\frac{1}{N_{max}}\sum_{i=1}^{N_{S}}log\bigg{(}\sum_{j=1}^{M}\Pi_{j,i}\mathcal{N}(\Delta x_{i},\Delta y_{i} | \mu_{x,j,i},\mu_{y,j,i},\sigma_{x,j,i},\sigma_{y,j,i},\rho_{xy,j,i})\bigg{)}<br />
\end{align}<br />
\begin{align}<br />
L_{p} = -\frac{1}{N_{max}}\sum_{i=1}^{N_{max}} \sum_{k=1}^{3}p_{k,i}log(q_{k,i})<br />
\end{align}<br />
\begin{align}<br />
L_{R} = L_{s} + L{p}<br />
\end{align}<br />
<br />
The KL divergence loss <math>L_{KL}</math> measures the difference between the latent vector <math>z</math> and an IID Gaussian distribution with 0 mean and unit variance. This term, normalized by the number of dimensions <math>N_{z}</math> is calculated as:<br />
\begin{align}<br />
L_{KL} = -\frac{1}{2N_{z}}\big{(}1 + \hat{\sigma} - \mu^{2} – exp(\hat{\sigma})\big{)}<br />
\end{align}<br />
<br />
The loss for the entire model is thus the weighted sum:<br />
\begin{align}<br />
Loss = L_{R} + w_{KL}L_{KL}<br />
\end{align}<br />
<br />
The value of the weight parameter <math>w_{KL}</math> has the effect that as <math>w_{KL} \rightarrow 0</math>, there is a loss in ability to enforce a prior over the latent space and the model assumes the form of a pure autoencoder. As with VAEs, there is a trade-off between optimizing for the two loss terms (i.e. between how precisely the model can regenerate training data <math>S</math> and how closely the latent vector <math>z</math> follows a standard normal distribution) - smaller values of <math>w_{KL}</math> lead to better <math>L_R</math> and worse <math>L_{KL}</math> compared to bigger values of <math>w_{KL}</math>. Also for unconditional generation, the model is a standalone decoder, so there will be no <math>L_{KL}</math> term as only <math>L_{R}</math> is optimized for. This trade-off is illustrated in Figure 4 showing different settings of <math>w_{KL}</math> and the resulting <math>L_{KL}</math> and <math>L_{R}</math>, as well as just <math>L_{R}</math> in the case of unconditional generation with only a standalone decoder.<br />
<br />
[[File:paper15_fig4.png|600px]]<br />
<br />
In practice however it was found that annealing the KL term improves the training of the network. While the original loss can be used for testing and validation, when training the following variation on the loss is used:<br />
\begin{align}<br />
\eta_{step} = 1 - (1 - \eta_{min})R^{step}<br />
\end{align}<br />
\begin{align}<br />
Loss_{train} = L_{R} + w_{KL} \eta_{step} max(L_{KL},KL_{min})<br />
\end{align}<br />
<br />
As can be seen above, the <math>\eta_{step}</math> term will start at some preset <math>\eta_{min}</math> value. As <math>R</math> is a value slightly smaller than 1, the <math>R^{step}</math> term will converge to 0, and thus <math>\eta_{step}</math> will converge to 1. This will have the affect of focusing the training loss in the early stages on the reconstruction loss <math>L_{R}</math>, but to a more balanced loss in the later steps. Additionally, in practice it was found that when <math>L_{KL}</math> got too low, the network would cease learning. To combat this a floor value for the KL loss was implemented by the <math>max(.)</math> function.<br />
<br />
=== Model Configuration ===<br />
In the given model, the encoder and decoder RNNs consist of 512 and 2048 nodes respectively. Also, M = 20 mixture components are used for the decoder RNN. Layer Normalization is applied to the model, and during training recurrent dropout is applied with a keep probability of 90%. The model is trained with batch sizes of 100 samples, using Adam with a learning rate of 0.0001 and gradient clipping of 1.0. During training, simple data augmentation is performed by multiplying the offset columns by two IID random factors. <br />
<br />
= Experiments =<br />
The authors trained multiple conditional and unconditional models using varying values of <math>w_{KL}</math> and recorded the different <math>L_{R}</math> and <math>L_{KL}</math> values at convergence. The network used LSTM as it’s encoder RNN and HyperLSTM as the decoder network. The HyperLSTM model was used for decoding because it has a history of being useful in sequence generation tasks. (A HyperLSTM consists of two coupled LSTMS: an auxiliary LSTM and a main LSTM. At every time step, the auxiliary LSTM reads the previous hidden state and the current input vector, and computes an intermediate vector <math display="inline"> z </math>. The weights of the main LSTM used in the current time step are then a learned function of this intermediate vector <math display="inline"> z </math>. That is, the weights of the main LSTM are allowed to vary between time steps as a function of the output of the auxiliary LSTM. See Ha et al. (2016) for details)<br />
<br />
=== Conditional Reconstruction ===<br />
[[File:conditional_generation.PNG]]<br />
<br />
The authors qualitatively assessed the reconstructed images <math>S’</math> given input sketch <math>S</math> using different values for the temperature hyperparameter <math>\tau</math>. The figure above shows the results for different values of <math>\tau</math> starting with 0.01 at the far left and increasing to 1.0 on the far right. Interestingly, sketches with extra features like a cat with 3 eyes are reproduced as a sketch of a cat with two eyes and sketches of object of a different class such as a toothbrush are reproduced as a sketch of a cat that maintains several of the input toothbrush sketches features.<br />
<br />
=== Latent Space Interpolation ===<br />
[[File:latent_space_interp.PNG]]<br />
<br />
The latent space vectors <math>z</math> have few “gaps” between encoded latent space vectors due to the enforcement of a Gaussian prior. This allowed the authors to do simple arithmetic on the latent vectors from different sketches and produce logical resulting images in the same style as latent space arithmetic on Word2Vec vectors. A model trained with higher <math>w_{KL}</math> is expected to produce images closer to the data manifold, and the figure above shows reconstructed images from latent vector interpolation between the original images. Results from the model trained with higher <math>w_{KL}</math> seem to produce more coherent images.<br />
<br />
=== Sketch Drawing Analogies ===<br />
Given the latent space arithmetic possible, it was found that features of a sketch could be added after some sketch input was encoded. For example, a drawing of a cat with a body could be produced by providing the network with a drawing of a cat’s head, and then adding a latent vector to the embedding layer that represents “body”. As an example, this “body” vector might be produced by taking a drawing of a pig with a body and subtracting a vector representing the pigs head.<br />
<br />
=== Predicting Different Endings of Incomplete Sketches ===<br />
[[File:predicting_endings.PNG]]<br />
<br />
Using the decoder RNN only, it is possible to finish sketches by conditioning future vector line predictions on the previous points. To do this, the decoder RNN is first used to encode some existing points into the hidden state of the decoder network and then generating the remaining points of the sketch with <math>\tau</math> set to 0.8.<br />
<br />
= Applications and Future Work =<br />
Sketch-RNN may enable the production of several creative applications. These might include suggesting ways an artist could finish a sketch, enabling artists to explore latent space arithmetic to find interesting outputs given different sketch inputs, or allowing the production of multiple different sketches of some object as a purely generative application. The authors suggest that providing some conditional sketch of an object to a model designed to produce output from a different class might be useful for producing sketches that morph the two different object classes into one sketch. For example, the image below was trained on drawing cats, but a chair was used as the input. This results in a chair looking cat.<br />
<br />
[[File:cat-chair.png]]<br />
<br />
Sketch-RNN may also be useful as a teaching tool to help people learn how to draw, especially if it were to be trained on higher quality images. Teaching tools might suggest to students how to proceed to finish a sketch or intake low fidelity sketches to produce a higher quality and “more coherent” output sketch.<br />
<br />
The authors noted that Sketch-RNN is not as effective at generating coherent sketches when trained on a large number of classes simultaneously (experiments mostly used datasets consisting of one or two object classes), and plan to use class information outside the latent space to try to model a greater number of classes.<br />
<br />
Finally, the authors suggest that combining this model with another that produces photorealistic pixel-based images using sketch input, such as Pix2Pix may be an interesting direction for future research. In this case, the output from the Sketch-RNN model would be used as input for Pix2Pix and could produce photorealistic images given some crude sketch from a user.<br />
<br />
= Limitations =<br />
The authors note a major limitation to the model is the training time relative to the number of data points. When sketches surpass 300 data points the model is difficult to train. To counteract this effect the Ramer-Douglas-Peucker algorithm was used to reduce the number of data points per sketch. This algorithm attempts to significantly reduce the number of data points while keeping the sketch as close to the original as possible.<br />
<br />
Another limitation is the effectiveness of generating sketches as the complexity of the class increases. Below are sketches of a few classes which show how the less complex classes such as cats and crabs are more accurately generated. Frogs (more complex) tend to have overly smooth lines drawn which do not seem to be part of realistic frog samples.<br />
<br />
[[File:paper15_classcomplexity.png]]<br />
<br />
A further limitation is the need to train individual neural networks for each class of drawings. While the former is useful in sketch completion with labeled incomplete sketches it may produce low quality results when the starting sketch is very different than any part of the learned representation. Further work can be done to extend the model to account for both prediction of the label and sketch completion.<br />
<br />
= Conclusion =<br />
The authors presented Sketch-RNN, a RNN model for modelling and generating vector-based sketch drawings with a goal to abstract concepts in the images similar to the way humans think. The VAE inspired architecture allows sampling the latent space to generate new drawings and also allows for applications that use latent space arithmetic in the style of Word2Vec to produce new drawings given operations on embedded sketch vectors. The authors also made available a large dataset of sketch drawings in the hope of encouraging more research in the area of vector-based image modelling.<br />
<br />
= Criticisms =<br />
The paper produces an interesting model that can effectively model vector-based images instead of traditional pixel-based images. This is an interesting problem because vector based images require producing a new way to encode the data. While the results from this paper are interesting, most of the techniques used are borrowed ideas from Variational Autoencoders and the main architecture is not terribly groundbreaking. <br />
<br />
One novel part about the architecture presented was the way the authors used GMMs in the decoder network. While this was interesting and seemed to allow the authors to produce different outputs given the same latent vector input <math>z</math> by manipulating the <math>\tau</math> hyperparameter, it was not that clear in the article why GMMs were used instead of a more simple architecture. Much time was spent explaining basics about GMM parameters like <math>\mu</math> and <math>\sigma</math>, but there was comparatively little explanation about how points were actually sampled from these mixture models.<br />
<br />
The authors contribute to a novel dataset but fail to evaluate the quality of the dataset, including generalized metrics for evaluation. They also provide no comparisons of their method on this dataset with other baseline sequence generation approaches.<br />
<br />
Finally, the authors gloss somewhat over how they were able to encode previous sketch points using only the decoder network into the hidden state of the decoder RNN to finish partially finished sketches. I can only assume that some kind of back-propagation was used to encode the expected sketch points into the hidden parameters of the decoder, but no explanation was given in the paper.<br />
<br />
== Major Contributions ==<br />
<br />
The paper provides with intuition of their approach and detailed below are the major contributions of this paper:<br />
* For images composed by sequence of lines, such as hand drawing, this paper proposed a framework to generate such image in vector format, conditionally and unconditionally. <br />
* Provided a unique training procedure that targets vector images, which makes training procedures more robust.<br />
* Composed large dataset of hand drawn vector images which benefits future development.<br />
* Discussed several potential applications of this methodology, such as drawing assist for artists and educational tool for students.<br />
<br />
= Implementation =<br />
Google has released all code related to this paper at the following open source repository: https://github.com/tensorflow/magenta/tree/master/magenta/models/sketch_rnn<br />
<br />
= Source =<br />
<br />
# Ha, D., & Eck, D. A neural representation of sketch drawings. In Proc. International Conference on Learning Representations (2018).<br />
# Tensorflow/magenta. (n.d.). Retrieved March 25, 2018, from https://github.com/tensorflow/magenta/tree/master/magenta/models/sketch_rnn<br />
# Graves et al, 2013, https://arxiv.org/pdf/1308.0850.pdf</div>Swalshhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Synthetic_and_natural_noise_both_break_neural_machine_translation&diff=36238stat946w18/Synthetic and natural noise both break neural machine translation2018-04-07T00:32:48Z<p>Swalsh: /* Richness of Natural Noise */</p>
<hr />
<div>== Introduction ==<br />
Humans have surprisingly robust language processing systems which can easily overcome typos, e.g.<br />
<br />
* "Aoccdrnig to a rscheearch at Cmabrigde Uinervtisy, it deosn't mttaer in waht oredr the ltteers in a wrod are, the olny iprmoetnt tihng is taht the frist and lsat ltteer be at the rghit pclae."<br />
<br />
A person's ability to read this text comes as no surprise to the Psychology literature<br />
# Saberi & Perrott (1999) found that this robustness extends to audio as well.<br />
# Rayner et al. (2006) found that in noisier settings reading comprehension only slowed by 11%.<br />
# McCusker et al. (1981) found that the common case of swapping letters could often go unnoticed by the reader.<br />
# Mayall et al (1997) shows that we rely on word shape.<br />
# Reicher, 1969; Pelli et al., (2003) found that we can switch between whole word recognition but the first and last letter positions are required to stay constant for comprehension<br />
<br />
However, neural machine translation (NMT) systems are brittle. i.e. The Arabic word<br />
[[File:Good_morning.PNG]] means a blessing for good morning, however [[File:Hunt.PNG]] means hunt or slaughter. <br />
<br />
Facebook's MT system mistakenly confused two words that only differ by one character, a situation that is challenging for a character-based NMT system.<br />
<br />
The figure below shows the performance translating German to English as a function of the percent of German words modified. Here two types of noise are shown: (1) In blue, random permutation of the word and (2) In green, swapping a pair of adjacent letters that does not include the first or last letter of the word. The important thing to note is that even small amounts of noise lead to substantial drops in performance.<br />
<br />
[[File:BLEU_plot.PNG|center]] <br />
<br />
BLEU (bilingual evaluation understudy) is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another. Quality is considered to be the correspondence between a machine's output and that of a human: "the closer a machine translation is to a professional human translation, the better it is". BLEU is between 0 and 1. BELU computes the scores for individual translated segments and then computes an average accuracy score for the whole corpus.<br />
<br />
This paper explores two simple strategies for increasing model robustness:<br />
# using structure-invariant representations (character CNN representation)<br />
# robust training on noisy data, a form of adversarial training.<br />
<br />
The goal of the paper is two-fold:<br />
# to initiate a conversation on robust training and modeling techniques in NMT<br />
# to promote the creation of better and more linguistically accurate artificial noise to be applied to new languages and tasks<br />
<br />
== Adversarial examples ==<br />
The growing literature on adversarial examples has demonstrated how dangerous it can be to have brittle machine learning systems being used so pervasively in the real world. Small changes to the input can lead to dramatic<br />
failures of deep learning models. This leads to a potential for malicious attacks using adversarial examples. An important distinction is often drawn between white-box attacks, where adversarial examples are generated with<br />
access to the model parameters, and black-box attacks, where examples are generated without such access.<br />
<br />
The paper devises simple methods for generating adversarial examples for NMT. They do not assume any access to the NMT models' gradients, instead relying on cognitively-informed and naturally occurring language errors to generate noise.<br />
<br />
== MT system ==<br />
The authors experiment with three different NMT systems with access to character information at different levels.<br />
# Use <code>char2char</code>, the fully character-level model of (Lee et al. 2017). This model processes a sentence as a sequence of characters. The encoder works as follows: the characters are embedded as vectors, and then the sequence of vectors is fed to a convolutional layer. The sequence output by the convolutional layer is then shortened by max pooling in the time dimension. The output of the max-pooling layer is then fed to a four-layer highway network (Srivasta et al. 2015), and the output of the highway network is in turn fed to a bidirectional GRU, producing a sequence of hidden units. The sequence of hidden units is then processed by the decoder, a GRU with attention, to produce probabilities over sequences of output characters.<br />
# Use <code>Nematus</code> (Sennrich et al., 2017), a popular NMT toolkit. It is another sequence-to-sequence model with several architecture modifications, especially operating on sub-word units using byte-pair encoding. Byte-pair encoding (Sennich et al. 2015, Gage 1994) is an algorithm according to which we begin with a list of characters as our symbols, and repeatedly fuse common combinations to create new symbols. For example, if we begin with the letters a to z as our symbol list, and we find that "th" is the most common two-letter combination in a corpus, then we would add "th" to our symbol list in the first iteration. After we have used this algorithm to create a symbol list of the desired size, we apply a standard encoder-decoder with attention.<br />
# Use an attentional sequence-to-sequence model with a word representation based on a character convolutional neural network (<code>charCNN</code>). The <code>charCNN</code> model is similar to <code>char2char</code>, but uses a shallower highway network and, although it reads the input sentence as characters, it produces as output a probability distribution over words, not characters.<br />
<br />
== Data ==<br />
=== MT Data ===<br />
The authors use the TED talks parallel corpus prepared for IWSLT 2016 (Cettolo et al., 2012) for testing all of the NMT systems.<br />
<br />
[[File:Table1x.PNG|center]]<br />
<br />
=== Natural and Artificial Noise ===<br />
==== Natural Noise ====<br />
The three languages, French, German, and Czech, each have their own frequent natural errors. The corpora of edits used for these languages are:<br />
<br />
# French : Wikipedia Correction and Paraphrase Corpus (WiCoPaCo)<br />
# German : RWSE Wikipedia Correction Dataset and The MERLIN corpus<br />
# Czech : CzeSL Grammatical Error Correction Dataset (CzeSL-GEC) which is a manually annotated dataset of essays written by both non-native learners of Czech and Czech pupils<br />
<br />
The authors harvested naturally occurring errors (typos, misspellings, etc.) corresponding to these three languages from available corpora of edits to build a look-up table of possible lexical replacements.<br />
<br />
They insert these errors into the source-side of the parallel data by replacing every word in the corpus with an error if one exists in our dataset. When there is more than one possible replacement to choose, words for which there is no error, are sampled uniformly and kept as is.<br />
<br />
==== Synthetic Noise ====<br />
In addition to naturally collected sources of error, the authors also experiment with four types of synthetic noise: Swap, Middle Random, Fully Random, and Key Typo. <br />
# <code>Swap</code>: The first and simplest source of noise is swapping two letters (do not alter the first or last letters, only apply to words of length >=4).<br />
# <code>Middle Random</code>: Randomize the order of all the letters in a word except for the first and last (only apply to words of length >=4).<br />
# <code>Fully Random</code> Completely randomized words.<br />
# <code>Keyboard Typo</code> Randomly replace one letter in each word with an adjacent key<br />
<br />
[[File:Table3x.PNG|center]]<br />
<br />
Table 3 shows BLEU scores of models trained on clean (Vanilla) texts and tested on clean and noisy<br />
texts. All models suffer a significant drop in BLEU when evaluated on noisy texts. This is true<br />
for both natural noise and all kinds of synthetic noise. The more noise in the text, the worse the<br />
translation quality, with random scrambling producing the lowest BLEU scores.<br />
<br />
In contrast to the poor performance of these methods in the presence of noise, humans can perform very well as mentioned in the introduction. The table below shows the translations performed by a German native-speaker human, not familiar with the meme and three machine translation methods. Clearly, the machine translation methods failed. <br />
<br />
[[File:paper16_tab4.png|center]]<br />
<br />
The author also examined improvements by using a simple spell checker. The author tried correcting error through Google's spell checker by simply accepting the first suggestion on the detected mistake. There was a small improvement in French and German translations, and a small drop in accuracy for the Czech translation due to more complex grammar. The author concluded using existing spell checkers would not improve the accuracy to be comparable with vanilla text. The results are shown in the table below.<br />
<br />
<br />
[[File:paper16_tab5.png|center]]<br />
<br />
== Dealing with noise ==<br />
=== Structure Invariant Representations ===<br />
The three NMT models are all sensitive to word structure. The <code>char2char</code> and <code>charCNN</code> models both have convolutional layers on character sequences, designed to capture character n-grams (which are sequences of characters or words, of length n). The model in <code>Nematus</code> is based on sub-word units obtained with byte pair encoding (where common consecutive characters are replaced with a unique byte that does not occur in the data). It thus relies on character order.<br />
<br />
The simplest way to improve such a model is to take the average character embeddings as a word representation. This model, referred to as <code>meanChar</code>, first generates a word representation by averaging character embeddings, and then proceeds with a word-level encoder similar to the <code>charCNN</code> model.<br />
<br />
[[File:Table5x.PNG|center]]<br />
<br />
<code>meanChar</code> is good with the other three scrambling errors (Swap, Middle Random and Fully Random), but bad with Keyboard errors and Natural errors.<br />
<br />
=== Black-Box Adversarial Training ===<br />
<br />
<code>charCNN</code> Performance<br />
[[File:Table6x.PNG|center]]<br />
<br />
Here is the result of the translation of the scrambled meme:<br />
“According to a study of Cambridge University, it doesn’t matter which technology in a word is going to get the letters in a word that is the only important thing for the first and last letter.”<br />
<br />
== Analysis ==<br />
=== Learning Multiple Kinds of Noise in <code>charCNN</code> ===<br />
<br />
As Table 6 above shows, <code>charCNN</code> models performed quite well across different noise types on the test set when they are trained on a mix of noise types, which led the authors to speculate that filters from different convolutional layers learned to be robust to different types of noises. To test this hypothesis, they analyzed the weights learned by <code>charCNN</code> models trained on two kinds of input: completely scrambled words (Rand) without other kinds of noise, and a mix of Rand+Key+Nat kinds of noise. For each model, they computed the variance across the filter dimension for each one of the 1000 filters and for each one of the 25 character embedding dimensions, which were then averaged across the filters to yield 25 variances. <br />
<br />
As Figure 2 below shows, the variances for the ensemble model are higher and more varied, which indicates that the filters learned different patterns and the model differentiated between different character embedding dimensions. Under the random scrambling scheme, there should be no patterns for the model to learn, so it makes sense for the filter weights to stay close uniform weights, hence the consistently lower variance measures.<br />
<br />
[[File:Table7x.PNG|center]]<br />
<br />
=== Richness of Natural Noise ===<br />
[[File:SNNoise_NatNoiseExp.png|750px|right]]<br />
The synthetic noise used in this paper appears to be very different from natural noise. This is evident because none of the modes trained only on synthetic noise demonstrated good performance on natural noise. Therefore, the authors say that the noise models used in this paper are not representative of real noise and that a more sophisticated model using explicit phonemic and linguistic knowledge is required if an error-free corpus is to be augmented with error for training. The synthetic noise analysed is lacking a two common types of typos: inserting a character that is adjacent (on the keyboard) to a letter and omitting letters.<br />
<br />
During a manual analysis of a small subset of the German dataset, the natural noise was found to be comprised of:<br />
* 34% Phonetic error<br />
* 32% Character omissions<br />
* 34% Other: Morphological, Key swap, ect.<br />
<br />
Examples of these types of errors can be seen in Table 8.<br />
<br />
== Conclusion ==<br />
In this work, the authors have shown that character-based NMT models are extremely brittle and tend to break when presented with both natural and synthetic kinds of noise. After a comparison of the models, they found that a character-based CNN can learn to<br />
address multiple types of errors that are seen in training.<br />
For the future work, the author suggested generating more realistic synthetic noise by using phonetic and syntactic structure. Also, they suggested that a better NMT architecture could be designed which can be robust to noise without seeing it in the training data.<br />
<br />
== Criticism ==<br />
According to the [https://openreview.net/forum?id=BJ8vJebC- OpenReview thread], a major critique of this paper is that the solutions presented do not adequately solve the problem. The response to the meanChar architecture has been mostly negative and the method of noise injection has been seen as a simple start. However, the authors have acknowledged these critiques stating that they realize their solution is just a starting point. They argue that this paper has opened the discussion on dealing with noise in machine translation which has been mostly left untouched. Also these solutions/models still do not tackle the problem of natural noise as the models trained on the synthetic noise don't generalize well to natural noise.<br />
<br />
== References ==<br />
# Yonatan Belinkov and Yonatan Bisk. Synthetic and Natural Noise Both Break Neural Machine Translation. In ''International Conference on Learning Representations (ICLR)'', 2017.<br />
# Mauro Cettolo, Christian Girardi, and Marcello Federico. WIT: Web Inventory of Transcribed and Translated Talks. In ''Proceedings of the 16th Conference of the European Association for Machine Translation (EAMT)'', pp. 261–268, Trento, Italy, May 2012.<br />
# Jason Lee, Kyunghyun Cho, and Thomas Hofmann. Fully Character-Level Neural Machine Translation without Explicit Segmentation. ''Transactions of the Association for Computational Linguistics (TACL)'', 2017.<br />
# Rico Sennrich, Orhan Firat, Kyunghyun Cho, Alexandra Birch, Barry Haddow, Julian Hitschler, Marcin Junczys-Dowmunt, Samuel Laubli, Antonio Valerio Miceli Barone, Jozef Mokry, and Maria Nadejde. Nematus: a Toolkit for Neural Machine Translation. In ''Proceedings of the Software Demonstrations of the 15th Conference of the European Chapter of the Association for Computational Linguistics'', pp. 65–68, Valencia, Spain, April 2017. Association for Computational Linguistics. URL http://aclweb.org/anthology/E17-3017.<br />
# Aurlien Max and Guillaume Wisniewski. Mining Naturally-occurring Corrections and Paraphrases from Wikipedias Revision History. In Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC’10), Valletta, Malta, may 2010. European Language Resources Association (ELRA). ISBN 2-9517408-6-7. URL https://wicopaco.limsi.fr.<br />
# Katrin Wisniewski, Karin Schne, Lionel Nicolas, Chiara Vettori, Adriane Boyd, Detmar Meurers, Andrea Abel, and Jirka Hana. MERLIN: An online trilingual learner corpus empirically grounding the European Reference Levels in authentic learner data, 10 2013. URL https://www.ukp.tu-darmstadt.de/data/spelling-correction/rwse-datasets.<br />
# Torsten Zesch. Measuring Contextual Fitness Using Error Contexts Extracted from the Wikipedia Revision History. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pp. 529–538, Avignon, France, April 2012. Association for Computational Linguistics.<br />
# Suranjana Samanta and Sameep Mehta. Towards Crafting Text Adversarial Samples. arXiv preprint arXiv:1707.02812, 2017. Karel Sebesta, Zuzanna Bedrichova, Katerina Sormov́a, Barbora Stindlov́a, Milan Hrdlicka, Tereza Hrdlickov́a, Jiŕı Hana, Vladiḿır Petkevic, Toḿas Jeĺınek, Svatava Skodov́a, Petr Janes, Katerina Lund́akov́a, Hana Skoumalov́a, Simon Sĺadek, Piotr Pierscieniak, Dagmar Toufarov́a, Milan Straka, Alexandr Rosen, Jakub Ńaplava, and Marie Poĺackova. CzeSL grammatical error correction dataset (CzeSL-GEC). Technical report, LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics, Charles University, 2017. URL https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-2143.</div>Swalshhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Synthetic_and_natural_noise_both_break_neural_machine_translation&diff=36237stat946w18/Synthetic and natural noise both break neural machine translation2018-04-07T00:32:12Z<p>Swalsh: /* Richness of Natural Noise */</p>
<hr />
<div>== Introduction ==<br />
Humans have surprisingly robust language processing systems which can easily overcome typos, e.g.<br />
<br />
* "Aoccdrnig to a rscheearch at Cmabrigde Uinervtisy, it deosn't mttaer in waht oredr the ltteers in a wrod are, the olny iprmoetnt tihng is taht the frist and lsat ltteer be at the rghit pclae."<br />
<br />
A person's ability to read this text comes as no surprise to the Psychology literature<br />
# Saberi & Perrott (1999) found that this robustness extends to audio as well.<br />
# Rayner et al. (2006) found that in noisier settings reading comprehension only slowed by 11%.<br />
# McCusker et al. (1981) found that the common case of swapping letters could often go unnoticed by the reader.<br />
# Mayall et al (1997) shows that we rely on word shape.<br />
# Reicher, 1969; Pelli et al., (2003) found that we can switch between whole word recognition but the first and last letter positions are required to stay constant for comprehension<br />
<br />
However, neural machine translation (NMT) systems are brittle. i.e. The Arabic word<br />
[[File:Good_morning.PNG]] means a blessing for good morning, however [[File:Hunt.PNG]] means hunt or slaughter. <br />
<br />
Facebook's MT system mistakenly confused two words that only differ by one character, a situation that is challenging for a character-based NMT system.<br />
<br />
The figure below shows the performance translating German to English as a function of the percent of German words modified. Here two types of noise are shown: (1) In blue, random permutation of the word and (2) In green, swapping a pair of adjacent letters that does not include the first or last letter of the word. The important thing to note is that even small amounts of noise lead to substantial drops in performance.<br />
<br />
[[File:BLEU_plot.PNG|center]] <br />
<br />
BLEU (bilingual evaluation understudy) is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another. Quality is considered to be the correspondence between a machine's output and that of a human: "the closer a machine translation is to a professional human translation, the better it is". BLEU is between 0 and 1. BELU computes the scores for individual translated segments and then computes an average accuracy score for the whole corpus.<br />
<br />
This paper explores two simple strategies for increasing model robustness:<br />
# using structure-invariant representations (character CNN representation)<br />
# robust training on noisy data, a form of adversarial training.<br />
<br />
The goal of the paper is two-fold:<br />
# to initiate a conversation on robust training and modeling techniques in NMT<br />
# to promote the creation of better and more linguistically accurate artificial noise to be applied to new languages and tasks<br />
<br />
== Adversarial examples ==<br />
The growing literature on adversarial examples has demonstrated how dangerous it can be to have brittle machine learning systems being used so pervasively in the real world. Small changes to the input can lead to dramatic<br />
failures of deep learning models. This leads to a potential for malicious attacks using adversarial examples. An important distinction is often drawn between white-box attacks, where adversarial examples are generated with<br />
access to the model parameters, and black-box attacks, where examples are generated without such access.<br />
<br />
The paper devises simple methods for generating adversarial examples for NMT. They do not assume any access to the NMT models' gradients, instead relying on cognitively-informed and naturally occurring language errors to generate noise.<br />
<br />
== MT system ==<br />
The authors experiment with three different NMT systems with access to character information at different levels.<br />
# Use <code>char2char</code>, the fully character-level model of (Lee et al. 2017). This model processes a sentence as a sequence of characters. The encoder works as follows: the characters are embedded as vectors, and then the sequence of vectors is fed to a convolutional layer. The sequence output by the convolutional layer is then shortened by max pooling in the time dimension. The output of the max-pooling layer is then fed to a four-layer highway network (Srivasta et al. 2015), and the output of the highway network is in turn fed to a bidirectional GRU, producing a sequence of hidden units. The sequence of hidden units is then processed by the decoder, a GRU with attention, to produce probabilities over sequences of output characters.<br />
# Use <code>Nematus</code> (Sennrich et al., 2017), a popular NMT toolkit. It is another sequence-to-sequence model with several architecture modifications, especially operating on sub-word units using byte-pair encoding. Byte-pair encoding (Sennich et al. 2015, Gage 1994) is an algorithm according to which we begin with a list of characters as our symbols, and repeatedly fuse common combinations to create new symbols. For example, if we begin with the letters a to z as our symbol list, and we find that "th" is the most common two-letter combination in a corpus, then we would add "th" to our symbol list in the first iteration. After we have used this algorithm to create a symbol list of the desired size, we apply a standard encoder-decoder with attention.<br />
# Use an attentional sequence-to-sequence model with a word representation based on a character convolutional neural network (<code>charCNN</code>). The <code>charCNN</code> model is similar to <code>char2char</code>, but uses a shallower highway network and, although it reads the input sentence as characters, it produces as output a probability distribution over words, not characters.<br />
<br />
== Data ==<br />
=== MT Data ===<br />
The authors use the TED talks parallel corpus prepared for IWSLT 2016 (Cettolo et al., 2012) for testing all of the NMT systems.<br />
<br />
[[File:Table1x.PNG|center]]<br />
<br />
=== Natural and Artificial Noise ===<br />
==== Natural Noise ====<br />
The three languages, French, German, and Czech, each have their own frequent natural errors. The corpora of edits used for these languages are:<br />
<br />
# French : Wikipedia Correction and Paraphrase Corpus (WiCoPaCo)<br />
# German : RWSE Wikipedia Correction Dataset and The MERLIN corpus<br />
# Czech : CzeSL Grammatical Error Correction Dataset (CzeSL-GEC) which is a manually annotated dataset of essays written by both non-native learners of Czech and Czech pupils<br />
<br />
The authors harvested naturally occurring errors (typos, misspellings, etc.) corresponding to these three languages from available corpora of edits to build a look-up table of possible lexical replacements.<br />
<br />
They insert these errors into the source-side of the parallel data by replacing every word in the corpus with an error if one exists in our dataset. When there is more than one possible replacement to choose, words for which there is no error, are sampled uniformly and kept as is.<br />
<br />
==== Synthetic Noise ====<br />
In addition to naturally collected sources of error, the authors also experiment with four types of synthetic noise: Swap, Middle Random, Fully Random, and Key Typo. <br />
# <code>Swap</code>: The first and simplest source of noise is swapping two letters (do not alter the first or last letters, only apply to words of length >=4).<br />
# <code>Middle Random</code>: Randomize the order of all the letters in a word except for the first and last (only apply to words of length >=4).<br />
# <code>Fully Random</code> Completely randomized words.<br />
# <code>Keyboard Typo</code> Randomly replace one letter in each word with an adjacent key<br />
<br />
[[File:Table3x.PNG|center]]<br />
<br />
Table 3 shows BLEU scores of models trained on clean (Vanilla) texts and tested on clean and noisy<br />
texts. All models suffer a significant drop in BLEU when evaluated on noisy texts. This is true<br />
for both natural noise and all kinds of synthetic noise. The more noise in the text, the worse the<br />
translation quality, with random scrambling producing the lowest BLEU scores.<br />
<br />
In contrast to the poor performance of these methods in the presence of noise, humans can perform very well as mentioned in the introduction. The table below shows the translations performed by a German native-speaker human, not familiar with the meme and three machine translation methods. Clearly, the machine translation methods failed. <br />
<br />
[[File:paper16_tab4.png|center]]<br />
<br />
The author also examined improvements by using a simple spell checker. The author tried correcting error through Google's spell checker by simply accepting the first suggestion on the detected mistake. There was a small improvement in French and German translations, and a small drop in accuracy for the Czech translation due to more complex grammar. The author concluded using existing spell checkers would not improve the accuracy to be comparable with vanilla text. The results are shown in the table below.<br />
<br />
<br />
[[File:paper16_tab5.png|center]]<br />
<br />
== Dealing with noise ==<br />
=== Structure Invariant Representations ===<br />
The three NMT models are all sensitive to word structure. The <code>char2char</code> and <code>charCNN</code> models both have convolutional layers on character sequences, designed to capture character n-grams (which are sequences of characters or words, of length n). The model in <code>Nematus</code> is based on sub-word units obtained with byte pair encoding (where common consecutive characters are replaced with a unique byte that does not occur in the data). It thus relies on character order.<br />
<br />
The simplest way to improve such a model is to take the average character embeddings as a word representation. This model, referred to as <code>meanChar</code>, first generates a word representation by averaging character embeddings, and then proceeds with a word-level encoder similar to the <code>charCNN</code> model.<br />
<br />
[[File:Table5x.PNG|center]]<br />
<br />
<code>meanChar</code> is good with the other three scrambling errors (Swap, Middle Random and Fully Random), but bad with Keyboard errors and Natural errors.<br />
<br />
=== Black-Box Adversarial Training ===<br />
<br />
<code>charCNN</code> Performance<br />
[[File:Table6x.PNG|center]]<br />
<br />
Here is the result of the translation of the scrambled meme:<br />
“According to a study of Cambridge University, it doesn’t matter which technology in a word is going to get the letters in a word that is the only important thing for the first and last letter.”<br />
<br />
== Analysis ==<br />
=== Learning Multiple Kinds of Noise in <code>charCNN</code> ===<br />
<br />
As Table 6 above shows, <code>charCNN</code> models performed quite well across different noise types on the test set when they are trained on a mix of noise types, which led the authors to speculate that filters from different convolutional layers learned to be robust to different types of noises. To test this hypothesis, they analyzed the weights learned by <code>charCNN</code> models trained on two kinds of input: completely scrambled words (Rand) without other kinds of noise, and a mix of Rand+Key+Nat kinds of noise. For each model, they computed the variance across the filter dimension for each one of the 1000 filters and for each one of the 25 character embedding dimensions, which were then averaged across the filters to yield 25 variances. <br />
<br />
As Figure 2 below shows, the variances for the ensemble model are higher and more varied, which indicates that the filters learned different patterns and the model differentiated between different character embedding dimensions. Under the random scrambling scheme, there should be no patterns for the model to learn, so it makes sense for the filter weights to stay close uniform weights, hence the consistently lower variance measures.<br />
<br />
[[File:Table7x.PNG|center]]<br />
<br />
=== Richness of Natural Noise ===<br />
<br />
The synthetic noise used in this paper appears to be very different from natural noise. This is evident because none of the modes trained only on synthetic noise demonstrated good performance on natural noise. Therefore, the authors say that the noise models used in this paper are not representative of real noise and that a more sophisticated model using explicit phonemic and linguistic knowledge is required if an error-free corpus is to be augmented with error for training. The synthetic noise analysed is lacking a two common types of typos: inserting a character that is adjacent (on the keyboard) to a letter and omitting letters.<br />
<br />
During a manual analysis of a small subset of the German dataset, the natural noise was found to be comprised of:<br />
* 34% Phonetic error<br />
* 32% Character omissions<br />
* 34% Other: Morphological, Key swap, ect.<br />
<br />
Examples of these types of errors can be seen in Table 8 below.<br />
<br />
[[File:SNNoise_NatNoiseExp.png|750px|center]]<br />
<br />
== Conclusion ==<br />
In this work, the authors have shown that character-based NMT models are extremely brittle and tend to break when presented with both natural and synthetic kinds of noise. After a comparison of the models, they found that a character-based CNN can learn to<br />
address multiple types of errors that are seen in training.<br />
For the future work, the author suggested generating more realistic synthetic noise by using phonetic and syntactic structure. Also, they suggested that a better NMT architecture could be designed which can be robust to noise without seeing it in the training data.<br />
<br />
== Criticism ==<br />
According to the [https://openreview.net/forum?id=BJ8vJebC- OpenReview thread], a major critique of this paper is that the solutions presented do not adequately solve the problem. The response to the meanChar architecture has been mostly negative and the method of noise injection has been seen as a simple start. However, the authors have acknowledged these critiques stating that they realize their solution is just a starting point. They argue that this paper has opened the discussion on dealing with noise in machine translation which has been mostly left untouched. Also these solutions/models still do not tackle the problem of natural noise as the models trained on the synthetic noise don't generalize well to natural noise.<br />
<br />
== References ==<br />
# Yonatan Belinkov and Yonatan Bisk. Synthetic and Natural Noise Both Break Neural Machine Translation. In ''International Conference on Learning Representations (ICLR)'', 2017.<br />
# Mauro Cettolo, Christian Girardi, and Marcello Federico. WIT: Web Inventory of Transcribed and Translated Talks. In ''Proceedings of the 16th Conference of the European Association for Machine Translation (EAMT)'', pp. 261–268, Trento, Italy, May 2012.<br />
# Jason Lee, Kyunghyun Cho, and Thomas Hofmann. Fully Character-Level Neural Machine Translation without Explicit Segmentation. ''Transactions of the Association for Computational Linguistics (TACL)'', 2017.<br />
# Rico Sennrich, Orhan Firat, Kyunghyun Cho, Alexandra Birch, Barry Haddow, Julian Hitschler, Marcin Junczys-Dowmunt, Samuel Laubli, Antonio Valerio Miceli Barone, Jozef Mokry, and Maria Nadejde. Nematus: a Toolkit for Neural Machine Translation. In ''Proceedings of the Software Demonstrations of the 15th Conference of the European Chapter of the Association for Computational Linguistics'', pp. 65–68, Valencia, Spain, April 2017. Association for Computational Linguistics. URL http://aclweb.org/anthology/E17-3017.<br />
# Aurlien Max and Guillaume Wisniewski. Mining Naturally-occurring Corrections and Paraphrases from Wikipedias Revision History. In Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC’10), Valletta, Malta, may 2010. European Language Resources Association (ELRA). ISBN 2-9517408-6-7. URL https://wicopaco.limsi.fr.<br />
# Katrin Wisniewski, Karin Schne, Lionel Nicolas, Chiara Vettori, Adriane Boyd, Detmar Meurers, Andrea Abel, and Jirka Hana. MERLIN: An online trilingual learner corpus empirically grounding the European Reference Levels in authentic learner data, 10 2013. URL https://www.ukp.tu-darmstadt.de/data/spelling-correction/rwse-datasets.<br />
# Torsten Zesch. Measuring Contextual Fitness Using Error Contexts Extracted from the Wikipedia Revision History. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pp. 529–538, Avignon, France, April 2012. Association for Computational Linguistics.<br />
# Suranjana Samanta and Sameep Mehta. Towards Crafting Text Adversarial Samples. arXiv preprint arXiv:1707.02812, 2017. Karel Sebesta, Zuzanna Bedrichova, Katerina Sormov́a, Barbora Stindlov́a, Milan Hrdlicka, Tereza Hrdlickov́a, Jiŕı Hana, Vladiḿır Petkevic, Toḿas Jeĺınek, Svatava Skodov́a, Petr Janes, Katerina Lund́akov́a, Hana Skoumalov́a, Simon Sĺadek, Piotr Pierscieniak, Dagmar Toufarov́a, Milan Straka, Alexandr Rosen, Jakub Ńaplava, and Marie Poĺackova. CzeSL grammatical error correction dataset (CzeSL-GEC). Technical report, LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics, Charles University, 2017. URL https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-2143.</div>Swalshhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Synthetic_and_natural_noise_both_break_neural_machine_translation&diff=36236stat946w18/Synthetic and natural noise both break neural machine translation2018-04-07T00:31:53Z<p>Swalsh: Add Natural noise review and examples</p>
<hr />
<div>== Introduction ==<br />
Humans have surprisingly robust language processing systems which can easily overcome typos, e.g.<br />
<br />
* "Aoccdrnig to a rscheearch at Cmabrigde Uinervtisy, it deosn't mttaer in waht oredr the ltteers in a wrod are, the olny iprmoetnt tihng is taht the frist and lsat ltteer be at the rghit pclae."<br />
<br />
A person's ability to read this text comes as no surprise to the Psychology literature<br />
# Saberi & Perrott (1999) found that this robustness extends to audio as well.<br />
# Rayner et al. (2006) found that in noisier settings reading comprehension only slowed by 11%.<br />
# McCusker et al. (1981) found that the common case of swapping letters could often go unnoticed by the reader.<br />
# Mayall et al (1997) shows that we rely on word shape.<br />
# Reicher, 1969; Pelli et al., (2003) found that we can switch between whole word recognition but the first and last letter positions are required to stay constant for comprehension<br />
<br />
However, neural machine translation (NMT) systems are brittle. i.e. The Arabic word<br />
[[File:Good_morning.PNG]] means a blessing for good morning, however [[File:Hunt.PNG]] means hunt or slaughter. <br />
<br />
Facebook's MT system mistakenly confused two words that only differ by one character, a situation that is challenging for a character-based NMT system.<br />
<br />
The figure below shows the performance translating German to English as a function of the percent of German words modified. Here two types of noise are shown: (1) In blue, random permutation of the word and (2) In green, swapping a pair of adjacent letters that does not include the first or last letter of the word. The important thing to note is that even small amounts of noise lead to substantial drops in performance.<br />
<br />
[[File:BLEU_plot.PNG|center]] <br />
<br />
BLEU (bilingual evaluation understudy) is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another. Quality is considered to be the correspondence between a machine's output and that of a human: "the closer a machine translation is to a professional human translation, the better it is". BLEU is between 0 and 1. BELU computes the scores for individual translated segments and then computes an average accuracy score for the whole corpus.<br />
<br />
This paper explores two simple strategies for increasing model robustness:<br />
# using structure-invariant representations (character CNN representation)<br />
# robust training on noisy data, a form of adversarial training.<br />
<br />
The goal of the paper is two-fold:<br />
# to initiate a conversation on robust training and modeling techniques in NMT<br />
# to promote the creation of better and more linguistically accurate artificial noise to be applied to new languages and tasks<br />
<br />
== Adversarial examples ==<br />
The growing literature on adversarial examples has demonstrated how dangerous it can be to have brittle machine learning systems being used so pervasively in the real world. Small changes to the input can lead to dramatic<br />
failures of deep learning models. This leads to a potential for malicious attacks using adversarial examples. An important distinction is often drawn between white-box attacks, where adversarial examples are generated with<br />
access to the model parameters, and black-box attacks, where examples are generated without such access.<br />
<br />
The paper devises simple methods for generating adversarial examples for NMT. They do not assume any access to the NMT models' gradients, instead relying on cognitively-informed and naturally occurring language errors to generate noise.<br />
<br />
== MT system ==<br />
The authors experiment with three different NMT systems with access to character information at different levels.<br />
# Use <code>char2char</code>, the fully character-level model of (Lee et al. 2017). This model processes a sentence as a sequence of characters. The encoder works as follows: the characters are embedded as vectors, and then the sequence of vectors is fed to a convolutional layer. The sequence output by the convolutional layer is then shortened by max pooling in the time dimension. The output of the max-pooling layer is then fed to a four-layer highway network (Srivasta et al. 2015), and the output of the highway network is in turn fed to a bidirectional GRU, producing a sequence of hidden units. The sequence of hidden units is then processed by the decoder, a GRU with attention, to produce probabilities over sequences of output characters.<br />
# Use <code>Nematus</code> (Sennrich et al., 2017), a popular NMT toolkit. It is another sequence-to-sequence model with several architecture modifications, especially operating on sub-word units using byte-pair encoding. Byte-pair encoding (Sennich et al. 2015, Gage 1994) is an algorithm according to which we begin with a list of characters as our symbols, and repeatedly fuse common combinations to create new symbols. For example, if we begin with the letters a to z as our symbol list, and we find that "th" is the most common two-letter combination in a corpus, then we would add "th" to our symbol list in the first iteration. After we have used this algorithm to create a symbol list of the desired size, we apply a standard encoder-decoder with attention.<br />
# Use an attentional sequence-to-sequence model with a word representation based on a character convolutional neural network (<code>charCNN</code>). The <code>charCNN</code> model is similar to <code>char2char</code>, but uses a shallower highway network and, although it reads the input sentence as characters, it produces as output a probability distribution over words, not characters.<br />
<br />
== Data ==<br />
=== MT Data ===<br />
The authors use the TED talks parallel corpus prepared for IWSLT 2016 (Cettolo et al., 2012) for testing all of the NMT systems.<br />
<br />
[[File:Table1x.PNG|center]]<br />
<br />
=== Natural and Artificial Noise ===<br />
==== Natural Noise ====<br />
The three languages, French, German, and Czech, each have their own frequent natural errors. The corpora of edits used for these languages are:<br />
<br />
# French : Wikipedia Correction and Paraphrase Corpus (WiCoPaCo)<br />
# German : RWSE Wikipedia Correction Dataset and The MERLIN corpus<br />
# Czech : CzeSL Grammatical Error Correction Dataset (CzeSL-GEC) which is a manually annotated dataset of essays written by both non-native learners of Czech and Czech pupils<br />
<br />
The authors harvested naturally occurring errors (typos, misspellings, etc.) corresponding to these three languages from available corpora of edits to build a look-up table of possible lexical replacements.<br />
<br />
They insert these errors into the source-side of the parallel data by replacing every word in the corpus with an error if one exists in our dataset. When there is more than one possible replacement to choose, words for which there is no error, are sampled uniformly and kept as is.<br />
<br />
==== Synthetic Noise ====<br />
In addition to naturally collected sources of error, the authors also experiment with four types of synthetic noise: Swap, Middle Random, Fully Random, and Key Typo. <br />
# <code>Swap</code>: The first and simplest source of noise is swapping two letters (do not alter the first or last letters, only apply to words of length >=4).<br />
# <code>Middle Random</code>: Randomize the order of all the letters in a word except for the first and last (only apply to words of length >=4).<br />
# <code>Fully Random</code> Completely randomized words.<br />
# <code>Keyboard Typo</code> Randomly replace one letter in each word with an adjacent key<br />
<br />
[[File:Table3x.PNG|center]]<br />
<br />
Table 3 shows BLEU scores of models trained on clean (Vanilla) texts and tested on clean and noisy<br />
texts. All models suffer a significant drop in BLEU when evaluated on noisy texts. This is true<br />
for both natural noise and all kinds of synthetic noise. The more noise in the text, the worse the<br />
translation quality, with random scrambling producing the lowest BLEU scores.<br />
<br />
In contrast to the poor performance of these methods in the presence of noise, humans can perform very well as mentioned in the introduction. The table below shows the translations performed by a German native-speaker human, not familiar with the meme and three machine translation methods. Clearly, the machine translation methods failed. <br />
<br />
[[File:paper16_tab4.png|center]]<br />
<br />
The author also examined improvements by using a simple spell checker. The author tried correcting error through Google's spell checker by simply accepting the first suggestion on the detected mistake. There was a small improvement in French and German translations, and a small drop in accuracy for the Czech translation due to more complex grammar. The author concluded using existing spell checkers would not improve the accuracy to be comparable with vanilla text. The results are shown in the table below.<br />
<br />
<br />
[[File:paper16_tab5.png|center]]<br />
<br />
== Dealing with noise ==<br />
=== Structure Invariant Representations ===<br />
The three NMT models are all sensitive to word structure. The <code>char2char</code> and <code>charCNN</code> models both have convolutional layers on character sequences, designed to capture character n-grams (which are sequences of characters or words, of length n). The model in <code>Nematus</code> is based on sub-word units obtained with byte pair encoding (where common consecutive characters are replaced with a unique byte that does not occur in the data). It thus relies on character order.<br />
<br />
The simplest way to improve such a model is to take the average character embeddings as a word representation. This model, referred to as <code>meanChar</code>, first generates a word representation by averaging character embeddings, and then proceeds with a word-level encoder similar to the <code>charCNN</code> model.<br />
<br />
[[File:Table5x.PNG|center]]<br />
<br />
<code>meanChar</code> is good with the other three scrambling errors (Swap, Middle Random and Fully Random), but bad with Keyboard errors and Natural errors.<br />
<br />
=== Black-Box Adversarial Training ===<br />
<br />
<code>charCNN</code> Performance<br />
[[File:Table6x.PNG|center]]<br />
<br />
Here is the result of the translation of the scrambled meme:<br />
“According to a study of Cambridge University, it doesn’t matter which technology in a word is going to get the letters in a word that is the only important thing for the first and last letter.”<br />
<br />
== Analysis ==<br />
=== Learning Multiple Kinds of Noise in <code>charCNN</code> ===<br />
<br />
As Table 6 above shows, <code>charCNN</code> models performed quite well across different noise types on the test set when they are trained on a mix of noise types, which led the authors to speculate that filters from different convolutional layers learned to be robust to different types of noises. To test this hypothesis, they analyzed the weights learned by <code>charCNN</code> models trained on two kinds of input: completely scrambled words (Rand) without other kinds of noise, and a mix of Rand+Key+Nat kinds of noise. For each model, they computed the variance across the filter dimension for each one of the 1000 filters and for each one of the 25 character embedding dimensions, which were then averaged across the filters to yield 25 variances. <br />
<br />
As Figure 2 below shows, the variances for the ensemble model are higher and more varied, which indicates that the filters learned different patterns and the model differentiated between different character embedding dimensions. Under the random scrambling scheme, there should be no patterns for the model to learn, so it makes sense for the filter weights to stay close uniform weights, hence the consistently lower variance measures.<br />
<br />
[[File:Table7x.PNG|center]]<br />
<br />
=== Richness of Natural Noise ===<br />
<br />
The synthetic noise used in this paper appears to be very different from natural noise. This is evident because none of the modes trained only on synthetic noise demonstrated good performance on natural noise. Therefore, the authors say that the noise models used in this paper are not representative of real noise and that a more sophisticated model using explicit phonemic and linguistic knowledge is required if an error-free corpus is to be augmented with error for training. The synthetic noise analysed is lacking a two common types of typos: inserting a character that is adjacent (on the keyboard) to a letter and omitting letters.<br />
<br />
During a manual analysis of a small subset of the German dataset, the natural noise was found to be comprised of:<br />
* 34% Phonetic error<br />
* 32% Character omissions<br />
* 34% Other: Morphological, Key swap, ect.<br />
<br />
Examples of these types of errors can be seen in Table 8 below.<br />
<br />
[[File:SNNoise_NatNoiseExp.png|750px|left]]<br />
<br />
== Conclusion ==<br />
In this work, the authors have shown that character-based NMT models are extremely brittle and tend to break when presented with both natural and synthetic kinds of noise. After a comparison of the models, they found that a character-based CNN can learn to<br />
address multiple types of errors that are seen in training.<br />
For the future work, the author suggested generating more realistic synthetic noise by using phonetic and syntactic structure. Also, they suggested that a better NMT architecture could be designed which can be robust to noise without seeing it in the training data.<br />
<br />
== Criticism ==<br />
According to the [https://openreview.net/forum?id=BJ8vJebC- OpenReview thread], a major critique of this paper is that the solutions presented do not adequately solve the problem. The response to the meanChar architecture has been mostly negative and the method of noise injection has been seen as a simple start. However, the authors have acknowledged these critiques stating that they realize their solution is just a starting point. They argue that this paper has opened the discussion on dealing with noise in machine translation which has been mostly left untouched. Also these solutions/models still do not tackle the problem of natural noise as the models trained on the synthetic noise don't generalize well to natural noise.<br />
<br />
== References ==<br />
# Yonatan Belinkov and Yonatan Bisk. Synthetic and Natural Noise Both Break Neural Machine Translation. In ''International Conference on Learning Representations (ICLR)'', 2017.<br />
# Mauro Cettolo, Christian Girardi, and Marcello Federico. WIT: Web Inventory of Transcribed and Translated Talks. In ''Proceedings of the 16th Conference of the European Association for Machine Translation (EAMT)'', pp. 261–268, Trento, Italy, May 2012.<br />
# Jason Lee, Kyunghyun Cho, and Thomas Hofmann. Fully Character-Level Neural Machine Translation without Explicit Segmentation. ''Transactions of the Association for Computational Linguistics (TACL)'', 2017.<br />
# Rico Sennrich, Orhan Firat, Kyunghyun Cho, Alexandra Birch, Barry Haddow, Julian Hitschler, Marcin Junczys-Dowmunt, Samuel Laubli, Antonio Valerio Miceli Barone, Jozef Mokry, and Maria Nadejde. Nematus: a Toolkit for Neural Machine Translation. In ''Proceedings of the Software Demonstrations of the 15th Conference of the European Chapter of the Association for Computational Linguistics'', pp. 65–68, Valencia, Spain, April 2017. Association for Computational Linguistics. URL http://aclweb.org/anthology/E17-3017.<br />
# Aurlien Max and Guillaume Wisniewski. Mining Naturally-occurring Corrections and Paraphrases from Wikipedias Revision History. In Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC’10), Valletta, Malta, may 2010. European Language Resources Association (ELRA). ISBN 2-9517408-6-7. URL https://wicopaco.limsi.fr.<br />
# Katrin Wisniewski, Karin Schne, Lionel Nicolas, Chiara Vettori, Adriane Boyd, Detmar Meurers, Andrea Abel, and Jirka Hana. MERLIN: An online trilingual learner corpus empirically grounding the European Reference Levels in authentic learner data, 10 2013. URL https://www.ukp.tu-darmstadt.de/data/spelling-correction/rwse-datasets.<br />
# Torsten Zesch. Measuring Contextual Fitness Using Error Contexts Extracted from the Wikipedia Revision History. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pp. 529–538, Avignon, France, April 2012. Association for Computational Linguistics.<br />
# Suranjana Samanta and Sameep Mehta. Towards Crafting Text Adversarial Samples. arXiv preprint arXiv:1707.02812, 2017. Karel Sebesta, Zuzanna Bedrichova, Katerina Sormov́a, Barbora Stindlov́a, Milan Hrdlicka, Tereza Hrdlickov́a, Jiŕı Hana, Vladiḿır Petkevic, Toḿas Jeĺınek, Svatava Skodov́a, Petr Janes, Katerina Lund́akov́a, Hana Skoumalov́a, Simon Sĺadek, Piotr Pierscieniak, Dagmar Toufarov́a, Milan Straka, Alexandr Rosen, Jakub Ńaplava, and Marie Poĺackova. CzeSL grammatical error correction dataset (CzeSL-GEC). Technical report, LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics, Charles University, 2017. URL https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-2143.</div>Swalshhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:SNNoise_NatNoiseExp.png&diff=36235File:SNNoise NatNoiseExp.png2018-04-07T00:30:18Z<p>Swalsh: </p>
<hr />
<div></div>Swalshhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Wasserstein_Auto-Encoders&diff=36233Wasserstein Auto-Encoders2018-04-07T00:17:24Z<p>Swalsh: /* Experimental Results */</p>
<hr />
<div><br />
= Introduction =<br />
Recent years have seen a convergence of two previously distinct approaches: representation learning from high dimensional data, and unsupervised generative modeling. In the field that formed at their intersection, Variational Auto-Encoders (VAEs) and Generative Adversarial Networks (GANs) have emerged to become well-established. VAEs are theoretically elegant but with the drawback that they tend to generate blurry samples when applied to natural images. GANs on the other hand produce better visual quality of sampled images, but come without an encoder, are harder to train and suffer from the mode-collapse problem when the trained model is unable to capture all the variability in the true data distribution. There has been recent research in generating encoder-decoder GANs where an encoder is trained in parallel with the generator, based on the intuition that this will allow the GAN to learn meaningful mapping from the compressed representation to the original image; however, these models also suffer from mode-collapse and perform comparable to vanilla GANs. Thus there has been a push to come up with the best way to combine them together, but a principled unifying framework is yet to be discovered.<br />
<br />
This work proposes a new family of regularized auto-encoders called the Wasserstein Auto-Encoder (WAE). The proposed method provides a novel theoretical insight into setting up an objective function for auto-encoders from the point of view of of optimal transport (OT). This theoretical formulation leads the authors to examine adversarial and maximum mean discrepancy based regularizers for matching a prior and the distribution of encoded data points in the latent space. An empirical evaluation is performed on MNIST and CelebA datasets, where WAE is found to generate samples of better quality than VAE while preserving training stability, encoder-decoder structure and nice latent manifold structure.<br />
<br />
The main contribution of the proposed algorithm is to provide theoretical foundations for using optimal transport cost as the auto-encoder objective function, while blending auto-encoders and GANs in a principled way. It also theoretically and experimentally explores the interesting relationships between WAEs, VAEs and adversarial auto-encoders.<br />
<br />
= Proposed Approach =<br />
==Theory of Optimal Transport and Wasserstein Distance==<br />
Wasserstein Distance is a measure of the distance between two probability distributions. It is also called Earth Mover’s distance, short for EM distance, because informally it can be interpreted as moving piles of dirt that follow one probability distribution at a minimum cost to follow the other distribution. The cost is quantified by the amount of dirt moved times the moving distance. <br />
A simple case where the probability domain is discrete is presented below.<br />
<br />
<br />
[[File:em_distance.PNG|thumb|upright=1.4|center|Step-by-step plan of moving dirt between piles in ''P'' and ''Q'' to make them match (''W'' = 5).]]<br />
<br />
<br />
When dealing with the continuous probability domain, the EM distance or the minimum one among the costs of all dirt moving solutions becomes:<br />
\begin{align}<br />
\small W(p_r, p_g) = \underset{\gamma\sim\Pi(p_r, p_g)} {\inf}\pmb{\mathbb{E}}_{(x,y)\sim\gamma}[\parallel x-y\parallel]<br />
\end{align}<br />
<br />
Where <math>\Pi(p_r, p_g)</math> is the set of all joint probability distributions with marginals <math>p_r</math> and <math>p_g</math>. Here the distribution <math>\gamma</math> is called a transport plan because it's marginal structure gives some intuition that it represents the amount of probability mass to be moved from x to y. This intuition can be explained by looking at the following equation.<br />
<br />
\begin{align}<br />
\int\gamma(x, y)dx = p_g(y)<br />
\end{align}<br />
<br />
The Wasserstein distance or the cost of Optimal Transport (OT) provides a much weaker topology, which informally means that it makes it easier for a sequence of distribution to converge as compared to other ''f''-divergences. This is particularly important in applications where data is supported on low dimensional manifolds in the input space. As a result, stronger notions of distances such as KL-divergence, often max out, providing no useful gradients for training. In contrast, OT has a much nicer linear behaviour even upon saturation. It can be shown that the Wasserstein distance has guarantees of continuity and differentiability (Arjovsky et al., 2017). Moreover, Arjovsky et al. show there is a nice relationship between the magnitude of the Wasserstein distance and the distance between distributions; a smaller distance nicely corresponds to a smaller distance between the two distributions, and vice versa.<br />
<br />
==Problem Formulation and Notation==<br />
In this paper, calligraphic letters, i.e. <math>\small {\mathcal{X}}</math>, are used for sets, capital letters, i.e. <math>\small X</math>, are used for random variables and lower case letters, i.e. <math>\small x</math>, for their values. Probability distributions are denoted with capital letters, i.e. <math>\small P(X)</math>, and corresponding densities with lower case letters, i.e. <math>\small p(x)</math>.<br />
<br />
This work aims to minimize OT <math>\small W_c(P_X, P_G)</math> between the true (but unknown) data distribution <math>\small P_X</math> and a latent variable model <math>\small P_G</math> specified by the prior distribution <math>\small P_Z</math> of latent codes <math>\small Z \in \pmb{\mathbb{Z}}</math> and the generative model <math>\small P_G(X|Z)</math> of the data points <math>\small X \in \pmb{\mathbb{X}}</math> given <math>\small Z</math>. <br />
<br />
Kantorovich's formulation of the OT problem is given by:<br />
\begin{align}<br />
\small W_c(P_X, P_G) := \underset{\Gamma\sim {\mathcal{P}}(X \sim P_X, Y \sim P_G)}{\inf} {\pmb{\mathbb{E}}_{(X,Y)\sim\Gamma}[c(X,Y)]}<br />
\end{align}<br />
where <math>\small c(x,y)</math> is any measurable cost function and <math>\small {\mathcal{P}(X \sim P_X,Y \sim P_G)}</math> is a set of all joint distributions of <math>\small (X,Y)</math> with marginals <math>\small P_X</math> and <math>\small P_G</math>. When <math>\small c(x,y)=d(x,y)</math>, the following Kantorovich-Rubinstein duality holds for the <math>\small 1^{st}</math> root of <math>\small W_c</math>:<br />
\begin{align}<br />
\small W_1(P_X, P_G) := \underset{f \in {\mathcal{F_L}}} {\sup} {\pmb{\mathbb{E}}_{X \sim P_X}[f(X)]} -{\pmb{\mathbb{E}}_{Y \sim P_G}[f(Y)]}<br />
\end{align}<br />
where <math>\small {\mathcal{F_L}}</math> is the class of all bounded [https://en.wikipedia.org/wiki/Lipschitz_continuity Lipschitz continuous functions]. A reference that provides an intuitive explanation for how the Kantorovich-Rubinstein duality was applied in this case is [https://vincentherrmann.github.io/blog/wasserstein/ here].<br />
<br />
==Wasserstein Auto-Encoders==<br />
The proposed method focuses on latent variables <math>\small P_G </math> defined by a two step procedure, where first a code <math>\small Z</math> is sampled from a fixed prior distribution <math>\small P_Z</math> on a latent space <math>\small {\mathcal{Z}}</math> and then <math>\small Z</math> is mapped to the image <math>\small X \in {\mathcal{X}}</math> with a transformation. This results in a density of the form<br />
\begin{align}<br />
\small p_G(x) := \int_{{\mathcal{Z}}} p_G(x|z)p_z(z)dz, \forall x\in{\mathcal{X}}<br />
\end{align}<br />
assuming all the densities are properly defined. It turns out that if the focus is only on generative models deterministically mapping <math>\small Z </math> to <math>\small X = G(Z) </math>, then the OT cost takes a much simpler form as stated below by Theorem 1.<br />
<br />
'''Theorem 1''' For any function <math>\small G:{\mathcal{Z}} \rightarrow {\mathcal{X}}</math>, where <math>\small Q(Z) </math> is the marginal distribution of <math>\small Z </math> when <math>\small X \in P_X </math> and <math>\small Z \in Q(Z|X) </math>,<br />
\begin{align}<br />
\small \underset{\Gamma\sim {\mathcal{P}}(X \sim P_X, Y \sim P_G)}{\inf} {\pmb{\mathbb{E}}_{(X,Y)\sim\Gamma}[c(X,Y)]} = \underset{Q : Q_z=P_z}{\inf} {{\pmb{\mathbb{E}}_{P_X}}{\pmb{\mathbb{E}}_{Q(Z|X)}}[c(X,G(Z))]}<br />
\end{align}<br />
This essentially means that instead of finding a coupling <math>\small \Gamma </math> between two random variables living in the <math>\small {\mathcal{X}} </math> space, one distributed according to <math>\small P_X </math> and the other one according to <math>\small P_G </math>, it is sufficient to find a conditional distribution <math>\small Q(Z|X) </math> such that its <math>\small Z </math> marginal <math>\small Q_Z(Z) := {\pmb{\mathbb{E}}_{X \sim P_X}[Q(Z|X)]} </math> is identical to the prior distribution <math>\small P_Z </math>. In order to implement a numerical solution to Theorem 1, the constraints on <math>\small Q(Z|X) </math> and <math>\small P_Z </math> are relaxed and a penalty function is added to the objective leading to the WAE objective function given by:<br />
<br />
\begin{align}<br />
\small D_{WAE}(P_X, P_G):= \underset{Q(Z|X) \in Q}{\inf} {{\pmb{\mathbb{E}}_{P_X}}{\pmb{\mathbb{E}}_{Q(Z|X)}}[c(X,G(Z))]} + {\lambda} {{\mathcal{D}}_Z(Q_Z,P_Z)}<br />
\end{align}<br />
where <math>\small Q </math> is any non-parametric set of probabilistic encoders, <math>\small {\mathcal{D}}_Z </math> is an arbitrary divergence between <br />
<math>\small Q_Z </math> and <math>\small P_Z </math>, and <math>\small \lambda > 0 </math> is a hyperparameter. The authors propose two different penalties <math>\small {\mathcal{D}}_Z(Q_Z,P_Z) </math> based on adversarial training (GANs) and maximum mean discrepancy (MMD). The authors note that a numerical solution to the dual formulation of the problem has been tried by clipping the weights of the network (to satisfy the Lipschitz condition) and by penalizing the objective with <math>\small \lambda \mathbb{E}(\parallel \nabla f(X) \parallel - 1)^2 </math><br />
<br />
===WAE-GAN: GAN-based===<br />
The first option is to choose <math>\small {\mathcal{D}}_Z(Q_Z,P_Z) = D_{JS}(Q_Z,P_Z)</math>, where <math>\small D_{JS} </math> is the Jensen-Shannon divergence metric, and use adversarial training to estimate it. Specifically a discriminator is introduced in the latent space <math>\small {\mathcal{Z}} </math> trying to separate true points sampled from <math>\small P_Z </math> from fake ones sampled from <math>\small Q_Z </math>. This results in Algorithm 1. It is interesting that the min-max problem is moved from the input pixel space to the latent space.<br />
<br />
<br />
[[File:wae-gan.PNG|270px|center]]<br />
<br />
===WAE-MMD: MMD-based===<br />
For a positive definite kernel <math>\small k: {\mathcal{Z}} \times {\mathcal{Z}} \rightarrow {\mathcal{R}}</math>, the following expression is called the maximum mean discrepancy:<br />
\begin{align}<br />
\small {MMD}_k(P_Z,Q_Z) = \parallel \int_{{\mathcal{Z}}} k(z,\cdot)dP_z(z) - \int_{{\mathcal{Z}}} k(z,\cdot)dQ_z(z) \parallel_{\mathcal{H}_k},<br />
\end{align}<br />
<br />
where <math>\mathcal{H}_k</math> is the reproducing kernel Hilbert space of real-valued functions mapping <math>\mathcal{Z}</math> to <math>\mathcal{R}</math>. This can be used as a divergence measure and the authors propose to use <math>\small {\mathcal{D}}_Z(Q_Z,P_Z) = MMD_k(P_Z,Q_Z) </math>, which leads to Algorithm 2.<br />
<br />
<br />
[[File:wae-mmd.PNG|270px|center]]<br />
<br />
= Comparison with Related Work =<br />
==Auto-Encoders, VAEs and WAEs==<br />
Classical unregularized encoders only minimized the reconstruction cost, and resulted in training points being chaotically scattered across the latent space with holes in between, where the decoder had never been trained. They were hard to sample from and did not provide a useful representation. VAEs circumvented this problem by maximizing a variational lower-bound term comprising of a reconstruction cost and a KL-divergence measure which captures how distinct each training example is from the prior <math>\small P_Z</math>. This however does not guarantee that the overall encoded distribution <math>\small {{\pmb{\mathbb{E}}_{P_X}}}[Q(Z|X)]</math> matches <math>\small P_Z</math>. This is ensured by WAE however, is a direct consequence of our objective function derived from Theorem 1, and is visually represented in the figure below. It is also interesting to note that this also allows WAE to have deterministic encoder-decoder pairs.<br />
<br />
<br />
[[File:vae-wae.PNG|500px|thumb|center|WAE and VAE regularization]]<br />
<br />
<br />
It is also shown that if <math>\small c(x,y)={\parallel x-y \parallel}_2^2</math>, WAE-GAN is equivalent to adversarial autoencoders (AAE). Thus the theory suggests that AAE minimize the 2-Wasserstein distance between <math>\small P_X</math> and <math>\small P_G</math>.<br />
<br />
==OT, W-GAN and WAE==<br />
The Wasserstein GAN (W-GAN) minimizes the 1-Wasserstein distance <math>\small W_1(P_X,P_G)</math> for generative modeling. The W-GAN formulation is approached from the dual form and thus it cannot be applied to another other cost <math>\small W_c</math> as the neat form of the Kantorovich-Rubinstein duality holds only for <math>\small W_1</math>. WAE approaches the same problem from the primal form, can be applied to any cost function <math>\small c</math> and comes naturally with an encoder. The constraint on OT in Theorem 1, is relaxed in line with theory on unbalanced optimal transport by adding a penalty or additional divergences to the objective.<br />
<br />
==GANs and WAEs==<br />
Many of the GAN variations including f-GAN and W-GAN come without an encoder. Often it may be desirable to reconstruct the latent codes and use the learned manifold in which case they won't be applicable. For works which try to blend adversarial auto-encoder structures, encoders and decoders do not have incentive to be reciprocal. WAE does not necessarily lead to a min-max game and has a clear theoretical foundation for using penalties for regularization.<br />
<br />
=Experimental Results=<br />
The authors empirically evaluate the proposed WAE generative model by specifically testing if data points are accurately reconstructed, if the latent manifold has reasonable geometry, and if random samples of good visual quality are generated. <br />
<br />
'''Experimental setup:'''<br />
Gaussian prior distribution <math> \small P_Z</math> and squared cost function <math> \small c(x,y)</math> are used for data-points. The encoder-decoder pairs were deterministic. The convolutional deep neural network for encoder mapping and decoder mapping are similar to DC-GAN with batch normalization. Real world datasets, MNIST with 70k images and CelebA with 203k images were used for training and testing. For interpolations a pair of of held out images, <math>(x,y)</math> from the test set are Auto-encoded (separately), to produce <math>(z_x, z_y)</math> in the latent space. The elements of the latent space are linearly interpolated and decoded to produce the images below. <br />
<br />
'''WAE-GAN and WAE-MMD:'''<br />
In WAE-GAN, the discriminator <math> \small D </math> composed of several fully connected layers with ReLu activations. For WAE-MMD, the RBF kernel failed to penalize outliers and thus the authors resorted to using inverse multiquadratics kernel <math> \small k(x,y)=C/(C+\parallel{x-y}_2^2\parallel) </math>. Trained models are presented in the figure below.<br />
As far as random sampled results are concerned, WAE-GAN seems to be highly unstable but do lead to better matching scores among WAE-GAN, WAE-MMD and VAE. WAE-MMD on the other hand has much more stable training and fairly good quality of sampled results.<br />
<br />
'''Qualitative assessment:'''<br />
In order to quantitatively assess the quality of the generated images, they use the Fréchet Inception Distance and report the results on CelebA (The Fréchet Inception Distance measures the similarity between two sets of images, by comparing the Fréchet distance of multivariate Gaussian distributions fitted to their feature representations. In more detail, let <math> (m,C) </math> denote the mean vector and covariance matrix of the features of the inception network (Szegedy et al. 2017) applied to model samples. Let <math>(m_w,C_w) </math> denote the mean vector and covariance matrix of the features of the inception network applied to real data. Then the Fréchet Inception Distance between the model samples and the real data is <math> ||m-m_w||^2 +\mathrm{tr}(C+C_w-2(CC_w)^{\frac{1}{2}} )\,</math> (Heusel et al. 2017). ) These results confirm that the sampled images from WAE are of better quality than from VAE (score: 82), and WAE-GAN gets a slightly better score (score:42) than WAE-MMD (score:55), which correlates with visual inspection of the images.<br />
<br />
[[File:results.png|800px|thumb|center|Results on MNIST and Celeb-A dataset. In "test reconstructions" (middle row of images), odd rows correspond to the real test points.]]<br />
<br />
<br />
<br />
The authors also heuristically evaluate the sharpness of generated samples using the Laplace filter. The numbers, summarized in Table1, show that WAE-MMD has samples of slightly better quality than VAE, while WAE-GAN achieves the best results overall.<br />
[[File: paper17_Table.png|300px|thumb|center|Qualitative Assessment of Images]]<br />
<br />
'''Network structures:'''<br />
<br />
The Encoder, Decoder, and Adversary architectures used for the MNIST and CelebA datasets are as sown in the following two images:<br />
<br />
[[File:WAE_MNIST.png|700px|thumb|center|Network architectures used to evaluate on the MNIST dataset.]]<br />
<br />
[[File:WAE_CelebA.png|700px|thumb|center|Network architectures used to evaluate on the CelebA dataset.]]<br />
<br />
= Commentary and Conclusion =<br />
This paper presents an interesting theoretical justification for a new family of auto-encoders called Wasserstein Auto-Encoders (WAE). The objective function minimizes the optimal transport cost in the form of the Wasserstein distance, but relaxes theoretical constraints to separate it into a reconstruction cost and a regularization penalty. The regularization penalizes divergences between a prior and the distribution of encoded latent space training data, and is estimated by means of adversarial training (WAE-GAN), or kernel-based techniques (WAE-MMD). They show that they achieve samples of better visual quality than VAEs, while achieving stable training at the same time. They also theoretically show that WAEs are a generalization of adversarial auto-encoders (AAEs).<br />
<br />
Although the paper mentions that encoder-decoder pairs can be deterministic, they do not show the geometry of the latent space that is obtained. It is necessary to study the effect of randomness of encoders on the quality of obtained samples. While this method is evaluated on MNIST and CelebA datasets, it is also important to see their performance on other real world data distributions. The authors do not provide a comprehensive evaluation of WAE-GAN regularization, thus making it hard to comment on whether moving an adversarial problem to the latent space results in less instability. Reasons for better sample quality of WAE-GAN over WAE-MMD also need to be inspected. In the future it would be interesting to investigate different ways to compute the divergences between the encoded distribution and the prior distribution.<br />
<br />
=Open Source Code=<br />
1. https://github.com/tolstikhin/wae <br />
<br />
2. https://github.com/maitek/waae-pytorch<br />
<br />
=Sources=<br />
1. M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein GAN, 2017<br />
<br />
2. Martin Heusel et al. "Gans trained by a two time-scale update rule converge to a local nash equilibrium." Advances in Neural Information Processing Systems. 2017.<br />
<br />
3. Christian Szegedy et al. "Inception-v4, inception-resnet and the impact of residual connections on learning." AAAI. Vol. 4. 2017.<br />
<br />
4. Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, Bernhard Scholkopf. Wasserstein Auto-Encoders, 2017<br />
<br />
5. https://lilianweng.github.io/lil-log/2017/08/20/from-GAN-to-WGAN.html</div>Swalshhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Wasserstein_Auto-Encoders&diff=36230Wasserstein Auto-Encoders2018-04-07T00:16:06Z<p>Swalsh: added network architectures</p>
<hr />
<div><br />
= Introduction =<br />
Recent years have seen a convergence of two previously distinct approaches: representation learning from high dimensional data, and unsupervised generative modeling. In the field that formed at their intersection, Variational Auto-Encoders (VAEs) and Generative Adversarial Networks (GANs) have emerged to become well-established. VAEs are theoretically elegant but with the drawback that they tend to generate blurry samples when applied to natural images. GANs on the other hand produce better visual quality of sampled images, but come without an encoder, are harder to train and suffer from the mode-collapse problem when the trained model is unable to capture all the variability in the true data distribution. There has been recent research in generating encoder-decoder GANs where an encoder is trained in parallel with the generator, based on the intuition that this will allow the GAN to learn meaningful mapping from the compressed representation to the original image; however, these models also suffer from mode-collapse and perform comparable to vanilla GANs. Thus there has been a push to come up with the best way to combine them together, but a principled unifying framework is yet to be discovered.<br />
<br />
This work proposes a new family of regularized auto-encoders called the Wasserstein Auto-Encoder (WAE). The proposed method provides a novel theoretical insight into setting up an objective function for auto-encoders from the point of view of of optimal transport (OT). This theoretical formulation leads the authors to examine adversarial and maximum mean discrepancy based regularizers for matching a prior and the distribution of encoded data points in the latent space. An empirical evaluation is performed on MNIST and CelebA datasets, where WAE is found to generate samples of better quality than VAE while preserving training stability, encoder-decoder structure and nice latent manifold structure.<br />
<br />
The main contribution of the proposed algorithm is to provide theoretical foundations for using optimal transport cost as the auto-encoder objective function, while blending auto-encoders and GANs in a principled way. It also theoretically and experimentally explores the interesting relationships between WAEs, VAEs and adversarial auto-encoders.<br />
<br />
= Proposed Approach =<br />
==Theory of Optimal Transport and Wasserstein Distance==<br />
Wasserstein Distance is a measure of the distance between two probability distributions. It is also called Earth Mover’s distance, short for EM distance, because informally it can be interpreted as moving piles of dirt that follow one probability distribution at a minimum cost to follow the other distribution. The cost is quantified by the amount of dirt moved times the moving distance. <br />
A simple case where the probability domain is discrete is presented below.<br />
<br />
<br />
[[File:em_distance.PNG|thumb|upright=1.4|center|Step-by-step plan of moving dirt between piles in ''P'' and ''Q'' to make them match (''W'' = 5).]]<br />
<br />
<br />
When dealing with the continuous probability domain, the EM distance or the minimum one among the costs of all dirt moving solutions becomes:<br />
\begin{align}<br />
\small W(p_r, p_g) = \underset{\gamma\sim\Pi(p_r, p_g)} {\inf}\pmb{\mathbb{E}}_{(x,y)\sim\gamma}[\parallel x-y\parallel]<br />
\end{align}<br />
<br />
Where <math>\Pi(p_r, p_g)</math> is the set of all joint probability distributions with marginals <math>p_r</math> and <math>p_g</math>. Here the distribution <math>\gamma</math> is called a transport plan because it's marginal structure gives some intuition that it represents the amount of probability mass to be moved from x to y. This intuition can be explained by looking at the following equation.<br />
<br />
\begin{align}<br />
\int\gamma(x, y)dx = p_g(y)<br />
\end{align}<br />
<br />
The Wasserstein distance or the cost of Optimal Transport (OT) provides a much weaker topology, which informally means that it makes it easier for a sequence of distribution to converge as compared to other ''f''-divergences. This is particularly important in applications where data is supported on low dimensional manifolds in the input space. As a result, stronger notions of distances such as KL-divergence, often max out, providing no useful gradients for training. In contrast, OT has a much nicer linear behaviour even upon saturation. It can be shown that the Wasserstein distance has guarantees of continuity and differentiability (Arjovsky et al., 2017). Moreover, Arjovsky et al. show there is a nice relationship between the magnitude of the Wasserstein distance and the distance between distributions; a smaller distance nicely corresponds to a smaller distance between the two distributions, and vice versa.<br />
<br />
==Problem Formulation and Notation==<br />
In this paper, calligraphic letters, i.e. <math>\small {\mathcal{X}}</math>, are used for sets, capital letters, i.e. <math>\small X</math>, are used for random variables and lower case letters, i.e. <math>\small x</math>, for their values. Probability distributions are denoted with capital letters, i.e. <math>\small P(X)</math>, and corresponding densities with lower case letters, i.e. <math>\small p(x)</math>.<br />
<br />
This work aims to minimize OT <math>\small W_c(P_X, P_G)</math> between the true (but unknown) data distribution <math>\small P_X</math> and a latent variable model <math>\small P_G</math> specified by the prior distribution <math>\small P_Z</math> of latent codes <math>\small Z \in \pmb{\mathbb{Z}}</math> and the generative model <math>\small P_G(X|Z)</math> of the data points <math>\small X \in \pmb{\mathbb{X}}</math> given <math>\small Z</math>. <br />
<br />
Kantorovich's formulation of the OT problem is given by:<br />
\begin{align}<br />
\small W_c(P_X, P_G) := \underset{\Gamma\sim {\mathcal{P}}(X \sim P_X, Y \sim P_G)}{\inf} {\pmb{\mathbb{E}}_{(X,Y)\sim\Gamma}[c(X,Y)]}<br />
\end{align}<br />
where <math>\small c(x,y)</math> is any measurable cost function and <math>\small {\mathcal{P}(X \sim P_X,Y \sim P_G)}</math> is a set of all joint distributions of <math>\small (X,Y)</math> with marginals <math>\small P_X</math> and <math>\small P_G</math>. When <math>\small c(x,y)=d(x,y)</math>, the following Kantorovich-Rubinstein duality holds for the <math>\small 1^{st}</math> root of <math>\small W_c</math>:<br />
\begin{align}<br />
\small W_1(P_X, P_G) := \underset{f \in {\mathcal{F_L}}} {\sup} {\pmb{\mathbb{E}}_{X \sim P_X}[f(X)]} -{\pmb{\mathbb{E}}_{Y \sim P_G}[f(Y)]}<br />
\end{align}<br />
where <math>\small {\mathcal{F_L}}</math> is the class of all bounded [https://en.wikipedia.org/wiki/Lipschitz_continuity Lipschitz continuous functions]. A reference that provides an intuitive explanation for how the Kantorovich-Rubinstein duality was applied in this case is [https://vincentherrmann.github.io/blog/wasserstein/ here].<br />
<br />
==Wasserstein Auto-Encoders==<br />
The proposed method focuses on latent variables <math>\small P_G </math> defined by a two step procedure, where first a code <math>\small Z</math> is sampled from a fixed prior distribution <math>\small P_Z</math> on a latent space <math>\small {\mathcal{Z}}</math> and then <math>\small Z</math> is mapped to the image <math>\small X \in {\mathcal{X}}</math> with a transformation. This results in a density of the form<br />
\begin{align}<br />
\small p_G(x) := \int_{{\mathcal{Z}}} p_G(x|z)p_z(z)dz, \forall x\in{\mathcal{X}}<br />
\end{align}<br />
assuming all the densities are properly defined. It turns out that if the focus is only on generative models deterministically mapping <math>\small Z </math> to <math>\small X = G(Z) </math>, then the OT cost takes a much simpler form as stated below by Theorem 1.<br />
<br />
'''Theorem 1''' For any function <math>\small G:{\mathcal{Z}} \rightarrow {\mathcal{X}}</math>, where <math>\small Q(Z) </math> is the marginal distribution of <math>\small Z </math> when <math>\small X \in P_X </math> and <math>\small Z \in Q(Z|X) </math>,<br />
\begin{align}<br />
\small \underset{\Gamma\sim {\mathcal{P}}(X \sim P_X, Y \sim P_G)}{\inf} {\pmb{\mathbb{E}}_{(X,Y)\sim\Gamma}[c(X,Y)]} = \underset{Q : Q_z=P_z}{\inf} {{\pmb{\mathbb{E}}_{P_X}}{\pmb{\mathbb{E}}_{Q(Z|X)}}[c(X,G(Z))]}<br />
\end{align}<br />
This essentially means that instead of finding a coupling <math>\small \Gamma </math> between two random variables living in the <math>\small {\mathcal{X}} </math> space, one distributed according to <math>\small P_X </math> and the other one according to <math>\small P_G </math>, it is sufficient to find a conditional distribution <math>\small Q(Z|X) </math> such that its <math>\small Z </math> marginal <math>\small Q_Z(Z) := {\pmb{\mathbb{E}}_{X \sim P_X}[Q(Z|X)]} </math> is identical to the prior distribution <math>\small P_Z </math>. In order to implement a numerical solution to Theorem 1, the constraints on <math>\small Q(Z|X) </math> and <math>\small P_Z </math> are relaxed and a penalty function is added to the objective leading to the WAE objective function given by:<br />
<br />
\begin{align}<br />
\small D_{WAE}(P_X, P_G):= \underset{Q(Z|X) \in Q}{\inf} {{\pmb{\mathbb{E}}_{P_X}}{\pmb{\mathbb{E}}_{Q(Z|X)}}[c(X,G(Z))]} + {\lambda} {{\mathcal{D}}_Z(Q_Z,P_Z)}<br />
\end{align}<br />
where <math>\small Q </math> is any non-parametric set of probabilistic encoders, <math>\small {\mathcal{D}}_Z </math> is an arbitrary divergence between <br />
<math>\small Q_Z </math> and <math>\small P_Z </math>, and <math>\small \lambda > 0 </math> is a hyperparameter. The authors propose two different penalties <math>\small {\mathcal{D}}_Z(Q_Z,P_Z) </math> based on adversarial training (GANs) and maximum mean discrepancy (MMD). The authors note that a numerical solution to the dual formulation of the problem has been tried by clipping the weights of the network (to satisfy the Lipschitz condition) and by penalizing the objective with <math>\small \lambda \mathbb{E}(\parallel \nabla f(X) \parallel - 1)^2 </math><br />
<br />
===WAE-GAN: GAN-based===<br />
The first option is to choose <math>\small {\mathcal{D}}_Z(Q_Z,P_Z) = D_{JS}(Q_Z,P_Z)</math>, where <math>\small D_{JS} </math> is the Jensen-Shannon divergence metric, and use adversarial training to estimate it. Specifically a discriminator is introduced in the latent space <math>\small {\mathcal{Z}} </math> trying to separate true points sampled from <math>\small P_Z </math> from fake ones sampled from <math>\small Q_Z </math>. This results in Algorithm 1. It is interesting that the min-max problem is moved from the input pixel space to the latent space.<br />
<br />
<br />
[[File:wae-gan.PNG|270px|center]]<br />
<br />
===WAE-MMD: MMD-based===<br />
For a positive definite kernel <math>\small k: {\mathcal{Z}} \times {\mathcal{Z}} \rightarrow {\mathcal{R}}</math>, the following expression is called the maximum mean discrepancy:<br />
\begin{align}<br />
\small {MMD}_k(P_Z,Q_Z) = \parallel \int_{{\mathcal{Z}}} k(z,\cdot)dP_z(z) - \int_{{\mathcal{Z}}} k(z,\cdot)dQ_z(z) \parallel_{\mathcal{H}_k},<br />
\end{align}<br />
<br />
where <math>\mathcal{H}_k</math> is the reproducing kernel Hilbert space of real-valued functions mapping <math>\mathcal{Z}</math> to <math>\mathcal{R}</math>. This can be used as a divergence measure and the authors propose to use <math>\small {\mathcal{D}}_Z(Q_Z,P_Z) = MMD_k(P_Z,Q_Z) </math>, which leads to Algorithm 2.<br />
<br />
<br />
[[File:wae-mmd.PNG|270px|center]]<br />
<br />
= Comparison with Related Work =<br />
==Auto-Encoders, VAEs and WAEs==<br />
Classical unregularized encoders only minimized the reconstruction cost, and resulted in training points being chaotically scattered across the latent space with holes in between, where the decoder had never been trained. They were hard to sample from and did not provide a useful representation. VAEs circumvented this problem by maximizing a variational lower-bound term comprising of a reconstruction cost and a KL-divergence measure which captures how distinct each training example is from the prior <math>\small P_Z</math>. This however does not guarantee that the overall encoded distribution <math>\small {{\pmb{\mathbb{E}}_{P_X}}}[Q(Z|X)]</math> matches <math>\small P_Z</math>. This is ensured by WAE however, is a direct consequence of our objective function derived from Theorem 1, and is visually represented in the figure below. It is also interesting to note that this also allows WAE to have deterministic encoder-decoder pairs.<br />
<br />
<br />
[[File:vae-wae.PNG|500px|thumb|center|WAE and VAE regularization]]<br />
<br />
<br />
It is also shown that if <math>\small c(x,y)={\parallel x-y \parallel}_2^2</math>, WAE-GAN is equivalent to adversarial autoencoders (AAE). Thus the theory suggests that AAE minimize the 2-Wasserstein distance between <math>\small P_X</math> and <math>\small P_G</math>.<br />
<br />
==OT, W-GAN and WAE==<br />
The Wasserstein GAN (W-GAN) minimizes the 1-Wasserstein distance <math>\small W_1(P_X,P_G)</math> for generative modeling. The W-GAN formulation is approached from the dual form and thus it cannot be applied to another other cost <math>\small W_c</math> as the neat form of the Kantorovich-Rubinstein duality holds only for <math>\small W_1</math>. WAE approaches the same problem from the primal form, can be applied to any cost function <math>\small c</math> and comes naturally with an encoder. The constraint on OT in Theorem 1, is relaxed in line with theory on unbalanced optimal transport by adding a penalty or additional divergences to the objective.<br />
<br />
==GANs and WAEs==<br />
Many of the GAN variations including f-GAN and W-GAN come without an encoder. Often it may be desirable to reconstruct the latent codes and use the learned manifold in which case they won't be applicable. For works which try to blend adversarial auto-encoder structures, encoders and decoders do not have incentive to be reciprocal. WAE does not necessarily lead to a min-max game and has a clear theoretical foundation for using penalties for regularization.<br />
<br />
=Experimental Results=<br />
The authors empirically evaluate the proposed WAE generative model by specifically testing if data points are accurately reconstructed, if the latent manifold has reasonable geometry, and if random samples of good visual quality are generated. <br />
<br />
'''Experimental setup:'''<br />
Gaussian prior distribution <math> \small P_Z</math> and squared cost function <math> \small c(x,y)</math> are used for data-points. The encoder-decoder pairs were deterministic. The convolutional deep neural network for encoder mapping and decoder mapping are similar to DC-GAN with batch normalization. Real world datasets, MNIST with 70k images and CelebA with 203k images were used for training and testing. For interpolations a pair of of held out images, <math>(x,y)</math> from the test set are Auto-encoded (separately), to produce <math>(z_x, z_y)</math> in the latent space. The elements of the latent space are linearly interpolated and decoded to produce the images below. <br />
<br />
'''WAE-GAN and WAE-MMD:'''<br />
In WAE-GAN, the discriminator <math> \small D </math> composed of several fully connected layers with ReLu activations. For WAE-MMD, the RBF kernel failed to penalize outliers and thus the authors resorted to using inverse multiquadratics kernel <math> \small k(x,y)=C/(C+\parallel{x-y}_2^2\parallel) </math>. Trained models are presented in the figure below.<br />
As far as random sampled results are concerned, WAE-GAN seems to be highly unstable but do lead to better matching scores among WAE-GAN, WAE-MMD and VAE. WAE-MMD on the other hand has much more stable training and fairly good quality of sampled results.<br />
<br />
'''Qualitative assessment:'''<br />
In order to quantitatively assess the quality of the generated images, they use the Fréchet Inception Distance and report the results on CelebA (The Fréchet Inception Distance measures the similarity between two sets of images, by comparing the Fréchet distance of multivariate Gaussian distributions fitted to their feature representations. In more detail, let <math> (m,C) </math> denote the mean vector and covariance matrix of the features of the inception network (Szegedy et al. 2017) applied to model samples. Let <math>(m_w,C_w) </math> denote the mean vector and covariance matrix of the features of the inception network applied to real data. Then the Fréchet Inception Distance between the model samples and the real data is <math> ||m-m_w||^2 +\mathrm{tr}(C+C_w-2(CC_w)^{\frac{1}{2}} )\,</math> (Heusel et al. 2017). ) These results confirm that the sampled images from WAE are of better quality than from VAE (score: 82), and WAE-GAN gets a slightly better score (score:42) than WAE-MMD (score:55), which correlates with visual inspection of the images.<br />
<br />
[[File:results.png|800px|thumb|center|Results on MNIST and Celeb-A dataset. In "test reconstructions" (middle row of images), odd rows correspond to the real test points.]]<br />
<br />
<br />
<br />
The authors also heuristically evaluate the sharpness of generated samples using the Laplace filter. The numbers, summarized in Table1, show that WAE-MMD has samples of slightly better quality than VAE, while WAE-GAN achieves the best results overall.<br />
[[File: paper17_Table.png|300px|thumb|center|Qualitative Assessment of Images]]<br />
<br />
'''Network structures:'''<br />
<br />
The Encoder, Decoder, and Adversary architectures used for the MNIST and CelebA datasets are as sown in the following two images:<br />
<br />
[[File:WAE_MNIST.png|900px|thumb|center|Network architectures used to evaluate on the MNIST dataset.]]<br />
<br />
[[File:WAE_CelebA.png|900px|thumb|center|Network architectures used to evaluate on the CelebA dataset.]]<br />
<br />
= Commentary and Conclusion =<br />
This paper presents an interesting theoretical justification for a new family of auto-encoders called Wasserstein Auto-Encoders (WAE). The objective function minimizes the optimal transport cost in the form of the Wasserstein distance, but relaxes theoretical constraints to separate it into a reconstruction cost and a regularization penalty. The regularization penalizes divergences between a prior and the distribution of encoded latent space training data, and is estimated by means of adversarial training (WAE-GAN), or kernel-based techniques (WAE-MMD). They show that they achieve samples of better visual quality than VAEs, while achieving stable training at the same time. They also theoretically show that WAEs are a generalization of adversarial auto-encoders (AAEs).<br />
<br />
Although the paper mentions that encoder-decoder pairs can be deterministic, they do not show the geometry of the latent space that is obtained. It is necessary to study the effect of randomness of encoders on the quality of obtained samples. While this method is evaluated on MNIST and CelebA datasets, it is also important to see their performance on other real world data distributions. The authors do not provide a comprehensive evaluation of WAE-GAN regularization, thus making it hard to comment on whether moving an adversarial problem to the latent space results in less instability. Reasons for better sample quality of WAE-GAN over WAE-MMD also need to be inspected. In the future it would be interesting to investigate different ways to compute the divergences between the encoded distribution and the prior distribution.<br />
<br />
=Open Source Code=<br />
1. https://github.com/tolstikhin/wae <br />
<br />
2. https://github.com/maitek/waae-pytorch<br />
<br />
=Sources=<br />
1. M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein GAN, 2017<br />
<br />
2. Martin Heusel et al. "Gans trained by a two time-scale update rule converge to a local nash equilibrium." Advances in Neural Information Processing Systems. 2017.<br />
<br />
3. Christian Szegedy et al. "Inception-v4, inception-resnet and the impact of residual connections on learning." AAAI. Vol. 4. 2017.<br />
<br />
4. Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, Bernhard Scholkopf. Wasserstein Auto-Encoders, 2017<br />
<br />
5. https://lilianweng.github.io/lil-log/2017/08/20/from-GAN-to-WGAN.html</div>Swalshhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:WAE_CelebA.png&diff=36229File:WAE CelebA.png2018-04-07T00:13:28Z<p>Swalsh: </p>
<hr />
<div></div>Swalshhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:WAE_MNIST.png&diff=36228File:WAE MNIST.png2018-04-07T00:13:06Z<p>Swalsh: </p>
<hr />
<div></div>Swalshhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/MaskRNN:_Instance_Level_Video_Object_Segmentation&diff=36225stat946w18/MaskRNN: Instance Level Video Object Segmentation2018-04-06T23:56:56Z<p>Swalsh: typo and formatting of introduction</p>
<hr />
<div>== Introduction ==<br />
Deep Learning has produced state of the art results in many computer vision tasks like image classification, object localization, object detection, object segmentation, semantic segmentation and instance level video object segmentation. Image classification classify the image based on the prominent objects. Object localization is the task of finding objects’ location in the frame. Object Segmentation task involves providing a pixel map which represents the pixel wise location of the objects in the image. Semantic segmentation task attempts at segmenting the image into meaningful parts. Instance level video object segmentation is the task of consistent object segmentation in video sequences. Deforming shapes, fast movements, and occlusion from multiple objects, are just some of the significant challenges in instance level video object segmentation.<br />
<br />
There are 2 different types of video object segmentation, Unsupervised and Semi-supervised. <br />
* In unsupervised video object segmentation, the task is to find the salient objects and track the main objects in the video. <br />
* In a semi-supervised setting, the ground truth mask of the salient objects is provided for the first frame. The task is thus simplified to only track the objects required. <br />
<br />
In this paper, the authors look at an unsupervised video object segmentation technique.<br />
<br />
== Background Papers ==<br />
Video object segmentation has been performed using spatio-temporal graphs [[https://pdfs.semanticscholar.org/7221/c3470fa89879aab3ef270570ced15cde28de.pdf 5], [http://ieeexplore.ieee.org/abstract/document/5539893/ 6], [http://openaccess.thecvf.com/content_iccv_2013/papers/Li_Video_Segmentation_by_2013_ICCV_paper.pdf 7], [https://link.springer.com/content/pdf/10.1007/s11263-011-0512-5.pdf 8]] and deep learning. The graph based methods construct 3D spatio-temporal graphs in order to model the inter and the intra-frame relationship of pixels or superpixels in a video. Hence they are computationally slower than deep learning methods and are unable to run at real-time. There are 2 main deep learning techniques for semi-supervised video object segmentation: One Shot Video Object Segmentation (OSVOS) and Learning Video Object Segmentation from Static Images (MaskTrack). Following is a brief description of the new techniques introduced by these papers for semi-supervised video object segmentation task.<br />
<br />
=== OSVOS (One-Shot Video Object Segmentation) ===<br />
<br />
[[File:OSVOS.jpg | 1000px]]<br />
<br />
This paper introduces the technique of using a frame-by-frame object segmentation without any temporal information from the previous frames of the video. The paper uses a VGG-16 network with pre-trained weights from image classification task. This network is then converted into a fully-connected network (FCN) by removing the fully connected dense layers at the end and adding convolution layers to generate a segment mask of the input. This network is then trained on the DAVIS 2016 dataset.<br />
<br />
During testing, the trained VGG-16 FCN is fine-tuned using the first frame of the video using the ground truth. Because this is a semi-supervised case, the segmented mask (ground truth) for the first frame is available. The first frame data is augmented by zooming/rotating/flipping the first frame and the associated segment mask.<br />
<br />
=== MaskTrack (Learning Video Object Segmentation from Static Images) ===<br />
<br />
[[File:MaskTrack.jpg | 500px]]<br />
<br />
MaskTrack takes the output of the previous frame to improve its predictions and to generate the segmentation mask for the next frame. Thus the input to the network is 4 channel wide (3 RGB channels from the frame at time <math>t</math> plus one binary segmentation mask from frame <math>t-1</math>). The output of the network is the binary segmentation mask for frame at time <math>t</math>. Using the binary segmentation mask (referred to as guided object segmentation in the paper), the network is able to use some temporal information from the previous frame to improve its segmentation mask prediction for the next frame.<br />
<br />
The model of the MaskTrack network is similar to a modular VGG-16 and is referred to as MaskTrack ConvNet in the paper. The network is trained offline on saliency segmentation datasets: ECSSD, MSRA 10K, SOD and PASCAL-S. The input mask for the binary segmentation mask channel is generated via non-rigid deformation and affine transformation of the ground truth segmentation mask. Similar data-augmentation techniques are also used during online training. Just like OSVOS, MaskTrack uses the first frame as ground truth (with augmented images) to fine-tune the network to improve prediction score for the particular video sequence.<br />
<br />
A parallel ConvNet network is used to generate a predicted segment mask based on the optical flow magnitude. The optical flow between 2 frames is calculated using the EpicFlow algorithm. The output of the two networks is combined using an averaging operation to generate the final predicted segmented mask.<br />
<br />
Table 1 below gives a summary comparison of the different state of the art algorithms. The noteworthy information included in this table is that the technique presented in this paper is the only one which takes into account long-term temporal information. This is accomplished with a recurrent neural net. Furthermore, the bounding box is also estimated instead of just a segmentation mask. The authors claim that this allows the incorporation of a location prior from the tracked object.<br />
<br />
[[File:Paper19-SegmentationComp.png]]<br />
<br />
== Dataset ==<br />
The three major datasets used in this paper are DAVIS-2016, DAVIS-2017 and Segtrack v2. DAVIS-2016 dataset provides video sequences with only one segment mask for all salient objects. DAVIS-2017 improves the ground truth data by providing segmentation mask for each salient object as a separate color segment mask. Segtrack v2 also provides multiple segmentation mask for all salient objects in the video sequence. These datasets try to recreate real-life scenarios like occlusions, low resolution videos, background clutter, motion blur, fast motion etc.<br />
<br />
== MaskRNN: Introduction ==<br />
Most techniques mentioned above don’t work directly on instance level segmentation of the objects through the video sequence. The above approaches focus on image segmentation on each frame and using additional information (mask propagation and optical flow) from the preceding frame perform predictions for the current frame. To address the instance level segmentation problem, MaskRNN proposes a framework where the salient objects are tracked and segmented by capturing the temporal information in the video sequence using a recurrent neural network.<br />
<br />
== MaskRNN: Overview ==<br />
In a video sequence <math>I = \{I_1, I_2, …, I_T\}</math>, the sequence of <math>T</math> frames are given as input to the network, where the video sequence contains <math>N</math> salient objects. The ground truth for the first frame <math>y_1^*</math> is also provided for <math>N</math> salient objects.<br />
In this paper, the problem is formulated as a time dependency problem and using a recurrent neural network, the prediction of the previous frame influences the prediction of the next frame. The approach also computes the optical flow between frames (optical flow is the apparent motion of objects between two consecutive frames in the form of a 2D vector field representing the displacement in brightness patterns for each pixel, apparent because it depends on the relative motion between the observer and the scene) and uses that as the input to the neural network. The optical flow is also used to align the output of the predicted mask. “The warped prediction, the optical flow itself, and the appearance of the current frame are then used as input for <math>N</math> deep nets, one for each of the <math>N</math> objects.”[1 - MaskRNN] Each deep net is a made of an object localization network and a binary segmentation network. The binary segmentation network is used to generate the segmentation mask for an object. The object localization network is used to alleviate outliers from the predictions. The final prediction of the segmentation mask is generated by merging the predictions of the 2 networks. For <math>N</math> objects, there are N deep nets which predict the mask for each salient object. The predictions are then merged into a single prediction using an <math>\text{argmax}</math> operation at test time.<br />
<br />
== MaskRNN: Multiple Instance Level Segmentation ==<br />
<br />
[[File:2ObjectSeg.jpg | 850px]]<br />
<br />
Image segmentation requires producing a pixel level segmentation mask and this can become a multi-class problem. Instead, using the approach from [2- Mask R-CNN] this approach is converted into a multiple binary segmentation problem. A separate segmentation mask is predicted separately for each salient object and thus we get a binary segmentation problem. The binary segments are combined using an <math>\text{argmax}</math> operation where each pixel is assigned to the object containing the largest predicted probability.<br />
<br />
=== MaskRNN: Binary Segmentation Network ===<br />
<br />
[[File:MaskRNNDeepNet.jpg | 850px]]<br />
<br />
The above picture shows a single deep net employed for predicting the segment mask for one salient object in the video frame. The network consists of 2 networks: binary segmentation network and object localization network. The binary segmentation network is split into two streams: appearance and flow stream. The input of the appearance stream is the RGB frame at time t and the wrapped prediction of the binary segmentation mask from time <math>t-1</math>. The wrapping function uses the optical flow between frame <math>t-1</math> and frame <math>t</math> to generate a new binary segmentation mask for frame <math>t</math>. The input to the flow stream is the concatenation of the optical flow magnitude between frames <math>t-1</math> to <math>t</math> and frames <math>t</math> to <math>t+1</math> and the wrapped prediction of the segmentation mask from frame <math>t-1</math>. The magnitude of the optical flow is replicated into an RBG format before feeding it to the flow stream. The network architecture closely resembles a VGG-16 network without the pooling or fully connected layers at the end. The fully connected layers are replaced with convolutional and bilinear interpolation upsampling layers which are then linearly combined to form a feature representation that is the same size of the input image. This feature representation is then used to generate a binary segment mask. This technique is borrowed from the Fully Convolutional Network mentioned above. The output of the flow stream and the appearance stream is linearly combined and sigmoid function is applied to the result to generate binary mask for ith object. All parts of the network are fully differentiable and thus it can be fully trained in every pass.<br />
<br />
=== MaskRNN: Object Localization Network: ===<br />
Using a similar technique to the Fast-RCNN method of object localization, where the region of interest (RoI) pooling of the features of the region proposals (i.e. the bounding box proposals here) is performed and passed through fully connected layers to perform regression, the Object localization network generates a bounding box of the salient object in the frame. This bounding box is enlarged by a factor of 1.25 and combined with the output of binary segmentation mask. Only the segment mask available in the bounding box is used for prediction and the pixels outside of the bounding box are marked as zero. MaskRNN uses the convolutional feature output of the appearance stream as the input to the RoI-pooling layer to generate the predicted bounding box. A pixel is classified as foreground if it is both predicted to be in the foreground by the binary segmentation net and within the enlarged estimated bounding box from the object localization net.<br />
<br />
=== Training and Finetuning ===<br />
For training the network depicted in Figure 1, backpropagation through time is used in order to preserve the recurrence relationship connecting the frames of the video sequence. Predictive performance is further improved by following the algorithm for semi-supervised setting for video object segmentation with fine-tuning achieved by using the first frame segmentation mask of the ground truth. In this way, the network is further optimized using the ground truth data.<br />
<br />
== MaskRNN: Implementation Details ==<br />
=== Offline Training ===<br />
The deep net is first trained offline on a set of static images. The ground truth is randomly perturbed locally to generate the imperfect mask from frame <math>t-1</math>. Two different networks are trained offline separately for DAVIS-2016 and DAVIS-2017 datasets for a fair evaluation of both datasets. After both the object localization net and binary segmentation networks have trained, the temporal information in the network is used to further improve the segmented prediction results. Because of GPU memory constraints, the RNN is only able to backpropagate the gradients back 7 frames and learn long-term temporal information. <br />
<br />
For optical flow, a pre-trained flowNet2.0 is used to compute the optical flow between frames. (A flowNet (Dosovitskiy 2015) is a deep neural network trained to predict optical flow. The simplest form of flowNet has an architecture consisting of two parts. The first part accepts the two images between which the optical flow is to be computed as input, as applies a sequence of convolution and max-pooling operations, as in a standard convolutional neural network. In the second part, repeated up-convolution operations are applied, increasing the dimensions of the feature-maps. Besides the output of the previous upconvolution, each upconvolution is also fed as input the output of the corresponding down-convolution from the first part of the network. Thus part of the architecture resembles that of a U-net (Ronneberger, 2015). The output of the network is the predicted optical flow. ) <br />
<br />
=== Online Finetuning ===<br />
The deep nets (without the RNN) are then fine-tuned during test time by online training the networks on the ground truth of the first frame and some augmentations of the first frame data. The learning rate is set to <math>10^{-5}</math> for online training for 200 iterations and the learning rate is gradually decayed over time. Data augmentation techniques similar to those in offline training, namely random resizing, rotating, cropping and flipping is applied. Also, it should be noted that the RNN is ''not'' employed during online finetuning since only a single frame of training data is available.<br />
<br />
== MaskRNN: Experimental Results ==<br />
=== Evaluation Metrics ===<br />
There are 3 different techniques for performance analysis for Video Object Segmentation techniques:<br />
<br />
1. Region Similarity (Jaccard Index): Region similarity or Intersection-over-union is used to capture precision of the area covered by the prediction segmentation mask compared to the ground truth segmentation mask. It calculates the average across all frames of the dataset. This is particularly challenging for small sized foreground objects.<br />
<br />
\begin{equation}<br />
IoU = \frac{|M \cap G|}{|M| + |G| - |M \cap G|} <br />
\label{equation:Jaccard}<br />
\end{equation}<br />
<br />
2. Contour Accuracy (F-score): This metric measures the accuracy in the boundary of the predicted segment mask and the ground truth segment mask, by calculating the the precision and the recall of the two sets of points on the contours of the ground truth segment and the output segment via a bipartite graph matching. It is a measure of accurate delineation of the foreground objects. <br />
<br />
[[File:Fscore.jpg | 200px|center]]<br />
<br />
3. Temporal Stability : This estimates the degree of deformation needed to transform the segmentation masks from one frame to the next and is measured by the dissimilarity of the set of points on the contours of the segmentation between two adjacent frames.<br />
<br />
Region similarity measures the true segmented area in the prediction, while Contour Accuracy measures the accuracy of the contours/segmented mask boundary.<br />
<br />
=== Ablation Study ===<br />
<br />
The ablation study summarized how the different components contributed to the algorithm evaluated on DAVIS-2016 and DAVIS-2017 datasets.<br />
<br />
[[File:MaskRNNTable2.jpg | 700px|center]]<br />
<br />
The above table presents the contribution of each component of the network to the final prediction score. Online fine-tuning improves the performance by a large margin, as the network becomes adjusted to the appearance of the specific object being tracked. Addition of RNN/Localization Net and FStream all seem to positively affect the performance of the deep net.<br />
<br />
=== Quantitative Evaluation ===<br />
<br />
The authors use DAVIS-2016, DAVIS-2017 and Segtrack v2 to compare the performance of the proposed approach to other methods based on foreground-background video object segmentation and multiple instance-level video object segmentation.<br />
<br />
[[File:MaskRNNTable3.jpg | 700px]]<br />
<br />
The above table shows the results for contour accuracy mean and region similarity. The MaskRNN method seems to outperform all previously proposed methods. The performance gain is significant by employing a Recurrent Neural Network for learning recurrence relationship and using a object localization network to improve prediction results.<br />
<br />
The following table shows the improvements in the state of the art achieved by MaskRNN on the DAVIS-2017 and the SegTrack v2 dataset.<br />
<br />
[[File:MaskRNNTable4.jpg | 700px]]<br />
<br />
=== Qualitative Evaluation ===<br />
The authors showed example qualitative results from the DAVIS and Segtrack datasets. <br />
<br />
Below are some success cases of object segmentation under complex motion, cluttered background, and/or multiple object occlusion.<br />
<br />
[[File:maskrnn_example.png | 700px]]<br />
<br />
Below are a few failure cases. The authors explain two reasons for failure: a) when similar objects of interest are contained in the frame (left two images), and b) when there are large variations in scale and viewpoint (right two images).<br />
<br />
[[File:maskrnn_example_fail.png | 700px]]<br />
<br />
== Conclusion ==<br />
In this paper a novel approach to instance level video object segmentation task is presented which performs better than current state of the art. The long-term recurrence relationship is learnt using an RNN. The object localization network is added to improve accuracy of the system. Using online fine-tuning the network is adjusted to predict better for the current video sequence.<br />
<br />
== Critique ==<br />
The paper provides a technique to track multiple objects in a video. The novelty is to add back-propagation through time to improve the tracking accuracy and using a localization network to remove any outliers in the segmented binary mask. However, the network architecture it too large and isn't able to run in real-time. There are N deep-Nets for N objects and each deep-Net contains 2 parallel VGG-16 convolutional networks.<br />
<br />
== Implementation ==<br />
<br />
The implementation of this paper was produced as part of the NIPS Paper Implementation Challenge. This implementation can be found at the following open source project: https://github.com/philferriere/tfvos.<br />
<br />
== References ==<br />
# Dosovitskiy, Alexey, et al. "Flownet: Learning optical flow with convolutional networks." Proceedings of the IEEE International Conference on Computer Vision. 2015.<br />
# Hu, Y., Huang, J., & Schwing, A. "MaskRNN: Instance level video object segmentation". Conference on Neural Information Processing Systems (NIPS). 2017<br />
# Ferriere, P. (n.d.). Semi-Supervised Video Object Segmentation (VOS) with Tensorflow. Retrieved March 20, 2018, from https://github.com/philferriere/tfvos<br />
# Ronneberger, Olaf, Philipp Fischer, and Thomas Brox. "U-net: Convolutional networks for biomedical image segmentation." International Conference on Medical image computing and computer-assisted intervention. Springer, Cham, 2015.<br />
# Lee, Yong Jae, Jaechul Kim, and Kristen Grauman. "Key-segments for video object segmentation." Computer Vision (ICCV), 2011 IEEE International Conference on. IEEE, 2011.<br />
# Grundmann, Matthias, et al. "Efficient hierarchical graph-based video segmentation." Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on. IEEE, 2010.<br />
# Li, Fuxin, et al. "Video segmentation by tracking many figure-ground segments." Computer Vision (ICCV), 2013 IEEE International Conference on. IEEE, 2013.<br />
# Tsai, David, et al. "Motion coherent tracking using multi-label MRF optimization." International journal of computer vision 100.2 (2012): 190-202.</div>Swalshhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Wavelet_Pooling_CNN&diff=36222Wavelet Pooling CNN2018-04-06T23:36:34Z<p>Swalsh: fixed backpropogation description</p>
<hr />
<div>== Introduction ==<br />
Convolutional neural networks (CNN) have been proven to be powerful in image classification. Over the past few years, researchers have put efforts in improving fundamental components of CNNs such as the pooling operation. Various pooling methods exist; deterministic methods include max pooling and average pooling and probabilistic methods include mixed pooling and stochastic pooling. All these methods employ a neighborhood approach to the sub-sampling which, albeit fast and simple, can produce artifacts such as blurring, aliasing, and edge halos (Parker et al., 1983).<br />
<br />
This paper introduces a novel pooling method based on the discrete wavelet transform. Specifically, it uses a second-level wavelet decomposition for the sub-sampling. This method, instead of nearest neighbor interpolation uses a sub-band method that the authors' claim produces fewer artifacts and represents the underlying features more accurately. Therefore, if pooling is viewed as a lossy process, the reason for employing a wavelet approach is to try to minimize this loss.<br />
<br />
== Pooling Background ==<br />
Pooling essentially means sub-sampling. After the pooling layer, the spatial dimensions of the data is reduced to some degree, with the goal being to compress the data rather than discard some of it. Typical approaches to pooling reduce the dimensionality by using some method to combine a region of values into one value. Max pooling and Mean/Average pooling are the 2 most commonly used pooling methods. For max pooling, this can be represented by the equation <math>a_{kij} = max_{(p,q) \epsilon R_{ij}} (a_{kpq})</math> where <math>a_{kij}</math> is the output activation of the <math>k^th</math> feature map at <math>(i,j)</math>, <math>a_{kpq}</math> is input activation at <math>(p,q)</math> within <math>R_{ij}</math>, and <math>|R_{ij}|</math> is the size of the pooling region. Mean pooling can be represented by the equation <math>a_{kij} = \frac{1}{|R_{ij}|} \sum_{(p,q) \epsilon R_{ij}} (a_{kpq})</math> with everything defined as before. Figure 1 provides a numerical example that can be followed.<br />
<br />
[[File:WT_Fig1.PNG|650px|center|]]<br />
<br />
The paper mentions that these pooling methods, although simple and effective, have shortcomings. Max pooling can omit details from an image if the important features have less intensity than the insignificant ones, and also commonly overfits. On the other hand, average pooling can dilute important features if the data is averaged with values of significantly lower intensities. Figure 2 displays an image of this.<br />
<br />
[[File:WT_Fig2.PNG|650px|center|]]<br />
<br />
To account for the above-mentioned issues, probabilistic pooling methods were introduced, namely mixed pooling and; stochastic pooling. Mixed pooling is a simple method which just combines the max and the average pooling by randomly selecting one method over the other during training. Mixed pooling can be applied for all features, mixed between features, or mixed between regions for different features. Stochastic pooling on the other hand randomly samples within a receptive field with the activation values as the probabilities. These are calculated by taking each activation value and dividing it by the sum of all activation values in the grid so that the probabilities sum to 1.<br />
<br />
Figure 3 shows an example of how stochastic pooling works. On the left is a 3x3 grid filled with activations. The middle grid is the corresponding probability for each activation. The activation in the middle was randomly selected (it had a 13% chance of getting selected). Because the stochastic pooling is based on the probability of the pixels, it is able to avoid the shortcomings of max and mean pooling mentioned above.<br />
<br />
[[File:paper21-stochasticpooling.png|650px|center|]]<br />
<br />
== Wavelet Background ==<br />
Data or signals tend to be composed of slowly changing trends (low frequency) as well as fast-changing transients (high frequency). Similarly, images have smooth regions of intensity which are perturbed by edges or abrupt changes. We know that these abrupt changes can represent features that are of great importance to us when we perform deep learning. Wavelets are a class of functions that are well localized in time and frequency. The wavelet transform is often compared with the Fourier transform, in which signals are represented as a sum of sinusoids. In fact, the Fourier transform can be viewed as a special case of the continuous wavelet transform with the choice of the mother wavelet <math>\psi(t)=e^{-2 \pi it}</math>. The main difference in general is that wavelets are localized in both time and frequency whereas the standard Fourier transform is only localized in frequency. The ability of wavelets to be localized in time and space is what makes it suitable for detecting the abrupt changes in an image well. <br />
<br />
Essentially, a wavelet is a fast decaying oscillating signal with zero mean that only exists for a fixed duration and can be scaled and shifted in time. There are some well-defined types of wavelets as shown in Figure 3. The key characteristic of wavelets for us is that they have a band-pass characteristic, and the band can be adjusted based on the scaling and shifting. <br />
<br />
[[File:WT_Fig3.jpg|650px|center|]]<br />
<br />
The paper uses discrete wavelet transform and more specifically a faster variation called Fast Wavelet Transform (FWT) using the Haar wavelet. There also exists a continuous wavelet transform. The main difference in these is how the scale and shift parameters are selected.<br />
<br />
== Discrete Wavelet Transform General==<br />
The discrete wavelet transforms for images is essentially applying a low pass and high pass filter to your image where the transfer functions of the filters are related and defined by the type of wavelet used (Haar in this paper). This is shown in the figures below, which also show the recursive nature of the transform. For an image, the per-row transform is taken first. This results in a new image where the first half is a low-frequency sub-band and the second half is the high-frequency sub-band. Then this new image is transformed again per column, resulting in four sub-bands. Generally, the low-frequency content approximates the image and the high-frequency content represents abrupt changes. Therefore, one can simply take the LL band and perform the transformation again to sub-sample even more.<br />
<br />
[[File:WT_Fig8.png|650px|center|]]<br />
<br />
[[File:WT_Fig9.png|650px|center|]]<br />
<br />
In left half of the above image we see a grid containing four different transformations of the same initial image. Each transform has been done by applying a row wise convolution with a wavelet of either high or low frequency, then a column wise convolution with another wavelet of either high or low frequency. The four choices of frequency (LL, LH, HL, HH) result in four different images. The top left image is the result of applying a low frequency wavelet convolution to the original image both row wise and column wise. The bottom left image is the result of first applying a high frequency wavelet convolution row wise and then applying a low frequency wavelet convolution column wise. Since the LL (top right) transformation preserves the original image best, it is then used in this process again to generate the grid of smaller images that can be seen in the top center-right of the above image. The images in this smaller grid are called second order coefficients.<br />
<br />
== DWT example using Haar Wavelet ==<br />
Suppose we have an image represented by the following pixels:<br />
<br />
\begin{align}<br />
\begin{bmatrix} <br />
100 & 50 & 60 & 150 \\<br />
20 & 60 & 40 & 30 \\<br />
50 & 90 & 70 & 82 \\<br />
74 & 66 & 90 & 58 \\<br />
\end{bmatrix}<br />
\end{align}<br />
<br />
For each level of the DWT using the Haar wavelet, we will perform the transform on the rows first and then the columns. For the row pass, we transform each row as follows:<br />
* For each row i = <math>[i_{1}, i_{2}, i_{3}, i_{4}]</math> of the input image, transform the row to <math>i_{t}</math> via<br />
<br />
\begin{align}<br />
i_{t} = [(i_{1} + i_{2}) / 2, (i_{3} + i_{4}) / 2, (i_{1}, - i_{2}) / 2, (i_{3} - i_{4}) / 2]<br />
\end{align}<br />
<br />
After the row transforms, the images looks as follows:<br />
\begin{align}<br />
\begin{bmatrix} <br />
75 & 105 & 25 & -45 \\<br />
40 & 35 & -20 & 5 \\<br />
70 & 76 & -20 & -6 \\<br />
70 & 74 & 4 & 16 \\<br />
\end{bmatrix}<br />
\end{align}<br />
<br />
Now we apply the same method to the columns in the exact same way.<br />
<br />
== Proposed Method ==<br />
The proposed method uses subbands from the second level FWT and discards the first level subbands. The authors postulate that this method is more 'organic' in capturing the data compression and will create less artifacts that may affect the image classification.<br />
=== Forward Propagation ===<br />
FWT can be expressed by <math>W_\varphi[j + 1, k] = h_\varphi[-n]*W_\varphi[j,n]|_{n = 2k, k <= 0}</math> and <math>W_\psi[j + 1, k] = h_\psi[-n]*W_\psi[j,n]|_{n = 2k, k <= 0}</math> where <math>\varphi</math> is the approximation function, <math>\psi</math> is the detail function, <math>W_\varphi</math>, <math>W_\psi</math>, are approximation and detail coefficients, <math>h_\varphi[-n]</math> and <math>h_\psi[-n]</math> are time reversed scaling and wavelet vectors, <math>(n)</math> represents the sample in the vector, and <math>j</math> denotes the resolution level. To apply to images, FWT is first applied on the rows and then the columns. If a low (L) and high(H) sub-band is extracted from the rows and similarly for the columns than at each level there is 4 sub-bands (LH, HL, HH, and LL) where LL will further be decomposed into the level 2 decomposition. <br />
<br />
Using the level 2 decomposition sub-bands, the Inverse Fast Wavelet Transform (IFWT) is used to obtain the resulting sub-sampled image, which is sub-sampled by a factor of two. The Equation for IFWT is <math>W_\varphi[j, k] = h_\varphi[-n]*W_\varphi[j + 1,n] + h_\psi[-n]*W_\psi[j + 1,n]|_{n = \frac{k}{2}, k <= 0}</math> where the parameters are the same as previously explained. Figure 4 displays the algorithm for the forward propagation.<br />
<br />
[[File:WT_Fig6.PNG|650px|center|]]<br />
<br />
=== Back Propagation ===<br />
This is simply the reverse of the forward propagation. The image feature first has to be converted into the sub-bands using the 1st order wavelet decomposition. The sub-bands are upsampled by a factor of 2 and then backpropagated through the IDWT to achieve the final image. Figure 5 displays the algorithm.<br />
<br />
[[File:WT_Fig7.PNG|650px|center|]]<br />
<br />
== Results ==<br />
The authors tested on MNIST, CIFAR-10, SHVN, and KDEF and the paper provides comprehensive results for each. Stochastic gradient descent was used and the Haar wavelet is used due to its even, square subbands. The network for all datasets except MNIST is loosely based on (Zeiler & Fergus, 2013). The authors keep the network consistent but change the pooling method for each dataset. They also experiment with dropout and Batch Normalization to examine the effects of regularization on their method. All pooling methods compared use a 2x2 window, and a consistent pooling method was used for all pooling layers of a network. The overall results teach us that the pooling method should be chosen specifically for the type of data we have. In some cases, wavelet pooling may perform the best, and in other cases, other methods may perform better, if the data is more suited for those types of pooling.<br />
<br />
=== MNIST ===<br />
Figure 7 shows the network how's architecture was based on an example of MNIST structure from MatConvNet, with batch normalization. Table 1 shows the algorithms accuracy. It can be seen that wavelet pooling achieves the best accuracy from all pooling methods compared. Figure 8 shows the energy of each method per epoch. As can be noted by Figure 8 average and wavelet pooling show a smoother descent in learning and error reduction.<br />
<br />
[[File:WT_Fig4.PNG|650px|center|]]<br />
<br />
[[File:paper21_fig8.png|800px|center]]<br />
<br />
[[File:WT_Tab1.PNG|650px|center|]]<br />
<br />
=== CIFAR-10 ===<br />
In order to investigate the performance of different pooling methods, two types of networks are trained based on CIFAR-10. The first one is the regular CNN and the second one is the network with dropout and batch normalization. Figure 9 shows the network and Tables 2 and 3 shows the accuracy without and with dropout. Average pooling achieves the best accuracy but wavelet pooling is still competitive, while max pooling overfitted on the validation data fairly quickly as shown by the right energy curve in Figure 10 (although the accuracy performance is not significantly worse when dropout and batch normalization are applied).<br />
<br />
[[File:WT_Fig5.PNG|650px|center|]]<br />
<br />
[[File:paper21_fig10.png|800px|center]]<br />
<br />
[[File:WT_Tab2.PNG|650px|center|]]<br />
<br />
[[File:WT_Tab3.PNG|650px|center|]]<br />
<br />
===SHVN===<br />
Figure 11 shows the network and Tables 4 and 5 shows the accuracy without and with dropout. The proposed method does not perform well in this experiment. <br />
<br />
[[File: a.png|650px|center|]]<br />
<br />
[[File:paper21_fig12.png|800px|center]]<br />
<br />
[[File: b.png|650px|center|]]<br />
<br />
===KDEF===<br />
The authors experimented with pooling methods + dropout on the KDEF dataset (which consists of 4,900 images of 35 people portraying varying emotions through facial expressions under different poses, 3,900 of which were randomly assigned to be used for training). The data was treated for errors (e.g. corrupt images) and resized to 128x128 for memory and time constraints. <br />
<br />
Figure 13 below shows the network structure. Figure 14 shows the energy curve of the competing models on training and validation sets as the number of epochs increases, and Table 6 shows the accuracy performance. Average pooling demonstrated the highest accuracy, with wavelet pooling coming in second and max-pooling a close third. However, stochastic and wavelet pooling exhibited more stable learning progression compared to the other methods, and max-pooling eventually overfitted. <br />
<br />
[[File:kdef_struc.PNG|700px|center|]]<br />
[[File:kdef_curve.PNG|750px|center|]]<br />
[[File:kdef_accu.PNG|550px|center|]]<br />
<br />
== Computational Complexity ==<br />
The authors explain that their paper is a proof of concept and is not meant to implement wavelet pooling in the most efficient way. The table below displays a comparison of the number of mathematical operations for each method according to the dataset. It can be seen that wavelet pooling is significantly worse. The authors explain that through good implementation and coding practices, the method can prove to be viable.<br />
<br />
[[File:WT_Tab4.PNG|650px|center|]]<br />
<br />
== Criticism ==<br />
=== Positive ===<br />
* Wavelet Pooling achieves competitive performance with standard go-to pooling methods<br />
* Leads to a comparison of discrete transformation techniques for pooling (DCT, DFT)<br />
=== Negative ===<br />
* Only 2x2 pooling window used for comparison<br />
* Highly computationally extensive<br />
* Not as simple as other pooling methods<br />
* Only one wavelet used (HAAR wavelet)<br />
<br />
== References ==<br />
* Travis Williams and Robert Li. Wavelet Pooling for Convolutional Neural Networks. ICLR 2018.<br />
* J. Anthony Parker, Robert V. Kenyon, and Donald E. Troxel. Comparison of interpolating methods for image resampling. IEEE Transactions on Medical Imaging, 2(1):31–39, 1983.<br />
* Matthew Zeiler and Robert Fergus. Stochastic pooling for regularization of deep convolutional neural networks. In Proceedings of the International Conference on Learning Representation (ICLR), 2013.</div>Swalshhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/IMPROVING_GANS_USING_OPTIMAL_TRANSPORT&diff=36219stat946w18/IMPROVING GANS USING OPTIMAL TRANSPORT2018-04-06T23:13:56Z<p>Swalsh: examining the advantage of adversarially learned cost fcn</p>
<hr />
<div>== Introduction ==<br />
Recently, the problem of how to learn models that generate media such as images, video, audio and text has become very popular and is called Generative Modeling. One of the main benefits of such an approach is that generative models can be trained on unlabeled data that is readily available . Therefore, generative networks have a huge potential in the field of deep learning.<br />
<br />
Generative Adversarial Networks (GANs) are powerful generative models used for unsupervised learning techniques where the 2 agents compete to generate a zero-sum model. A GAN model consists of a generator and a discriminator or critic. The generator is a neural network which is trained to generate data having a distribution matched with the distribution of the real data. The critic is also a neural network, which is trained to separate the generated data from the real data. A loss function that measures the distribution distance between the generated data and the real one is important to train the generator.<br />
<br />
Optimal transport theory, which is another approach to measuring distances between distributions, evaluates the distribution distance between the generated data and the training data based on a metric, which provides another method for generator training. The main advantage of optimal transport theory over the distance measurement in GAN is its closed form solution for having a tractable training process. But the theory might also result in inconsistency in statistical estimation due to the given biased gradients if the mini-batches method is applied (Bellemare et al.,<br />
2017).<br />
<br />
This paper presents a variant GANs named OT-GAN, which incorporates a discriminative metric called 'Mini-batch Energy Distance' into its critic in order to overcome the issue of biased gradients.<br />
<br />
== GANs and Optimal Transport ==<br />
<br />
===Generative Adversarial Nets===<br />
Original GAN was firstly reviewed. The objective function of the GAN: <br />
<br />
[[File:equation1.png|700px]]<br />
<br />
The goal of GANs is to train the generator g and the discriminator d finding a pair of (g,d) to achieve Nash equilibrium(such that either of them cannot reduce their cost without changing the others' parameters). However, it could cause failure of converging since the generator and the discriminator are trained based on gradient descent techniques.<br />
<br />
===Wasserstein Distance (Earth-Mover Distance)===<br />
<br />
In order to solve the problem of convergence failure, Arjovsky et. al. (2017) suggested Wasserstein distance (Earth-Mover distance) based on the optimal transport theory.<br />
<br />
[[File:equation2.png|600px]]<br />
<br />
where <math> \prod (p,g) </math> is the set of all joint distributions <math> \gamma (x,y) </math> with marginals <math> p(x) </math> (real data), <math> g(y) </math> (generated data). <math> c(x,y) </math> is a cost function and the Euclidean distance was used by Arjovsky et. al. in the paper. <br />
<br />
The Wasserstein distance can be considered as moving the minimum amount of points between distribution <math> g(y) </math> and <math> p(x) </math> such that the generator distribution <math> g(y) </math> is similar to the real data distribution <math> p(x) </math>.<br />
<br />
Computing the Wasserstein distance is intractable. The proposed Wasserstein GAN (W-GAN) provides an estimated solution by switching the optimal transport problem into Kantorovich-Rubinstein dual formulation using a set of 1-Lipschitz functions. A neural network can then be used to obtain an estimation.<br />
<br />
[[File:equation3.png|600px]]<br />
<br />
W-GAN helps to solve the unstable training process of original GAN and it can solve the optimal transport problem approximately, but it is still intractable.<br />
<br />
===Sinkhorn Distance===<br />
Genevay et al. (2017) proposed to use the primal formulation of optimal transport instead of the dual formulation to generative modeling. They introduced Sinkhorn distance which is a smoothed generalization of the Wasserstein distance.<br />
[[File: equation4.png|600px]]<br />
<br />
It introduced entropy restriction (<math> \beta </math>) to the joint distribution <math> \prod_{\beta} (p,g) </math>. This distance could be generalized to approximate the mini-batches of data <math> X ,Y</math> with <math> K </math> vectors of <math> x, y</math>. The <math> i, j </math> th entry of the cost matrix <math> C </math> can be interpreted as the cost it needs to transport the <math> x_i </math> in mini-batch X to the <math> y_i </math> in mini-batch <math>Y </math>. The resulting distance will be:<br />
<br />
[[File: equation5.png|550px]]<br />
<br />
where <math> M </math> is a <math> K \times K </math> matrix, each row of <math> M </math> is a joint distribution of <math> \gamma (x,y) </math> with positive entries. The summmation of rows or columns of <math> M </math> is always equal to 1. <br />
<br />
This mini-batch Sinkhorn distance is not only fully tractable but also capable of solving the instability problem of GANs. However, it is not a valid metric over probability distribution when taking the expectation of <math> \mathcal{W}_{c} </math> and the gradients are biased when the mini-batch size is fixed.<br />
<br />
===Energy Distance (Cramer Distance)===<br />
In order to solve the above problem, Bellemare et al. proposed Energy distance:<br />
<br />
[[File: equation6.png|700px]]<br />
<br />
where <math> x, x' </math> and <math> y, y'</math> are independent samples from data distribution <math> p </math> and generator distribution <math> g </math>, respectively. Based on the Energy distance, Cramer GAN is to minimize the ED distance metric when training the generator.<br />
<br />
==Mini-Batch Energy Distance==<br />
Salimans et al. (2016) mentioned that comparing to use distributions over individual images, mini-batch GAN is more powerful when using the distributions over mini-batches <math> g(X), p(X) </math>. The distance measure is generated for mini-batches.<br />
<br />
===Generalized Energy Distance===<br />
The generalized energy distance allowed to use non-Euclidean distance functions d. It is also valid for mini-batches and is considered better than working with individual data batch.<br />
<br />
[[File: equation7.png|670px]]<br />
<br />
Similarly as defined in the Energy distance, <math> X, X' </math> and <math> Y, Y'</math> can be the independent samples from data distribution <math> p </math> and the generator distribution <math> g </math>, respectively. While in Generalized engergy distance, <math> X, X' </math> and <math> Y, Y'</math> can also be valid for mini-batches. The <math> D_{GED}(p,g) </math> is a metric when having <math> d </math> as a metric. Thus, taking the triangle inequality of <math> d </math> into account, <math> D(p,g) \geq 0,</math> and <math> D(p,g)=0 </math> when <math> p=g </math>.<br />
<br />
===Mini-Batch Energy Distance===<br />
As <math> d </math> is free to choose, authors proposed Mini-batch Energy Distance by using entropy-regularized Wasserstein distance as <math> d </math>. <br />
<br />
[[File: equation8.png|650px]]<br />
<br />
where <math> X, X' </math> and <math> Y, Y'</math> are independent sampled mini-batches from the data distribution <math> p </math> and the generator distribution <math> g </math>, respectively. This distance metric combines the energy distance with primal form of optimal transport over mini-batch distributions <math> g(Y) </math> and <math> p(X) </math>. Inside the generalized energy distance, the Sinkhorn distance is a valid metric between each mini-batches. By adding the <math> - \mathcal{W}_c (Y,Y')</math> and <math> \mathcal{W}_c (X,Y)</math> to equation (5) and using energy distance, the objective becomes statistically consistent (meaning the objective converges to the true parameter value for large sample sizes) and mini-batch gradients are unbiased.<br />
<br />
==Optimal Transport GAN (OT-GAN)==<br />
<br />
The mini-batch energy distance which was proposed depends on the transport cost function <math>c(x,y)</math>. One possibility would be to choose c to be some fixed function over vectors, like Euclidean distance, but the authors found this to perform poorly in preliminary experiments. For simple fixed cost functions like Euclidean distance, there exists many bad distributions <math>g</math> in higher dimensions for which the mini-batch energy distance is zero such that it is difficult to tell <math>p</math> and <math>g</math> apart if the sample size is not big enough. To solve this the authors propose learning the cost function adversarially, so that it can adapt to the generator distribution <math>g</math> and thereby become more discriminative. <br />
<br />
In practice, in order to secure the statistical efficiency (i.e. being able to tell <math>p</math> and <math>g</math> apart without requiring an enormous sample size when their distance is close to zero), authors suggested using cosine distance between vectors <math> v_\eta (x) </math> and <math> v_\eta (y) </math> based on the deep neural network that maps the mini-batch data to a learned latent space. Here is the transportation cost:<br />
<br />
[[File: euqation9.png|370px]]<br />
<br />
where the <math> v_\eta </math> is chosen to maximize the resulting minibatch energy distance.<br />
<br />
Unlike the practice when using the original GANs, the generator was trained more often than the critic, which keep the cost function from degeneration. The resulting generator in OT-GAN has a well defined and statistically consistent objective through the training process.<br />
<br />
The algorithm is defined below. The backpropagation is not used in the algorithm since ignoring this gradient flow is justified by the envelope theorem (i.e. when changing the parameters of the objective function, changes in the optimizer do not contribute to a change in the objective function). Stochastic gradient descent is used as the optimization method in algorithm 1 below, although other optimizers are also possible. In fact, Adam was used in experiments. <br />
<br />
[[File: al.png|600px]]<br />
<br />
<br />
[[File: al_figure.png|600px]]<br />
<br />
==Experiments==<br />
<br />
In order to demonstrate the supermum performance of the OT-GAN, authors compared it with the original GAN and other popular models based on four experiments: Dataset recovery; CIFAR-10 test; ImageNet test; and the conditional image synthesis test.<br />
<br />
===Mixture of Gaussian Dataset===<br />
OT-GAN has a statistically consistent objective when it is compared with the original GAN (DC-GAN), such that the generator would not update to a wrong direction even if the signal provided by the cost function to the generator is not good. In order to prove this advantage, authors compared the OT-GAN with the original GAN loss (DAN-S) based on a simple task. The task was set to recover all of the 8 modes from 8 Gaussian mixers in which the means were arranged in a circle. MLP with RLU activation functions were used in this task. The critic was only updated for 15K iterations. The generator distribution was tracked for another 25K iteration. The results showed that the original GAN experiences the model collapse after fixing the discriminator while the OT-GAN recovered all the 8 modes from the mixed Gaussian data.<br />
<br />
[[File: 5_1.png|600px]]<br />
<br />
===CIFAR-10===<br />
<br />
The dataset CIFAR-10 was then used for inspecting the effect of batch-size to the model training process and the image quality. OT-GAN and four other methods were compared using "inception score" as the criteria for comparison. Figure 3 shows the change of inceptions scores (y-axis) by the increased of the iteration number. Scores of four different batch sizes (200, 800, 3200 and 8000) were compared. The results show that a larger batch size, which would more likely cover more modes in the distribution of data, lead to a more stable model showing a larger value in inception score. However, a large batch size would also require a high-performance computational environment. The sample quality across all 5 methods, ran using a batch size of 8000, are compared in Table 1 where the OT_GAN has the best score.<br />
<br />
The OT-GAN was trained using Adam optimizer. The learning rate was set to <math> 3 x10^-4, β_1 = 0.5, β_2 = 0.999 </math>. The introduced OT-GAN algorithm also includes two additional hyperparameters for the Sinkhorn algorithm. The first hyperparameters indicated the number of iterations to run the algorithm and the second <math> 1/λ </math> the entropy penalty of alignments. The authors found that a value of 500 worked well for both mentioned hyperparameters.<br />
<br />
[[File: 5_2.png|600px]]<br />
<br />
Figure 4 below shows samples generated by the OT-GAN trained with a batch size of 8000. Figure 5 below shows random samples from a model trained with the same architecture and hyperparameters, but with random matching of samples in place of optimal transport.<br />
<br />
[[File: ot_gan_cifar_10_samples.png|600px]]<br />
<br />
<br />
In order to show the advantage of learning the cost function adversarially, the CIFAR-10 experiment was re-run with the cost as follows:<br />
<br />
[[File: OTGAN_CosineDist.png|250px]]<br />
<br />
When using this fixed cost and keeping the other experiment settings constant, the max inception score dropped from 8.47 with learned to 4.93 with fixed cost functions. The results of the fixed cost are seen in Figure 8 below.<br />
<br />
[[File: OTGAN_fixedDist.png|600px]]<br />
<br />
===ImageNet Dogs===<br />
<br />
In order to investigate the performance of OT-GAN when dealing with the high-quality images, the dog subset of ImageNet (128*128) was used to train the model. Figure 6 shows that OT-GAN produces less nonsensical images and it has a higher inception score compare to the DC-GAN. <br />
<br />
[[File: 5_3.png|600px]]<br />
<br />
<br />
To analyze mode collapse in GANs the authors trained both types of GANs for a large number of epochs. They find the DCGAN shows mode collapse as soon as 900 epochs. They trained the OT-GAN for 13000 epochs and saw no evidence of mode collapse or less diversity in the samples. Samples can be viewed in Figure 9.<br />
<br />
[[File: ModelCollapseImageNetDogs.png|600px]]<br />
<br />
===Conditional Generation of Birds===<br />
<br />
The last experiment was to compare OT-GAN with three popular GAN models for processing the text-to-image generation demonstrating the performance on conditional image synthesis. As can be found from Table 2, OT-GAN received the highest inception score than the scores of the other three models. <br />
<br />
[[File: 5_4.png|600px]]<br />
<br />
The algorithm used to obtain the results above is conditional generation generalized from '''Algorithm 1''' to include conditional information <math>s</math> such as some text description of an image. The modified algorithm is outlined in '''Algorithm 2'''.<br />
<br />
[[File: paper23_alg2.png|600px]]<br />
<br />
==Conclusion==<br />
<br />
In this paper, an OT-GAN method was proposed based on the optimal transport theory. A distance metric that combines the primal form of the optimal transport and the energy distance was given was presented for realizing the OT-GAN. The results showed OT-GAN to be uniquely stable when trained with large mini batches and state of the art results were achieved on some datasets. One of the advantages of OT-GAN over other GAN models is that OT-GAN can stay on the correct track with an unbiased gradient even if the training on critic is stopped or presents a weak cost signal. The performance of the OT-GAN can be maintained when the batch size is increasing, though the computational cost has to be taken into consideration.<br />
<br />
==Critique==<br />
<br />
The paper presents a variant of GANs by defining a new distance metric based on the primal form of optimal transport and the mini-batch energy distance. The stability was demonstrated through the four experiments that comparing OP-GAN with other popular methods. However, limitations in computational efficiency were not discussed much. Furthermore, in section 2, the paper lacks explanation on using mini-batches instead of a vector as input when applying Sinkhorn distance. It is also confusing when explaining the algorithm in section 4 about choosing M for minimizing <math> \mathcal{W}_c </math>. Lastly, it is found that it is lack of parallel comparison with existing GAN variants in this paper. Readers may feel jumping from one algorithm to another without necessary explanations.<br />
<br />
==Reference==<br />
Salimans, Tim, Han Zhang, Alec Radford, and Dimitris Metaxas. "Improving GANs using optimal transport." (2018).</div>Swalshhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:OTGAN_fixedDist.png&diff=36218File:OTGAN fixedDist.png2018-04-06T23:08:18Z<p>Swalsh: </p>
<hr />
<div></div>Swalshhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:OTGAN_CosineDist.png&diff=36217File:OTGAN CosineDist.png2018-04-06T23:08:03Z<p>Swalsh: </p>
<hr />
<div></div>Swalshhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Do_Deep_Neural_Networks_Suffer_from_Crowding&diff=36216Do Deep Neural Networks Suffer from Crowding2018-04-06T22:46:13Z<p>Swalsh: added dataset histogram to show why diff datasets have more crowding</p>
<hr />
<div>= Introduction =<br />
Since the increase in popularity of Deep Neural Networks (DNNs), there has been increased research in making machines capable of recognizing objects the same way humans do. Humans can recognize objects in ways that are invariant to scale, translation, and clutter. Crowding is visual effect suffered by humans, in which an object that can be recognized in isolation can no longer be recognized when other objects, called flankers, are placed close to it. This paper focuses on studying the impact of crowding on DNNs trained for object recognition by adding clutter to the images and then analyzing which models and settings suffer less from such effects. <br />
<br />
[[File:paper25_fig_crowding_ex.png|center|600px]]<br />
The figure shows a visual example of crowding [3]. Keep your eyes still and look at the dot in the center and try to identify the "A" in the two circles. You should see that it is much easier to make out the "A" in the right than in the left circle. The same "A" exists in both circles, however, the left circle contains flankers which are those line segments.<br />
<br />
Another common example to visualize the same:<br />
[[File:crowding-tigger.jpg|center|600px]]<br />
<br />
<br />
===Drawbacks of CNNs===<br />
CNNs fall short in explaining human perceptual invariance. Firstly, CNNs typically take input at a single uniform resolution. Biological measurements suggest that resolution is not uniform across the human visual field, but rather decays with eccentricity, i.e. distance from the center of focus. Even more importantly, CNNs rely not only on weight-sharing but also on data augmentation to achieve transformation invariance and so obviously a lot of processing is needed for CNNs.<br />
<br />
The paper investigates two types of DNNs for crowding: traditional deep convolutional neural networks (DCNN) and a multi-scale eccentricity-dependent model which is an extension of the DCNNs and inspired by the retina where the receptive field size of the convolutional filters in the model grows with increasing distance from the center of the image, called the eccentricity and is explained below. The authors focus on the dependence of crowding on image factors, such as flanker configuration, target-flanker similarity, target eccentricity and premature pooling in particular. Along with that, there is major emphasis on reducing the training time of the networks since the motive is to have a simple network capable of learning space-invariant features.<br />
<br />
= Models =<br />
The authors describe two kinds of DNN architectures: Deep Convolutional Neural Networks, and eccentricity dependent networks, with varying pooling strategies across space and scale. Of particular note is the pooling operation, as many researchers have suggested that this may be the cause of crowding in human perception.<br />
<br />
== Deep Convolutional Neural Networks ==<br />
The DCNN is a basic architecture with 3 convolutional layers, spatial 3x3 max-pooling with varying strides and a fully connected layer for classification as shown in the below figure. <br />
[[File:DCNN.png|800px|center]]<br />
<br />
The network is fed with images resized to 60x60, with mini-batches of 128 images, 32 feature channels for all convolutional layers, and convolutional filters of size 5x5 and stride 1.<br />
<br />
As highlighted earlier, the effect of pooling is into main consideration and hence three different configurations have been investigated as below: <br />
<br />
# '''No total pooling''' Feature maps sizes decrease only due to boundary effects, as the 3x3 max pooling has stride 1. The square feature maps sizes after each pool layer are 60-54-48-42.<br />
# '''Progressive pooling''' 3x3 pooling with a stride of 2 halves the square size of the feature maps, until we pool over what remains in the final layer, getting rid of any spatial information before the fully connected layer. (60-27-11-1).<br />
# '''At end pooling''' Same as no total pooling, but before the fully connected layer, max-pool over the entire feature map. (60-54-48-1).<br />
<br />
==Eccentricity-dependent Model==<br />
In order to take care of the scale invariance in the input image, the eccentricity dependent DNN is utilized. This was proposed as a model of the human visual cortex by [https://arxiv.org/pdf/1406.1770.pdf, Poggio et al] and later further studied in [2]. The main intuition behind this architecture is that as we increase eccentricity, the receptive fields also increase and hence the model will become invariant to changing input scales. The authors note that the width of each scale is roughly related to the amount of translation invariance for objects at that scale, simply because once the object is outside that window, the filter no longer observes it. Therefore, the authors say that the architecture emphasizes scale invariance over translation invariance, in contrast to traditional DCNNs. From a biological perspective, eye movement can compensate for the limitations of translation invariance, but compensating for scale invariance requires changing distance from the object. In this model, the input image is cropped into varying scales (11 crops increasing by a factor of <math>\sqrt{2}</math> which are then resized to 60x60 pixels) and then fed to the network. Exponentially interpolated crops are used over linearly interpolated crops since they produce fewer boundary effects while maintaining the same behavior qualitatively. The model computes an invariant representation of the input by sampling the inverted pyramid at a discrete set of scales with the same number of filters at each scale. Since the same number of filters are used for each scale, the smaller crops will be sampled at a high resolution while the larger crops will be sampled with a low resolution. These scales are fed into the network as an input channel to the convolutional layers and share the weights across scale and space. Due to the downsampling of the input image, this is equivalent to having receptive fields of varying sizes. Intuitively, this means that the network generalizes learnings across scales and is guaranteed by during back-propagation by averaging the error derivatives over all scale channels, then using the averages to compute weight adjustments. The same set of weight adjustments to the convolutional units across different scale channels is applied.<br />
[[File:EDM.png|2000x450px|center]]<br />
<br />
<br />
The architecture of this model is the same as the previous DCNN model with the only change being the extra filters added for each of the scales, so the number of parameters remains the same as DCNN models. The authors perform spatial pooling, the aforementioned ''At end pooling'' is used here, and scale pooling which helps in reducing the number of scales by taking the maximum value of corresponding locations in the feature maps across multiple scales. It has three configurations: (1) at the beginning, in which all the different scales are pooled together after the first layer, 11-1-1-1-1 (2) progressively, 11-7-5-3-1 and (3) at the end, 11-11-11-11-1, in which all 11 scales are pooled together at the last layer.<br />
<br />
===Contrast Normalization===<br />
Since there are multiple scales of an input image, in some experiments, normalization is performed such that the sum of the pixel intensities in each scale is in the same range [0,1] (this is to prevent smaller crops, which have more non-black pixels, from disproportionately dominating max-pooling across scales). The normalized pixel intensities are then divided by a factor proportional to the crop area [[File:sqrtf.png|60px]] where i=1 is the smallest crop.<br />
<br />
=Experiments=<br />
Targets are the set of objects to be recognized and flankers are the set of objects the model has not been trained to recognize, which act as clutter with respect to these target objects. The target objects are the even MNIST numbers having translational variance (shifted at different locations of the image along the horizontal axis), while flankers are from odd MNIST numbers, not MNIST dataset (contains alphabet letters) and Omniglot dataset (contains characters). Examples of the target and flanker configurations are shown below: <br />
[[File:eximages.png|800px|center]]<br />
<br />
The target and the object are referred to as ''a'' and ''x'' respectively with the below four configurations: <br />
# No flankers. Only the target object. (a in the plots) <br />
# One central flanker closer to the center of the image than the target. (xa) <br />
# One peripheral flanker closer to the boundary of the image that the target. (ax) <br />
# Two flankers spaced equally around the target, being both the same object, see Figure 1 above for an example (xax).<br />
<br />
Training is done using backpropagation with images of size <math>1920 px^2</math> with embedded targets objects and flankers of size of <math>120 px^2</math>. The training and test images are divided as per the usual MNIST configuration. To determine if there is a difference between the peripheral flankers and the central flankers, all the tests are performed in the right half image plane.<br />
<br />
==DNNs trained with Target and Flankers==<br />
This is a constant spacing training setup where identical flankers are placed at a distance of 120 pixels either side of the target(xax) with the target having translational variance. The tests are evaluated on (i) DCNN with at the end pooling, and (ii) eccentricity-dependent model with 11-11-11-11-1 scale pooling, at the end spatial pooling and contrast normalization. The results are reported by different flanker types <math>(xax,ax, xa)</math> at test. <br />
[[File:result1.png|x450px|center]]<br />
<br />
===Observations===<br />
* With the flanker configuration same as the training one, models are better at recognizing objects in clutter rather than isolated objects for all image locations<br />
* If the target-flanker spacing is changed, then models perform worse<br />
* the eccentricity model is much better at recognizing objects in isolation than the DCNN because the multi-scale crops divide the image into discrete regions, letting the model learn from image parts as well as the whole image<br />
* Only the eccentricity-dependent model is robust to different flanker configurations not included in training when the target is centered.<br />
<br />
==DNNs trained with Images with the Target in Isolation==<br />
Here the target objects are in isolation and with translational variance while the test-set is the same set of flanker configurations as used before. The constant spacing and constant eccentricity effect have been evaluated.<br />
<br />
[[File:result2.png|750x400px|center]]<br />
<br />
In addition to the evaluation of DCNNs in constant target eccentricity at 240 pixels, here they are tested with images in which the target is fixed at 720 pixels from the center of the image, as shown in Fig 3. Since the target is already at the edge of the visual field, a flanker cannot be more peripheral in the image than the target. Same results as for the 240 pixels target eccentricity can be extracted. The closer the flanker is to the target, the more accuracy decreases. Also, it can be seen that when the target is close to the image boundary, recognition is poor because of boundary effects eroding away information about the target.<br />
<br />
[[File:paper25_supplemental1.png|800px|center]]<br />
<br />
The authors also test the effect of flankers from different datasets on a DCNN model with at end pooling, with results shown in Fig. 7 below. Omniglot flankers crowd less than MNIST digits, and the authors note that this is because they are visually similar to MNIST digits, but are not actually digits, and thus activate the model's convolutional filters less than MNIST digits. The notMNIST digits however, result it more crowding. This is due to the fact that the different font style results in more high intensity pixels and edges. The intensity distributions for the 3 datasets is shown in the histograms in Fig. 12 below. The correlation between crowding and relative frequency of high intensity pixels can be seen from this figure.<br />
<br />
[[File:crowding_at_end_pooling.png|750px|center]]<br />
<br />
[[File:DCNN dataset histogram.png|750px|center]]<br />
<br />
<br />
===DCNN Observations===<br />
* Accuracy decreases with the increase in the number of flankers.<br />
* Unsurprisingly, CNNs are capable of being invariant to translations.<br />
* In the constant target eccentricity setup, where the target is fixed at the center of the image with varying target-flanker spacing, we observe that as the distance between target and flankers increase, recognition gets better.<br />
* Spatial pooling helps the network in learning invariance.<br />
* Flankers similar to the target object helps in recognition since they activate the convolutional filter more.<br />
* notMNIST data affects leads to more crowding since they have many more edges and white image pixels which activate the convolutional layers more.<br />
<br />
===Eccentric Model===<br />
The set-up is the same as explained earlier. The spacial pooling keeps constant. The effect of pooling across scales are investigated. The three configurations for scale pooling are (i) at the beginning, (ii)progressively and (iii) at the end. <br />
[[File:result3.png|750x400px|center]]<br />
<br />
====Observations====<br />
* The recognition accuracy is dependent on the eccentricity of the target object.<br />
* If the target is placed at the center and no contrast normalization is done, then the recognition accuracy is high since this model concentrates the most on the central region of the image.<br />
* If contrast normalization is done, then all the scales will contribute equal amount and hence the eccentricity dependence is removed.<br />
* Early pooling is harmful since it might take away the useful information very early which might be useful to the network.<br />
<br />
Without contrast normalization, the middle portion of the image can be focused more with high resolution so the target at the center with no normalization performs well in that case. But if normalization is done, then all the segments of the image contribute to the classification and hence the overall accuracy is not that great but the system becomes robust to the changes in eccentricity.<br />
<br />
==Complex Clutter==<br />
Here, the targets are randomly embedded into images of the Places dataset and shifted along horizontally in order to investigate model robustness when the target is not at the image center. Tests are performed on DCNN and the eccentricity model with and without contrast normalization using at end pooling. The results are shown in Figure 9 below. <br />
<br />
[[File:result4.png|750x400px|center]]<br />
<br />
====Observations====<br />
* Only eccentricity model without contrast normalization can recognize the target and only when the target is close to the image center.<br />
* The eccentricity model does not need to be trained on different types of clutter to become robust to those types of clutter, but it needs to fixate on the relevant part of the image to recognize the target. If it can fixate on the relevant part of the image, it can still discriminate it, even at different scales. This implies that the eccentricity model is robust to clutter.<br />
<br />
=Conclusions=<br />
This paper investigates the effect of crowding on a DNN. Using a simple technique of adding clutter in the model didn't improve the performance. We often think that just training the network with data similar to the test data would achieve good results in a general scenario too but that's not the case as we trained the model with flankers and it did not give us the ideal results for the target objects. The following 4 techniques influenced crowding in DNN:<br />
*'''Flanker Configuration''': When models are trained with images of objects in isolation, adding flankers harms recognition. Adding two flankers is the same or worse than adding just one and the smaller the spacing between flanker and target, the more crowding occurs. This is because the pooling operation merges nearby responses, such as the target and flankers if they are close.<br />
*'''Similarity between target and flanker''': Flankers more similar to targets cause more crowding, because of the selectivity property of the learned DNN filters.<br />
*'''Dependence on target location and contrast normalization''': In DCNNs and eccentricity-dependent models with contrast normalization, recognition accuracy is the same across all eccentricities. In eccentricity-dependent networks without contrast normalization, recognition does not decrease despite the presence of clutter when the target is at the center of the image.<br />
*'''Effect of pooling''': adding pooling leads to better recognition accuracy of the models. Yet, in the eccentricity model, pooling across the scales too early in the hierarchy leads to lower accuracy.<br />
* The Eccentricity Dependent Models can be used for modeling the feedforward path of the primate visual cortex. <br />
* If target locations are proposed, then the system can become even more robust and hence a simple network can become robust to clutter while also reducing the amount of training data and time needed<br />
<br />
=Critique=<br />
This paper only tries to check the impact of flankers on targets as to how crowding can affect recognition but it does not propose anything novel in terms of architecture to take care of such type of crowding. The paper only shows that the eccentricity based model does better (than plain DCNN model) when the target is placed at the center of the image but maybe windowing over the frames the same way that a convolutional model passes a filter over an image, instead of taking crops starting from the middle, might help.<br />
<br />
This paper focuses on image classification. For a stronger argument, their model could be applied to the task of object detection. Perhaps crowding does not have as large of an impact when the objects of interest are localized by a region proposal network.<br />
<br />
This paper does not provide a convincing argument that the problem of crowding as experienced by humans somehow shares a similar mechanism to the problem of DNN accuracy falling when there is more clutter in the scene. The multi-scale architecture does not appear similar to the distribution of rods and cones in the retina[https://www.ncbi.nlm.nih.gov/books/NBK10848/figure/A763/?report=objectonly]. It might be that the eccentric model does well when the target is centered because it is being sampled by more scales, not because it is similar to a primate visual cortex, and primates are able to recognize an object in clutter when looking directly at it.<br />
<br />
=References=<br />
# Volokitin A, Roig G, Poggio T:"Do Deep Neural Networks Suffer from Crowding?" Conference on Neural Information Processing Systems (NIPS). 2017<br />
# Francis X. Chen, Gemma Roig, Leyla Isik, Xavier Boix and Tomaso Poggio: "Eccentricity Dependent Deep Neural Networks for Modeling Human Vision" Journal of Vision. 17. 808. 10.1167/17.10.808.<br />
# J Harrison, W & W Remington, R & Mattingley, Jason. (2014). Visual crowding is anisotropic along the horizontal meridian during smooth pursuit. Journal of vision. 14. 10.1167/14.1.21. http://willjharrison.com/2014/01/new-paper-visual-crowding-is-anisotropic-along-the-horizontal-meridian-during-smooth-pursuit/</div>Swalshhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:DCNN_dataset_histogram.png&diff=36215File:DCNN dataset histogram.png2018-04-06T22:44:31Z<p>Swalsh: </p>
<hr />
<div></div>Swalshhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Dynamic_Routing_Between_Capsules_STAT946&diff=36212Dynamic Routing Between Capsules STAT9462018-04-06T22:02:59Z<p>Swalsh: added reasoning for selecting the 3 iteration method</p>
<hr />
<div>= Presented by =<br />
<br />
Yang, Tong(Richard)<br />
<br />
= Contributions =<br />
<br />
This paper introduces the concept of "capsules" and an approach to implement this concept in neural networks. Capsules are groups of neurons used to represent various properties of an entity/object present in the image, such as pose, deformation, and even the existence of the entity. Instead of the obvious representation of a logistic unit for the probability of existence, the paper explores using the length of the capsule output vector to represent existence, and the orientation to represent other properties of the entity. The paper makes the following major contributions:<br />
<br />
* Proposes an alternative to max-pooling called routing-by-agreement.<br />
* Demonstrates a mathematical structure for capsule layers and a routing mechanism. Builds a prototype architecture for capsule networks. <br />
* Presented promising results that confirm the value of Capsnet as a new direction for development in deep learning.<br />
<br />
= Hinton's Critiques on CNN =<br />
<br />
In a past talk, Hinton tried to explain why max-pooling is the biggest problem with current convolutional networks. Here are some highlights from his talk. <br />
<br />
== Four arguments against pooling ==<br />
<br />
* It is a bad fit to the psychology of shape perception: It does not explain why we assign intrinsic coordinate frames to objects and why they have such huge effects.<br />
<br />
* It solves the wrong problem: We want equivariance, not invariance. Disentangling rather than discarding.<br />
<br />
* It fails to use the underlying linear structure: It does not make use of the natural linear manifold that perfectly handles the largest source of variance in images.<br />
<br />
* Pooling is a poor way to do dynamic routing: We need to route each part of the input to the neurons that know how to deal with it. Finding the best routing is equivalent to parsing the image.<br />
<br />
===Intuition Behind Capsules ===<br />
We try to achieve viewpoint invariance in the activities of neurons by doing max-pooling. Invariance here means that by changing the input a little, the output still stays the same while the activity is just the output signal of a neuron. In other words, when in the input image we shift the object that we want to detect by a little bit, networks activities (outputs of neurons) will not change because of max pooling and the network will still detect the object. But the spacial relationships are not taken care of in this approach so instead capsules are used, because they encapsulate all important information about the state of the features they are detecting in a form of a vector. Capsules encode probability of detection of a feature as the length of their output vector. And the state of the detected feature is encoded as the direction in which that vector points to. So when detected feature moves around the image or its state somehow changes, the probability still stays the same (length of vector does not change), but its orientation changes.<br />
<br />
== Equivariance ==<br />
<br />
To deal with the invariance problem of CNN, Hinton proposes the concept called equivariance, which is the foundation of capsule concept.<br />
<br />
=== Two types of equivariance ===<br />
<br />
==== Place-coded equivariance ====<br />
If a low-level part moves to a very different position it will be represented by a different capsule.<br />
<br />
==== Rate-coded equivariance ====<br />
If a part only moves a small distance it will be represented by the same capsule but the pose outputs of the capsule will change.<br />
<br />
Higher-level capsules have bigger domains so low-level place-coded equivariance gets converted into high-level rate-coded equivariance.<br />
<br />
= Dynamic Routing =<br />
<br />
In the second section of this paper, authors give mathematical representations for two key features in routing algorithm in capsule network, which are squashing and agreement. The general setting for this algorithm is between two arbitrary capsules i and j. Capsule j is assumed to be an arbitrary capsule from the first layer of capsules, and capsule i is an arbitrary capsule from the layer below. The purpose of routing algorithm is to generate a vector output for routing decision between capsule j and capsule i. Furthermore, this vector output will be used in the decision for choice of dynamic routing. <br />
<br />
== Routing Algorithm ==<br />
<br />
The routing algorithm is as the following:<br />
<br />
[[File:DRBC_Figure_1.png|650px|center||Source: Sabour, Frosst, Hinton, 2017]]<br />
<br />
In the following sections, each part of this algorithm will be explained in details.<br />
<br />
=== Log Prior Probability ===<br />
<br />
<math>b_{ij}</math> represents the log prior probabilities that capsule i should be coupled to capsule j, and updated in each routing iteration. As line 2 suggests, the initial values of <math>b_{ij}</math> for all possible pairs of capsules are set to 0. In the very first routing iteration, <math>b_{ij}</math> equals to zero. For each routing iteration, <math>b_{ij}</math> gets updated by the value of agreement, which will be explained later.<br />
<br />
=== Coupling Coefficient === <br />
<br />
<math>c_{ij}</math> represents the coupling coefficient between capsule j and capsule i. It is calculated by applying the softmax function on the log prior probability <math>b_{ij}</math>. The mathematical transformation is shown below (Equation 3 in paper): <br />
<br />
\begin{align}<br />
c_{ij} = \frac{exp(b_ij)}{\sum_{k}exp(b_ik)}<br />
\end{align}<br />
<br />
<math>c_{ij}</math> are served as weights for computing the weighted sum and probabilities. Therefore, as probabilities, they have the following properties:<br />
<br />
\begin{align}<br />
c_{ij} \geq 0, \forall i, j<br />
\end{align}<br />
<br />
and, <br />
<br />
\begin{align}<br />
\sum_{i,j}c_{ij} = 1, \forall i, j<br />
\end{align}<br />
<br />
=== Predicted Output from Layer Below === <br />
<br />
<math>u_{i}</math> are the output vector from capsule i in the lower layer, and <math>\hat{u}_{j|i}</math> are the input vector for capsule j, which are the "prediction vectors" from the capsules in the layer below. <math>\hat{u}_{j|i}</math> is produced by multiplying <math>u_{i}</math> by a weight matrix <math>W_{ij}</math>, such as the following:<br />
<br />
\begin{align}<br />
\hat{u}_{j|i} = W_{ij}u_i<br />
\end{align}<br />
<br />
where <math>W_{ij}</math> encodes some spatial relationship between capsule j and capsule i.<br />
<br />
=== Capsule ===<br />
<br />
By using the definitions from previous sections, the total input vector for an arbitrary capsule j can be defined as:<br />
<br />
\begin{align}<br />
s_j = \sum_{i}c_{ij}\hat{u}_{j|i}<br />
\end{align}<br />
<br />
which is a weighted sum over all prediction vectors by using coupling coefficients.<br />
<br />
=== Squashing ===<br />
<br />
The length of <math>s_j</math> is arbitrary, which is needed to be addressed with. The next step is to convert its length between 0 and 1, since we want the length of the output vector of a capsule to represent the probability that the entity represented by the capsule is present in the current input. The "squashing" process is shown below:<br />
<br />
\begin{align}<br />
v_j = \frac{||s_j||^2}{1+||s_j||^2}\frac{s_j}{||s_j||}<br />
\end{align}<br />
<br />
Notice that "squashing" is not just normalizing the vector into unit length. In addition, it does extra non-linear transformation to ensure that short vectors get shrunk to almost zero length and long vectors get shrunk to a length slightly below 1. The reason for doing this is to make decision of routing, which is called "routing by agreement" much easier to make between capsule layers.<br />
<br />
=== Agreement ===<br />
<br />
The final step of a routing iteration is to form an routing agreement <math>a_{ij}</math>, which is represents as a scalar product:<br />
<br />
\begin{align}<br />
a_{ij} = v_{j} \cdot \hat{u}_{j|i}<br />
\end{align}<br />
<br />
As we mentioned in "squashing" section, the length of <math>v_{j}</math> is either close to 0 or close to 1, which will effect the magnitude of <math>a_{ij}</math> in this case. Therefore, the magnitude of <math>a_{ij}</math> indicate the how strong the routing algorithm agrees on taking the route between capsule j and capsule i. For each routing iteration, the log prior probability, <math>b_{ij}</math> will be updated by adding the value of its agreement value, which will effect how the coupling coefficients are computed in the next routing iteration. Because of the "squashing" process, we will eventually end up with a capsule j with its <math>v_{j}</math> close to 1 while all other capsules with its <math>v_{j}</math> close to 0, which indicates that this capsule j should be activated.<br />
<br />
= CapsNet Architecture =<br />
<br />
The second part of this paper discuss the experiment results from a 3-layer CapsNet, the architecture can be divided into two parts, encoder and decoder. <br />
<br />
== Encoder == <br />
<br />
[[File:DRBC_Architecture.png|650px|center||Source: Sabour, Frosst, Hinton, 2017]]<br />
<br />
=== How many routing iteration to use? === <br />
In appendix A of this paper, the authors have shown the empirical results from 500 epochs of training at different choice of routing iterations. According to their observation, more routing iterations increases the capacity of CapsNet but tends to bring additional risk of overfitting. Moreover, CapsNet with routing iterations less than three are not effective in general. As result, they suggest 3 iterations of routing for all experiments.<br />
<br />
=== Marginal loss for digit existence ===<br />
<br />
The experiments performed include segmenting overlapping digits on MultiMINST data set, so the loss function has be adjusted for presents of multiple digits. The marginal lose <math>L_k</math> for each capsule k is calculate by:<br />
<br />
\begin{align}<br />
L_k = T_k max(0, m^+ - ||v_k||)^2 + \lambda(1 - T_k) max(0, ||v_k|| - m^-)^2<br />
\end{align}<br />
<br />
where <math>m^+ = 0.9</math>, <math>m^- = 0.1</math>, and <math>\lambda = 0.5</math>.<br />
<br />
<math>T_k</math> is an indicator for presence of digit of class k, it takes value of 1 if and only if class k is presented. If class k is not presented, <math>\lambda</math> down-weight the loss which shrinks the lengths of the activity vectors for all the digit capsules. By doing this, The loss function penalizes the initial learning for all absent digit class, since we would like the top-level capsule for digit class k to have long instantiation vector if and only if that digit class is present in the input.<br />
<br />
=== Layer 1: Conv1 === <br />
<br />
The first layer of CapsNet. Similar to CNN, this is just convolutional layer that converts pixel intensities to activities of local feature detectors. <br />
<br />
* Layer Type: Convolutional Layer.<br />
* Input: <math>28 \times 28</math> pixels.<br />
* Kernel size: <math>9 \times 9</math>.<br />
* Number of Kernels: 256.<br />
* Activation function: ReLU.<br />
* Output: <math>20 \times 20 \times 256</math> tensor.<br />
<br />
=== Layer 2: PrimaryCapsules ===<br />
<br />
The second layer is formed by 32 primary 8D capsules. By 8D, it means that each primary capsule contains 8 convolutional units with a <math>9 \times 9</math> kernel and a stride of 2. Each capsule will take a <math>20 \times 20 \times 256</math> tensor from Conv1 and produce an output of a <math>6 \times 6 \times 8</math> tensor.<br />
<br />
* Layer Type: Convolutional Layer<br />
* Input: <math>20 \times 20 \times 256</math> tensor.<br />
* Number of capsules: 32.<br />
* Number of convolutional units in each capsule: 8.<br />
* Size of each convolutional unit: <math>6 \times 6</math>.<br />
* Output: <math>6 \times 6 \times 8</math> 8-dimensional vectors.<br />
<br />
=== Layer 3: DigitsCaps ===<br />
<br />
The last layer has 10 16D capsules, one for each digit. Not like the PrimaryCapsules layer, this layer is fully connected. Since this is the top capsule layer, dynamic routing mechanism will be applied between DigitsCaps and PrimaryCapsules. The process begins by taking a transformation of predicted output from PrimaryCapsules layer. Each output is a 8-dimensional vector, which needed to be mapped to a 16-dimensional space. Therefore, the weight matrix, <math>W_{ij}</math> is a <math>8 \times 16</math> matrix. The next step is to acquire coupling coefficients from routing algorithm and to perform "squashing" to get the output. <br />
<br />
* Layer Type: Fully connected layer.<br />
* Input: <math>6 \times 6 \times 8</math> 8-dimensional vectors.<br />
* Output: <math>16 \times 10 </math> matrix.<br />
<br />
=== The loss function ===<br />
<br />
The output of the loss function would be a ten-dimensional one-hot encoded vector with 9 zeros and 1 one at the correct position.<br />
<br />
<br />
== Regularization Method: Reconstruction ==<br />
<br />
This is regularization method introduced in the implementation of CapsNet. The method is to introduce a reconstruction loss (scaled down by 0.0005) to margin loss during training. The authors argue this would encourage the digit capsules to encode the instantiation parameters the input digits. All the reconstruction during training is by using the true labels of the image input. The results from experiments also confirms that adding the reconstruction regularizer enforces the pose encoding in CapsNet and thus boots the performance of routing procedure. <br />
<br />
=== Decoder ===<br />
<br />
The decoder consists of 3 fully connected layers, each layer maps pixel intensities to pixel intensities. The number of parameters in each layer and the activation functions used are indicated in the figure below:<br />
<br />
[[File:DRBC_Decoder.png|650px|center||Source: Sabour, Frosst, Hinton, 2017]]<br />
<br />
=== Result ===<br />
<br />
The authors include some results for CapsNet classification test accuracy to justify the result of reconstruction. We can see that for CapsNet with 1 routing iteration and CapsNet with 3 routing iterations, implement reconstruction shows significant improvements in both MINIST and MultiMINST data set. These improvements show the importance of routing and reconstruction regularizer. <br />
<br />
[[File:DRBC_Reconstruction.png|650px|center||Source: Sabour, Frosst, Hinton, 2017]]<br />
<br />
The decision to use a 3 iteration approach came from experimental results. The image below shows the average logit difference over epochs and at the end for different numbers of routing iterations.<br />
<br />
[[File:DRBC_AvgLogitDiff.png|700px|center||Source: Sabour, Frosst, Hinton, 2017]]<br />
<br />
The above image shows that the average logit difference decreases at a logarithmic rate according to the number of iterations. As part of this, it was seen that the higher routing iterations lead to overfitting on the training dataset. The following image however, shows that when trained on CIFAR10 the training loss is much lower for the 3 iteration method over the 1 iteration method. From these two evaluations the 3 iteration approach was selected as the most ideal.<br />
<br />
[[File:DRBC_TrainLossIter.png|350px|center||Source: Sabour, Frosst, Hinton, 2017]]<br />
<br />
= Experiment Results for CapsNet = <br />
<br />
In this part, the authors demonstrate experiment results of CapsNet on different data sets, such as MINIST and different variation of MINST, such as expanded MINST, affNIST, MultiMNIST. Moreover, they also briefly discuss the performance on some other popular data set such CIFAR 10. <br />
<br />
== MINST ==<br />
<br />
=== Highlights ===<br />
<br />
* CapsNet archives state-of-the-art performance on MINST with significantly fewer parameters (3-layer baseline CNN model has 35.4M parameters, compared to 8.2M for CapsNet with reconstruction network).<br />
* CapsNet with shallow structure (3 layers) achieves performance that only achieves by deeper network before.<br />
<br />
=== Interpretation of Each Capsule ===<br />
<br />
The authors suggest that they found evidence that dimension of some capsule always captures some variance of the digit, while some others represents the global combinations of different variations, this would open some possibility for interpretation of capsules in the future. After computing the activity vector for the correct digit capsule, the authors fed perturbed versions of those activity vectors to the decoder to examine the effect on reconstruction. Some results from perturbations are shown below, where each row represents the reconstructions when one of the 16 dimensions in the DigitCaps representation is tweaked by intervals of 0.05 from the range [-0.25, 0.25]: <br />
<br />
[[File:DRBC_Dimension.png|650px|center||Source: Sabour, Frosst, Hinton, 2017]]<br />
<br />
== affNIST == <br />
<br />
affNIT data set contains different affine transformation of original MINST data set. By the concept of capsule, CapsNet should gain more robustness from its equivariance nature, and the result confirms this. Compare the baseline CNN, CapsNet achieves 13% improvement on accuracy.<br />
<br />
== MultiMNIST ==<br />
<br />
The MultiMNIST is basically the overlapped version of MINIST. An important point to notice here is that this data set is generated by overlaying a digit on top of another digit from the same set but different class. In other words, the case of stacking digits from the same class is not allowed in MultiMINST. For example, stacking a 5 on a 0 is allowed, but stacking a 5 on another 5 is not. The reason is that CapsNet suffers from the "crowding" effect which will be discussed in the weakness of CapsNet section.<br />
<br />
The architecture used for the training is same as the one used for MNIST dataset. However, decay step of the learning rate is 10x larger to account for the larger dataset. Even with the overlap in MultiMNIST, the network is able to segment both digits separately and it shows that the network is able to position and style of the object in the image.<br />
<br />
[[File:multimnist.PNG | 700px|thumb|center|This figure shows some sample reconstructions on the MultiMNIST dataset using CapsNet. CapsNet reconstructs both of the digits in the image in different colours (green and red). It can be seen that the right most images have incorrect classifications with the 9 being classified as a 0 and the 7 being classified as an 8. ]]<br />
<br />
== Other data sets ==<br />
<br />
CapsNet is used on other data sets such as CIFAR10, smallNORB and SVHN. The results are not comparable with state-of-the-art performance, but it is still promising since this architecture is the very first, while other networks have been development for a long time. The authors pointed out one drawback of CapsNet is that they tend to account for everything in the input images - in the CIFAR10 dataset, the image backgrounds were too varied to model in a reasonably sized network, which partly explains the poorer results.<br />
<br />
= Conclusion = <br />
<br />
This paper discuss the specific part of capsule network, which is the routing-by-agreement mechanism. <br />
<br />
The authors suggest this is a great approach to solve the current problem with max-pooling in convolutional neural network. We see that the design of the capsule builds up upon the design of artificial neuron, but expands it to the vector form to allow for more powerful representational capabilities. It also introduces matrix weights to encode important hierarchical relationships between features of different layers. The result succeeds to achieve the goal of the designer: neuronal activity equivariance with respect to changes in inputs and invariance in probabilities of feature detection. <br />
<br />
Moreover, as author mentioned, the approach mentioned in this paper is only one possible implementation of the capsule concept. Approaches like [https://openreview.net/pdf?id=HJWLfGWRb/ this] have also been proposed to test other routing techniques.<br />
<br />
The preliminary results from experiment using a simple shallow CapsNet also demonstrate unparalleled performance that indicates the capsules are a direction worth exploring.<br />
<br />
= Weakness of Capsule Network =<br />
<br />
* Routing algorithm introduces internal loops for each capsule. As number of capsules and layers increases, these internal loops may exponentially expand the training time. <br />
* Capsule network suffers a perceptual phenomenon called "crowding", which is common for human vision as well. To address this weakness, capsules have to make a very strong representation assumption that at each location of the image, there is at most one instance of the type of entity that capsule represents. This is also the reason for not allowing overlaying digits from same class in generating process of MultiMINST.<br />
* Other criticisms include that the design of capsule networks requires domain knowledge or feature engineering, contrary to the abstraction-oriented goals of deep learning.<br />
<br />
= Implementations = <br />
1) Tensorflow Implementation : https://github.com/naturomics/CapsNet-Tensorflow<br />
<br />
2) Keras Implementation. : https://github.com/XifengGuo/CapsNet-Keras<br />
<br />
= References =<br />
# S. Sabour, N. Frosst, and G. E. Hinton, “Dynamic routing between capsules,” arXiv preprint arXiv:1710.09829v2, 2017<br />
# “XifengGuo/CapsNet-Keras.” GitHub, 14 Dec. 2017, github.com/XifengGuo/CapsNet-Keras. <br />
# “Naturomics/CapsNet-Tensorflow.” GitHub, 6 Mar. 2018, github.com/naturomics/CapsNet-Tensorflow.</div>Swalshhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:DRBC_TrainLossIter.png&diff=36211File:DRBC TrainLossIter.png2018-04-06T21:48:25Z<p>Swalsh: </p>
<hr />
<div></div>Swalshhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:DRBC_AvgLogitDiff.png&diff=36210File:DRBC AvgLogitDiff.png2018-04-06T21:48:15Z<p>Swalsh: </p>
<hr />
<div></div>Swalshhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=MarrNet:_3D_Shape_Reconstruction_via_2.5D_Sketches&diff=36208MarrNet: 3D Shape Reconstruction via 2.5D Sketches2018-04-06T21:16:44Z<p>Swalsh: /* Surface Normals */</p>
<hr />
<div>= Introduction =<br />
Humans are able to quickly recognize 3D shapes from images, even in spite of drastic differences in object texture, material, lighting, and background.<br />
<br />
[[File:marrnet_intro_image.png|700px|thumb|center|Objects in real images. The appearance of the same shaped object varies based on colour, texture, lighting, background, etc. However, the 2.5D sketches (e.g. depth or normal maps) of the object remain constant, and can be seen as an abstraction of the object which is used to reconstruct the 3D shape.]]<br />
<br />
In this work, the authors propose a novel end-to-end trainable model that sequentially estimates 2.5D sketches and 3D object shape from images and also enforce re-projection consistency between the 3D shape and the estimated sketch. 2.5D is the construction of a 3D environment using 2D retina projection along with depth perception obtained from the image. The two step approach makes the network more robust to differences in object texture, material, lighting and background. Based on the idea from [Marr, 1982] that human 3D perception relies on recovering 2.5D sketches, which include depth maps (contains information related to the distance of surfaces from a viewpoint) and surface normal maps (technique for adding the illusion of depth details to surfaces using an image's RGB information), the authors design an end-to-end trainable pipeline which they call MarrNet. MarrNet first estimates depth, normal maps, and silhouette, followed by a 3D shape. MarrNet uses an encoder-decoder structure for the sub-components of the framework. <br />
<br />
The authors claim several unique advantages to their method. Single image 3D reconstruction is a highly under-constrained problem, requiring strong prior knowledge of object shapes. As well, accurate 3D object annotations using real images are not common, and many previous approaches rely on purely synthetic data. However, most of these methods suffer from domain adaptation due to imperfect rendering.<br />
<br />
Using 2.5D sketches can alleviate the challenges of domain transfer. It is straightforward to generate perfect object surface normals and depths using a graphics engine. Since 2.5D sketches contain only depth, surface normal, and silhouette information, the second step of recovering 3D shape can be trained purely from synthetic data. As well, the introduction of differentiable constraints between 2.5D sketches and 3D shape makes it possible to fine-tune the system, even without any annotations.<br />
<br />
The framework is evaluated on both synthetic objects from ShapeNet, and real images from PASCAL 3D+, showing good qualitative and quantitative performance in 3D shape reconstruction.<br />
<br />
= Related Work =<br />
<br />
== 2.5D Sketch Recovery ==<br />
Researchers have explored recovering 2.5D information from shading, texture, and colour images in the past. More recently, the development of depth sensors has led to the creation of large RGB-D datasets, and papers on estimating depth, surface normals, and other intrinsic images using deep networks. While this method employs 2.5D estimation, the final output is a full 3D shape of an object.<br />
<br />
[[File:2-5d_example.PNG|700px|thumb|center|Results from the paper: Learning Non-Lambertian Object Intrinsics across ShapeNet Categories. The results show that neural networks can be trained to recover 2.5D information from an image. The top row predicts the albedo and the bottom row predicts the shading. It can be observed that the results are still blurry and the fine details are not fully recovered.]]<br />
<br />
=== Notes: 2.5D === <br />
<br />
Two and a half dimensional (shortened to 2.5D, known alternatively as three-quarter perspective and pseudo-3D) is a term used to describe either 2D graphical projections and similar techniques used to cause images to simulate the appearance of being three-dimensional (3D) when in fact they are not, or gameplay in an otherwise three-dimensional video game that is restricted to a two-dimensional plane or has a virtual camera with fixed angle.<br />
<br />
== Single Image 3D Reconstruction ==<br />
The development of large-scale shape repositories like ShapeNet has allowed for the development of models encoding shape priors for single image 3D reconstruction. These methods normally regress voxelized 3D shapes, relying on synthetic data or 2D masks for training. A voxel is an abbreviation for volume element, the three-dimensional version of a pixel. The formulation in the paper tackles domain adaptation better, since the network can be fine-tuned on images without any annotations.<br />
<br />
== 2D-3D Consistency ==<br />
Intuitively, the 3D shape can be constrained to be consistent with 2D observations. This idea has been explored for decades, and has been widely used in 3D shape completion with the use of depths and silhouettes. A few recent papers [5,6,7,8] discussed enforcing differentiable 2D-3D constraints between shape and silhouettes to enable joint training of deep networks for the task of 3D reconstruction. In this work, this idea is exploited to develop differentiable constraints for consistency between the 2.5D sketches and 3D shape.<br />
<br />
= Approach =<br />
The 3D structure is recovered from a single RGB view using three steps, shown in the figure below. The first step estimates 2.5D sketches, including depth, surface normal, and silhouette of the object. The second step estimates a 3D voxel representation of the object. The third step uses a reprojection consistency function to enforce the 2.5D sketch and 3D structure alignment.<br />
<br />
[[File:marrnet_model_components.png|700px|thumb|center|MarrNet architecture. 2.5D sketches of normals, depths, and silhouette are first estimated. The sketches are then used to estimate the 3D shape. Finally, re-projection consistency is used to ensure consistency between the sketch and 3D output.]]<br />
<br />
== 2.5D Sketch Estimation ==<br />
The first step takes a 2D RGB image and predicts the 2.5 sketch with surface normal, depth, and silhouette of the object. The goal is to estimate intrinsic object properties from the image, while discarding non-essential information such as texture and lighting. An encoder-decoder architecture is used. The encoder is a A ResNet-18 network, which takes a 256 x 256 RGB image and produces 512 feature maps of size 8 x 8. The decoder is four sets of 5 x 5 fully convolutional and ReLU layers, followed by four sets of 1 x 1 convolutional and ReLU layers. The output is 256 x 256 resolution depth, surface normal, and silhouette images.<br />
<br />
== 3D Shape Estimation ==<br />
The second step estimates a voxelized 3D shape using the 2.5D sketches from the first step. The focus here is for the network to learn the shape prior that can explain the input well, and can be trained on synthetic data without suffering from the domain adaptation problem since it only takes in surface normal and depth images as input. The network architecture is inspired by the TL[10] network, and 3D-VAE-GAN, with an encoder-decoder structure. The normal and depth image, masked by the estimated silhouette, are passed into 5 sets of convolutional, ReLU, and pooling layers, followed by two fully connected layers, with a final output width of 200. The 200-dimensional vector is passed into a decoder of 5 fully convolutional and ReLU layers, outputting a 128 x 128 x 128 voxelized estimate of the input.<br />
<br />
== Re-projection Consistency ==<br />
The third step consists of a depth re-projection loss and surface normal re-projection loss. Here, <math>v_{x, y, z}</math> represents the value at position <math>(x, y, z)</math> in a 3D voxel grid, with <math>v_{x, y, z} \in [0, 1] ∀ x, y, z</math>. <math>d_{x, y}</math> denotes the estimated depth at position <math>(x, y)</math>, <math>n_{x, y} = (n_a, n_b, n_c)</math> denotes the estimated surface normal. Orthographic projection is used.<br />
<br />
[[File:marrnet_reprojection_consistency.png|700px|thumb|center|Reprojection consistency for voxels. Left and middle: criteria for depth and silhouettes. Right: criterion for surface normals]]<br />
<br />
=== Depths ===<br />
The voxel with depth <math>v_{x, y}, d_{x, y}</math> should be 1, while all voxels in front of it should be 0. This ensures the estimated 3D shape matches the estimated depth values. The projected depth loss and its gradient are defined as follows:<br />
<br />
<math><br />
L_{depth}(x, y, z)=<br />
\left\{<br />
\begin{array}{ll}<br />
v^2_{x, y, z}, & z < d_{x, y} \\<br />
(1 - v_{x, y, z})^2, & z = d_{x, y} \\<br />
0, & z > d_{x, y} \\<br />
\end{array}<br />
\right.<br />
</math><br />
<br />
<math><br />
\frac{∂L_{depth}(x, y, z)}{∂v_{x, y, z}} =<br />
\left\{<br />
\begin{array}{ll}<br />
2v{x, y, z}, & z < d_{x, y} \\<br />
2(v_{x, y, z} - 1), & z = d_{x, y} \\<br />
0, & z > d_{x, y} \\<br />
\end{array}<br />
\right.<br />
</math><br />
<br />
When <math>d_{x, y} = \infty</math>, all voxels in front of it should be 0 when there is no intersection between the line and its shape, referred as the silhouette criterion.<br />
<br />
=== Surface Normals ===<br />
Since vectors <math>n_{x} = (0, −n_{c}, n_{b})</math> and <math>n_{y} = (−n_{c}, 0, n_{a})</math> are orthogonal to the normal vector <math>n_{x, y} = (n_{a}, n_{b}, n_{c})</math>, they can be normalized to obtain <math>n’_{x} = (0, −1, n_{b}/n_{c})</math> and <math>n’_{y} = (−1, 0, n_{a}/n_{c})</math> on the estimated surface plane at <math>(x, y, z)</math>. The projected surface normal tried to guarantee voxels at <math>(x, y, z) ± n’_{x}</math> and <math>(x, y, z) ± n’_{y}</math> should be 1 to match the estimated normal. The constraints are only applied when the target voxels are inside the estimated silhouette.<br />
<br />
The projected surface normal loss is defined as follows, with <math>z = d_{x, y}</math>:<br />
<br />
<math><br />
L_{normal}(x, y, z) =<br />
(1 - v_{x, y-1, z+\frac{n_b}{n_c}})^2 + (1 - v_{x, y+1, z-\frac{n_b}{n_c}})^2 + <br />
(1 - v_{x-1, y, z+\frac{n_a}{n_c}})^2 + (1 - v_{x+1, y, z-\frac{n_a}{n_c}})^2<br />
</math><br />
<br />
Gradients along x are:<br />
<br />
<math><br />
\frac{dL_{normal}(x, y, z)}{dv_{x-1, y, z+\frac{n_a}{n_c}}} = 2(v_{x-1, y, z+\frac{n_a}{n_c}}-1)<br />
</math><br />
and<br />
<math><br />
\frac{dL_{normal}(x, y, z)}{dv_{x+1, y, z-\frac{n_a}{n_c}}} = 2(v_{x+1, y, z-\frac{n_a}{n_c}}-1)<br />
</math><br />
<br />
Gradients along y are:<br />
<br />
<math><br />
\frac{dL_{normal}(x, y, z)}{dv_{x, y-1, z+\frac{n_b}{n_c}}} = 2(v_{x, y-1, z+\frac{n_b}{n_c}}-1)<br />
</math><br />
and<br />
<math><br />
\frac{dL_{normal}(x, y, z)}{dv_{x, y+1, z-\frac{n_b}{n_c}}} = 2(v_{x, y+1, z-\frac{n_b}{n_c}}-1)<br />
</math><br />
<br />
= Training =<br />
The 2.5D and 3D estimation components are first pre-trained separately on synthetic data from ShapeNet, and then fine-tuned on real images.<br />
<br />
For pre-training, the 2.5D sketch estimator is trained on synthetic ShapeNet depth, surface normal, and silhouette ground truth, using an L2 loss. The 3D estimator is trained with ground truth voxels using a cross-entropy loss.<br />
<br />
Reprojection consistency loss is used to fine-tune the 3D estimation using real images, using the predicted depth, normals, and silhouette. A straightforward implementation leads to shapes that explain the 2.5D sketches well, but lead to unrealistic 3D appearance due to overfitting.<br />
<br />
Instead, the decoder of the 3D estimator is fixed, and only the encoder is fine-tuned. The model is fine-tuned separately on each image for 40 iterations, which takes up to 10 seconds on the GPU. Without fine-tuning, testing time takes around 100 milliseconds. SGD is used for optimization with batch size of 4, learning rate of 0.001, and momentum of 0.9.<br />
<br />
= Evaluation =<br />
Qualitative and quantitative results are provided using different variants of the framework. The framework is evaluated on both synthetic and real images on three datasets; ShapeNet, PASCAL 3D+, and IKEA. Intersection-over-Union (IoU) is the main measurement of comparison between the models. However the authors note that models which focus on the IoU metric fail to capture the details of the object they are trying to model, disregarding details to focus on the overall shape. To counter this drawback they poll people on which reconstruction is preferred. IoU is also computationally inefficient since it has to check over all possible scales.<br />
<br />
== ShapeNet ==<br />
The data is based on synthesized images of ShapeNet chairs [Chang et al., 2015]. From the SUN database [Xiao et al., 2010], they combine the chars with random backgrounds and use a physics-based renderer by Jakob to render the corresponding RGB, depth, surface normal, and silhouette images.<br />
Synthesized images of 6,778 chairs from ShapeNet are rendered from 20 random viewpoints. The chairs are placed in front of random background from the SUN dataset, and the RGB, depth, normal, and silhouette images are rendered using the physics-based renderer Mitsuba for more realistic images.<br />
<br />
=== Method ===<br />
MarrNet is trained without the final fine-tuning stage, since 3D shapes are available. A baseline is created that directly predicts the 3D shape using the same 3D shape estimator architecture with no 2.5D sketch estimation.<br />
<br />
=== Results ===<br />
The baseline output is compared to the full framework, and the figure below shows that MarrNet provides model outputs with more details and smoother surfaces than the baseline. The estimated normal and depth images are able to extract intrinsic information about object shape while leaving behind non-essential information such as textures from the original images. Quantitatively, the full model also achieves 0.57 integer over union score (which compares the overlap of the predicted model and ground truth), which is higher than the direct prediction baseline.<br />
<br />
[[File:marrnet_shapenet_results.png|700px|thumb|center|ShapeNet results.]]<br />
<br />
== PASCAL 3D+ ==<br />
Rough 3D models are provided from real-life images.<br />
<br />
=== Method ===<br />
Each module is pre-trained on the ShapeNet dataset, and then fine-tuned on the PASCAL 3D+ dataset. Three variants of the model are tested. The first is trained using ShapeNet data only with no fine-tuning. The second is fine-tuned without fixing the decoder. The third is fine-tuned with a fixed decoder.<br />
<br />
=== Results ===<br />
The figure below shows the results of the ablation study. The model trained only on synthetic data provides reasonable estimates. However, fine-tuning without fixing the decoder leads to impossible shapes from certain views. The third model keeps the shape prior, providing more details in the final shape.<br />
<br />
[[File:marrnet_pascal_3d_ablation.png|600px|thumb|center|Ablation studies using the PASCAL 3D+ dataset.]]<br />
<br />
Additional comparisons are made with the state-of-the-art (DRC) on the provided ground truth shapes. MarrNet achieves 0.39 IoU, while DRC achieves 0.34. Since PASCAL 3D+ only has rough annotations, with only 10 CAD chair models for all images, computing IoU with these shapes is not very informative. Instead, human studies are conducted and MarrNet reconstructions are preferred 74% of the time over DRC, and 42% of the time to ground truth. This shows how MarrNet produces nice shapes and also highlights the fact that ground truth shapes are not very good.<br />
<br />
[[File:human_studies.png|400px|thumb|center|Human preferences on chairs in PASCAL 3D+ (Xiang et al. 2014). The numbers show the percentage of how often humans prefered the 3D shape from DRC (state-of-the-art), MarrNet, or GT.]]<br />
<br />
<br />
[[File:marrnet_pascal_3d_drc_comparison.png|600px|thumb|center|Comparison between DRC and MarrNet results.]]<br />
<br />
Several failure cases are shown in the figure below. Specifically, the framework does not seem to work well on thin structures.<br />
<br />
[[File:marrnet_pascal_3d_failure_cases.png|500px|thumb|center|Failure cases on PASCAL 3D+. The algorithm cannot recover thin structures.]]<br />
<br />
== IKEA ==<br />
This dataset contains images of IKEA furniture, with accurate 3D shape and pose annotations. Objects are often heavily occluded or truncated.<br />
<br />
=== Results ===<br />
Qualitative results are shown in the figure below. The model is shown to deal with mild occlusions in real life scenarios. Human studes show that MarrNet reconstructions are preferred 61% of the time to 3D-VAE-GAN.<br />
<br />
[[File:marrnet_ikea_results.png|700px|thumb|center|Results on chairs in the IKEA dataset, and comparison with 3D-VAE-GAN.]]<br />
<br />
== Other Data ==<br />
MarrNet is also applied on cars and airplanes. Shown below, smaller details such as the horizontal stabilizer and rear-view mirrors are recovered.<br />
<br />
[[File:marrnet_airplanes_and_cars.png|700px|thumb|center|Results on airplanes and cars from the PASCAL 3D+ dataset, and comparison with DRC.]]<br />
<br />
MarrNet is also jointly trained on three object categories, and successfully recovers the shapes of different categories. Results are shown in the figure below.<br />
<br />
[[File:marrnet_multiple_categories.png|700px|thumb|center|Results when trained jointly on all three object categories (cars, airplanes, and chairs).]]<br />
<br />
= Commentary =<br />
Qualitatively, the results look quite impressive. The 2.5D sketch estimation seems to distill the useful information for more realistic looking 3D shape estimation. The disentanglement of 2.5D and 3D estimation steps also allows for easier training and domain adaptation from synthetic data.<br />
<br />
As the authors mention, the IoU metric is not very descriptive, and most of the comparisons in this paper are only qualitative, mainly being human preference studies. A better quantitative evaluation metric would greatly help in making an unbiased comparison between different results.<br />
<br />
As seen in several of the results, the network does not deal well with objects that have thin structures, which is particularly noticeable with many of the chair arm rests. As well, looking more carefully at some results, it seems that fine-tuning only the 3D encoder does not seem to transfer well to unseen objects, since shape priors have already been learned by the decoder. Therefore, future work should address more "difficult" shapes and forms; it should be more difficult to generalize shapes that are more complex than furniture.<br />
<br />
Also there is ambiguity in terms of how the aforementioned self-supervision can work as the authors claim that the model can be fine-tuned using a single image itself. If the parameters are constrained to a single image, then it means it will not generalize well. It is not clearly explained as to what can be fine-tuned.<br />
<br />
= Conclusion =<br />
The proposed MarrNet employs a novel model to estimate 2.5D sketches for 3D shape reconstruction. The sketches are shown to improve the model’s performance, and make it easy to adapt to images across different domains and categories. Differentiable loss functions are created such that the model can be fine-tuned end-to-end on images without ground truth. The experiments show that the model performs well, and human studies show that the results are preferred over other methods.<br />
<br />
= Implementation =<br />
The following repository provides the source code for the paper. The repository provides the source code as written by the authors: https://github.com/jiajunwu/marrnet<br />
<br />
= References =<br />
# Jiajun Wu, Yifan Wang, Tianfan Xue, Xingyuan Sun, William T. Freeman, Joshua B. Tenenbaum. MarrNet: 3D Shape Reconstruction via 2.5D Sketches, 2017<br />
# David Marr. Vision: A computational investigation into the human representation and processing of visual information. W. H. Freeman and Company, 1982.<br />
# Shubham Tulsiani, Tinghui Zhou, Alexei A Efros, and Jitendra Malik. Multi-view supervision for single-view reconstruction via differentiable ray consistency. In CVPR, 2017.<br />
# JiajunWu, Chengkai Zhang, Tianfan Xue,William T Freeman, and Joshua B Tenenbaum. Learning a Proba- bilistic Latent Space of Object Shapes via 3D Generative-Adversarial Modeling. In NIPS, 2016b.<br />
# Wu, J. (n.d.). Jiajunwu/marrnet. Retrieved March 25, 2018, from https://github.com/jiajunwu/marrnet<br />
# Jiajun Wu, Tianfan Xue, Joseph J Lim, Yuandong Tian, Joshua B Tenenbaum, Antonio Torralba, and William T Freeman. Single image 3d interpreter network. In ECCV, 2016a.<br />
# Xinchen Yan, Jimei Yang, Ersin Yumer, Yijie Guo, and Honglak Lee. Perspective transformer nets: Learning single-view 3d object reconstruction without 3d supervision. In NIPS, 2016.<br />
# Danilo Jimenez Rezende, SM Ali Eslami, Shakir Mohamed, Peter Battaglia, Max Jaderberg, and Nicolas Heess. Unsupervised learning of 3d structure from images. In NIPS, 2016.<br />
# Shubham Tulsiani, Tinghui Zhou, Alexei A Efros, and Jitendra Malik. Multi-view supervision for single-view reconstruction via differentiable ray consistency. In CVPR, 2017.<br />
# Rohit Girdhar, David F. Fouhey, Mikel Rodriguez and Abhinav Gupta, Learning a Predictable and Generative Vector Representation for Objects, in ECCV 2016<br />
#Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, et al. Shapenet: An information-rich 3d model repository. arXiv:1512.03012, 2015. <br />
#Jianxiong Xiao, James Hays, Krista A Ehinger, Aude Oliva, and Antonio Torralba. Sun database: Large-scale scene recognition from abbey to zoo. In CVPR, 2010. <br />
#Wenzel Jakob. Mitsuba renderer, 2010. http://www.mitsuba-renderer.org.</div>Swalshhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=MarrNet:_3D_Shape_Reconstruction_via_2.5D_Sketches&diff=36207MarrNet: 3D Shape Reconstruction via 2.5D Sketches2018-04-06T21:16:25Z<p>Swalsh: added y gradient calculation</p>
<hr />
<div>= Introduction =<br />
Humans are able to quickly recognize 3D shapes from images, even in spite of drastic differences in object texture, material, lighting, and background.<br />
<br />
[[File:marrnet_intro_image.png|700px|thumb|center|Objects in real images. The appearance of the same shaped object varies based on colour, texture, lighting, background, etc. However, the 2.5D sketches (e.g. depth or normal maps) of the object remain constant, and can be seen as an abstraction of the object which is used to reconstruct the 3D shape.]]<br />
<br />
In this work, the authors propose a novel end-to-end trainable model that sequentially estimates 2.5D sketches and 3D object shape from images and also enforce re-projection consistency between the 3D shape and the estimated sketch. 2.5D is the construction of a 3D environment using 2D retina projection along with depth perception obtained from the image. The two step approach makes the network more robust to differences in object texture, material, lighting and background. Based on the idea from [Marr, 1982] that human 3D perception relies on recovering 2.5D sketches, which include depth maps (contains information related to the distance of surfaces from a viewpoint) and surface normal maps (technique for adding the illusion of depth details to surfaces using an image's RGB information), the authors design an end-to-end trainable pipeline which they call MarrNet. MarrNet first estimates depth, normal maps, and silhouette, followed by a 3D shape. MarrNet uses an encoder-decoder structure for the sub-components of the framework. <br />
<br />
The authors claim several unique advantages to their method. Single image 3D reconstruction is a highly under-constrained problem, requiring strong prior knowledge of object shapes. As well, accurate 3D object annotations using real images are not common, and many previous approaches rely on purely synthetic data. However, most of these methods suffer from domain adaptation due to imperfect rendering.<br />
<br />
Using 2.5D sketches can alleviate the challenges of domain transfer. It is straightforward to generate perfect object surface normals and depths using a graphics engine. Since 2.5D sketches contain only depth, surface normal, and silhouette information, the second step of recovering 3D shape can be trained purely from synthetic data. As well, the introduction of differentiable constraints between 2.5D sketches and 3D shape makes it possible to fine-tune the system, even without any annotations.<br />
<br />
The framework is evaluated on both synthetic objects from ShapeNet, and real images from PASCAL 3D+, showing good qualitative and quantitative performance in 3D shape reconstruction.<br />
<br />
= Related Work =<br />
<br />
== 2.5D Sketch Recovery ==<br />
Researchers have explored recovering 2.5D information from shading, texture, and colour images in the past. More recently, the development of depth sensors has led to the creation of large RGB-D datasets, and papers on estimating depth, surface normals, and other intrinsic images using deep networks. While this method employs 2.5D estimation, the final output is a full 3D shape of an object.<br />
<br />
[[File:2-5d_example.PNG|700px|thumb|center|Results from the paper: Learning Non-Lambertian Object Intrinsics across ShapeNet Categories. The results show that neural networks can be trained to recover 2.5D information from an image. The top row predicts the albedo and the bottom row predicts the shading. It can be observed that the results are still blurry and the fine details are not fully recovered.]]<br />
<br />
=== Notes: 2.5D === <br />
<br />
Two and a half dimensional (shortened to 2.5D, known alternatively as three-quarter perspective and pseudo-3D) is a term used to describe either 2D graphical projections and similar techniques used to cause images to simulate the appearance of being three-dimensional (3D) when in fact they are not, or gameplay in an otherwise three-dimensional video game that is restricted to a two-dimensional plane or has a virtual camera with fixed angle.<br />
<br />
== Single Image 3D Reconstruction ==<br />
The development of large-scale shape repositories like ShapeNet has allowed for the development of models encoding shape priors for single image 3D reconstruction. These methods normally regress voxelized 3D shapes, relying on synthetic data or 2D masks for training. A voxel is an abbreviation for volume element, the three-dimensional version of a pixel. The formulation in the paper tackles domain adaptation better, since the network can be fine-tuned on images without any annotations.<br />
<br />
== 2D-3D Consistency ==<br />
Intuitively, the 3D shape can be constrained to be consistent with 2D observations. This idea has been explored for decades, and has been widely used in 3D shape completion with the use of depths and silhouettes. A few recent papers [5,6,7,8] discussed enforcing differentiable 2D-3D constraints between shape and silhouettes to enable joint training of deep networks for the task of 3D reconstruction. In this work, this idea is exploited to develop differentiable constraints for consistency between the 2.5D sketches and 3D shape.<br />
<br />
= Approach =<br />
The 3D structure is recovered from a single RGB view using three steps, shown in the figure below. The first step estimates 2.5D sketches, including depth, surface normal, and silhouette of the object. The second step estimates a 3D voxel representation of the object. The third step uses a reprojection consistency function to enforce the 2.5D sketch and 3D structure alignment.<br />
<br />
[[File:marrnet_model_components.png|700px|thumb|center|MarrNet architecture. 2.5D sketches of normals, depths, and silhouette are first estimated. The sketches are then used to estimate the 3D shape. Finally, re-projection consistency is used to ensure consistency between the sketch and 3D output.]]<br />
<br />
== 2.5D Sketch Estimation ==<br />
The first step takes a 2D RGB image and predicts the 2.5 sketch with surface normal, depth, and silhouette of the object. The goal is to estimate intrinsic object properties from the image, while discarding non-essential information such as texture and lighting. An encoder-decoder architecture is used. The encoder is a A ResNet-18 network, which takes a 256 x 256 RGB image and produces 512 feature maps of size 8 x 8. The decoder is four sets of 5 x 5 fully convolutional and ReLU layers, followed by four sets of 1 x 1 convolutional and ReLU layers. The output is 256 x 256 resolution depth, surface normal, and silhouette images.<br />
<br />
== 3D Shape Estimation ==<br />
The second step estimates a voxelized 3D shape using the 2.5D sketches from the first step. The focus here is for the network to learn the shape prior that can explain the input well, and can be trained on synthetic data without suffering from the domain adaptation problem since it only takes in surface normal and depth images as input. The network architecture is inspired by the TL[10] network, and 3D-VAE-GAN, with an encoder-decoder structure. The normal and depth image, masked by the estimated silhouette, are passed into 5 sets of convolutional, ReLU, and pooling layers, followed by two fully connected layers, with a final output width of 200. The 200-dimensional vector is passed into a decoder of 5 fully convolutional and ReLU layers, outputting a 128 x 128 x 128 voxelized estimate of the input.<br />
<br />
== Re-projection Consistency ==<br />
The third step consists of a depth re-projection loss and surface normal re-projection loss. Here, <math>v_{x, y, z}</math> represents the value at position <math>(x, y, z)</math> in a 3D voxel grid, with <math>v_{x, y, z} \in [0, 1] ∀ x, y, z</math>. <math>d_{x, y}</math> denotes the estimated depth at position <math>(x, y)</math>, <math>n_{x, y} = (n_a, n_b, n_c)</math> denotes the estimated surface normal. Orthographic projection is used.<br />
<br />
[[File:marrnet_reprojection_consistency.png|700px|thumb|center|Reprojection consistency for voxels. Left and middle: criteria for depth and silhouettes. Right: criterion for surface normals]]<br />
<br />
=== Depths ===<br />
The voxel with depth <math>v_{x, y}, d_{x, y}</math> should be 1, while all voxels in front of it should be 0. This ensures the estimated 3D shape matches the estimated depth values. The projected depth loss and its gradient are defined as follows:<br />
<br />
<math><br />
L_{depth}(x, y, z)=<br />
\left\{<br />
\begin{array}{ll}<br />
v^2_{x, y, z}, & z < d_{x, y} \\<br />
(1 - v_{x, y, z})^2, & z = d_{x, y} \\<br />
0, & z > d_{x, y} \\<br />
\end{array}<br />
\right.<br />
</math><br />
<br />
<math><br />
\frac{∂L_{depth}(x, y, z)}{∂v_{x, y, z}} =<br />
\left\{<br />
\begin{array}{ll}<br />
2v{x, y, z}, & z < d_{x, y} \\<br />
2(v_{x, y, z} - 1), & z = d_{x, y} \\<br />
0, & z > d_{x, y} \\<br />
\end{array}<br />
\right.<br />
</math><br />
<br />
When <math>d_{x, y} = \infty</math>, all voxels in front of it should be 0 when there is no intersection between the line and its shape, referred as the silhouette criterion.<br />
<br />
=== Surface Normals ===<br />
Since vectors <math>n_{x} = (0, −n_{c}, n_{b})</math> and <math>n_{y} = (−n_{c}, 0, n_{a})</math> are orthogonal to the normal vector <math>n_{x, y} = (n_{a}, n_{b}, n_{c})</math>, they can be normalized to obtain <math>n’_{x} = (0, −1, n_{b}/n_{c})</math> and <math>n’_{y} = (−1, 0, n_{a}/n_{c})</math> on the estimated surface plane at <math>(x, y, z)</math>. The projected surface normal tried to guarantee voxels at <math>(x, y, z) ± n’_{x}</math> and <math>(x, y, z) ± n’_{y}</math> should be 1 to match the estimated normal. The constraints are only applied when the target voxels are inside the estimated silhouette.<br />
<br />
The projected surface normal loss is defined as follows, with <math>z = d_{x, y}</math>:<br />
<br />
<math><br />
L_{normal}(x, y, z) =<br />
(1 - v_{x, y-1, z+\frac{n_b}{n_c}})^2 + (1 - v_{x, y+1, z-\frac{n_b}{n_c}})^2 + <br />
(1 - v_{x-1, y, z+\frac{n_a}{n_c}})^2 + (1 - v_{x+1, y, z-\frac{n_a}{n_c}})^2<br />
</math><br />
<br />
Gradients along x are:<br />
<br />
<math><br />
\frac{dL_{normal}(x, y, z)}{dv_{x-1, y, z+\frac{n_a}{n_c}}} = 2(v_{x-1, y, z+\frac{n_a}{n_c}}-1)<br />
</math><br />
and<br />
<math><br />
\frac{dL_{normal}(x, y, z)}{dv_{x+1, y, z-\frac{n_a}{n_c}}} = 2(v_{x+1, y, z-\frac{n_a}{n_c}}-1)<br />
</math><br />
<br />
<br />
Gradients along y are:<br />
<br />
<math><br />
\frac{dL_{normal}(x, y, z)}{dv_{x, y-1, z+\frac{n_b}{n_c}}} = 2(v_{x, y-1, z+\frac{n_b}{n_c}}-1)<br />
</math><br />
and<br />
<math><br />
\frac{dL_{normal}(x, y, z)}{dv_{x, y+1, z-\frac{n_b}{n_c}}} = 2(v_{x, y+1, z-\frac{n_b}{n_c}}-1)<br />
</math><br />
<br />
= Training =<br />
The 2.5D and 3D estimation components are first pre-trained separately on synthetic data from ShapeNet, and then fine-tuned on real images.<br />
<br />
For pre-training, the 2.5D sketch estimator is trained on synthetic ShapeNet depth, surface normal, and silhouette ground truth, using an L2 loss. The 3D estimator is trained with ground truth voxels using a cross-entropy loss.<br />
<br />
Reprojection consistency loss is used to fine-tune the 3D estimation using real images, using the predicted depth, normals, and silhouette. A straightforward implementation leads to shapes that explain the 2.5D sketches well, but lead to unrealistic 3D appearance due to overfitting.<br />
<br />
Instead, the decoder of the 3D estimator is fixed, and only the encoder is fine-tuned. The model is fine-tuned separately on each image for 40 iterations, which takes up to 10 seconds on the GPU. Without fine-tuning, testing time takes around 100 milliseconds. SGD is used for optimization with batch size of 4, learning rate of 0.001, and momentum of 0.9.<br />
<br />
= Evaluation =<br />
Qualitative and quantitative results are provided using different variants of the framework. The framework is evaluated on both synthetic and real images on three datasets; ShapeNet, PASCAL 3D+, and IKEA. Intersection-over-Union (IoU) is the main measurement of comparison between the models. However the authors note that models which focus on the IoU metric fail to capture the details of the object they are trying to model, disregarding details to focus on the overall shape. To counter this drawback they poll people on which reconstruction is preferred. IoU is also computationally inefficient since it has to check over all possible scales.<br />
<br />
== ShapeNet ==<br />
The data is based on synthesized images of ShapeNet chairs [Chang et al., 2015]. From the SUN database [Xiao et al., 2010], they combine the chars with random backgrounds and use a physics-based renderer by Jakob to render the corresponding RGB, depth, surface normal, and silhouette images.<br />
Synthesized images of 6,778 chairs from ShapeNet are rendered from 20 random viewpoints. The chairs are placed in front of random background from the SUN dataset, and the RGB, depth, normal, and silhouette images are rendered using the physics-based renderer Mitsuba for more realistic images.<br />
<br />
=== Method ===<br />
MarrNet is trained without the final fine-tuning stage, since 3D shapes are available. A baseline is created that directly predicts the 3D shape using the same 3D shape estimator architecture with no 2.5D sketch estimation.<br />
<br />
=== Results ===<br />
The baseline output is compared to the full framework, and the figure below shows that MarrNet provides model outputs with more details and smoother surfaces than the baseline. The estimated normal and depth images are able to extract intrinsic information about object shape while leaving behind non-essential information such as textures from the original images. Quantitatively, the full model also achieves 0.57 integer over union score (which compares the overlap of the predicted model and ground truth), which is higher than the direct prediction baseline.<br />
<br />
[[File:marrnet_shapenet_results.png|700px|thumb|center|ShapeNet results.]]<br />
<br />
== PASCAL 3D+ ==<br />
Rough 3D models are provided from real-life images.<br />
<br />
=== Method ===<br />
Each module is pre-trained on the ShapeNet dataset, and then fine-tuned on the PASCAL 3D+ dataset. Three variants of the model are tested. The first is trained using ShapeNet data only with no fine-tuning. The second is fine-tuned without fixing the decoder. The third is fine-tuned with a fixed decoder.<br />
<br />
=== Results ===<br />
The figure below shows the results of the ablation study. The model trained only on synthetic data provides reasonable estimates. However, fine-tuning without fixing the decoder leads to impossible shapes from certain views. The third model keeps the shape prior, providing more details in the final shape.<br />
<br />
[[File:marrnet_pascal_3d_ablation.png|600px|thumb|center|Ablation studies using the PASCAL 3D+ dataset.]]<br />
<br />
Additional comparisons are made with the state-of-the-art (DRC) on the provided ground truth shapes. MarrNet achieves 0.39 IoU, while DRC achieves 0.34. Since PASCAL 3D+ only has rough annotations, with only 10 CAD chair models for all images, computing IoU with these shapes is not very informative. Instead, human studies are conducted and MarrNet reconstructions are preferred 74% of the time over DRC, and 42% of the time to ground truth. This shows how MarrNet produces nice shapes and also highlights the fact that ground truth shapes are not very good.<br />
<br />
[[File:human_studies.png|400px|thumb|center|Human preferences on chairs in PASCAL 3D+ (Xiang et al. 2014). The numbers show the percentage of how often humans prefered the 3D shape from DRC (state-of-the-art), MarrNet, or GT.]]<br />
<br />
<br />
[[File:marrnet_pascal_3d_drc_comparison.png|600px|thumb|center|Comparison between DRC and MarrNet results.]]<br />
<br />
Several failure cases are shown in the figure below. Specifically, the framework does not seem to work well on thin structures.<br />
<br />
[[File:marrnet_pascal_3d_failure_cases.png|500px|thumb|center|Failure cases on PASCAL 3D+. The algorithm cannot recover thin structures.]]<br />
<br />
== IKEA ==<br />
This dataset contains images of IKEA furniture, with accurate 3D shape and pose annotations. Objects are often heavily occluded or truncated.<br />
<br />
=== Results ===<br />
Qualitative results are shown in the figure below. The model is shown to deal with mild occlusions in real life scenarios. Human studes show that MarrNet reconstructions are preferred 61% of the time to 3D-VAE-GAN.<br />
<br />
[[File:marrnet_ikea_results.png|700px|thumb|center|Results on chairs in the IKEA dataset, and comparison with 3D-VAE-GAN.]]<br />
<br />
== Other Data ==<br />
MarrNet is also applied on cars and airplanes. Shown below, smaller details such as the horizontal stabilizer and rear-view mirrors are recovered.<br />
<br />
[[File:marrnet_airplanes_and_cars.png|700px|thumb|center|Results on airplanes and cars from the PASCAL 3D+ dataset, and comparison with DRC.]]<br />
<br />
MarrNet is also jointly trained on three object categories, and successfully recovers the shapes of different categories. Results are shown in the figure below.<br />
<br />
[[File:marrnet_multiple_categories.png|700px|thumb|center|Results when trained jointly on all three object categories (cars, airplanes, and chairs).]]<br />
<br />
= Commentary =<br />
Qualitatively, the results look quite impressive. The 2.5D sketch estimation seems to distill the useful information for more realistic looking 3D shape estimation. The disentanglement of 2.5D and 3D estimation steps also allows for easier training and domain adaptation from synthetic data.<br />
<br />
As the authors mention, the IoU metric is not very descriptive, and most of the comparisons in this paper are only qualitative, mainly being human preference studies. A better quantitative evaluation metric would greatly help in making an unbiased comparison between different results.<br />
<br />
As seen in several of the results, the network does not deal well with objects that have thin structures, which is particularly noticeable with many of the chair arm rests. As well, looking more carefully at some results, it seems that fine-tuning only the 3D encoder does not seem to transfer well to unseen objects, since shape priors have already been learned by the decoder. Therefore, future work should address more "difficult" shapes and forms; it should be more difficult to generalize shapes that are more complex than furniture.<br />
<br />
Also there is ambiguity in terms of how the aforementioned self-supervision can work as the authors claim that the model can be fine-tuned using a single image itself. If the parameters are constrained to a single image, then it means it will not generalize well. It is not clearly explained as to what can be fine-tuned.<br />
<br />
= Conclusion =<br />
The proposed MarrNet employs a novel model to estimate 2.5D sketches for 3D shape reconstruction. The sketches are shown to improve the model’s performance, and make it easy to adapt to images across different domains and categories. Differentiable loss functions are created such that the model can be fine-tuned end-to-end on images without ground truth. The experiments show that the model performs well, and human studies show that the results are preferred over other methods.<br />
<br />
= Implementation =<br />
The following repository provides the source code for the paper. The repository provides the source code as written by the authors: https://github.com/jiajunwu/marrnet<br />
<br />
= References =<br />
# Jiajun Wu, Yifan Wang, Tianfan Xue, Xingyuan Sun, William T. Freeman, Joshua B. Tenenbaum. MarrNet: 3D Shape Reconstruction via 2.5D Sketches, 2017<br />
# David Marr. Vision: A computational investigation into the human representation and processing of visual information. W. H. Freeman and Company, 1982.<br />
# Shubham Tulsiani, Tinghui Zhou, Alexei A Efros, and Jitendra Malik. Multi-view supervision for single-view reconstruction via differentiable ray consistency. In CVPR, 2017.<br />
# JiajunWu, Chengkai Zhang, Tianfan Xue,William T Freeman, and Joshua B Tenenbaum. Learning a Proba- bilistic Latent Space of Object Shapes via 3D Generative-Adversarial Modeling. In NIPS, 2016b.<br />
# Wu, J. (n.d.). Jiajunwu/marrnet. Retrieved March 25, 2018, from https://github.com/jiajunwu/marrnet<br />
# Jiajun Wu, Tianfan Xue, Joseph J Lim, Yuandong Tian, Joshua B Tenenbaum, Antonio Torralba, and William T Freeman. Single image 3d interpreter network. In ECCV, 2016a.<br />
# Xinchen Yan, Jimei Yang, Ersin Yumer, Yijie Guo, and Honglak Lee. Perspective transformer nets: Learning single-view 3d object reconstruction without 3d supervision. In NIPS, 2016.<br />
# Danilo Jimenez Rezende, SM Ali Eslami, Shakir Mohamed, Peter Battaglia, Max Jaderberg, and Nicolas Heess. Unsupervised learning of 3d structure from images. In NIPS, 2016.<br />
# Shubham Tulsiani, Tinghui Zhou, Alexei A Efros, and Jitendra Malik. Multi-view supervision for single-view reconstruction via differentiable ray consistency. In CVPR, 2017.<br />
# Rohit Girdhar, David F. Fouhey, Mikel Rodriguez and Abhinav Gupta, Learning a Predictable and Generative Vector Representation for Objects, in ECCV 2016<br />
#Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, et al. Shapenet: An information-rich 3d model repository. arXiv:1512.03012, 2015. <br />
#Jianxiong Xiao, James Hays, Krista A Ehinger, Aude Oliva, and Antonio Torralba. Sun database: Large-scale scene recognition from abbey to zoo. In CVPR, 2010. <br />
#Wenzel Jakob. Mitsuba renderer, 2010. http://www.mitsuba-renderer.org.</div>Swalshhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Tensorized_LSTMs&diff=36206stat946w18/Tensorized LSTMs2018-04-06T21:05:27Z<p>Swalsh: </p>
<hr />
<div>= Presented by =<br />
<br />
Chen, Weishi(Edward)<br />
<br />
= Introduction =<br />
<br />
Long Short-Term Memory (LSTM) is a popular approach to boosting the ability of Recurrent Neural Networks to store longer term temporal information. The capacity of an LSTM network can be increased by widening and adding layers (illustrations will be provided later). <br />
<br />
<br />
However, widening an LSTM or adding layers to it usually introduces additional parameters, which in turn increases the time required for model training and evaluation. As an alternative, this paper <Wider and Deeper, Cheaper and Faster: Tensorized LSTMs for Sequence Learning> has proposed a model based on LSTM called the '''Tensorized LSTM''' in which the hidden states are represented by '''tensors''' and updated via a '''cross-layer convolution'''. <br />
<br />
* By increasing the tensor size, the network can be widened efficiently without additional parameters since the parameters are shared across different locations in the tensor<br />
* By delaying the output, the network can be deepened implicitly with little additional run-time since deep computations for each time step are merged into temporal computations of the sequence. <br />
<br />
<br />
Also, the paper presents experiments that were conducted on five challenging sequence learning tasks to show the potential of the proposed model.<br />
<br />
= A Quick Introduction to RNN and LSTM =<br />
<br />
We consider the time-series prediction task of producing a desired output <math>y_t</math> at each time-step t∈ {1, ..., T} given an observed input sequence <math>x_{1:t} = {x_1,x_2, ···, x_t}</math>, where <math>x_t∈R^R</math> and <math>y_t∈R^S</math> are vectors. RNNs learn how to use a hidden state vector <math>h_t ∈ R^M</math> to encapsulate the relevant features of the entire input history x1:t (indicates all inputs from the initial time-step to the final step before predication - illustration given below) up to time-step t.<br />
<br />
\begin{align}<br />
h_{t-1}^{cat} = [x_t, h_{t-1}] \hspace{2cm} (1)<br />
\end{align}<br />
<br />
Where <math>h_{t-1}^{cat} ∈R^{R+M}</math> is the concatenation of the current input <math>x_t</math> and the previous hidden state <math>h_{t−1}</math>, which expands the dimensionality of intermediate information.<br />
<br />
The update of the hidden state h_t is defined as:<br />
<br />
\begin{align}<br />
a_{t} =h_{t-1}^{cat} W^h + b^h \hspace{2cm} (2)<br />
\end{align}<br />
<br />
and<br />
<br />
\begin{align}<br />
h_t = \Phi(a_t) \hspace{2cm} (3)<br />
\end{align}<br />
<br />
<math>W^h∈R^{(R+M)\times M} </math> guarantees each hidden state provided by the previous step is of dimension M. <math> a_t ∈R^M </math> is the hidden activation, and φ(·) is the element-wise hyperbolic tangent. Finally, the output <math> y_t </math> at time-step t is generated by:<br />
<br />
\begin{align}<br />
y_t = \varphi(h_{t}^{cat} W^y + b^y) \hspace{2cm} (4)<br />
\end{align}<br />
<br />
where <math>W^y∈R^{M×S}</math> and <math>b^y∈R^S</math>, and <math>\varphi(·)</math> can be any differentiable function. Note that the <math>\phi</math> is a non-linear, element-wise function which generates hidden output, while <math>\varphi</math> generates the final network output.<br />
<br />
[[File:StdRNN.png|650px|center||Figure 1: Recurrent Neural Network]]<br />
<br />
One shortfall of RNN is the problem of vanishing/exploding gradients. This shortfall is significant, especially when modeling long-range dependencies. One alternative is to instead use LSTM (Long Short-Term Memory), which alleviates these problems by employing several gates to selectively modulate the information flow across each neuron. Since LSTMs have been successfully used in sequence models, it is natural to consider them for accommodating more complex analytical needs.<br />
<br />
[[File:LSTM_Gated.png|650px|center||Figure 2: LSTM]]<br />
<br />
= Structural Measurement of Sequential Model =<br />
<br />
We can consider the capacity of a network consists of two components: the '''width''' (the amount of information handled in parallel) and the depth (the number of computation steps). <br />
<br />
A way to '''widen''' the LSTM is to increase the number of units in a hidden layer; however, the parameter number scales quadratically with the number of units. To deepen the LSTM, the popular Stacked LSTM (sLSTM) stacks multiple LSTM layers. The drawback of sLSTM, however, is that runtime is proportional to the number of layers and information from the input is potentially lost (due to gradient vanishing/explosion) as it propagates vertically through the layers. This paper introduced a way to both widen and deepen the LSTM whilst keeping the parameter number and runtime largely unchanged. In summary, we make the following contributions:<br />
<br />
'''(a)''' Tensorize RNN hidden state vectors into higher-dimensional tensors, to enable more flexible parameter sharing and can be widened more efficiently without additional parameters.<br />
<br />
'''(b)''' Based on (a), merge RNN deep computations into its temporal computations so that the network can be deepened with little additional runtime, resulting in a Tensorized RNN (tRNN).<br />
<br />
'''(c)''' We extend the tRNN to an LSTM, namely the Tensorized LSTM (tLSTM), which integrates a novel memory cell convolution to help to prevent the vanishing/exploding gradients.<br />
<br />
= Method =<br />
<br />
== Part 1: Tensorize RNN hidden State vectors ==<br />
<br />
'''Definition:''' Tensorization is defined as the transformation or mapping of lower-order data to higher-order data. For example, the low-order data can be a vector, and the tensorized result is a matrix, a third-order tensor or a higher-order tensor. The ‘low-order’ data can also be a matrix or a third-order tensor, for example. In the latter case, tensorization can take place along one or multiple modes.<br />
<br />
[[File:VecTsor.png|320px|center||Figure 3: Vector Third-order tensorization of a vector]]<br />
<br />
'''Optimization Methodology Part 1:''' It can be seen that in an RNN, the parameter number scales quadratically with the size of the hidden state. A popular way to limit the parameter number when widening the network is to organize parameters as higher-dimensional tensors which can be factorized into lower-rank sub-tensors that contain significantly fewer elements, which is is known as tensor factorization. <br />
<br />
'''Optimization Methodology Part 2:''' Another common way to reduce the parameter number is to share a small set of parameters across different locations in the hidden state, similar to Convolutional Neural Networks (CNNs).<br />
<br />
'''Effects:''' This '''widens''' the network since the hidden state vectors are in fact broadcast to interact with the tensorized parameters. <br />
<br />
<br />
<br />
We adopt parameter sharing to cutdown the parameter number for RNNs, since compared with factorization, it has the following advantages: <br />
<br />
(i) '''Scalability,''' the number of shared parameters can be set independent of the hidden state size<br />
<br />
(ii) '''Separability,''' the information flow can be carefully managed by controlling the receptive field, allowing one to shift RNN deep computations to the temporal domain<br />
<br />
<br />
<br />
We also explicitly tensorize the RNN hidden state vectors, since compared with vectors, tensors have a better: <br />
<br />
(i) '''Flexibility,''' one can specify which dimensions to share parameters and then can just increase the size of those dimensions without introducing additional parameters<br />
<br />
(ii) '''Efficiency,''' with higher-dimensional tensors, the network can be widened faster w.r.t. its depth when fixing the parameter number (explained later). <br />
<br />
<br />
'''Illustration:''' For ease of exposition, we first consider 2D tensors (matrices): we tensorize the hidden state <math>h_t∈R^{M}</math> to become <math>Ht∈R^{P×M}</math>, '''where P is the tensor size,''' and '''M the channel size'''. We locally-connect the first dimension of <math>H_t</math> (which is P - the tensor size) in order to share parameters, and fully-connect the second dimension of <math>H_t</math> (which is M - the channel size) to allow global interactions. This is analogous to the CNN which fully-connects one dimension (e.g., the RGB channel for input images) to globally fuse different feature planes. Also, if one compares <math>H_t</math> to the hidden state of a Stacked RNN (sRNN) (see Figure Blow). <br />
<br />
[[File:Screen_Shot_2018-03-26_at_11.28.37_AM.png|160px|center||Figure 4: Stacked RNN]]<br />
<br />
[[File:ind.png|60px|center||Figure 4: Stacked RNN]]<br />
<br />
Then P is akin to the number of stacked hidden layers (vertical length in the graph), and M the size of each hidden layer (each white node in the graph). We start to describe our model based on 2D tensors, and finally show how to strengthen the model with higher-dimensional tensors.<br />
<br />
== Part 2: Merging Deep Computations ==<br />
<br />
Since an RNN is already deep in its temporal direction, we can deepen an input-to-output computation by associating the input <math>x_t</math> with a (delayed) future output. In doing this, we need to ensure that the output <math>y_t</math> is separable, i.e., not influenced by any future input <math>x_{t^{'}}</math> <math>(t^{'}>t)</math>. Thus, we concatenate the projection of <math>x_t</math> to the top of the previous hidden state <math>H_{t−1}</math>, then gradually shift the input information down when the temporal computation proceeds, and finally generate <math>y_t</math> from the bottom of <math>H_{t+L−1}</math>, where L−1 is the number of delayed time-steps for computations of depth L. <br />
<br />
An example with L= 3 is shown in Figure.<br />
<br />
[[File:tRNN.png|160px|center||Figure 5: skewed sRNN]]<br />
<br />
[[File:ind.png|60px|center||Figure 5: skewed sRNN]]<br />
<br />
<br />
This is in fact a skewed sRNN (or tRNN without feedback). However, the method does not need to change the network structure and also allows different kinds of interactions as long as the output is separable; for example, one can increase the local connections and '''use feedback''' (shown in figure below), which can be beneficial for sRNNs (or tRNN). <br />
<br />
[[File:tRNN_wF.png|160px|center||Figure 5: skewed sRNN with F]]<br />
<br />
[[File:ind.png|60px|center||Figure 5: skewed sRNN with F]]<br />
<br />
'''In order to share parameters, we update <math>H_t</math> using a convolution with a learnable kernel.''' In this manner we increase the complexity of the input-to-output mapping (by delaying outputs) and limit parameter growth (by sharing transition parameters using convolutions).<br />
<br />
To examine the resulting model mathematically, let <math>H^{cat}_{t−1}∈R^{(P+1)×M}</math> be the concatenated hidden state, and <math>p∈Z_+</math> the location at a tensor. The channel vector <math>h^{cat}_{t−1, p }∈R^M</math> at location p of <math>H^{cat}_{t−1}</math> (the p-th channel of H) is defined as:<br />
<br />
\begin{align}<br />
h^{cat}_{t-1, p} = x_t W^x + b^x \hspace{1cm} if p = 1 \hspace{1cm} (5)<br />
\end{align}<br />
<br />
\begin{align}<br />
h^{cat}_{t-1, p} = h_{t-1, p-1} \hspace{1cm} if p > 1 \hspace{1cm} (6)<br />
\end{align}<br />
<br />
where <math>W^x ∈ R^{R×M}</math> and <math>b^x ∈ R^M</math> (recall the dimension of input x is R). Then, the update of tensor <math>H_t</math> is implemented via a convolution:<br />
<br />
\begin{align}<br />
A_t = H^{cat}_{t-1} \circledast \{W^h, b^h \} \hspace{2cm} (7)<br />
\end{align}<br />
<br />
\begin{align}<br />
H_t = \Phi{A_t} \hspace{2cm} (8)<br />
\end{align}<br />
<br />
where <math>W^h∈R^{K×M^i×M^o}</math> is the kernel weight of size K, with <math>M^i =M</math> input channels and <math>M^o =M</math> output channels, <math>b^h ∈ R^{M^o}</math> is the kernel bias, <math>A_t ∈ R^{P×M^o}</math> is the hidden activation, and <math>\circledast</math> is the convolution operator. Since the kernel convolves across different hidden layers, we call it the cross-layer convolution. The kernel enables interaction, both bottom-up and top-down across layers. Finally, we generate <math>y_t</math> from the channel vector <math>h_{t+L−1,P}∈R^M</math> which is located at the bottom of <math>H_{t+L−1}</math>:<br />
<br />
\begin{align}<br />
y_t = \varphi(h_{t+L−1}, _PW^y + b^y) \hspace{2cm} (9)<br />
\end{align}<br />
<br />
Where <math>W^y ∈R^{M×S}</math> and <math>b^y ∈R^S</math>. To guarantee that the receptive field of <math>y_t</math> only covers the current and previous inputs x1:t. (Check the Skewed sRNN again below):<br />
<br />
[[File:tRNN_wF.png|160px|center||Figure 5: skewed sRNN with F]]<br />
<br />
[[File:ind.png|60px|center||Figure 5: skewed sRNN with F]]<br />
<br />
=== Quick Summary of Set of Parameters ===<br />
<br />
'''1. <math> W^x</math> and <math>b_x</math>''' connect input to the first hidden node<br />
<br />
'''2. <math> W^h</math> and <math>b_h</math>''' convolute between layers<br />
<br />
'''3. <math> W^y</math> and <math>b_y</math>''' produce output of each stages<br />
<br />
<br />
== Part 3: Extending to LSTMs==<br />
<br />
Similar to standard RNN, to allow the tRNN (skewed sRNN) to capture long-range temporal dependencies, one can straightforwardly extend it<br />
to a tLSTM by replacing the tRNN tensors:<br />
<br />
\begin{align}<br />
[A^g_t, A^i_t, A^f_t, A^o_t] = H^{cat}_{t-1} \circledast \{W^h, b^h \} \hspace{2cm} (10)<br />
\end{align}<br />
<br />
\begin{align}<br />
[G_t, I_t, F_t, O_t]= [\Phi{(A^g_t)}, σ(A^i_t), σ(A^f_t), σ(A^o_t)] \hspace{2cm} (11)<br />
\end{align}<br />
<br />
Which are pretty similar to tRNN case, the main differences can be observes for memory cells of tLSTM (Ct):<br />
<br />
\begin{align}<br />
C_t= G_t \odot I_t + C_{t-1} \odot F_t \hspace{2cm} (12)<br />
\end{align}<br />
<br />
\begin{align}<br />
H_t= \Phi{(C_t )} \odot O_t \hspace{2cm} (13)<br />
\end{align}<br />
<br />
Note that since the previous memory cell <math>C_{t-1}</math> is only gated along the temporal direction, increasing the tensor size ''P'' might result in the loss of long-range dependencies from the input to the output.<br />
<br />
Summary of the terms: <br />
<br />
1. '''<math>\{W^h, b^h \}</math>:''' Kernel of size K <br />
<br />
2. '''<math>A^g_t, A^i_t, A^f_t, A^o_t \in \mathbb{R}^{P\times M}</math>:''' Activations for the new content <math>G_t</math><br />
<br />
3. '''<math>I_t</math>:''' Input gate<br />
<br />
4. '''<math>F_t</math>:''' Forget gate<br />
<br />
5. '''<math>O_t</math>:''' Output gate<br />
<br />
6. '''<math>C_t \in \mathbb{R}^{P\times M}</math>:''' Memory cell<br />
<br />
Then, see graph below for illustration:<br />
<br />
[[File:tLSTM_wo_MC.png |160px|center||Figure 5: tLSTM wo MC]]<br />
<br />
[[File:ind.png|60px|center||Figure 5: tLSTM wo MC]]<br />
<br />
To further evolve tLSTM, we invoke the '''Memory Cell Convolution''' to capture long-range dependencies from multiple directions, we additionally introduce a novel memory cell convolution, by which the memory cells can have a larger receptive field (figure provided below). <br />
<br />
[[File:tLSTM_w_MC.png |160px|center||Figure 5: tLSTM w MC]]<br />
<br />
[[File:ind.png|60px|center||Figure 5: tLSTM w MC]]<br />
<br />
One can also dynamically generate this convolution kernel so that it is both time - and location-dependent, allowing for flexible control over long-range dependencies from different directions. Mathematically, it can be represented in with the following formulas:<br />
<br />
\begin{align}<br />
[A^g_t, A^i_t, A^f_t, A^o_t, A^q_t] = H^{cat}_{t-1} \circledast \{W^h, b^h \} \hspace{2cm} (14)<br />
\end{align}<br />
<br />
\begin{align}<br />
[G_t, I_t, F_t, O_t, Q_t]= [\Phi{(A^g_t)}, σ(A^i_t), σ(A^f_t), σ(A^o_t), ς(A^q_t)] \hspace{2cm} (15)<br />
\end{align}<br />
<br />
\begin{align}<br />
W_t^c(p) = reshape(q_{t,p}, [K, 1, 1]) \hspace{2cm} (16)<br />
\end{align}<br />
<br />
\begin{align}<br />
C_{t-1}^{conv}= C_{t-1} \circledast W_t^c(p) \hspace{2cm} (17)<br />
\end{align}<br />
<br />
\begin{align}<br />
C_t= G_t \odot I_t + C_{t-1}^{conv} \odot F_t \hspace{2cm} (18)<br />
\end{align}<br />
<br />
\begin{align}<br />
H_t= \Phi{(C_t )} \odot O_t \hspace{2cm} (19)<br />
\end{align}<br />
<br />
where the kernel <math>{W^h, b^h}</math> has additional <K> output channels to generate the activation <math>A^q_t ∈ R^{P×<K>}</math> for the dynamic kernel bank <math>Q_t∈R^{P × <K>}</math>, <math>q_{t,p}∈R^{<K>}</math> is the vectorized adaptive kernel at the location p of <math>Q_t</math>, and <math>W^c_t(p) ∈ R^{K×1×1}</math> is the dynamic kernel of size K with a single input/output channel, which is reshaped from <math>q_{t,p}</math>. Each channel of the previous memory cell <math>C_{t-1}</math> is convolved with <math>W_t^c(p)</math> whose values vary with <math>p</math>, to form a memory cell convolution, which produces a convolved memory cell <math>C_{t-1}^{conv} \in \mathbb{R}^{P\times M}</math>. This convolution is defined by:<br />
<br />
\begin{align}<br />
C_{t-1,p,m}^{conv} = \sum\limits_{k=1}^K C_{t-1,p-\frac{K-1}{2}+k,m} · W_{t,k,1,1}^c(p) \hspace{2cm} (30)<br />
\end{align}<br />
<br />
where <math>C_{t-1}</math> is padded with the boundary values to retain the stored information.<br />
<br />
Note the paper also employed a softmax function ς(·) to normalize the channel dimension of <math>Q_t</math>. which can also stabilize the value of memory cells and help to prevent the vanishing/exploding gradients. An illustration is provided below to better illustrate the process:<br />
<br />
[[File:MCC.png |240px|center||Figure 5: MCC]]<br />
<br />
Theorem 17-18 of Leifert et al. [3] proves the prevention of vanishing/exploding gradients for the lambda gate, which is very similar to the proposed memory cell convolution kernel. The only major differences between the two are the use of softmax for normalization and the sharing the of the kernal for all channels. Since these changes to not affect the assertions made in Theorem 17-18, it can be established that the prevention of vanishing/exploding gradients is a feature of the memory cell convolution kernel as well.<br />
<br />
To improve training, the authors introduced a new normalization technique for ''t''LSTM termed channel normalization (adapted from layer normalization), in which the channel vector are normalized at different locations with their own statistics. Note that layer normalization does not work well with ''t''LSTM, because lower level information is near the input and higher level information is near the output. Channel normalization (CN) is defined as: <br />
<br />
\begin{align}<br />
\mathrm{CN}(\mathbf{Z}; \mathbf{\Gamma}, \mathbf{B}) = \mathbf{\hat{Z}} \odot \mathbf{\Gamma} + \mathbf{B} \hspace{2cm} (20)<br />
\end{align}<br />
<br />
where <math>\mathbf{Z}</math>, <math>\mathbf{\hat{Z}}</math>, <math>\mathbf{\Gamma}</math>, <math>\mathbf{B} \in \mathbb{R}^{P \times M^z}</math> are the original tensor, normalized tensor, gain parameter and bias parameter. The <math>m^z</math>-th channel of <math>\mathbf{Z}</math> is normalized element-wisely: <br />
<br />
\begin{align}<br />
\hat{z_{m^z}} = (z_{m^z} - z^\mu)/z^{\sigma} \hspace{2cm} (21)<br />
\end{align}<br />
<br />
where <math>z^{\mu}</math>, <math>z^{\sigma} \in \mathbb{R}^P</math> are the mean and standard deviation along the channel dimension of <math>\mathbf{Z}</math>, and <math>\hat{z_{m^z}} \in \mathbb{R}^P</math> is the <math>m^z</math>-th channel <math>\mathbf{\hat{Z}}</math>. Channel normalization introduces very few additional parameters compared to the number of other parameters in the model.<br />
<br />
= Results and Evaluation =<br />
<br />
Summary of list of models tLSTM family (may be useful later):<br />
<br />
(a) sLSTM (baseline): the implementation of sLSTM with parameters shared across all layers.<br />
<br />
(b) 2D tLSTM: the standard 2D tLSTM.<br />
<br />
(c) 2D tLSTM–M: removing memory (M) cell convolutions from (b).<br />
<br />
(d) 2D tLSTM–F: removing (–) feedback (F) connections from (b).<br />
<br />
(e) 3D tLSTM: tensorizing (b) into 3D tLSTM.<br />
<br />
(f) 3D tLSTM+LN: applying (+) Layer Normalization.<br />
<br />
(g) 3D tLSTM+CN: applying (+) Channel Normalization.<br />
<br />
=== Efficiency Analysis ===<br />
<br />
'''Fundaments:''' For each configuration, fix the parameter number and increase the tensor size to see if the performance of tLSTM can be boosted without increasing the parameter number. Can also investigate how the runtime is affected by the depth, where the runtime is measured by the average GPU milliseconds spent by a forward and backward pass over one timestep of a single example. <br />
<br />
'''Dataset:''' The Hutter Prize Wikipedia dataset consists of 100 million characters taken from 205 different characters including alphabets, XML markups and special symbols. We model the dataset at the character-level, and try to predict the next character of the input sequence.<br />
<br />
All configurations are evaluated with depths L = 1, 2, 3, 4. Bits-per-character(BPC) is used to measure the model performance and the results are shown in the figure below.<br />
[[File:wiki.png |280px|center||Figure 5: WifiPerf]]<br />
[[File:Wiki_Performance.png |480px|center||Figure 5: WifiPerf]]<br />
<br />
=== Accuracy Analysis ===<br />
<br />
The MNIST dataset [35] consists of 50000/10000/10000 handwritten digit images of size 28×28 for training/validation/test. Two tasks are used for evaluation on this dataset:<br />
<br />
(a) '''Sequential MNIST:''' The goal is to classify the digit after sequentially reading the pixels in a scan-line order. It is therefore a 784 time-step sequence learning task where a single output is produced at the last time-step; the task requires very long range dependencies in the sequence.<br />
<br />
(b) '''Sequential Permuted MNIST:''' We permute the original image pixels in a fixed random order, resulting in a permuted MNIST (pMNIST) problem that has even longer range dependencies across pixels and is harder.<br />
<br />
In both tasks, all configurations are evaluated with M = 100 and L= 1, 3, 5. The model performance is measured by the classification accuracy and results are shown in the figure below.<br />
<br />
[[File:MNISTperf.png |480px|center]]<br />
<br />
<br />
<br />
[[File:Acc_res.png |480px|center||Figure 5: MNIST]]<br />
<br />
[[File:33_mnist.PNG|center|thumb|800px| This figure displays a visualization of the means of the diagonal channels of the tLSTM memory cells per task. The columns indicate the time steps and the rows indicate the diagonal locations. The values are normalized between 0 and 1.]]<br />
<br />
It can be seen in the above figure that tLSTM behaves differently with different tasks:<br />
<br />
- Wikipedia: the input can be carried to the output location with less modification if it is sufficient to determine the next character, and vice versa<br />
<br />
- addition: the first integer is gradually encoded into memories and then interacts (performs addition) with the second integer, producing the sum <br />
<br />
- memorization: the network behaves like a shift register that continues to move the input symbol to the output location at the correct timestep<br />
<br />
- sequential MNIST: the network is more sensitive to the pixel value change (representing the contour, or topology of the digit) and can gradually accumulate evidence for the final prediction <br />
<br />
- sequential pMNIST: the network is sensitive to high value pixels (representing the foreground digit), and we conjecture that this is because the permutation destroys the topology of the digit, making each high value pixel potentially important.<br />
<br />
From the figure above it can can also be observe some common phenomena in all tasks: <br />
# it is clear that wider (larger) tensors can encode more information by observing that at each timestep, the values at different tensor locations are markedly different<br />
# from the input to the output, the values become increasingly distinct and are shifted by time, revealing that deep computations are indeed performed together with temporal computations, with long-range dependencies carried by memory cells.<br />
<br />
<br />
= Conclusions =<br />
<br />
The paper introduced the Tensorized LSTM, which employs tensors to share parameters and utilizes the temporal computation to perform the deep computation for sequential tasks. Then validated the model<br />
on a variety of tasks, showing its potential over other popular approaches. The paper shows a method to widen and deepen the LSTM network at the same time and the following 3 points list their main contributions:<br />
* The RNNs are now tensorized into higher dimensional tensors which are more flexible.<br />
* RNNs' deep computation is merged into the temporal computation, referred to as the tensorizedRNN.<br />
* tRNN is extended to a LSTM architecture and a new architecture is studied: tensorizedLSTM<br />
<br />
= Critique(to be edited) =<br />
<br />
* Using tensor as hidden layer indeed increasing the capability of the network, but authors never mentioned the trade-off in terms of extra computation cost and training time.<br />
<br />
= References =<br />
#Zhen He, Shaobing Gao, Liang Xiao, Daxue Liu, Hangen He, and David Barber. <Wider and Deeper, Cheaper and Faster: Tensorized LSTMs for Sequence Learning> (2017)<br />
#Ali Ghodsi, <Deep Learning: STAT 946 - Winter 2018><br />
#Gundram Leifert, Tobias Strauß, Tobias Grüning, Welf Wustlich, and Roger Labahn. Cells in multidimensional recurrent neural networks. JMLR, 17(1):3313–3349, 2016.</div>Swalshhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Tensorized_LSTMs&diff=36205stat946w18/Tensorized LSTMs2018-04-06T20:24:30Z<p>Swalsh: /* Method */</p>
<hr />
<div>= Presented by =<br />
<br />
Chen, Weishi(Edward)<br />
<br />
= Introduction =<br />
<br />
Long Short-Term Memory (LSTM) is a popular approach to boosting the ability of Recurrent Neural Networks to store longer term temporal information. The capacity of an LSTM network can be increased by widening and adding layers (illustrations will be provided later). <br />
<br />
<br />
However, widening an LSTM or adding layers to it usually introduces additional parameters, which in turn increases the time required for model training and evaluation. As an alternative, this paper <Wider and Deeper, Cheaper and Faster: Tensorized LSTMs for Sequence Learning> has proposed a model based on LSTM called the '''Tensorized LSTM''' in which the hidden states are represented by '''tensors''' and updated via a '''cross-layer convolution'''. <br />
<br />
* By increasing the tensor size, the network can be widened efficiently without additional parameters since the parameters are shared across different locations in the tensor<br />
* By delaying the output, the network can be deepened implicitly with little additional run-time since deep computations for each time step are merged into temporal computations of the sequence. <br />
<br />
<br />
Also, the paper presents experiments that were conducted on five challenging sequence learning tasks to show the potential of the proposed model.<br />
<br />
= A Quick Introduction to RNN and LSTM =<br />
<br />
We consider the time-series prediction task of producing a desired output <math>y_t</math> at each time-step t∈ {1, ..., T} given an observed input sequence <math>x_{1:t} = {x_1,x_2, ···, x_t}</math>, where <math>x_t∈R^R</math> and <math>y_t∈R^S</math> are vectors. RNNs learn how to use a hidden state vector <math>h_t ∈ R^M</math> to encapsulate the relevant features of the entire input history x1:t (indicates all inputs from the initial time-step to the final step before predication - illustration given below) up to time-step t.<br />
<br />
\begin{align}<br />
h_{t-1}^{cat} = [x_t, h_{t-1}] \hspace{2cm} (1)<br />
\end{align}<br />
<br />
Where <math>h_{t-1}^{cat} ∈R^{R+M}</math> is the concatenation of the current input <math>x_t</math> and the previous hidden state <math>h_{t−1}</math>, which expands the dimensionality of intermediate information.<br />
<br />
The update of the hidden state h_t is defined as:<br />
<br />
\begin{align}<br />
a_{t} =h_{t-1}^{cat} W^h + b^h \hspace{2cm} (2)<br />
\end{align}<br />
<br />
and<br />
<br />
\begin{align}<br />
h_t = \Phi(a_t) \hspace{2cm} (3)<br />
\end{align}<br />
<br />
<math>W^h∈R^{(R+M)\times M} </math> guarantees each hidden state provided by the previous step is of dimension M. <math> a_t ∈R^M </math> is the hidden activation, and φ(·) is the element-wise hyperbolic tangent. Finally, the output <math> y_t </math> at time-step t is generated by:<br />
<br />
\begin{align}<br />
y_t = \varphi(h_{t}^{cat} W^y + b^y) \hspace{2cm} (4)<br />
\end{align}<br />
<br />
where <math>W^y∈R^{M×S}</math> and <math>b^y∈R^S</math>, and <math>\varphi(·)</math> can be any differentiable function. Note that the <math>\phi</math> is a non-linear, element-wise function which generates hidden output, while <math>\varphi</math> generates the final network output.<br />
<br />
[[File:StdRNN.png|650px|center||Figure 1: Recurrent Neural Network]]<br />
<br />
One shortfall of RNN is the problem of vanishing/exploding gradients. This shortfall is significant, especially when modeling long-range dependencies. One alternative is to instead use LSTM (Long Short-Term Memory), which alleviates these problems by employing several gates to selectively modulate the information flow across each neuron. Since LSTMs have been successfully used in sequence models, it is natural to consider them for accommodating more complex analytical needs.<br />
<br />
[[File:LSTM_Gated.png|650px|center||Figure 2: LSTM]]<br />
<br />
= Structural Measurement of Sequential Model =<br />
<br />
We can consider the capacity of a network consists of two components: the '''width''' (the amount of information handled in parallel) and the depth (the number of computation steps). <br />
<br />
A way to '''widen''' the LSTM is to increase the number of units in a hidden layer; however, the parameter number scales quadratically with the number of units. To deepen the LSTM, the popular Stacked LSTM (sLSTM) stacks multiple LSTM layers. The drawback of sLSTM, however, is that runtime is proportional to the number of layers and information from the input is potentially lost (due to gradient vanishing/explosion) as it propagates vertically through the layers. This paper introduced a way to both widen and deepen the LSTM whilst keeping the parameter number and runtime largely unchanged. In summary, we make the following contributions:<br />
<br />
'''(a)''' Tensorize RNN hidden state vectors into higher-dimensional tensors, to enable more flexible parameter sharing and can be widened more efficiently without additional parameters.<br />
<br />
'''(b)''' Based on (a), merge RNN deep computations into its temporal computations so that the network can be deepened with little additional runtime, resulting in a Tensorized RNN (tRNN).<br />
<br />
'''(c)''' We extend the tRNN to an LSTM, namely the Tensorized LSTM (tLSTM), which integrates a novel memory cell convolution to help to prevent the vanishing/exploding gradients.<br />
<br />
= Method =<br />
<br />
== Part 1: Tensorize RNN hidden State vectors ==<br />
<br />
'''Definition:''' Tensorization is defined as the transformation or mapping of lower-order data to higher-order data. For example, the low-order data can be a vector, and the tensorized result is a matrix, a third-order tensor or a higher-order tensor. The ‘low-order’ data can also be a matrix or a third-order tensor, for example. In the latter case, tensorization can take place along one or multiple modes.<br />
<br />
[[File:VecTsor.png|320px|center||Figure 3: Vector Third-order tensorization of a vector]]<br />
<br />
'''Optimization Methodology Part 1:''' It can be seen that in an RNN, the parameter number scales quadratically with the size of the hidden state. A popular way to limit the parameter number when widening the network is to organize parameters as higher-dimensional tensors which can be factorized into lower-rank sub-tensors that contain significantly fewer elements, which is is known as tensor factorization. <br />
<br />
'''Optimization Methodology Part 2:''' Another common way to reduce the parameter number is to share a small set of parameters across different locations in the hidden state, similar to Convolutional Neural Networks (CNNs).<br />
<br />
'''Effects:''' This '''widens''' the network since the hidden state vectors are in fact broadcast to interact with the tensorized parameters. <br />
<br />
<br />
<br />
We adopt parameter sharing to cutdown the parameter number for RNNs, since compared with factorization, it has the following advantages: <br />
<br />
(i) '''Scalability,''' the number of shared parameters can be set independent of the hidden state size<br />
<br />
(ii) '''Separability,''' the information flow can be carefully managed by controlling the receptive field, allowing one to shift RNN deep computations to the temporal domain<br />
<br />
<br />
<br />
We also explicitly tensorize the RNN hidden state vectors, since compared with vectors, tensors have a better: <br />
<br />
(i) '''Flexibility,''' one can specify which dimensions to share parameters and then can just increase the size of those dimensions without introducing additional parameters<br />
<br />
(ii) '''Efficiency,''' with higher-dimensional tensors, the network can be widened faster w.r.t. its depth when fixing the parameter number (explained later). <br />
<br />
<br />
'''Illustration:''' For ease of exposition, we first consider 2D tensors (matrices): we tensorize the hidden state <math>h_t∈R^{M}</math> to become <math>Ht∈R^{P×M}</math>, '''where P is the tensor size,''' and '''M the channel size'''. We locally-connect the first dimension of <math>H_t</math> (which is P - the tensor size) in order to share parameters, and fully-connect the second dimension of <math>H_t</math> (which is M - the channel size) to allow global interactions. This is analogous to the CNN which fully-connects one dimension (e.g., the RGB channel for input images) to globally fuse different feature planes. Also, if one compares <math>H_t</math> to the hidden state of a Stacked RNN (sRNN) (see Figure Blow). <br />
<br />
[[File:Screen_Shot_2018-03-26_at_11.28.37_AM.png|160px|center||Figure 4: Stacked RNN]]<br />
<br />
[[File:ind.png|60px|center||Figure 4: Stacked RNN]]<br />
<br />
Then P is akin to the number of stacked hidden layers (vertical length in the graph), and M the size of each hidden layer (each white node in the graph). We start to describe our model based on 2D tensors, and finally show how to strengthen the model with higher-dimensional tensors.<br />
<br />
== Part 2: Merging Deep Computations ==<br />
<br />
Since an RNN is already deep in its temporal direction, we can deepen an input-to-output computation by associating the input <math>x_t</math> with a (delayed) future output. In doing this, we need to ensure that the output <math>y_t</math> is separable, i.e., not influenced by any future input <math>x_{t^{'}}</math> <math>(t^{'}>t)</math>. Thus, we concatenate the projection of <math>x_t</math> to the top of the previous hidden state <math>H_{t−1}</math>, then gradually shift the input information down when the temporal computation proceeds, and finally generate <math>y_t</math> from the bottom of <math>H_{t+L−1}</math>, where L−1 is the number of delayed time-steps for computations of depth L. <br />
<br />
An example with L= 3 is shown in Figure.<br />
<br />
[[File:tRNN.png|160px|center||Figure 5: skewed sRNN]]<br />
<br />
[[File:ind.png|60px|center||Figure 5: skewed sRNN]]<br />
<br />
<br />
This is in fact a skewed sRNN (or tRNN without feedback). However, the method does not need to change the network structure and also allows different kinds of interactions as long as the output is separable; for example, one can increase the local connections and '''use feedback''' (shown in figure below), which can be beneficial for sRNNs (or tRNN). <br />
<br />
[[File:tRNN_wF.png|160px|center||Figure 5: skewed sRNN with F]]<br />
<br />
[[File:ind.png|60px|center||Figure 5: skewed sRNN with F]]<br />
<br />
'''In order to share parameters, we update <math>H_t</math> using a convolution with a learnable kernel.''' In this manner we increase the complexity of the input-to-output mapping (by delaying outputs) and limit parameter growth (by sharing transition parameters using convolutions).<br />
<br />
To examine the resulting model mathematically, let <math>H^{cat}_{t−1}∈R^{(P+1)×M}</math> be the concatenated hidden state, and <math>p∈Z_+</math> the location at a tensor. The channel vector <math>h^{cat}_{t−1, p }∈R^M</math> at location p of <math>H^{cat}_{t−1}</math> (the p-th channel of H) is defined as:<br />
<br />
\begin{align}<br />
h^{cat}_{t-1, p} = x_t W^x + b^x \hspace{1cm} if p = 1 \hspace{1cm} (5)<br />
\end{align}<br />
<br />
\begin{align}<br />
h^{cat}_{t-1, p} = h_{t-1, p-1} \hspace{1cm} if p > 1 \hspace{1cm} (6)<br />
\end{align}<br />
<br />
where <math>W^x ∈ R^{R×M}</math> and <math>b^x ∈ R^M</math> (recall the dimension of input x is R). Then, the update of tensor <math>H_t</math> is implemented via a convolution:<br />
<br />
\begin{align}<br />
A_t = H^{cat}_{t-1} \circledast \{W^h, b^h \} \hspace{2cm} (7)<br />
\end{align}<br />
<br />
\begin{align}<br />
H_t = \Phi{A_t} \hspace{2cm} (8)<br />
\end{align}<br />
<br />
where <math>W^h∈R^{K×M^i×M^o}</math> is the kernel weight of size K, with <math>M^i =M</math> input channels and <math>M^o =M</math> output channels, <math>b^h ∈ R^{M^o}</math> is the kernel bias, <math>A_t ∈ R^{P×M^o}</math> is the hidden activation, and <math>\circledast</math> is the convolution operator. Since the kernel convolves across different hidden layers, we call it the cross-layer convolution. The kernel enables interaction, both bottom-up and top-down across layers. Finally, we generate <math>y_t</math> from the channel vector <math>h_{t+L−1,P}∈R^M</math> which is located at the bottom of <math>H_{t+L−1}</math>:<br />
<br />
\begin{align}<br />
y_t = \varphi(h_{t+L−1}, _PW^y + b^y) \hspace{2cm} (9)<br />
\end{align}<br />
<br />
Where <math>W^y ∈R^{M×S}</math> and <math>b^y ∈R^S</math>. To guarantee that the receptive field of <math>y_t</math> only covers the current and previous inputs x1:t. (Check the Skewed sRNN again below):<br />
<br />
[[File:tRNN_wF.png|160px|center||Figure 5: skewed sRNN with F]]<br />
<br />
[[File:ind.png|60px|center||Figure 5: skewed sRNN with F]]<br />
<br />
=== Quick Summary of Set of Parameters ===<br />
<br />
'''1. <math> W^x</math> and <math>b_x</math>''' connect input to the first hidden node<br />
<br />
'''2. <math> W^h</math> and <math>b_h</math>''' convolute between layers<br />
<br />
'''3. <math> W^y</math> and <math>b_y</math>''' produce output of each stages<br />
<br />
<br />
== Part 3: Extending to LSTMs==<br />
<br />
Similar to standard RNN, to allow the tRNN (skewed sRNN) to capture long-range temporal dependencies, one can straightforwardly extend it<br />
to a tLSTM by replacing the tRNN tensors:<br />
<br />
\begin{align}<br />
[A^g_t, A^i_t, A^f_t, A^o_t] = H^{cat}_{t-1} \circledast \{W^h, b^h \} \hspace{2cm} (10)<br />
\end{align}<br />
<br />
\begin{align}<br />
[G_t, I_t, F_t, O_t]= [\Phi{(A^g_t)}, σ(A^i_t), σ(A^f_t), σ(A^o_t)] \hspace{2cm} (11)<br />
\end{align}<br />
<br />
Which are pretty similar to tRNN case, the main differences can be observes for memory cells of tLSTM (Ct):<br />
<br />
\begin{align}<br />
C_t= G_t \odot I_t + C_{t-1} \odot F_t \hspace{2cm} (12)<br />
\end{align}<br />
<br />
\begin{align}<br />
H_t= \Phi{(C_t )} \odot O_t \hspace{2cm} (13)<br />
\end{align}<br />
<br />
Note that since the previous memory cell <math>C_{t-1}</math> is only gated along the temporal direction, increasing the tensor size ''P'' might result in the loss of long-range dependencies from the input to the output.<br />
<br />
Summary of the terms: <br />
<br />
1. '''<math>\{W^h, b^h \}</math>:''' Kernel of size K <br />
<br />
2. '''<math>A^g_t, A^i_t, A^f_t, A^o_t \in \mathbb{R}^{P\times M}</math>:''' Activations for the new content <math>G_t</math><br />
<br />
3. '''<math>I_t</math>:''' Input gate<br />
<br />
4. '''<math>F_t</math>:''' Forget gate<br />
<br />
5. '''<math>O_t</math>:''' Output gate<br />
<br />
6. '''<math>C_t \in \mathbb{R}^{P\times M}</math>:''' Memory cell<br />
<br />
Then, see graph below for illustration:<br />
<br />
[[File:tLSTM_wo_MC.png |160px|center||Figure 5: tLSTM wo MC]]<br />
<br />
[[File:ind.png|60px|center||Figure 5: tLSTM wo MC]]<br />
<br />
To further evolve tLSTM, we invoke the '''Memory Cell Convolution''' to capture long-range dependencies from multiple directions, we additionally introduce a novel memory cell convolution, by which the memory cells can have a larger receptive field (figure provided below). <br />
<br />
[[File:tLSTM_w_MC.png |160px|center||Figure 5: tLSTM w MC]]<br />
<br />
[[File:ind.png|60px|center||Figure 5: tLSTM w MC]]<br />
<br />
One can also dynamically generate this convolution kernel so that it is both time - and location-dependent, allowing for flexible control over long-range dependencies from different directions. Mathematically, it can be represented in with the following formulas:<br />
<br />
\begin{align}<br />
[A^g_t, A^i_t, A^f_t, A^o_t, A^q_t] = H^{cat}_{t-1} \circledast \{W^h, b^h \} \hspace{2cm} (14)<br />
\end{align}<br />
<br />
\begin{align}<br />
[G_t, I_t, F_t, O_t, Q_t]= [\Phi{(A^g_t)}, σ(A^i_t), σ(A^f_t), σ(A^o_t), ς(A^q_t)] \hspace{2cm} (15)<br />
\end{align}<br />
<br />
\begin{align}<br />
W_t^c(p) = reshape(q_{t,p}, [K, 1, 1]) \hspace{2cm} (16)<br />
\end{align}<br />
<br />
\begin{align}<br />
C_{t-1}^{conv}= C_{t-1} \circledast W_t^c(p) \hspace{2cm} (17)<br />
\end{align}<br />
<br />
\begin{align}<br />
C_t= G_t \odot I_t + C_{t-1}^{conv} \odot F_t \hspace{2cm} (18)<br />
\end{align}<br />
<br />
\begin{align}<br />
H_t= \Phi{(C_t )} \odot O_t \hspace{2cm} (19)<br />
\end{align}<br />
<br />
where the kernel <math>{W^h, b^h}</math> has additional <K> output channels to generate the activation <math>A^q_t ∈ R^{P×<K>}</math> for the dynamic kernel bank <math>Q_t∈R^{P × <K>}</math>, <math>q_{t,p}∈R^{<K>}</math> is the vectorized adaptive kernel at the location p of <math>Q_t</math>, and <math>W^c_t(p) ∈ R^{K×1×1}</math> is the dynamic kernel of size K with a single input/output channel, which is reshaped from <math>q_{t,p}</math>. Each channel of the previous memory cell <math>C_{t-1}</math> is convolved with <math>W_t^c(p)</math> whose values vary with <math>p</math>, to form a memory cell convolution, which produces a convolved memory cell <math>C_{t-1}^{conv} \in \mathbb{R}^{P\times M}</math>. This convolution is defined by:<br />
<br />
\begin{align}<br />
C_{t-1,p,m}^{conv} = \sum\limits_{k=1}^K C_{t-1,p-\frac{K-1}{2}+k,m} · W_{t,k,1,1}^c(p) \hspace{2cm} (30)<br />
\end{align}<br />
<br />
where <math>C_{t-1}</math> is padded with the boundary values to retain the stored information.<br />
<br />
Note the paper also employed a softmax function ς(·) to normalize the channel dimension of <math>Q_t</math>. which can also stabilize the value of memory cells and help to prevent the vanishing/exploding gradients. An illustration is provided below to better illustrate the process:<br />
<br />
[[File:MCC.png |240px|center||Figure 5: MCC]]<br />
<br />
To improve training, the authors introduced a new normalization technique for ''t''LSTM termed channel normalization (adapted from layer normalization), in which the channel vector are normalized at different locations with their own statistics. Note that layer normalization does not work well with ''t''LSTM, because lower level information is near the input and higher level information is near the output. Channel normalization (CN) is defined as: <br />
<br />
\begin{align}<br />
\mathrm{CN}(\mathbf{Z}; \mathbf{\Gamma}, \mathbf{B}) = \mathbf{\hat{Z}} \odot \mathbf{\Gamma} + \mathbf{B} \hspace{2cm} (20)<br />
\end{align}<br />
<br />
where <math>\mathbf{Z}</math>, <math>\mathbf{\hat{Z}}</math>, <math>\mathbf{\Gamma}</math>, <math>\mathbf{B} \in \mathbb{R}^{P \times M^z}</math> are the original tensor, normalized tensor, gain parameter and bias parameter. The <math>m^z</math>-th channel of <math>\mathbf{Z}</math> is normalized element-wisely: <br />
<br />
\begin{align}<br />
\hat{z_{m^z}} = (z_{m^z} - z^\mu)/z^{\sigma} \hspace{2cm} (21)<br />
\end{align}<br />
<br />
where <math>z^{\mu}</math>, <math>z^{\sigma} \in \mathbb{R}^P</math> are the mean and standard deviation along the channel dimension of <math>\mathbf{Z}</math>, and <math>\hat{z_{m^z}} \in \mathbb{R}^P</math> is the <math>m^z</math>-th channel <math>\mathbf{\hat{Z}}</math>. Channel normalization introduces very few additional parameters compared to the number of other parameters in the model.<br />
<br />
= Results and Evaluation =<br />
<br />
Summary of list of models tLSTM family (may be useful later):<br />
<br />
(a) sLSTM (baseline): the implementation of sLSTM with parameters shared across all layers.<br />
<br />
(b) 2D tLSTM: the standard 2D tLSTM.<br />
<br />
(c) 2D tLSTM–M: removing memory (M) cell convolutions from (b).<br />
<br />
(d) 2D tLSTM–F: removing (–) feedback (F) connections from (b).<br />
<br />
(e) 3D tLSTM: tensorizing (b) into 3D tLSTM.<br />
<br />
(f) 3D tLSTM+LN: applying (+) Layer Normalization.<br />
<br />
(g) 3D tLSTM+CN: applying (+) Channel Normalization.<br />
<br />
=== Efficiency Analysis ===<br />
<br />
'''Fundaments:''' For each configuration, fix the parameter number and increase the tensor size to see if the performance of tLSTM can be boosted without increasing the parameter number. Can also investigate how the runtime is affected by the depth, where the runtime is measured by the average GPU milliseconds spent by a forward and backward pass over one timestep of a single example. <br />
<br />
'''Dataset:''' The Hutter Prize Wikipedia dataset consists of 100 million characters taken from 205 different characters including alphabets, XML markups and special symbols. We model the dataset at the character-level, and try to predict the next character of the input sequence.<br />
<br />
All configurations are evaluated with depths L = 1, 2, 3, 4. Bits-per-character(BPC) is used to measure the model performance and the results are shown in the figure below.<br />
[[File:wiki.png |280px|center||Figure 5: WifiPerf]]<br />
[[File:Wiki_Performance.png |480px|center||Figure 5: WifiPerf]]<br />
<br />
=== Accuracy Analysis ===<br />
<br />
The MNIST dataset [35] consists of 50000/10000/10000 handwritten digit images of size 28×28 for training/validation/test. Two tasks are used for evaluation on this dataset:<br />
<br />
(a) '''Sequential MNIST:''' The goal is to classify the digit after sequentially reading the pixels in a scan-line order. It is therefore a 784 time-step sequence learning task where a single output is produced at the last time-step; the task requires very long range dependencies in the sequence.<br />
<br />
(b) '''Sequential Permuted MNIST:''' We permute the original image pixels in a fixed random order, resulting in a permuted MNIST (pMNIST) problem that has even longer range dependencies across pixels and is harder.<br />
<br />
In both tasks, all configurations are evaluated with M = 100 and L= 1, 3, 5. The model performance is measured by the classification accuracy and results are shown in the figure below.<br />
<br />
[[File:MNISTperf.png |480px|center]]<br />
<br />
<br />
<br />
[[File:Acc_res.png |480px|center||Figure 5: MNIST]]<br />
<br />
[[File:33_mnist.PNG|center|thumb|800px| This figure displays a visualization of the means of the diagonal channels of the tLSTM memory cells per task. The columns indicate the time steps and the rows indicate the diagonal locations. The values are normalized between 0 and 1.]]<br />
<br />
It can be seen in the above figure that tLSTM behaves differently with different tasks:<br />
<br />
- Wikipedia: the input can be carried to the output location with less modification if it is sufficient to determine the next character, and vice versa<br />
<br />
- addition: the first integer is gradually encoded into memories and then interacts (performs addition) with the second integer, producing the sum <br />
<br />
- memorization: the network behaves like a shift register that continues to move the input symbol to the output location at the correct timestep<br />
<br />
- sequential MNIST: the network is more sensitive to the pixel value change (representing the contour, or topology of the digit) and can gradually accumulate evidence for the final prediction <br />
<br />
- sequential pMNIST: the network is sensitive to high value pixels (representing the foreground digit), and we conjecture that this is because the permutation destroys the topology of the digit, making each high value pixel potentially important.<br />
<br />
From the figure above it can can also be observe some common phenomena in all tasks: <br />
# it is clear that wider (larger) tensors can encode more information by observing that at each timestep, the values at different tensor locations are markedly different<br />
# from the input to the output, the values become increasingly distinct and are shifted by time, revealing that deep computations are indeed performed together with temporal computations, with long-range dependencies carried by memory cells.<br />
<br />
<br />
= Conclusions =<br />
<br />
The paper introduced the Tensorized LSTM, which employs tensors to share parameters and utilizes the temporal computation to perform the deep computation for sequential tasks. Then validated the model<br />
on a variety of tasks, showing its potential over other popular approaches. The paper shows a method to widen and deepen the LSTM network at the same time and the following 3 points list their main contributions:<br />
* The RNNs are now tensorized into higher dimensional tensors which are more flexible.<br />
* RNNs' deep computation is merged into the temporal computation, referred to as the tensorizedRNN.<br />
* tRNN is extended to a LSTM architecture and a new architecture is studied: tensorizedLSTM<br />
<br />
= Critique(to be edited) =<br />
<br />
* Using tensor as hidden layer indeed increasing the capability of the network, but authors never mentioned the trade-off in terms of extra computation cost and training time.<br />
<br />
= References =<br />
#Zhen He, Shaobing Gao, Liang Xiao, Daxue Liu, Hangen He, and David Barber. <Wider and Deeper, Cheaper and Faster: Tensorized LSTMs for Sequence Learning> (2017)<br />
#Ali Ghodsi, <Deep Learning: STAT 946 - Winter 2018></div>Swalshhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Training_And_Inference_with_Integers_in_Deep_Neural_Networks&diff=36204Training And Inference with Integers in Deep Neural Networks2018-04-06T20:13:06Z<p>Swalsh: Minor editorial added reference links for quick ref jumps</p>
<hr />
<div>== Introduction ==<br />
<br />
Deep neural networks have enjoyed much success in all manners of tasks, but it is common for these networks to be complicated and have high memory requirements while performing many floating-point operations (FLOPs). As a result, running many of these models will be very expensive in terms of energy use, and using state-of-the-art networks in applications where energy is limited can be very difficult. In order to overcome this and allow use of these networks in situations with low energy availability, the energy costs must be reduced while trying to maintain as high network performance as possible and/or practical.<br />
<br />
Most existing methods focus on reducing the energy requirements during inference rather than training. Since training with SGD requires accumulation, training usually has higher precision demand than inference. Most of the existing methods focus on how to compress a model for inference, rather than during training. This paper proposes a framework to reduce complexity both during training and inference through the use of integers instead of floats. The authors address how to quantize all operations and operands as well as examining the bitwidth requirement for SGD computation & accumulation. Using integers instead of floats results in energy-savings because integer operations are more efficient than floating point (see the table below). Also, there already exists dedicated hardware for deep learning that uses integer operations (such as the 1st generation of Google TPU) so understanding the best way to use integers is well-motivated. A TPU is a Tensor Processing Unit developed by Google for Tensor operations. TPU is comparative to a GPU but produces higher IO per second for low precision computations.<br />
{| class="wikitable"<br />
|+Rough Energy Costs in 45nm 0.9V<sup>[[#References|[1]]]</sup><br />
!<br />
! colspan="2" |Energy(pJ)<br />
! colspan="2" |Area(<math>\mu m^2</math>)<br />
|-<br />
!Operation<br />
!MUL<br />
!ADD<br />
!MUL<br />
!ADD<br />
|-<br />
|8-bit INT<br />
|0.2<br />
|0.03<br />
|282<br />
|36<br />
|-<br />
|16-bit FP<br />
|1.1<br />
|0.4<br />
|1640<br />
|1360<br />
|-<br />
|32-bit FP<br />
|3.7<br />
|0.9<br />
|7700<br />
|4184<br />
|}<br />
The authors call the framework WAGE because they consider how best to handle the '''W'''eights, '''A'''ctivations, '''G'''radients, and '''E'''rrors separately.<br />
<br />
== Related Work ==<br />
<br />
=== Weight and Activation ===<br />
Existing works to train DNNs on binary weights and activations <sup>[[#References|[2]]]</sup> add noise to weights and activations as a form of regularization. The use of high-precision accumulation is required for SGD optimization since real-valued gradients are obtained from real-valued variables. XNOR-Net <sup>[[#References|[11]]]</sup> uses bitwise operations to approximate convolutions in a highly memory-efficient manner, and applies a filter-wise scaling factor for weights to improve performance. However, these floating-point factors are calculated simultaneously during training, which aggravates the training effort. Ternary weight networks (TWN) <sup>[[#References|[3]]]</sup> and Trained ternary quantization (TTQ)<sup>[[#References|[9]]]</sup> offer more expressive ability than binary weight networks by constraining the weights to be ternary-valued {-1,0,1} using two symmetric thresholds. Tang et al.<sup>[[#References|[14]]]</sup> achieve impressive results by using a binarization scheme according to which floating-point activation vectors are approximated as linear combinations of binary vectors, where the weights in the linear combination are floating-point. Still other approaches rely on relative quantization<sup>[[#References|[13]]]</sup>; however, an efficient implementation is difficult to apply in practice due to the requirements of persisting and applying a codebook.<br />
<br />
=== Gradient Computation and Accumulation ===<br />
The DoReFa-Net quantizes gradients to low-bandwidth floating point numbers with discrete states in the backwards pass. In order to reduce the overhead of gradient synchronization in distributed training the TernGrad method quantizes the gradient updates to ternary values. In both works the weights are still stored and updated with float32, and the quantization of batch normalization and its derivative is ignored.<br />
<br />
== WAGE Quantization ==<br />
The core idea of the proposed method is to constrain the following to low-bitwidth integers on each layer:<br />
* '''W:''' weight in inference<br />
* '''a:''' activation in inference<br />
* '''e:''' error in backpropagation<br />
* '''g:''' gradient in backpropagation<br />
[[File:p32fig1.PNG|center|thumb|800px|Four operators QW (·), QA(·), QG(·), QE(·) added in WAGE computation dataflow to reduce precision, bitwidth of signed integers are below or on the right of arrows, activations are included in MAC for concision.]]<br />
The error and gradient are defined as:<br />
<br />
<math>e^i = \frac{\partial L}{\partial a^i}, g^i = \frac{\partial L}{\partial W^i}</math><br />
<br />
where L is the loss function.<br />
<br />
The precision in bits of the errors, activations, gradients, and weights are <math>k_E</math>, <math>k_A</math>, <math>k_G</math>, and <math>k_W</math> respectively. As shown in the above figure, each quantity also has a quantization operators to reduce bitwidth increases caused by multiply-accumulate (MAC) operations. Also, note that since this is a layer-by-layer approach, each layer may be followed or preceded by a layer with different precision, or even a layer using floating point math.<br />
<br />
=== Shift-Based Linear Mapping and Stochastic Mapping ===<br />
The proposed method makes use of a linear mapping where continuous, unbounded values are discretized for each bitwidth <math>k</math> with a uniform spacing of<br />
<br />
<math>\sigma(k) = 2^{1-k}, k \in Z_+ </math><br />
With this, the full quantization function is<br />
<br />
<math>Q(x,k) = Clip\left \{ \sigma(k) \cdot round\left [ \frac{x}{\sigma(k)} \right ], -1 + \sigma(k), 1 - \sigma(k) \right \}</math>, <br />
<br />
where <math>round</math> approximates continuous values to their nearest discrete state, and <math>Clip</math> is the saturation function that clips unbounded values to <math>[-1 + \sigma, 1 - \sigma]</math>. Note that this function is only using when simulating integer operations on floating-point hardware, on native integer hardware, this is done automatically. In addition to this quantization function, a distribution scaling factor is used in some quantization operators to preserve as much variance as possible when applying the quantization function above. The scaling factor is defined below.<br />
<br />
<math>Shift(x) = 2^{round(log_2(x))}</math><br />
<br />
Finally, stochastic rounding is substituted for small or real-valued updates during gradient accumulation.<br />
<br />
A visual representation of these operations is below.<br />
[[File:p32fig2.PNG|center|thumb|800px|Quantization methods used in WAGE. The notation <math>P, x, \lfloor \cdot \rfloor, \lceil \cdot \rceil</math> denotes probability, vector, floor and ceil, respectively. <math>Shift(\cdot)</math> refers to distribution shifting with a certain argument]]<br />
<br />
=== Weight Initialization ===<br />
In this work, batch normalization is simplified to a constant scaling layer in order to sidestep the problem of normalizing outputs without floating point math, and to remove the extra memory requirement with batch normalization. As such, some care must be taken when initializing weights. The authors use a modified initialization method base on MSRA<sup>[[#References|[4]]]</sup>.<br />
<br />
<math>W \thicksim U(-L, +L),L = max \left \{ \sqrt{6/n_{in}}, L_{min} \right \}, L_{min} = \beta \sigma</math><br />
<br />
<math>n_{in}</math> is the layer fan-in number, <math>U</math> denotes uniform distribution. The original initialization method for <math>\eta</math> is modified by adding the condition that the distribution width should be at least <math>\beta \sigma</math>, where <math>\beta</math> is a constant greater than 1 and <math>\sigma</math> is the minimum step size seen already. This prevents weights being initialised to all-zeros in the case where the bitwidth is low, or the fan-in number is high.<br />
<br />
=== Quantization Details ===<br />
<br />
==== Weight <math>Q_W(\cdot)</math> ====<br />
<math>W_q = Q_W(W) = Q(W, k_W)</math><br />
<br />
The quantization operator is simply the quantization function previously introduced. <br />
<br />
==== Activation <math>Q_A(\cdot)</math> ====<br />
The authors say that the variance of the weights passed through this function will be scaled compared to the variance of the weights as initialized. To prevent this effect from blowing up the network outputs, they introduce a scaling factor <math>\alpha</math>. Notice that it is constant for each layer.<br />
<br />
<math>\alpha = max \left \{ Shift(L_{min} / L), 1 \right \}</math><br />
<br />
The quantization operator is then<br />
<br />
<math>a_q = Q_A(a) = Q(a/\alpha, k_A)</math><br />
<br />
The scaling factor approximates batch normalization.<br />
<br />
==== Error <math>Q_E(\cdot)</math> ====<br />
The magnitude of the error can vary greatly, and that a previous approach (DoReFa-Net<sup>[[#References|[5]]]</sup>) solves the issue by using an affine transform to map the error to the range <math>[-1, 1]</math>, apply quantization, and then applying the inverse transform. However, the authors claim that this approach still requires using float32, and that the magnitude of the error is unimportant: rather it is the orientation of the error. Thus, they only scale the error distribution to the range <math>\left [ -\sqrt2, \sqrt2 \right ]</math> and quantise:<br />
<br />
<math>e_q = Q_E(e) = Q(e/Shift(max\{|e|\}), k_E)</math><br />
<br />
Max is the element-wise maximum. Note that this discards any error elements less than the minimum step size.<br />
<br />
==== Gradient <math>Q_G(\cdot)</math> ====<br />
Similar to the activations and errors, the gradients are rescaled:<br />
<br />
<math>g_s = \eta \cdot g/Shift(max\{|g|\})</math><br />
<br />
<math> \eta </math> is a shift-based learning rate. It is an integer power of 2. The shifted gradients are represented in units of minimum step sizes <math> \sigma(k) </math>. When reducing the bitwidth of the gradients (remember that the gradients are coming out of a MAC operation, so the bitwidth may have increased) stochastic rounding is used as a substitute for small gradient accumulation.<br />
<br />
<math>\Delta W = Q_G(g) = \sigma(k_G) \cdot sgn(g_s) \cdot \left \{ \lfloor | g_s | \rfloor + Bernoulli(|g_s|<br />
- \lfloor | g_s | \rfloor) \right \}</math><br />
<br />
This randomly rounds the result of the MAC operation up or down to the nearest quantization for the given gradient bitwidth. The weights are updated with the resulting discrete increments:<br />
<br />
<math>W_{t+1} = Clip \left \{ W_t - \Delta W_t, -1 + \sigma(k_G), 1 - \sigma(k_G) \right \}</math><br />
<br />
=== Miscellaneous ===<br />
To train WAGE networks, the authors used pure SGD exclusively because more complicated techniques such as Momentum or RMSProp increase memory consumption and are complicated by the rescaling that happens within each quantization operator.<br />
<br />
The quantization and stochastic rounding are a form of regularization.<br />
<br />
The authors didn't use a traditional softmax with cross-entropy loss for the experiments because there does not yet exist a softmax layer for low-bit integers. Instead, they use a sum of squared error loss. This works for tasks with a small number of categories, but does not scale well.<br />
<br />
== Experiments ==<br />
For all experiments, the default layer bitwidth configuration is 2-8-8-8 for Weights, Activations, Gradients, and Error bits. The weight bitwidth is set to 2 because that results in ternary weights, and therefore no multiplication during inference. They authors argue that the bitwidth for activation and errors should be the same because the computation graph for each is similar and might use the same hardware. During training, the weight bitwidth is 8. For inference the weights are ternarized.<br />
<br />
=== Implementation Details ===<br />
MNIST: Network is LeNet-5 variant<sup>[[#References|[6]]]</sup> with 32C5-MP2-64C5-MP2-512FC-10SSE.<br />
<br />
SVHN & CIFAR10: VGG variant<sup>[[#References|[7]]]</sup> with 2×(128C3)-MP2-2×(256C3)-MP2-2×(512C3)-MP2-1024FC-10SSE. For CIFAR10 dataset, the data augmentation is followed in Lee et al. (2015)<sup>[[#References|[10]]]</sup> for training.<br />
<br />
ImageNet: AlexNet variant<sup>[[#References|[8]]]</sup> on ILSVRC12 dataset.<br />
{| class="wikitable"<br />
|+Test or validation error rates (%) in previous works and WAGE on multiple datasets. Opt denotes gradient descent optimizer, withM means SGD with momentum, BN represents batch normalization, 32 bit refers to float32, and ImageNet top-k format: top1/top5.<br />
!Method<br />
!<math>k_W</math><br />
!<math>k_A</math><br />
!<math>k_G</math><br />
!<math>k_E</math><br />
!Opt<br />
!BN<br />
!MNIST<br />
!SVHN<br />
!CIFAR10<br />
!ImageNet<br />
|-<br />
|BC<br />
|1<br />
|32<br />
|32<br />
|32<br />
|Adam<br />
|yes<br />
|1.29<br />
|2.30<br />
|9.90<br />
|<br />
|-<br />
|BNN<br />
|1<br />
|1<br />
|32<br />
|32<br />
|Adam<br />
|yes <br />
|0.96<br />
|2.53<br />
|10.15<br />
|<br />
|-<br />
|BWN<br />
|1<br />
|32<br />
|32<br />
|32<br />
|withM<br />
|yes<br />
|<br />
|<br />
|<br />
|43.2/20.6<br />
|-<br />
|XNOR<br />
|1<br />
|1<br />
|32<br />
|32<br />
|Adam<br />
|yes<br />
|<br />
|<br />
|<br />
|55.8/30.8<br />
|-<br />
|TWN<br />
|2<br />
|32<br />
|32<br />
|32<br />
|withM<br />
|yes<br />
|0.65<br />
|<br />
|7.44<br />
|'''34.7/13.8'''<br />
|-<br />
|TTQ<br />
|2<br />
|32<br />
|32<br />
|32<br />
|Adam<br />
|yes<br />
|<br />
|<br />
|6.44<br />
|42.5/20.3<br />
|-<br />
|DoReFa<br />
|8<br />
|8<br />
|32<br />
|8<br />
|Adam<br />
|yes<br />
|<br />
|2.30<br />
|<br />
|47.0/<br />
|-<br />
|TernGrad<br />
|32<br />
|32<br />
|2<br />
|32<br />
|Adam<br />
|yes<br />
|<br />
|<br />
|14.36<br />
|42.4/19.5<br />
|-<br />
|WAGE<br />
|2<br />
|8<br />
|8<br />
|8<br />
|SGD<br />
|no<br />
|'''0.40'''<br />
|'''1.92'''<br />
|'''6.78'''<br />
|51.6/27.8<br />
|}<br />
<br />
=== Training Curves and Regularization ===<br />
The authors compare the 2-8-8-8 WAGE configuration introduced above, a 2-8-f-f (meaning float32) configuration, and a completely floating point version on CIFAR10. The test error is plotted against epoch. For training these networks, the learning rate is divided by 8 at the 200th epoch and again at the 250th epoch.<br />
[[File:p32fig3.PNG|center|thumb|800px|Training curves of WAGE variations and a vanilla CNN on CIFAR10]]<br />
The convergence of the 2-8-8-8 has comparable convergence to the vanilla CNN and outperforms the 2-8-f-f variant. The authors speculate that this is because the extra discretization acts as a regularizer.<br />
<br />
=== Bitwidth of Errors ===<br />
The CIFAR10 test accuracy is plotted against bitwidth below and the error density for a single layer is compared with the Vanilla network.<br />
[[File:p32fig4.PNG|center|thumb|520x522px|The 10 run accuracies of different <math>k_E</math>]]<br />
<br />
[[File:32_error.png|center|thumb|520x522px|Histogram of errors for Vanilla network and Wage network. After being quantized and shifted each layer, the error is reshaped and so most orientation information is retained. ]]<br />
<br />
The table below shows the test error rates on CIFAR10 when left-shift upper boundary with factor γ. From this table we could see that large values play critical roles for backpropagation training even though they are infrequent while the majority with small values are just noise.<br />
<br />
[[File:testerror_rate.png|center]]<br />
<br />
=== Bitwidth of Gradients ===<br />
<br />
The authors next investigated the choice of a proper <math>k_G</math> for gradients using the CIFAR10 dataset. <br />
<br />
{| class="wikitable"<br />
|+Test error rates (%) on CIFAR10 with different <math>k_G</math><br />
!<math>k_G</math><br />
!2<br />
!3<br />
!4<br />
!5<br />
!6<br />
!7<br />
!8<br />
!9<br />
!10<br />
!11<br />
!12<br />
|-<br />
|error<br />
|54.22<br />
|51.57<br />
|28.22<br />
|18.01<br />
|11.48<br />
|7.61<br />
|6.78<br />
|6.63<br />
|6.43<br />
|6.55<br />
|6.57<br />
|}<br />
<br />
The results show similar bitwidth requirements as the last experiment for <math>k_E</math>.<br />
<br />
The authors also examined the effect of bitwidth on the ImageNet implementation.<br />
<br />
Here, C denotes 12 bits (Hexidecimal) and BN refers to batch normalization being added. 7 models are used: 2888 from the first experiment, 288C for more accurate errors (12 bits), 28C8 for larger buffer space, 28f8 for non-quantization of gradients, 28ff for errors and gradients in float32, and 28ff with BN added. The baseline vanilla model refers to the original AlexNet architecture. <br />
<br />
{| class="wikitable"<br />
|+Top-5 error rates (%) on ImageNet with different <math>k_G</math>and <math>k_E</math><br />
!Pattern<br />
!vanilla<br />
!28ff-BN<br />
!28ff<br />
!28f8<br />
!28C8<br />
!288C<br />
!2888<br />
|-<br />
|error<br />
|19.29<br />
|20.67<br />
|24.14<br />
|23.92<br />
|26.88<br />
|28.06<br />
|27.82<br />
|}<br />
<br />
The comparison between 28C8 and 288C shows that the model may perform better if it has more buffer space <math>k_G</math> for gradient accumulation than if it has high-resolution orientation <math>k_E</math>. The authors also noted that batch normalization and <math>k_G</math> are more important for ImageNet because the training set samples are highly variant.<br />
<br />
== Discussion ==<br />
The authors have a few areas they believe this approach could be improved.<br />
<br />
'''MAC Operation:''' The 2-8-8-8 configuration was chosen because the low weight bitwidth means there aren't any multiplication during inference. However, this does not remove the requirement for multiplication during training. 2-2-8-8 configuration satisfies this requirement, but it is difficult to train and detrimental to the accuracy.<br />
<br />
'''Non-linear Quantization:''' The linear mapping used in this approach is simple, but there might be a more effective mapping. For example, a logarithmic mapping could be more effective if the weights and activations have a log-normal distribution.<br />
<br />
'''Normalization:''' Normalization layers (softmax, batch normalization) were not used in this paper. Quantized versions are an area of future work<br />
<br />
== Conclusion ==<br />
<br />
A framework for training and inference without the use of floating-point representation is presented. By quantizing all operations and operands of a network, the authors successfully reduce the energy costs of both training and inference with deep learning architectures. Future work may further improve compression and memory requirements.<br />
<br />
== Implementation ==<br />
The following repository provides the source code for the paper: https://github.com/boluoweifenda/WAGE. The repository provides the source code as written by the authors, in Tensorflow.<br />
<br />
== Limitation == <br />
<br />
* The paper states the advantages in energy costs, but is there any limitation or trade-off by selecting integer instead of float-point-operation? What is a good situation for such implementation? The authors should explain more on this.<br />
<br />
== References ==<br />
# Sze, Vivienne; Chen, Yu-Hsin; Yang, Tien-Ju; Emer, Joel (2017-03-27). [http://arxiv.org/abs/1703.09039 "Efficient Processing of Deep Neural Networks: A Tutorial and Survey"]. arXiv:1703.09039 [cs].<br />
# Courbariaux, Matthieu; Bengio, Yoshua; David, Jean-Pierre (2015-11-01). [http://arxiv.org/abs/1511.00363 "BinaryConnect: Training Deep Neural Networks with binary weights during propagations"]. arXiv:1511.00363 [cs].<br />
# Li, Fengfu; Zhang, Bo; Liu, Bin (2016-05-16). [http://arxiv.org/abs/1605.04711 "Ternary Weight Networks"]. arXiv:1605.04711 [cs].<br />
# He, Kaiming; Zhang, Xiangyu; Ren, Shaoqing; Sun, Jian (2015-02-06). [http://arxiv.org/abs/1502.01852 "Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification"]. arXiv:1502.01852 [cs].<br />
# Zhou, Shuchang; Wu, Yuxin; Ni, Zekun; Zhou, Xinyu; Wen, He; Zou, Yuheng (2016-06-20). [http://arxiv.org/abs/1606.06160 "DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients"]. arXiv:1606.06160 [cs].<br />
# Lecun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. (November 1998). [http://ieeexplore.ieee.org/document/726791/?reload=true "Gradient-based learning applied to document recognition"]. Proceedings of the IEEE. 86 (11): 2278–2324. doi:10.1109/5.726791. ISSN 0018-9219.<br />
# Simonyan, Karen; Zisserman, Andrew (2014-09-04). [http://arxiv.org/abs/1409.1556 "Very Deep Convolutional Networks for Large-Scale Image Recognition"]. arXiv:1409.1556 [cs].<br />
# Krizhevsky, Alex; Sutskever, Ilya; Hinton, Geoffrey E (2012). Pereira, F.; Burges, C. J. C.; Bottou, L.; Weinberger, K. Q., eds. [http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf Advances in Neural Information Processing Systems 25 (PDF)]. Curran Associates, Inc. pp. 1097–1105.<br />
# Chenzhuo Zhu, Song Han, Huizi Mao, and William J Dally. Trained ternary quantization. arXiv preprint arXiv:1612.01064, 2016.<br />
# Chen-Yu Lee, Saining Xie, Patrick Gallagher, Zhengyou Zhang, and Zhuowen Tu. Deeplysupervisednets. In Artificial Intelligence and Statistics, pp. 562–570, 2015.<br />
# Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnor-net: Imagenet classification using binary convolutional neural networks. In European Conference on Computer Vision, pp. 525–542. Springer, 2016.<br />
# “Boluoweifenda/WAGE.” GitHub, github.com/boluoweifenda/WAGE.<br />
# Han, S., Mao, H., & Dally, W. J. (2015). Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149.<br />
# Tang, Wei, Gang Hua, and Liang Wang. "How to train a compact binary neural network with high accuracy?." AAAI. 2017.</div>Swalshhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Multi-scale_Dense_Networks_for_Resource_Efficient_Image_Classification&diff=35390Multi-scale Dense Networks for Resource Efficient Image Classification2018-03-23T15:31:49Z<p>Swalsh: /* Architecture */</p>
<hr />
<div>= Introduction = <br />
<br />
Multi-Scale Dense Networks, MSDNets, are designed to address the growing demand for efficient object recognition. The issue with existing recognition networks is that they are either:<br />
efficient networks, but don't do well on hard examples, or<br />
large networks that do well on all examples but require a large amount of resources<br />
<br />
In order to be efficient on all difficulties MSDNets propose a structure that can accurately output classifications for varying levels of computational requirements. The two cases that are used to evaluate the network are:<br />
Anytime Prediction: What is the best prediction the network can provide when suddenly prompted.<br />
Budget Batch Predictions: Given a maximum amount of computational resources how well does the network do on the batch.<br />
<br />
= Related Networks =<br />
<br />
== Computationally Efficient Networks ==<br />
<br />
Existing methods for refining an accurate network to be more efficient include weight pruning, quantization of weights (during or after training), and knowledge distillation, which trains smaller network to match teacher network.<br />
<br />
== Resource Efficient Networks == <br />
<br />
Unlike the above, resource efficient concepts consider limited resources as a part of the structure/loss.<br />
Examples of work in this area include: <br />
* Efficient variants to existing state of the art networks<br />
* Gradient boosted decision trees, which incorporate computational limitations into the training<br />
* Fractal nets<br />
* Adaptive computation time method<br />
<br />
== Related architectures ==<br />
<br />
MSDNets pull on concepts from a number of existing networks:<br />
* Neural fabrics and others, are used to quickly establish a low resolution feature map, which is integral for classification.<br />
* Deeply supervised nets, introduced the incorporation of multiple classifiers throughout the network<br />
* The feature concatenation method from DenseNets allows the later classifiers to not be disrupted by the weight updates from earlier classifiers.<br />
<br />
= Multi-Scale Dense Networks =<br />
<br />
== Integral Contributions ==<br />
<br />
The way MSDNets aims to provide efficient classification with varying computational costs is to create one network that outputs results at depths. While this may seem trivial, as intermediate classifiers can be inserted into any existing network, two major problems arise.<br />
<br />
=== Coarse Level Features Needed For Classification ===<br />
<br />
Coarse level features are needed to gain context of scene. In typical CNN based networks, the features propagate from fine to coarse. Classifiers added to the early, fine featured, layers do not output accurate predictions due to the lack of context.<br />
<br />
To address this issue, MSDNets proposes an architecture in which uses multi scaled feature maps. The network is quickly formed to contain a set number of scales ranging from fine to coarse. These scales are propagated throughout, so that for the length of the network there are always coarse level features for classification and fine features for learning more difficult representations.<br />
<br />
=== Training of Early Classifiers Interferes with Later Classifiers ===<br />
<br />
When training a network containing intermediate classifiers, the training of early classifiers will cause the early layers to focus on features for that classifier. These learned features may not be as useful to the later classifiers and degrade their accuracy.<br />
<br />
MSDNets use dense connectivity to avoid this issue. By concatenating all prior layers to learn future layers, the gradient propagation is spread throughout the available features. This allows later layers to not be reliant on any single prior, providing opportunities to learn new features that priors have ignored.<br />
<br />
== Architecture ==<br />
<br />
[[File:MSDNet_arch.png | 700px|thumb|center|Left: the MSDNet architecture. Right: example calculations for each output given 3 scales and 4 layers.]]<br />
<br />
The architecture of an MSDNet is a structure of convolutions with a set number of layers and a set number of scales. Layers allow the network to build on the previous information to generate more accurate predictions, while the scales allow the network to maintain coarse level features throughout.<br />
<br />
The first layer is a special, mini-cnn-network, that quickly fills all required scales with features. The following layers are generated through the convolutions of the previous layers and scales.<br />
<br />
Each output at a given s scale is given by the convolution of all prior outputs of the same scale, and the strided-convolution of all prior outputs from the previous scale. <br />
<br />
The classifiers are run on the concatenation of all of the coarsest outputs from the preceding layers.<br />
<br />
=== Loss Function ===<br />
<br />
The loss is calculated as a weighted sum of each classifier's logistic loss. The weighted loss is taken as an average over a set of training samples. The weights can be determined from a budget of computational power, but results also show that setting all to 1 is also acceptable.<br />
<br />
=== Computational Limit Inclusion ===<br />
<br />
When running in a budgeted batch scenario, the network attempts to provide the best overall accuracy. To do this with a set limit on computational resources, it works to use less of the budget on easy detections in order to allow more time to be spent on hard ones. <br />
In order to facilitate this, the classifiers are designed to exit when the confidence of the classification exceeds a preset threshold. To determine the threshold for each classifier, <math>|D_{test}|\sum_{k}(q_k C_k) \leq B </math> must be true. Where <math>|D_{test}|</math> is the total number of test samples, <math>C_k</math> is the computational requirement to get an output from the <math>k</math>th classifier, and <math>q_k </math> is the probability that a sample exits at the <math>k</math>th classifier. Assuming that all classifiers have the same base probability, <math>q</math>, then <math>q_k</math> can be used to find the threshold.<br />
<br />
= Experiments = <br />
<br />
When evaluating on CIFAR-10 and CIFAR-100 ensembles and multi-classifier versions of ResNets and DenseNets, as well as FractalNet are used to compare with MSDNet. <br />
<br />
When evaluating on ImageNet ensembles and individual versions of ResNets and DenseNets are compared with MSDNets.<br />
<br />
== Anytime Prediction ==<br />
<br />
In anytime prediction MSDNets are shown to have highly accurate with very little budget, and continue to remain above the alternate methods as the budget increases.<br />
<br />
[[File:MSDNet_anytime.png | 700px|thumb|center|Accuracy of the anytime classification models.]]<br />
<br />
== Budget Batch ==<br />
<br />
For budget batch 3 MSDNets are designed with classifiers set-up for varying ranges of budget constraints. On both dataset options the MSDNets exceed all alternate methods with a fraction of the budget required.<br />
<br />
[[File:MSDNet_budgetbatch.png | 700px|thumb|center|Accuracy of the budget batch classification models.]]<br />
<br />
= Critique = <br />
<br />
The problem formulation and scenario evaluation were very well formulated, and according to independent reviews, the results were reproducible. Where the paper could improve is on explaining how to implement the threshold; it isn't very well explained how the use of the validation set can be used to set the threshold value.</div>Swalshhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Multi-scale_Dense_Networks_for_Resource_Efficient_Image_Classification&diff=35389Multi-scale Dense Networks for Resource Efficient Image Classification2018-03-23T15:22:50Z<p>Swalsh: /* Computational Limit Inclusion */</p>
<hr />
<div>= Introduction = <br />
<br />
Multi-Scale Dense Networks, MSDNets, are designed to address the growing demand for efficient object recognition. The issue with existing recognition networks is that they are either:<br />
efficient networks, but don't do well on hard examples, or<br />
large networks that do well on all examples but require a large amount of resources<br />
<br />
In order to be efficient on all difficulties MSDNets propose a structure that can accurately output classifications for varying levels of computational requirements. The two cases that are used to evaluate the network are:<br />
Anytime Prediction: What is the best prediction the network can provide when suddenly prompted.<br />
Budget Batch Predictions: Given a maximum amount of computational resources how well does the network do on the batch.<br />
<br />
= Related Networks =<br />
<br />
== Computationally Efficient Networks ==<br />
<br />
Existing methods for refining an accurate network to be more efficient include weight pruning, quantization of weights (during or after training), and knowledge distillation, which trains smaller network to match teacher network.<br />
<br />
== Resource Efficient Networks == <br />
<br />
Unlike the above, resource efficient concepts consider limited resources as a part of the structure/loss.<br />
Examples of work in this area include: <br />
* Efficient variants to existing state of the art networks<br />
* Gradient boosted decision trees, which incorporate computational limitations into the training<br />
* Fractal nets<br />
* Adaptive computation time method<br />
<br />
== Related architectures ==<br />
<br />
MSDNets pull on concepts from a number of existing networks:<br />
* Neural fabrics and others, are used to quickly establish a low resolution feature map, which is integral for classification.<br />
* Deeply supervised nets, introduced the incorporation of multiple classifiers throughout the network<br />
* The feature concatenation method from DenseNets allows the later classifiers to not be disrupted by the weight updates from earlier classifiers.<br />
<br />
= Multi-Scale Dense Networks =<br />
<br />
== Integral Contributions ==<br />
<br />
The way MSDNets aims to provide efficient classification with varying computational costs is to create one network that outputs results at depths. While this may seem trivial, as intermediate classifiers can be inserted into any existing network, two major problems arise.<br />
<br />
=== Coarse Level Features Needed For Classification ===<br />
<br />
Coarse level features are needed to gain context of scene. In typical CNN based networks, the features propagate from fine to coarse. Classifiers added to the early, fine featured, layers do not output accurate predictions due to the lack of context.<br />
<br />
To address this issue, MSDNets proposes an architecture in which uses multi scaled feature maps. The network is quickly formed to contain a set number of scales ranging from fine to coarse. These scales are propagated throughout, so that for the length of the network there are always coarse level features for classification and fine features for learning more difficult representations.<br />
<br />
=== Training of Early Classifiers Interferes with Later Classifiers ===<br />
<br />
When training a network containing intermediate classifiers, the training of early classifiers will cause the early layers to focus on features for that classifier. These learned features may not be as useful to the later classifiers and degrade their accuracy.<br />
<br />
MSDNets use dense connectivity to avoid this issue. By concatenating all prior layers to learn future layers, the gradient propagation is spread throughout the available features. This allows later layers to not be reliant on any single prior, providing opportunities to learn new features that priors have ignored.<br />
<br />
== Architecture ==<br />
<br />
[[File:MSDNet_arch.png | 700px|thumb|center|Left: the MSDNet architecture. Right: example calculations for each output given 3 scales and 4 layers.]]<br />
<br />
The architecture of an MSDNet can be thought of as a structure of convolutions with a set number of layers and a set number of scales. Layers allow the network to build on the previous information to generate more accurate predictions, while the scales allow the network to maintain coarse level features throught.<br />
<br />
The first layer is a special, mini-cnn-network, that quickly fills all required scales with features. The following layers are generated through the convolutions of the previous layers and scales.<br />
Each output at a given s scale is given by the convolution of all prior outputs of the same scale, and the strided-convolution of all prior outputs from the previous scale.<br />
<br />
The classifiers are run on the concatenation of all of the coarsest outputs from the preceding layers.<br />
<br />
=== Loss Function ===<br />
<br />
The loss is calculated as a weighted sum of each classifier's logistic loss. The weighted loss is taken as an average over a set of training samples. The weights can be determined from a budget of computational power, but results also show that setting all to 1 is also acceptable.<br />
<br />
=== Computational Limit Inclusion ===<br />
<br />
When running in a budgeted batch scenario, the network attempts to provide the best overall accuracy. To do this with a set limit on computational resources, it works to use less of the budget on easy detections in order to allow more time to be spent on hard ones. <br />
In order to facilitate this, the classifiers are designed to exit when the confidence of the classification exceeds a preset threshold. To determine the threshold for each classifier, <math>|D_{test}|\sum_{k}(q_k C_k) \leq B </math> must be true. Where <math>|D_{test}|</math> is the total number of test samples, <math>C_k</math> is the computational requirement to get an output from the <math>k</math>th classifier, and <math>q_k </math> is the probability that a sample exits at the <math>k</math>th classifier. Assuming that all classifiers have the same base probability, <math>q</math>, then <math>q_k</math> can be used to find the threshold.<br />
<br />
= Experiments = <br />
<br />
When evaluating on CIFAR-10 and CIFAR-100 ensembles and multi-classifier versions of ResNets and DenseNets, as well as FractalNet are used to compare with MSDNet. <br />
<br />
When evaluating on ImageNet ensembles and individual versions of ResNets and DenseNets are compared with MSDNets.<br />
<br />
== Anytime Prediction ==<br />
<br />
In anytime prediction MSDNets are shown to have highly accurate with very little budget, and continue to remain above the alternate methods as the budget increases.<br />
<br />
[[File:MSDNet_anytime.png | 700px|thumb|center|Accuracy of the anytime classification models.]]<br />
<br />
== Budget Batch ==<br />
<br />
For budget batch 3 MSDNets are designed with classifiers set-up for varying ranges of budget constraints. On both dataset options the MSDNets exceed all alternate methods with a fraction of the budget required.<br />
<br />
[[File:MSDNet_budgetbatch.png | 700px|thumb|center|Accuracy of the budget batch classification models.]]<br />
<br />
= Critique = <br />
<br />
The problem formulation and scenario evaluation were very well formulated, and according to independent reviews, the results were reproducible. Where the paper could improve is on explaining how to implement the threshold; it isn't very well explained how the use of the validation set can be used to set the threshold value.</div>Swalshhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Multi-scale_Dense_Networks_for_Resource_Efficient_Image_Classification&diff=35388Multi-scale Dense Networks for Resource Efficient Image Classification2018-03-23T15:21:43Z<p>Swalsh: /* Computational Limit Inclusion */</p>
<hr />
<div>= Introduction = <br />
<br />
Multi-Scale Dense Networks, MSDNets, are designed to address the growing demand for efficient object recognition. The issue with existing recognition networks is that they are either:<br />
efficient networks, but don't do well on hard examples, or<br />
large networks that do well on all examples but require a large amount of resources<br />
<br />
In order to be efficient on all difficulties MSDNets propose a structure that can accurately output classifications for varying levels of computational requirements. The two cases that are used to evaluate the network are:<br />
Anytime Prediction: What is the best prediction the network can provide when suddenly prompted.<br />
Budget Batch Predictions: Given a maximum amount of computational resources how well does the network do on the batch.<br />
<br />
= Related Networks =<br />
<br />
== Computationally Efficient Networks ==<br />
<br />
Existing methods for refining an accurate network to be more efficient include weight pruning, quantization of weights (during or after training), and knowledge distillation, which trains smaller network to match teacher network.<br />
<br />
== Resource Efficient Networks == <br />
<br />
Unlike the above, resource efficient concepts consider limited resources as a part of the structure/loss.<br />
Examples of work in this area include: <br />
* Efficient variants to existing state of the art networks<br />
* Gradient boosted decision trees, which incorporate computational limitations into the training<br />
* Fractal nets<br />
* Adaptive computation time method<br />
<br />
== Related architectures ==<br />
<br />
MSDNets pull on concepts from a number of existing networks:<br />
* Neural fabrics and others, are used to quickly establish a low resolution feature map, which is integral for classification.<br />
* Deeply supervised nets, introduced the incorporation of multiple classifiers throughout the network<br />
* The feature concatenation method from DenseNets allows the later classifiers to not be disrupted by the weight updates from earlier classifiers.<br />
<br />
= Multi-Scale Dense Networks =<br />
<br />
== Integral Contributions ==<br />
<br />
The way MSDNets aims to provide efficient classification with varying computational costs is to create one network that outputs results at depths. While this may seem trivial, as intermediate classifiers can be inserted into any existing network, two major problems arise.<br />
<br />
=== Coarse Level Features Needed For Classification ===<br />
<br />
Coarse level features are needed to gain context of scene. In typical CNN based networks, the features propagate from fine to coarse. Classifiers added to the early, fine featured, layers do not output accurate predictions due to the lack of context.<br />
<br />
To address this issue, MSDNets proposes an architecture in which uses multi scaled feature maps. The network is quickly formed to contain a set number of scales ranging from fine to coarse. These scales are propagated throughout, so that for the length of the network there are always coarse level features for classification and fine features for learning more difficult representations.<br />
<br />
=== Training of Early Classifiers Interferes with Later Classifiers ===<br />
<br />
When training a network containing intermediate classifiers, the training of early classifiers will cause the early layers to focus on features for that classifier. These learned features may not be as useful to the later classifiers and degrade their accuracy.<br />
<br />
MSDNets use dense connectivity to avoid this issue. By concatenating all prior layers to learn future layers, the gradient propagation is spread throughout the available features. This allows later layers to not be reliant on any single prior, providing opportunities to learn new features that priors have ignored.<br />
<br />
== Architecture ==<br />
<br />
[[File:MSDNet_arch.png | 700px|thumb|center|Left: the MSDNet architecture. Right: example calculations for each output given 3 scales and 4 layers.]]<br />
<br />
The architecture of an MSDNet can be thought of as a structure of convolutions with a set number of layers and a set number of scales. Layers allow the network to build on the previous information to generate more accurate predictions, while the scales allow the network to maintain coarse level features throught.<br />
<br />
The first layer is a special, mini-cnn-network, that quickly fills all required scales with features. The following layers are generated through the convolutions of the previous layers and scales.<br />
Each output at a given s scale is given by the convolution of all prior outputs of the same scale, and the strided-convolution of all prior outputs from the previous scale.<br />
<br />
The classifiers are run on the concatenation of all of the coarsest outputs from the preceding layers.<br />
<br />
=== Loss Function ===<br />
<br />
The loss is calculated as a weighted sum of each classifier's logistic loss. The weighted loss is taken as an average over a set of training samples. The weights can be determined from a budget of computational power, but results also show that setting all to 1 is also acceptable.<br />
<br />
=== Computational Limit Inclusion ===<br />
<br />
When running in a budgeted batch scenario, the network attempts to provide the best overall accuracy. To do this with a set limit on computational resources, it works to use less of the budget on easy detections in order to allow more time to be spent on hard ones. <br />
In order to facilitate this, the classifiers are designed to exit when the confidence of the classification exceeds a preset threshold. To determine the threshold for each classifier, <math>|D_test|\sum_{k}(q_k C_k) \leq B </math> must be true. Where <math>|D_test|</math> is the total number of test samples, <math>C_k</math> is the computational requirement to get an output from the <math>k</math>th classifier, and <math>q_k </math> is the probability that a sample exits at the <math>k</math>th classifier. Assuming that all classifiers have the same base probability, <math>q</math>, then <math>q_k</math> can be used to find the threshold.<br />
<br />
= Experiments = <br />
<br />
When evaluating on CIFAR-10 and CIFAR-100 ensembles and multi-classifier versions of ResNets and DenseNets, as well as FractalNet are used to compare with MSDNet. <br />
<br />
When evaluating on ImageNet ensembles and individual versions of ResNets and DenseNets are compared with MSDNets.<br />
<br />
== Anytime Prediction ==<br />
<br />
In anytime prediction MSDNets are shown to have highly accurate with very little budget, and continue to remain above the alternate methods as the budget increases.<br />
<br />
[[File:MSDNet_anytime.png | 700px|thumb|center|Accuracy of the anytime classification models.]]<br />
<br />
== Budget Batch ==<br />
<br />
For budget batch 3 MSDNets are designed with classifiers set-up for varying ranges of budget constraints. On both dataset options the MSDNets exceed all alternate methods with a fraction of the budget required.<br />
<br />
[[File:MSDNet_budgetbatch.png | 700px|thumb|center|Accuracy of the budget batch classification models.]]<br />
<br />
= Critique = <br />
<br />
The problem formulation and scenario evaluation were very well formulated, and according to independent reviews, the results were reproducible. Where the paper could improve is on explaining how to implement the threshold; it isn't very well explained how the use of the validation set can be used to set the threshold value.</div>Swalshhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Multi-scale_Dense_Networks_for_Resource_Efficient_Image_Classification&diff=35387Multi-scale Dense Networks for Resource Efficient Image Classification2018-03-23T15:21:12Z<p>Swalsh: /* Computational Limit Inclusion */</p>
<hr />
<div>= Introduction = <br />
<br />
Multi-Scale Dense Networks, MSDNets, are designed to address the growing demand for efficient object recognition. The issue with existing recognition networks is that they are either:<br />
efficient networks, but don't do well on hard examples, or<br />
large networks that do well on all examples but require a large amount of resources<br />
<br />
In order to be efficient on all difficulties MSDNets propose a structure that can accurately output classifications for varying levels of computational requirements. The two cases that are used to evaluate the network are:<br />
Anytime Prediction: What is the best prediction the network can provide when suddenly prompted.<br />
Budget Batch Predictions: Given a maximum amount of computational resources how well does the network do on the batch.<br />
<br />
= Related Networks =<br />
<br />
== Computationally Efficient Networks ==<br />
<br />
Existing methods for refining an accurate network to be more efficient include weight pruning, quantization of weights (during or after training), and knowledge distillation, which trains smaller network to match teacher network.<br />
<br />
== Resource Efficient Networks == <br />
<br />
Unlike the above, resource efficient concepts consider limited resources as a part of the structure/loss.<br />
Examples of work in this area include: <br />
* Efficient variants to existing state of the art networks<br />
* Gradient boosted decision trees, which incorporate computational limitations into the training<br />
* Fractal nets<br />
* Adaptive computation time method<br />
<br />
== Related architectures ==<br />
<br />
MSDNets pull on concepts from a number of existing networks:<br />
* Neural fabrics and others, are used to quickly establish a low resolution feature map, which is integral for classification.<br />
* Deeply supervised nets, introduced the incorporation of multiple classifiers throughout the network<br />
* The feature concatenation method from DenseNets allows the later classifiers to not be disrupted by the weight updates from earlier classifiers.<br />
<br />
= Multi-Scale Dense Networks =<br />
<br />
== Integral Contributions ==<br />
<br />
The way MSDNets aims to provide efficient classification with varying computational costs is to create one network that outputs results at depths. While this may seem trivial, as intermediate classifiers can be inserted into any existing network, two major problems arise.<br />
<br />
=== Coarse Level Features Needed For Classification ===<br />
<br />
Coarse level features are needed to gain context of scene. In typical CNN based networks, the features propagate from fine to coarse. Classifiers added to the early, fine featured, layers do not output accurate predictions due to the lack of context.<br />
<br />
To address this issue, MSDNets proposes an architecture in which uses multi scaled feature maps. The network is quickly formed to contain a set number of scales ranging from fine to coarse. These scales are propagated throughout, so that for the length of the network there are always coarse level features for classification and fine features for learning more difficult representations.<br />
<br />
=== Training of Early Classifiers Interferes with Later Classifiers ===<br />
<br />
When training a network containing intermediate classifiers, the training of early classifiers will cause the early layers to focus on features for that classifier. These learned features may not be as useful to the later classifiers and degrade their accuracy.<br />
<br />
MSDNets use dense connectivity to avoid this issue. By concatenating all prior layers to learn future layers, the gradient propagation is spread throughout the available features. This allows later layers to not be reliant on any single prior, providing opportunities to learn new features that priors have ignored.<br />
<br />
== Architecture ==<br />
<br />
[[File:MSDNet_arch.png | 700px|thumb|center|Left: the MSDNet architecture. Right: example calculations for each output given 3 scales and 4 layers.]]<br />
<br />
The architecture of an MSDNet can be thought of as a structure of convolutions with a set number of layers and a set number of scales. Layers allow the network to build on the previous information to generate more accurate predictions, while the scales allow the network to maintain coarse level features throught.<br />
<br />
The first layer is a special, mini-cnn-network, that quickly fills all required scales with features. The following layers are generated through the convolutions of the previous layers and scales.<br />
Each output at a given s scale is given by the convolution of all prior outputs of the same scale, and the strided-convolution of all prior outputs from the previous scale.<br />
<br />
The classifiers are run on the concatenation of all of the coarsest outputs from the preceding layers.<br />
<br />
=== Loss Function ===<br />
<br />
The loss is calculated as a weighted sum of each classifier's logistic loss. The weighted loss is taken as an average over a set of training samples. The weights can be determined from a budget of computational power, but results also show that setting all to 1 is also acceptable.<br />
<br />
=== Computational Limit Inclusion ===<br />
<br />
When running in a budgeted batch scenario, the network attempts to provide the best overall accuracy. To do this with a set limit on computational resources, it works to use less of the budget on easy detections in order to allow more time to be spent on hard ones. <br />
In order to facilitate this, the classifiers are designed to exit when the confidence of the classification exceeds a preset threshold. To determine the threshold for each classifier, <math>|Dtest|\sum_{k}(q_k C_k) \leq B </math> must be true. Where <math>|Dtest|</math> is the total number of test samples, <math>C_k</math> is the computational requirement to get an output from the <math>k</math>th classifier, and <math>q_k </math> is the probability that a sample exits at the <math>k</math>th classifier. Assuming that all classifiers have the same base probability, <math>q</math>, then <math>q_k</math> can be used to find the threshold.<br />
<br />
= Experiments = <br />
<br />
When evaluating on CIFAR-10 and CIFAR-100 ensembles and multi-classifier versions of ResNets and DenseNets, as well as FractalNet are used to compare with MSDNet. <br />
<br />
When evaluating on ImageNet ensembles and individual versions of ResNets and DenseNets are compared with MSDNets.<br />
<br />
== Anytime Prediction ==<br />
<br />
In anytime prediction MSDNets are shown to have highly accurate with very little budget, and continue to remain above the alternate methods as the budget increases.<br />
<br />
[[File:MSDNet_anytime.png | 700px|thumb|center|Accuracy of the anytime classification models.]]<br />
<br />
== Budget Batch ==<br />
<br />
For budget batch 3 MSDNets are designed with classifiers set-up for varying ranges of budget constraints. On both dataset options the MSDNets exceed all alternate methods with a fraction of the budget required.<br />
<br />
[[File:MSDNet_budgetbatch.png | 700px|thumb|center|Accuracy of the budget batch classification models.]]<br />
<br />
= Critique = <br />
<br />
The problem formulation and scenario evaluation were very well formulated, and according to independent reviews, the results were reproducible. Where the paper could improve is on explaining how to implement the threshold; it isn't very well explained how the use of the validation set can be used to set the threshold value.</div>Swalshhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18&diff=35345stat946w182018-03-23T00:39:00Z<p>Swalsh: </p>
<hr />
<div>=[https://piazza.com/uwaterloo.ca/fall2017/stat946/resources List of Papers]=<br />
<br />
= Record your contributions here [https://docs.google.com/spreadsheets/d/1fU746Cld_mSqQBCD5qadvkXZW1g-j-kHvmHQ6AMeuqU/edit?usp=sharing]=<br />
<br />
Use the following notations:<br />
<br />
P: You have written a summary/critique on the paper.<br />
<br />
T: You had a technical contribution on a paper (excluding the paper that you present).<br />
<br />
E: You had an editorial contribution on a paper (excluding the paper that you present).<br />
<br />
<br />
<br />
[https://docs.google.com/forms/d/e/1FAIpQLSdcfYZu5cvpsbzf0Nlxh9TFk8k1m5vUgU1vCLHQNmJog4xSHw/viewform?usp=sf_link Your feedback on presentations]<br />
<br />
=Paper presentation=<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="60pt"|Date<br />
|width="100pt"|Name <br />
|width="30pt"|Paper number <br />
|width="700pt"|Title<br />
|width="30pt"|Link to the paper<br />
|width="30pt"|Link to the summary<br />
|-<br />
|Feb 15 (example)||Ri Wang || ||Sequence to sequence learning with neural networks.||[http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Paper] || [http://wikicoursenote.com/wiki/Stat946f15/Sequence_to_sequence_learning_with_neural_networks#Long_Short-Term_Memory_Recurrent_Neural_Network Summary]<br />
|-<br />
|Feb 27 || || 1|| || || <br />
|-<br />
|Feb 27 || || 2|| || || <br />
|-<br />
|Feb 27 || || 3|| || || <br />
|-<br />
|Mar 1 || Peter Forsyth || 4|| Unsupervised Machine Translation Using Monolingual Corpora Only || [https://arxiv.org/pdf/1711.00043.pdf Paper] || [[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Unsupervised_Machine_Translation_Using_Monolingual_Corpora_Only Summary]]<br />
|-<br />
|Mar 1 || wenqing liu || 5|| Spectral Normalization for Generative Adversarial Networks || [https://openreview.net/pdf?id=B1QRgziT- Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Spectral_normalization_for_generative_adversial_network Summary]<br />
|-<br />
|Mar 1 || Ilia Sucholutsky || 6|| One-Shot Imitation Learning || [https://papers.nips.cc/paper/6709-one-shot-imitation-learning.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=One-Shot_Imitation_Learning Summary]<br />
|-<br />
|Mar 6 || George (Shiyang) Wen || 7|| AmbientGAN: Generative models from lossy measurements || [https://openreview.net/pdf?id=Hy7fDog0b Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/AmbientGAN:_Generative_Models_from_Lossy_Measurements Summary]<br />
|-<br />
|Mar 6 || Raphael Tang || 8|| Rethinking the Smaller-Norm-Less-Informative Assumption in Channel Pruning of Convolutional Layers || [https://arxiv.org/pdf/1802.00124.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Rethinking_the_Smaller-Norm-Less-Informative_Assumption_in_Channel_Pruning_of_Convolutional_Layers Summary]<br />
|-<br />
|Mar 6 ||Fan Xia || 9|| Word translation without parallel data ||[https://arxiv.org/pdf/1710.04087.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Word_translation_without_parallel_data Summary]<br />
|-<br />
|Mar 8 || Alex (Xian) Wang || 10 || Self-Normalizing Neural Networks || [http://papers.nips.cc/paper/6698-self-normalizing-neural-networks.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Self_Normalizing_Neural_Networks Summary] <br />
|-<br />
|Mar 8 || Michael Broughton || 11|| Convergence of Adam and beyond || [https://openreview.net/pdf?id=ryQu7f-RZ Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=On_The_Convergence_Of_ADAM_And_Beyond Summary] <br />
|-<br />
|Mar 8 || Wei Tao Chen || 12|| Predicting Floor-Level for 911 Calls with Neural Networks and Smartphone Sensor Data || [https://openreview.net/forum?id=ryBnUWb0b Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Predicting_Floor-Level_for_911_Calls_with_Neural_Networks_and_Smartphone_Sensor_Data Summary]<br />
|-<br />
|Mar 13 || Chunshang Li || 13 || UNDERSTANDING IMAGE MOTION WITH GROUP REPRESENTATIONS || [https://openreview.net/pdf?id=SJLlmG-AZ Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Understanding_Image_Motion_with_Group_Representations Summary] <br />
|-<br />
|Mar 13 || Saifuddin Hitawala || 14 || Robust Imitation of Diverse Behaviors || [https://papers.nips.cc/paper/7116-robust-imitation-of-diverse-behaviors.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Robust_Imitation_of_Diverse_Behaviors Summary] <br />
|-<br />
|Mar 13 || Taylor Denouden || 15|| A neural representation of sketch drawings || [https://arxiv.org/pdf/1704.03477.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=A_Neural_Representation_of_Sketch_Drawings Summary]<br />
|-<br />
|Mar 15 || Zehao Xu || 16|| Synthetic and natural noise both break neural machine translation || [https://openreview.net/pdf?id=BJ8vJebC- Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Synthetic_and_natural_noise_both_break_neural_machine_translation Summary]<br />
|-<br />
|Mar 15 || Prarthana Bhattacharyya || 17|| Wasserstein Auto-Encoders || [https://arxiv.org/pdf/1711.01558.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Wasserstein_Auto-Encoders Summary] <br />
|-<br />
|Mar 15 || Changjian Li || 18|| Label-Free Supervision of Neural Networks with Physics and Domain Knowledge || [https://arxiv.org/pdf/1609.05566.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Label-Free_Supervision_of_Neural_Networks_with_Physics_and_Domain_Knowledge Summary]<br />
|-<br />
|Mar 20 || Travis Dunn || 19|| Continuous Adaptation via Meta-Learning in Nonstationary and Competitive Environments || [https://openreview.net/pdf?id=Sk2u1g-0- Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Continuous_Adaptation_via_Meta-Learning_in_Nonstationary_and_Competitive_Environments Summary]<br />
|-<br />
|Mar 20 || Sushrut Bhalla || 20|| MaskRNN: Instance Level Video Object Segmentation || [https://papers.nips.cc/paper/6636-maskrnn-instance-level-video-object-segmentation.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/MaskRNN:_Instance_Level_Video_Object_Segmentation Summary]<br />
|-<br />
|Mar 20 || Hamid Tahir || 21|| Wavelet Pooling for Convolution Neural Networks || [https://openreview.net/pdf?id=rkhlb8lCZ Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Wavelet_Pooling_CNN Summary]<br />
|-<br />
|Mar 22 || Dongyang Yang|| 22|| Implicit Causal Models for Genome-wide Association Studies || [https://openreview.net/pdf?id=SyELrEeAb Paper] ||[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Implicit_Causal_Models_for_Genome-wide_Association_Studies Summary]<br />
|-<br />
|Mar 22 || Yao Li || 23||Improving GANs Using Optimal Transport || [https://openreview.net/pdf?id=rkQkBnJAb Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/IMPROVING_GANS_USING_OPTIMAL_TRANSPORT Summary]<br />
|-<br />
|Mar 22 || Sahil Pereira || 24||End-to-End Differentiable Adversarial Imitation Learning|| [http://proceedings.mlr.press/v70/baram17a/baram17a.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=End-to-End_Differentiable_Adversarial_Imitation_Learning Summary]<br />
|-<br />
|Mar 27 || Jaspreet Singh Sambee || 25|| Do Deep Neural Networks Suffer from Crowding? || [http://papers.nips.cc/paper/7146-do-deep-neural-networks-suffer-from-crowding.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Do_Deep_Neural_Networks_Suffer_from_Crowding Summary]<br />
|-<br />
|Mar 27 || Braden Hurl || 26|| Spherical CNNs || [https://openreview.net/pdf?id=Hkbd5xZRb Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Spherical_CNNs Summary]<br />
|-<br />
|Mar 27 || Marko Ilievski || 27|| Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders || [http://proceedings.mlr.press/v70/engel17a/engel17a.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Neural_Audio_Synthesis_of_Musical_Notes_with_WaveNet_autoencoders Summary]<br />
|-<br />
|Mar 29 || Alex Pon || 28||PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space || [https://arxiv.org/abs/1706.02413 Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=PointNet%2B%2B:_Deep_Hierarchical_Feature_Learning_on_Point_Sets_in_a_Metric_Space Summary]<br />
|-<br />
|Mar 29 || Sean Walsh || 29||Multi-scale Dense Networks for Resource Efficient Image Classification || [https://arxiv.org/pdf/1703.09844.pdf Paper] ||[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Multi-scale_Dense_Networks_for_Resource_Efficient_Image_Classification#Architecture Summary]<br />
|-<br />
|Mar 29 || Jason Ku || 30||MarrNet: 3D Shape Reconstruction via 2.5D Sketches || [https://arxiv.org/pdf/1711.03129.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=MarrNet:_3D_Shape_Reconstruction_via_2.5D_Sketches Summary] <br />
|-<br />
|Apr 3 || Tong Yang || 31|| Dynamic Routing Between Capsules. || [http://papers.nips.cc/paper/6975-dynamic-routing-between-capsules.pdf Paper] || <br />
|-<br />
|Apr 3 || Benjamin Skikos || 32|| Training and Inference with Integers in Deep Neural Networks || [https://openreview.net/pdf?id=HJGXzmspb Paper] || <br />
|-<br />
|Apr 3 || Weishi Chen || 33|| Tensorized LSTMs for Sequence Learning || [https://arxiv.org/pdf/1711.01577.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Tensorized_LSTMs&action=edit&redlink=1 Summary] <br />
|-</div>Swalshhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Multi-scale_Dense_Networks_for_Resource_Efficient_Image_Classification&diff=35342Multi-scale Dense Networks for Resource Efficient Image Classification2018-03-23T00:36:57Z<p>Swalsh: /* Architecture */</p>
<hr />
<div>= Introduction = <br />
<br />
Multi-Scale Dense Networks, MSDNets, are designed to address the growing demand for efficient object recognition. The issue with existing recognition networks is that they are either:<br />
efficient networks, but don't do well on hard examples, or<br />
large networks that do well on all examples but require a large amount of resources<br />
<br />
In order to be efficient on all difficulties MSDNets propose a structure that can accurately output classifications for varying levels of computational requirements. The two cases that are used to evaluate the network are:<br />
Anytime Prediction: What is the best prediction the network can provide when suddenly prompted.<br />
Budget Batch Predictions: Given a maximum amount of computational resources how well does the network do on the batch.<br />
<br />
= Related Networks =<br />
<br />
== Computationally Efficient Networks ==<br />
<br />
Existing methods for refining an accurate network to be more efficient include weight pruning, quantization of weights (during or after training), and knowledge distillation, which trains smaller network to match teacher network.<br />
<br />
== Resource Efficient Networks == <br />
<br />
Unlike the above, resource efficient concepts consider limited resources as a part of the structure/loss.<br />
Examples of work in this area include: <br />
* Efficient variants to existing state of the art networks<br />
* Gradient boosted decision trees, which incorporate computational limitations into the training<br />
* Fractal nets<br />
* Adaptive computation time method<br />
<br />
== Related architectures ==<br />
<br />
MSDNets pull on concepts from a number of existing networks:<br />
* Neural fabrics and others, are used to quickly establish a low resolution feature map, which is integral for classification.<br />
* Deeply supervised nets, introduced the incorporation of multiple classifiers throughout the network<br />
* The feature concatenation method from DenseNets allows the later classifiers to not be disrupted by the weight updates from earlier classifiers.<br />
<br />
= Multi-Scale Dense Networks =<br />
<br />
== Integral Contributions ==<br />
<br />
The way MSDNets aims to provide efficient classification with varying computational costs is to create one network that outputs results at depths. While this may seem trivial, as intermediate classifiers can be inserted into any existing network, two major problems arise.<br />
<br />
=== Coarse Level Features Needed For Classification ===<br />
<br />
Coarse level features are needed to gain context of scene. In typical CNN based networks, the features propagate from fine to coarse. Classifiers added to the early, fine featured, layers do not output accurate predictions due to the lack of context.<br />
<br />
To address this issue, MSDNets proposes an architecture in which uses multi scaled feature maps. The network is quickly formed to contain a set number of scales ranging from fine to coarse. These scales are propagated throughout, so that for the length of the network there are always coarse level features for classification and fine features for learning more difficult representations.<br />
<br />
=== Training of Early Classifiers Interferes with Later Classifiers ===<br />
<br />
When training a network containing intermediate classifiers, the training of early classifiers will cause the early layers to focus on features for that classifier. These learned features may not be as useful to the later classifiers and degrade their accuracy.<br />
<br />
MSDNets use dense connectivity to avoid this issue. By concatenating all prior layers to learn future layers, the gradient propagation is spread throughout the available features. This allows later layers to not be reliant on any single prior, providing opportunities to learn new features that priors have ignored.<br />
<br />
== Architecture ==<br />
<br />
[[File:MSDNet_arch.png | 700px|thumb|center|Left: the MSDNet architecture. Right: example calculations for each output given 3 scales and 4 layers.]]<br />
<br />
The architecture of an MSDNet can be thought of as a structure of convolutions with a set number of layers and a set number of scales. Layers allow the network to build on the previous information to generate more accurate predictions, while the scales allow the network to maintain coarse level features throught.<br />
<br />
The first layer is a special, mini-cnn-network, that quickly fills all required scales with features. The following layers are generated through the convolutions of the previous layers and scales.<br />
Each output at a given s scale is given by the convolution of all prior outputs of the same scale, and the strided-convolution of all prior outputs from the previous scale.<br />
<br />
The classifiers are run on the concatenation of all of the coarsest outputs from the preceding layers.<br />
<br />
=== Loss Function ===<br />
<br />
The loss is calculated as a weighted sum of each classifier's logistic loss. The weighted loss is taken as an average over a set of training samples. The weights can be determined from a budget of computational power, but results also show that setting all to 1 is also acceptable.<br />
<br />
=== Computational Limit Inclusion ===<br />
<br />
When running in a budgeted batch scenario, the network attempts to provide the best overall accuracy. To do this with a set limit on computational resources, it works to use less of the budget on easy detections in order to allow more time to be spent on hard ones. <br />
In order to facilitate this, the classifiers are designed to exit when the confidence of the classification exceeds a preset threshold. To determine the threshold for each classifier, Dtest*sum(qk*Ck) <= B must be true. Where Dtest is the total number of test samples, Ck is the computational requirement to get an output from the kth classifier, and qk is the probability that a sample exits at the kth classifier. Assuming that all classifiers have the same base probability, q, then qk can be used to find the threshold<br />
<br />
= Experiments = <br />
<br />
When evaluating on CIFAR-10 and CIFAR-100 ensembles and multi-classifier versions of ResNets and DenseNets, as well as FractalNet are used to compare with MSDNet. <br />
<br />
When evaluating on ImageNet ensembles and individual versions of ResNets and DenseNets are compared with MSDNets.<br />
<br />
== Anytime Prediction ==<br />
<br />
In anytime prediction MSDNets are shown to have highly accurate with very little budget, and continue to remain above the alternate methods as the budget increases.<br />
<br />
[[File:MSDNet_anytime.png | 700px|thumb|center|Accuracy of the anytime classification models.]]<br />
<br />
== Budget Batch ==<br />
<br />
For budget batch 3 MSDNets are designed with classifiers set-up for varying ranges of budget constraints. On both dataset options the MSDNets exceed all alternate methods with a fraction of the budget required.<br />
<br />
[[File:MSDNet_budgetbatch.png | 700px|thumb|center|Accuracy of the budget batch classification models.]]<br />
<br />
= Critique = <br />
<br />
The problem formulation and scenario evaluation were very well formulated, and according to independent reviews, the results were reproducible. Where the paper could improve is on explaining how to implement the threshold; it isn't very well explained how the use of the validation set can be used to set the threshold value.</div>Swalshhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Multi-scale_Dense_Networks_for_Resource_Efficient_Image_Classification&diff=35339Multi-scale Dense Networks for Resource Efficient Image Classification2018-03-23T00:34:13Z<p>Swalsh: /* Anytime Prediction */</p>
<hr />
<div>= Introduction = <br />
<br />
Multi-Scale Dense Networks, MSDNets, are designed to address the growing demand for efficient object recognition. The issue with existing recognition networks is that they are either:<br />
efficient networks, but don't do well on hard examples, or<br />
large networks that do well on all examples but require a large amount of resources<br />
<br />
In order to be efficient on all difficulties MSDNets propose a structure that can accurately output classifications for varying levels of computational requirements. The two cases that are used to evaluate the network are:<br />
Anytime Prediction: What is the best prediction the network can provide when suddenly prompted.<br />
Budget Batch Predictions: Given a maximum amount of computational resources how well does the network do on the batch.<br />
<br />
= Related Networks =<br />
<br />
== Computationally Efficient Networks ==<br />
<br />
Existing methods for refining an accurate network to be more efficient include weight pruning, quantization of weights (during or after training), and knowledge distillation, which trains smaller network to match teacher network.<br />
<br />
== Resource Efficient Networks == <br />
<br />
Unlike the above, resource efficient concepts consider limited resources as a part of the structure/loss.<br />
Examples of work in this area include: <br />
* Efficient variants to existing state of the art networks<br />
* Gradient boosted decision trees, which incorporate computational limitations into the training<br />
* Fractal nets<br />
* Adaptive computation time method<br />
<br />
== Related architectures ==<br />
<br />
MSDNets pull on concepts from a number of existing networks:<br />
* Neural fabrics and others, are used to quickly establish a low resolution feature map, which is integral for classification.<br />
* Deeply supervised nets, introduced the incorporation of multiple classifiers throughout the network<br />
* The feature concatenation method from DenseNets allows the later classifiers to not be disrupted by the weight updates from earlier classifiers.<br />
<br />
= Multi-Scale Dense Networks =<br />
<br />
== Integral Contributions ==<br />
<br />
The way MSDNets aims to provide efficient classification with varying computational costs is to create one network that outputs results at depths. While this may seem trivial, as intermediate classifiers can be inserted into any existing network, two major problems arise.<br />
<br />
=== Coarse Level Features Needed For Classification ===<br />
<br />
Coarse level features are needed to gain context of scene. In typical CNN based networks, the features propagate from fine to coarse. Classifiers added to the early, fine featured, layers do not output accurate predictions due to the lack of context.<br />
<br />
To address this issue, MSDNets proposes an architecture in which uses multi scaled feature maps. The network is quickly formed to contain a set number of scales ranging from fine to coarse. These scales are propagated throughout, so that for the length of the network there are always coarse level features for classification and fine features for learning more difficult representations.<br />
<br />
=== Training of Early Classifiers Interferes with Later Classifiers ===<br />
<br />
When training a network containing intermediate classifiers, the training of early classifiers will cause the early layers to focus on features for that classifier. These learned features may not be as useful to the later classifiers and degrade their accuracy.<br />
<br />
MSDNets use dense connectivity to avoid this issue. By concatenating all prior layers to learn future layers, the gradient propagation is spread throughout the available features. This allows later layers to not be reliant on any single prior, providing opportunities to learn new features that priors have ignored.<br />
<br />
== Architecture ==<br />
<br />
(image of arch)<br />
The architecture of an MSDNet can be thought of as a structure of convolutions with a set number of layers and a set number of scales. Layers allow the network to build on the previous information to generate more accurate predictions, while the scales allow the network to maintain coarse level features throught.<br />
<br />
The first layer is a special, mini-cnn-network, that quickly fills all required scales with features. The following layers are generated through the convolutions of the previous layers and scales.<br />
Each output at a given s scale is given by the convolution of all prior outputs of the same scale, and the strided-convolution of all prior outputs from the previous scale.<br />
<br />
The classifiers are run on the concatenation of all of the coarsest outputs from the preceding layers.<br />
<br />
=== Loss Function ===<br />
<br />
The loss is calculated as a weighted sum of each classifier's logistic loss. The weighted loss is taken as an average over a set of training samples. The weights can be determined from a budget of computational power, but results also show that setting all to 1 is also acceptable.<br />
<br />
=== Computational Limit Inclusion ===<br />
<br />
When running in a budgeted batch scenario, the network attempts to provide the best overall accuracy. To do this with a set limit on computational resources, it works to use less of the budget on easy detections in order to allow more time to be spent on hard ones. <br />
In order to facilitate this, the classifiers are designed to exit when the confidence of the classification exceeds a preset threshold. To determine the threshold for each classifier, Dtest*sum(qk*Ck) <= B must be true. Where Dtest is the total number of test samples, Ck is the computational requirement to get an output from the kth classifier, and qk is the probability that a sample exits at the kth classifier. Assuming that all classifiers have the same base probability, q, then qk can be used to find the threshold<br />
<br />
= Experiments = <br />
<br />
When evaluating on CIFAR-10 and CIFAR-100 ensembles and multi-classifier versions of ResNets and DenseNets, as well as FractalNet are used to compare with MSDNet. <br />
<br />
When evaluating on ImageNet ensembles and individual versions of ResNets and DenseNets are compared with MSDNets.<br />
<br />
== Anytime Prediction ==<br />
<br />
In anytime prediction MSDNets are shown to have highly accurate with very little budget, and continue to remain above the alternate methods as the budget increases.<br />
<br />
[[File:MSDNet_anytime.png | 700px|thumb|center|Accuracy of the anytime classification models.]]<br />
<br />
== Budget Batch ==<br />
<br />
For budget batch 3 MSDNets are designed with classifiers set-up for varying ranges of budget constraints. On both dataset options the MSDNets exceed all alternate methods with a fraction of the budget required.<br />
<br />
[[File:MSDNet_budgetbatch.png | 700px|thumb|center|Accuracy of the budget batch classification models.]]<br />
<br />
= Critique = <br />
<br />
The problem formulation and scenario evaluation were very well formulated, and according to independent reviews, the results were reproducible. Where the paper could improve is on explaining how to implement the threshold; it isn't very well explained how the use of the validation set can be used to set the threshold value.</div>Swalshhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Multi-scale_Dense_Networks_for_Resource_Efficient_Image_Classification&diff=35338Multi-scale Dense Networks for Resource Efficient Image Classification2018-03-23T00:33:51Z<p>Swalsh: /* Anytime Prediction */</p>
<hr />
<div>= Introduction = <br />
<br />
Multi-Scale Dense Networks, MSDNets, are designed to address the growing demand for efficient object recognition. The issue with existing recognition networks is that they are either:<br />
efficient networks, but don't do well on hard examples, or<br />
large networks that do well on all examples but require a large amount of resources<br />
<br />
In order to be efficient on all difficulties MSDNets propose a structure that can accurately output classifications for varying levels of computational requirements. The two cases that are used to evaluate the network are:<br />
Anytime Prediction: What is the best prediction the network can provide when suddenly prompted.<br />
Budget Batch Predictions: Given a maximum amount of computational resources how well does the network do on the batch.<br />
<br />
= Related Networks =<br />
<br />
== Computationally Efficient Networks ==<br />
<br />
Existing methods for refining an accurate network to be more efficient include weight pruning, quantization of weights (during or after training), and knowledge distillation, which trains smaller network to match teacher network.<br />
<br />
== Resource Efficient Networks == <br />
<br />
Unlike the above, resource efficient concepts consider limited resources as a part of the structure/loss.<br />
Examples of work in this area include: <br />
* Efficient variants to existing state of the art networks<br />
* Gradient boosted decision trees, which incorporate computational limitations into the training<br />
* Fractal nets<br />
* Adaptive computation time method<br />
<br />
== Related architectures ==<br />
<br />
MSDNets pull on concepts from a number of existing networks:<br />
* Neural fabrics and others, are used to quickly establish a low resolution feature map, which is integral for classification.<br />
* Deeply supervised nets, introduced the incorporation of multiple classifiers throughout the network<br />
* The feature concatenation method from DenseNets allows the later classifiers to not be disrupted by the weight updates from earlier classifiers.<br />
<br />
= Multi-Scale Dense Networks =<br />
<br />
== Integral Contributions ==<br />
<br />
The way MSDNets aims to provide efficient classification with varying computational costs is to create one network that outputs results at depths. While this may seem trivial, as intermediate classifiers can be inserted into any existing network, two major problems arise.<br />
<br />
=== Coarse Level Features Needed For Classification ===<br />
<br />
Coarse level features are needed to gain context of scene. In typical CNN based networks, the features propagate from fine to coarse. Classifiers added to the early, fine featured, layers do not output accurate predictions due to the lack of context.<br />
<br />
To address this issue, MSDNets proposes an architecture in which uses multi scaled feature maps. The network is quickly formed to contain a set number of scales ranging from fine to coarse. These scales are propagated throughout, so that for the length of the network there are always coarse level features for classification and fine features for learning more difficult representations.<br />
<br />
=== Training of Early Classifiers Interferes with Later Classifiers ===<br />
<br />
When training a network containing intermediate classifiers, the training of early classifiers will cause the early layers to focus on features for that classifier. These learned features may not be as useful to the later classifiers and degrade their accuracy.<br />
<br />
MSDNets use dense connectivity to avoid this issue. By concatenating all prior layers to learn future layers, the gradient propagation is spread throughout the available features. This allows later layers to not be reliant on any single prior, providing opportunities to learn new features that priors have ignored.<br />
<br />
== Architecture ==<br />
<br />
(image of arch)<br />
The architecture of an MSDNet can be thought of as a structure of convolutions with a set number of layers and a set number of scales. Layers allow the network to build on the previous information to generate more accurate predictions, while the scales allow the network to maintain coarse level features throught.<br />
<br />
The first layer is a special, mini-cnn-network, that quickly fills all required scales with features. The following layers are generated through the convolutions of the previous layers and scales.<br />
Each output at a given s scale is given by the convolution of all prior outputs of the same scale, and the strided-convolution of all prior outputs from the previous scale.<br />
<br />
The classifiers are run on the concatenation of all of the coarsest outputs from the preceding layers.<br />
<br />
=== Loss Function ===<br />
<br />
The loss is calculated as a weighted sum of each classifier's logistic loss. The weighted loss is taken as an average over a set of training samples. The weights can be determined from a budget of computational power, but results also show that setting all to 1 is also acceptable.<br />
<br />
=== Computational Limit Inclusion ===<br />
<br />
When running in a budgeted batch scenario, the network attempts to provide the best overall accuracy. To do this with a set limit on computational resources, it works to use less of the budget on easy detections in order to allow more time to be spent on hard ones. <br />
In order to facilitate this, the classifiers are designed to exit when the confidence of the classification exceeds a preset threshold. To determine the threshold for each classifier, Dtest*sum(qk*Ck) <= B must be true. Where Dtest is the total number of test samples, Ck is the computational requirement to get an output from the kth classifier, and qk is the probability that a sample exits at the kth classifier. Assuming that all classifiers have the same base probability, q, then qk can be used to find the threshold<br />
<br />
= Experiments = <br />
<br />
When evaluating on CIFAR-10 and CIFAR-100 ensembles and multi-classifier versions of ResNets and DenseNets, as well as FractalNet are used to compare with MSDNet. <br />
<br />
When evaluating on ImageNet ensembles and individual versions of ResNets and DenseNets are compared with MSDNets.<br />
<br />
== Anytime Prediction ==<br />
<br />
In anytime prediction MSDNets are shown to have highly accurate with very little budget, and continue to remain above the alternate methods as the budget increases.<br />
<br />
[[File:MSDNet_anytime.png | 700px|thumb|center|Accuracy of the budget batch classification models.]]<br />
<br />
== Budget Batch ==<br />
<br />
For budget batch 3 MSDNets are designed with classifiers set-up for varying ranges of budget constraints. On both dataset options the MSDNets exceed all alternate methods with a fraction of the budget required.<br />
<br />
[[File:MSDNet_budgetbatch.png | 700px|thumb|center|Accuracy of the budget batch classification models.]]<br />
<br />
= Critique = <br />
<br />
The problem formulation and scenario evaluation were very well formulated, and according to independent reviews, the results were reproducible. Where the paper could improve is on explaining how to implement the threshold; it isn't very well explained how the use of the validation set can be used to set the threshold value.</div>Swalshhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Multi-scale_Dense_Networks_for_Resource_Efficient_Image_Classification&diff=35337Multi-scale Dense Networks for Resource Efficient Image Classification2018-03-23T00:33:29Z<p>Swalsh: /* Budget Batch */</p>
<hr />
<div>= Introduction = <br />
<br />
Multi-Scale Dense Networks, MSDNets, are designed to address the growing demand for efficient object recognition. The issue with existing recognition networks is that they are either:<br />
efficient networks, but don't do well on hard examples, or<br />
large networks that do well on all examples but require a large amount of resources<br />
<br />
In order to be efficient on all difficulties MSDNets propose a structure that can accurately output classifications for varying levels of computational requirements. The two cases that are used to evaluate the network are:<br />
Anytime Prediction: What is the best prediction the network can provide when suddenly prompted.<br />
Budget Batch Predictions: Given a maximum amount of computational resources how well does the network do on the batch.<br />
<br />
= Related Networks =<br />
<br />
== Computationally Efficient Networks ==<br />
<br />
Existing methods for refining an accurate network to be more efficient include weight pruning, quantization of weights (during or after training), and knowledge distillation, which trains smaller network to match teacher network.<br />
<br />
== Resource Efficient Networks == <br />
<br />
Unlike the above, resource efficient concepts consider limited resources as a part of the structure/loss.<br />
Examples of work in this area include: <br />
* Efficient variants to existing state of the art networks<br />
* Gradient boosted decision trees, which incorporate computational limitations into the training<br />
* Fractal nets<br />
* Adaptive computation time method<br />
<br />
== Related architectures ==<br />
<br />
MSDNets pull on concepts from a number of existing networks:<br />
* Neural fabrics and others, are used to quickly establish a low resolution feature map, which is integral for classification.<br />
* Deeply supervised nets, introduced the incorporation of multiple classifiers throughout the network<br />
* The feature concatenation method from DenseNets allows the later classifiers to not be disrupted by the weight updates from earlier classifiers.<br />
<br />
= Multi-Scale Dense Networks =<br />
<br />
== Integral Contributions ==<br />
<br />
The way MSDNets aims to provide efficient classification with varying computational costs is to create one network that outputs results at depths. While this may seem trivial, as intermediate classifiers can be inserted into any existing network, two major problems arise.<br />
<br />
=== Coarse Level Features Needed For Classification ===<br />
<br />
Coarse level features are needed to gain context of scene. In typical CNN based networks, the features propagate from fine to coarse. Classifiers added to the early, fine featured, layers do not output accurate predictions due to the lack of context.<br />
<br />
To address this issue, MSDNets proposes an architecture in which uses multi scaled feature maps. The network is quickly formed to contain a set number of scales ranging from fine to coarse. These scales are propagated throughout, so that for the length of the network there are always coarse level features for classification and fine features for learning more difficult representations.<br />
<br />
=== Training of Early Classifiers Interferes with Later Classifiers ===<br />
<br />
When training a network containing intermediate classifiers, the training of early classifiers will cause the early layers to focus on features for that classifier. These learned features may not be as useful to the later classifiers and degrade their accuracy.<br />
<br />
MSDNets use dense connectivity to avoid this issue. By concatenating all prior layers to learn future layers, the gradient propagation is spread throughout the available features. This allows later layers to not be reliant on any single prior, providing opportunities to learn new features that priors have ignored.<br />
<br />
== Architecture ==<br />
<br />
(image of arch)<br />
The architecture of an MSDNet can be thought of as a structure of convolutions with a set number of layers and a set number of scales. Layers allow the network to build on the previous information to generate more accurate predictions, while the scales allow the network to maintain coarse level features throught.<br />
<br />
The first layer is a special, mini-cnn-network, that quickly fills all required scales with features. The following layers are generated through the convolutions of the previous layers and scales.<br />
Each output at a given s scale is given by the convolution of all prior outputs of the same scale, and the strided-convolution of all prior outputs from the previous scale.<br />
<br />
The classifiers are run on the concatenation of all of the coarsest outputs from the preceding layers.<br />
<br />
=== Loss Function ===<br />
<br />
The loss is calculated as a weighted sum of each classifier's logistic loss. The weighted loss is taken as an average over a set of training samples. The weights can be determined from a budget of computational power, but results also show that setting all to 1 is also acceptable.<br />
<br />
=== Computational Limit Inclusion ===<br />
<br />
When running in a budgeted batch scenario, the network attempts to provide the best overall accuracy. To do this with a set limit on computational resources, it works to use less of the budget on easy detections in order to allow more time to be spent on hard ones. <br />
In order to facilitate this, the classifiers are designed to exit when the confidence of the classification exceeds a preset threshold. To determine the threshold for each classifier, Dtest*sum(qk*Ck) <= B must be true. Where Dtest is the total number of test samples, Ck is the computational requirement to get an output from the kth classifier, and qk is the probability that a sample exits at the kth classifier. Assuming that all classifiers have the same base probability, q, then qk can be used to find the threshold<br />
<br />
= Experiments = <br />
<br />
When evaluating on CIFAR-10 and CIFAR-100 ensembles and multi-classifier versions of ResNets and DenseNets, as well as FractalNet are used to compare with MSDNet. <br />
<br />
When evaluating on ImageNet ensembles and individual versions of ResNets and DenseNets are compared with MSDNets.<br />
<br />
== Anytime Prediction ==<br />
<br />
In anytime prediction MSDNets are shown to have highly accurate with very little budget, and continue to remain above the alternate methods as the budget increases.<br />
<br />
(image of anytime results)<br />
<br />
== Budget Batch ==<br />
<br />
For budget batch 3 MSDNets are designed with classifiers set-up for varying ranges of budget constraints. On both dataset options the MSDNets exceed all alternate methods with a fraction of the budget required.<br />
<br />
[[File:MSDNet_budgetbatch.png | 700px|thumb|center|Accuracy of the budget batch classification models.]]<br />
<br />
= Critique = <br />
<br />
The problem formulation and scenario evaluation were very well formulated, and according to independent reviews, the results were reproducible. Where the paper could improve is on explaining how to implement the threshold; it isn't very well explained how the use of the validation set can be used to set the threshold value.</div>Swalsh