statwiki - User contributions [US]

PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space

2018-03-22T01:09:43Z

Cs4li:

= Introduction =
This paper builds off of ideas from PointNet (Qi et al., 2017). The name PointNet is derived from the network's input - a point cloud. A point cloud is a set of three dimensional points that each have coordinates <math> (x,y,z) </math>. These coordinates usually represent the surface of an object. For example, a point cloud describing the shape of a torus is shown below.

[[File:Point cloud torus.gif|thumb|center|Point cloud torus]]

Processing point clouds is important in applications such as autonomous driving where point clouds are collected from an onboard LiDAR sensor. These point clouds can then be used for object detection. However, point clouds are challenging to process because:

# They are unordered. If <math> N </math> is the number of points in a point cloud, then there are <math> N! </math> permutations that the point cloud can be represented.
# The spatial arrangement of the points contains useful information, thus it needs to be encoded.
# The function processing the point cloud needs to be invariant to transformations such as rotation and translations of all points.

Previously, typical point cloud processing methods handled the challenges of point clouds by transforming the data with a 3D voxel grid or by representing the point cloud with multiple 2D images. When PointNet was introduced, it was novel because it directly took points as its input. PointNet++ improves on PointNet by using a hierarchical method to better capture local structures of the point cloud.

[[File:point_cloud.png | 400px|thumb|center|Examples of point clouds and their associated task. Classification (left), part segmentation (centre), scene segmentation (right) ]]

= Review of PointNet =

The PointNet architecture is shown below. The input of the network is <math> n </math> points, which each have <math> (x,y,z) </math> coordinates. Each point is processed individually through a multi-layer perceptron (MLP). This network creates an encoding for each point; in the diagram, each point is represented by a 1024 dimension vector. Then, using a max pool layer a vector is created that represents the "global signature" of a point cloud. If classification is the task, this global signature is processed by another MLP to compute the classification scores. If segmentation is the task, this global signature is appended to to each point from the "nx64" layer, and these points are processed by a MLP to compute a semantic category score for each point.

The core idea of the network is to learn a symmetric function on transformed points. Through the T-Nets and the MLP network, a transformation is learned with the hopes of making points invariant to point cloud transformations. Learning a symmetric function solves the challenge imposed by having unordered points; a symmetric function will produce the same value no matter the order of the input. This symmetric function is represented by the max pool layer.

[[File:pointnet_arch.png | 700px|thumb|center|PointNet architecture. The blue highlighted region is when it is used for classification, and the beige highlighted region is when it is used for segmentation.]]

= PointNet++ =

The motivation for PointNet++ is that PointNet does not capture local, fine-grained details. Since PointNet performs a max pool layer over all of its points, information such as the local interaction between points is lost.

== Problem Statement ==

There is a metric space <math> X = (M,d) </math> where <math>d</math> is the metric from a Euclidean space <math>\pmb{\mathbb{R}}^n</math> and <math> M \subseteq \pmb{\mathbb{R}}^n </math> is the set of points. The goal is to learn a function that takes <math>X</math> as the input and output a class or per point label to each member of <math>M</math>.

== Method ==

=== High Level Overview ===
[[File:point_net++.png | 700px|thumb|right|PointNet++ architecture]]

The PointNet++ architecture is shown on the right. The core idea is that a hierarchical architecture is used and at each level of the hierarchy a set of points is processed and abstracted to a new set with less points, i.e.,

\begin{aligned}
\text{Input at each level: } N \times (d + c) \text{ matrix}
\end{aligned}

where <math>N</math> is the number of points, <math>d</math> is the coordinate points <math>(x,y,z)</math> and <math>c</math> is the feature representation of each point, and

\begin{aligned}
\text{Output at each level: } N' \times (d + c') \text{ matrix}
\end{aligned}

where <math>N'</math> is the new number (smaller) of points and <math>c'</math> is the new feature vector.

Each level has three layers: Sampling, Grouping, and PointNet. The Sampling layer selects points that will act as centroids of local regions within the point cloud. The Grouping layer then finds points near these centroids. Lastly, the PointNet layer performs PointNet on each group to encode local information.

=== Sampling Layer ===

The input of this layer is a set of points <math>{\{x_1,x_2,...,x_n}\}</math>. The goal of this layer is to select a subset of these points <math>{\{\hat{x}_1, \hat{x}_2,...,\hat{x}_m\}} </math> that will define the centroid of local regions.

To select these points farthest point sampling is used. This is where <math>\hat{x}_j</math> is the most distant point with regards to <math>{\{\hat{x}_1, \hat{x}_2,...,\hat{x}_{j-1}\}}</math>. This ensures coverage of the entire point cloud opposed to random sampling.

=== Grouping Layer ===

The objective of the grouping layer is to form local regions around each centroid by grouping points near the selected centroids. The input is a point set of size <math>N \times (d + c)</math> and the coordinates of the centroids <math>N' \times d</math>. The output is the groups of points within each region <math>N' \times k \times (d+c)</math> where <math>k</math> is the number of points in each region.

Note that <math>k</math> can vary per group. Later, the PointNet layer creates a feature vector that is the same size for all regions at a hierarchical level.

To determine which points belong to a group a ball query is used; all points within a radius of the centroid are grouped. This is advantageous over nearest neighbour because it guarantees a fixed region space, which is important when learning local structure.

=== PointNet Layer ===

After grouping, PointNet is applied to the points. However, first the coordinates of points in a local region are converted to a local coordinate frame by <math> x_i = x_i - \bar{x}</math> where <math>\bar{x}</math> is the coordinates of the centroid.

=== Robust Feature Learning under Non-Uniform Sampling Density ===

The previous description of grouping uses a single scale. This is not optimal because the density varies per section of the point cloud. At each level, it would be better if the PointNet layer was applied to adaptively sized groups depending on the point cloud density.

The two grouping methods the authors propose are shown in the diagram below. Multi-scale grouping (MSG) applies PointNet at various scales per group. The features from the various scales are concatenated. This method, however, is computationally expensive because for each region it always applies PointNet to all points. On the other hand, multi-resolution grouping (MRG) is less computationally expensive but still adaptively collects features. As shown in the diagram, the left vector is obtained by applying PointNet to three points, and these three points obtained information from three groups. This vector is then concatenated by a vector that is created by using PointNet on all the points in the level below. The second vector can be weighed more heavily if the first vector contains a sparse amount of points.

[[File:grouping.png | 300px|thumb|center|Example of the two ways to perform grouping]]

== Point Cloud Segmentation ==

If the task is segmentation, the architecture is slightly modified since we want a semantic score for each point. To achieve this, distance-based interpolation and skip-connections are used.

=== Distance-based Interpolation ===

Here, point features from <math>N_l \times (d + C)</math> points are propagated to <math>N_{l-1} \times (d + C)</math> points where <math>N_{l-1}</math> is greater than <math>N_l</math>.

To propagate features an inverse distance weighted average based on <math>k</math> nearest neighbors is used. The <math>p=2</math> and <math>k=3</math>.

[[File:prop_feature.png | 500px|thumb|center|Feature interpolation during segmentation]]

=== Skip-connections ===

In addition, skip connections are used (see the PointNet++ architecture diagram). The features from the the skip layers are concatenated with the interpolated features. Next, a "unit-wise" PointNet is applied, which the authors describe as similar to a one-by-one convolution.

== Experiments ==
To validate the effectiveness of PointNet++, experiments in three areas were performed - classification in Euclidean metric space, semantic scene labelling, and classification in non-Euclidean space.

=== Point Set Classification in Euclidean Metric Space ===

The digit dataset, MNIST, was converted to a 2D point cloud. Pixel intensities were normalized in the range of <math>[0, 1]</math>, and only pixels with intensities larger than 0.5 were considered. The coordinate system was set at the centre of the image. PointNet++ achieved a classification error of 0.51%. The original PointNet had 0.78% classification error. The table below compares these results to the state-of-the-art.

[[File:mnist_results.png | 300px|thumb|center|MNIST classification results.]]

In addition, the ModelNet40 dataset was used. This dataset consists of CAD models. Three dimensional point clouds were sampled from mesh surfaces of the ModelNet40 shapes. The classification results from this dataset are shown below.

[[File:modelnet40.png | 300px|thumb|center|ModelNet40 classification results.]]

An experiment was performed to show how the accuracy was affected by the number of points used. With PointNet++ using multi-scale grouping and dropout, the performance decreased by less than 1% when 1024 test points were reduced to 256. On the other hand, PointNet's performance was impacted by the decrease in points.

[[File:paper28_fig4_chair.png | 300px|thumb|center|An example showing the reduction of points visually. At 256 points, the points making up the object is very spare, however the accuracy is only reduced by 1%]][[File:num_points_acc.png | 300px|thumb|center|Relationship between accuracy and the number of points used for classification.]]

=== Semantic Scene Labelling ===

The ScanNet dataset was used for experiments in semantic scene labelling. This dataset consists of laser scans of indoor scenes where the goal is to predict a semantic label for each point. Example results are shown below.

[[File:scannet.png | 300px|thumb|center|Example ScanNet semantic segmentation results.]]

To compare to other methods, the authors convert their point labels to a voxel format, and accuracy is determined on a per voxel basis. The accuracy compared to other methods is shown below.

[[File:scannet_acc.png | 500px|thumb|center|ScanNet semantic segmentation classification comparison to other methods.]]

=== Classification in Non-Euclidean Metric Space ===

[[File:shrec.png | 300px|thumb|right|Example of shapes from the SHREC15 dataset.]]

Lastly, experiments were performed on the SHREC15 dataset. This dataset contains shapes that have different poses. This experiment shows that PointNet++ is able to generalize to non-Euclidean spaces. Results from this dataset are provided below.

[[File:shrec15_results.png | 500px|thumb|center|Results from the SHREC15 dataset.]]

== Critique ==

It seems clear that PointNet is lacking capturing local context between points. PointNet++ seems to be an important extension, but the improvements in the experimental results seem small. Some computational efficiency experiments would have been nice. For example, the processing speed of the network, and the computational efficiency of MRG over MRG.

== Code ==

Code for PointNet++ can be found at: https://github.com/charlesq34/pointnet2

=Sources=
1. Charles R. Qi, Li Yi, Hao Su, Leonidas J. Guibas. PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space, 2017

2. Charles R. Qi, Hao Su, Kaichun Mo, Leonidas J. Guibas. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation, 2017

File:paper28 fig4 chair.png

2018-03-22T01:04:49Z

Cs4li:

Do Deep Neural Networks Suffer from Crowding

2018-03-21T15:37:02Z

Cs4li:

= Introduction =
Ever since the evolution of Deep Networks, there has been tremendous amount of research and effort that has been put into making machines capable of recognizing objects the same way as humans do. Humans can recognize objects in a way that is invariant to scale, translation, and clutter. Crowding is another visual effect suffered by humans, in which an object that can be recognized in isolation can no longer be recognized when other objects, called flankers, are placed close to it and this is a very common real-life experience. This paper focuses on studying the impact of crowding on Deep Neural Networks (DNNs) by adding clutter to the images and then analyzing which models and settings suffer less from such effects.

[[File:paper25_fig_crowding_ex.png|center|600px]]
The figure shows a visual example of crowding [3]. Keep your eyes still and look at the dot in the center and try to identify the "A" in the two circles. You should see that it is much easier to make out the "A" in the right than in the left circle. The same "A" exists in both circles, however the left circle contains flankers which are those line segments.

The paper investigates two types of DNNs for crowding: traditional deep convolutional neural networks(DCNN) and a multi-scale eccentricity-dependent model which is an extension of the DCNNs and inspired by the retina where the receptive field size of the convolutional filters in the model grows with increasing distance from the center of the image, called the eccentricity and will be explained below. The authors focus on the dependence of crowding on image factors, such as flanker configuration, target-flanker similarity, target eccentricity and premature pooling in particular.

= Models =
== Deep Convolutional Neural Networks ==
The DCNN is a basic architecture with 3 convolutional layers, spatial 3x3 max-pooling with varying strides and a fully connected layer for classification as shown in the below figure.
[[File:DCNN.png|800px|center]]

The network is fed with images resized to 60x60, with minibatches of 128 images, 32 feature channels for all convolutional layers, and convolutional filters of size 5x5 and stride 1.

As highlighted earlier, the effect of pooling is into main consideration and hence three different configurations have been investigated as below:

1. '''No total pooling''' Feature maps sizes decrease only due to boundary effects, as the 3x3 max pooling has stride 1. The square feature maps sizes after each pool layer are 60-54-48-42.

2. '''Progressive pooling''' 3x3 pooling with a stride of 2 halves the square size of the feature maps, until we pool over what remains in the final layer, getting rid of any spatial information before the fully connected layer. (60-27-11-1).

3. '''At end pooling''' Same as no total pooling, but before the fully connected layer, max-pool over the entire feature map. (60-54-48-1).

===What is the problem in CNNs?===
CNNs fall short in explaining human perceptual invariance. First, CNNs typically take input at a single uniform resolution. Biological measurements suggest that resolution is not uniform across the human visual field, but rather decays with eccentricity, i.e. distance from the center of focus Even more importantly, CNNs rely on data augmentation to achieve transformation-invariance and obviously a lot of processing is needed for CNNs.

==Eccentricity-dependent Model==
In order to take care of the scale invariance in the input image, the eccentricity dependent DNN is utilized. The main intuition behind this architecture is that as we increase eccentricity, the receptive fields also increase and hence the model will become invariant to changing input scales. In this model the input image is cropped into varying scales(11 crops increasing by a factor of <math>\sqrt{2}</math> which are then resized to 60x60 pixels) and then fed to the network. The model computes an invariant representation of the input by sampling the inverted pyramid at a discrete set of scales with the same number of filters at each scale. Since the same number of filters are used for each scale, the smaller crops will be sampled at a high resolution while the larger crops will be sampled with a low resolution. These scales are fed into the network as an input channel to the convolutional layers and share the weights across scale and space.
[[File:EDM.png|2000x450px|center]]

The architecture of this model is the same as the previous DCNN model with the only change being the extra filters added for each of the scales. The authors perform spatial pooling, the aforementioned ''At end pooling'' is used here, and scale pooling which helps in reducing number of scales by taking the maximum value of corresponding locations in the feature maps across multiple scales. It has three configurations: (1) at the beginning, in which all the different scales are pooled together after the first layer, 11-1-1-1-1 (2) progressively, 11-7-5-3-1 and (3) at the end, 11-11-11-11-1, in which all 11 scales are pooled together at the last layer.

===Contrast Normalization===
Since we have multiple scales of input image, we perform normalization such that the sum of the pixel intensities in each scale is in the same range [0,1]. These are then divided by a factor proportional to the crop area [[File:sqrtf.png|60px]] where i=1 is the smallest crop.

=Experiments and its Set-Up =
Targets are the set of objects to be recognized and flankers act as clutter with respect to these target objects. The target objects are the even MNIST numbers having translational variance (shifted at different locations of the image along the horizontal axis). Examples of the target and flanker configurations is shown below:
[[File:eximages.png|800px|center]]

The target and the object are referred to as ''a'' and ''x'' respectively with the below four conifgurations: (1) No flankers. Only the target object. (a in the plots) (2) One central flanker closer to the center of the image than the target. (xa) (3) One peripheral flanker closer to the boundary of the image that the target. (ax) (4) Two flankers spaced equally around the target, being both the same object (xax).

==DNNs trained with Target and Flankers==
This is a constant spacing training setup where identical flankers are placed at a distance of 120 pixels either side of the target(xax) with the target having translational variance. THe tests are evaluated on (i) DCNN with at the end pooling, and (ii) eccentricity-dependent model with 11-11-11-11-1 scale pooling, at the end spatial pooling and contrast normalization. The test data has different flanker configurations as described above.
[[File:result1.png|x450px|center]]

===Observations===
* With the flanker configuration same as the training one, models are better at recognizing objects in clutter rather than isolated objects for all image locations

* If the target-flanker spacing is changed, then models perform worse

* the eccentricity model is much better at recognizing objects in isolation than the DCNN because the multi-scale crops divide the image into discrete regions, letting the model learn from image parts as well as the whole image

* Only the eccentricity-dependent model is robust to different flanker configurations not included in training, when the target is centered.

==DNNs trained with Images with the Target in Isolation==
Here the target objects are in isolation and with translational variance while the test-set is the same set of flanker configurations as used before.
[[File:result2.png|750x400px|center]]
===DCNN Observations===
* The recognition gets worse with the increase in the number of flankers.

* Convolutional networks are capable of being invariant to translations.

* In the constant target eccentricity setup, where target is fixed at the center of the image with varying target-flanker spacing, we observe that as the distance between target and flankers increase, recognition gets better.

* Spatial pooling helps in learning invariance.

*Flankers similar to the target object helps in recognition since they dont activate the convolutional filter more.

* notMNIST data affects leads to more crowding since they have many more edges and white image pixels which activate the convolutional layers more.
===Eccentric Model===
The set-up is the same as explained earlier.
[[File:result3.png|750x400px|center]]
====Observations====
* If target is placed at the center and no contrast normalization is done, then the recognition accuracy is high since this model concentrates the most on the central region of the image.

* If contrast normalization is done, then all the scales will contribute equal amount and hence the eccentricity dependence is removed.

* Early pooling is harmful since it migh take away the useful information very early which might be useful to the network.

==Complex Clutter==
Here the tarets are embedded into images (The places dataset here) and then tests are performed.
[[File:result4.png|750x400px|center]]

====Observations====
- The eccentricity model without contrast normalization only can recognize the target and only when the target is close to the image center.
-
=Conclusions=
We often think that just training the network with data similar to the test data would achieve good results in a general scenario too but thats not the case as we trained the model with flankers and it did not give us the ideal results for the target obects.
*'''Flanker Configuration''': When models are trained with images of objects in isolation, adding flankers harms recognition. Adding two flankers is the same or worse than adding just one and the smaller the spacing between flanker and target, the more crowding occurs. These is because the pooling operation merges nearby responses, such as the target and flankers if they are close.

*'''Similarity between target and flanker''': Flankers more similar to targets cause more crowding, because of the selectivity property of the learned DNN filters.

*'''Dependence on target location and contrast normalization''': In DCNNs and eccentricitydependent models with contrast normalization, recognition accuracy is the same across all eccentricities. In eccentricity-dependent networks without contrast normalization, recognition does not decrease despite presence of clutter when the target is at the center of the image.

*'''Effect of pooling''': adding pooling leads to better recognition accuracy of the models. Yet, in the eccentricity model, pooling across the scales too early in the hierarchy leads to lower accuracy.

=Critique=
This paper just tries to check the impact of flankers on targets as to how crowding can affect recognition but it does not propose anything novel in terms of architecture to take care of such a type of crowding. The eccentricity based model does well only when the target is placed at the center of the image but maybe windowing over the frames instead of taking crops starting from the middle might help.

=References=
1) Volokitin A, Roig G, Poggio T:"Do Deep Neural Networks Suffer from Crowding?" Conference on Neural Information Processing Systems (NIPS). 2017

2) Francis X. Chen, Gemma Roig, Leyla Isik, Xavier Boix and Tomaso Poggio: "Eccentricity Dependent Deep Neural Networks for Modeling Human Vision" Journal of Vision. 17. 808. 10.1167/17.10.808.

3) J Harrison, W & W Remington, R & Mattingley, Jason. (2014). Visual crowding is anisotropic along the horizontal meridian during smooth pursuit. Journal of vision. 14. 10.1167/14.1.21. http://willjharrison.com/2014/01/new-paper-visual-crowding-is-anisotropic-along-the-horizontal-meridian-during-smooth-pursuit/

File:paper25 fig crowding ex.png

2018-03-21T15:23:42Z

Cs4li:

End-to-End Differentiable Adversarial Imitation Learning

2018-03-19T03:45:50Z

Cs4li: Undo revision 34649 by Cs4li (talk)

= Introduction =
The ability to imitate an expert policy is very beneficial in the case of automating human demonstrated tasks. Assuming that a sequence of state action pairs (trajectories) of an expert policy are available, a new policy can be trained that imitates the expert without having access to the original reward signal used by the expert. There are two main approaches to solve the problem of imitating a policy; they are Behavioural Cloning (BC) and Inverse Reinforcement Learning (IRL). BC directly learns the conditional distribution of actions over states in a supervised fashion by training on single time-step state-action pairs. The disadvantage of BC is that the training requires large amounts of expert data, which is hard to obtain. In addition, an agent trained using BC is unaware of how its action can affect future state distribution. The second method using IRL involves recovering a reward signal under which the expert is uniquely optimal; the main disadvantage is that it’s an ill-posed problem.

To address the problem of imitating an expert policy, techniques based on Generative Adversarial Networks (GANs) have been proposed in recent years. GANs use a discriminator to guide the generative model towards producing patterns like those of the expert. This idea was used by (Ho & Ermon, 2016) in their work titled Generative Adversarial Imitation Learning (GAIL) to imitate an expert policy in a model-free setup. The disadvantage of GAIL’s model-free approach is that backpropagation required gradient estimation which tends to suffer from high variance, which results in the need for large sample sizes and variance reduction methods. This paper proposed a model-based method (MGAIL) to address these issues.

= Background =
== Imitation Learning ==
A common technique for performing imitation learning is to train a policy <math> \pi </math> that minimizes some loss function <math> l(s, \pi(s)) </math> with respect to a discounted state distribution encountered by the expert: <math> d_\pi(s) = (1-\gamma)\sum_{t=0}^{\infty}\gamma^t p(s_t) </math>. This can be obtained using any supervised learning (SL) algorithm, but the policy's prediction affects future state distributions; this violates the independent and identically distributed (i.i.d) assumption made my most SL algorithms. This process is susceptible to compounding errors since a slight deviation in the learner's behavior can lead to different state distributions not encountered by the expert policy.

This issue was overcome through the use of the Forward Training (FT) algorithm which trains a non-stationary policy iteratively overtime. At each time step a new policy is trained on the state distribution induced by the previously trained policies. This is continued till the end of the time horizon to obtain a policy that can mimic the expert policy. This requirement to train a policy at each time step till the end makes the FT algorithm impractical for cases where the time horizon is very large or undefined. This short coming is resolved using the Stochastic Mixing Iterative Learning (SMILe) algorithm. SMILe trains a stochastic stationary policy over several iterations under the trajectory distribution induced by the previously trained policy.

== Generative Adversarial Networks ==
GANs learn a generative model that can fool the discriminator by using a two-player zero-sum game:

\begin{align}
\underset{G}{\operatorname{argmin}}\; \underset{D\in (0,1)}{\operatorname{argmax}} = \mathbb{E}_{x\sim p_E}[log(D(x)]\ +\ \mathbb{E}_{z\sim p_z}[log(1 - D(G(z)))]
\end{align}

In the above equation, <math> p_E </math> represents the expert distribution and <math> p_z </math> represents the input noise distribution from which the input to the generator is sampled. The generator produces patterns and the discriminator judges if the pattern was generated or from the expert data. When the discriminator cannot distinguish between the two distributions the game ends and the generator has learned to mimic the expert. GANs rely on basic ideas such as binary classification and algorithms such as backpropagation in order to learn the expert distribution.

GAIL applies GANs to the task of imitating an expert policy in a model-free approach. GAIL uses similar objective functions like GANs, but the expert distribution in GAIL represents the joint distribution over state action tuples:

\begin{align}
\underset{\pi}{\operatorname{argmin}}\; \underset{D\in (0,1)}{\operatorname{argmax}} = \mathbb{E}_{\pi}[log(D(s,a)]\ +\ \mathbb{E}_{\pi_E}[log(1 - D(s,a))] - \lambda H(\pi))
\end{align}

where <math> H(\pi) \triangleq \mathbb{E}_{\pi}[-log\: \pi(a|s)]</math> is the entropy.

This problem cannot be solved using the standard methods described for GANs because the generator in GAIL represents a stochastic policy. The exact form of the first term in the above equation is given by: <math> \mathbb{E}_{s\sim \rho_\pi(s)}\mathbb{E}_{a\sim \pi(\cdot |s)} [log(D(s,a)] </math>.

The two-player game now depends on the stochastic properties (<math> \theta </math>) of the policy, and it is unclear how to differentiate the above equation with respect to <math> \theta </math>. This problem can be overcome using score functions such as REINFORCE to obtain an unbiased gradient estimation:

\begin{align}
\nabla_\theta\mathbb{E}_{\pi} [log\; D(s,a)] \cong \hat{\mathbb{E}}_{\tau_i}[\nabla_\theta\; log\; \pi_\theta(a|s)Q(s,a)]
\end{align}

where <math> Q(\hat{s},\hat{a}) </math> is the score function of the gradient:

\begin{align}
Q(\hat{s},\hat{a}) = \hat{\mathbb{E}}_{\tau_i}[log\; D(s,a) | s_0 = \hat{s}, a_0 = \hat{a}]
\end{align}

REINFORCE gradients suffer from high variance which makes them difficult to work with even after applying variance reduction techniques. In order to better understand the changes required to fool the discriminator we need access to the gradients of the discriminator network, which can be obtained from the Jacobian of the discriminator. This paper demonstrates the use of a forward model along with the Jacobian of the discriminator to train a policy, without using high-variance gradient estimations.

= Algorithm =
This section first analyzes the characteristics of the discriminator network, then describes how a forward model can enable policy imitation through GANs. Lastly, the model based adversarial imitation learning algorithm is presented.

== The discriminator network ==
The discriminator network is trained to predict the conditional distribution: <math> D(s,a) = p(y|s,a) </math> where <math> y \in (\pi_E, \pi) </math>.

The discriminator is trained on an even distribution of expert and generated examples; hence <math> p(\pi) = p(\pi_E) = \frac{1}{2} </math>. Given this, we can rearrange and factor <math> D(s,a) </math> to obtain:

\begin{aligned}
D(s,a) &= p(\pi|s,a) \\
& = \frac{p(s,a|\pi)p(\pi)}{p(s,a|\pi)p(\pi) + p(s,a|\pi_E)p(\pi_E)} \\
& = \frac{p(s,a|\pi)}{p(s,a|\pi) + p(s,a|\pi_E)} \\
& = \frac{1}{1 + \frac{p(s,a|\pi_E)}{p(s,a|\pi)}} \\
& = \frac{1}{1 + \frac{p(a|s,\pi_E)}{p(a|s,\pi)} \cdot \frac{p(s|\pi_E)}{p(s|\pi)}} \\
\end{aligned}

Define <math> \varphi(s,a) </math> and <math> \psi(s) </math> to be:

\begin{aligned}
\varphi(s,a) = \frac{p(a|s,\pi_E)}{p(a|s,\pi)}, \psi(s) = \frac{p(s|\pi_E)}{p(s|\pi)}
\end{aligned}

to get the final expression for <math> D(s,a) </math>:
\begin{aligned}
D(s,a) = \frac{1}{1 + \varphi(s,a)\cdot \psi(s)}
\end{aligned}

<math> \varphi(s,a) </math> represents a policy likelihood ratio, and <math> \psi(s) </math> represents a state distribution likelihood ratio. Based on these expressions, the paper states that the discriminator makes its decisions by answering two questions. The first question relates to state distribution: what is the likelihood of encountering state <math> s </math> under the distribution induces by <math> \pi_E </math> vs <math> \pi </math>? The second question is about behavior: given a state <math> s </math>, how likely is action a under <math> \pi_E </math> vs <math> \pi </math>? The desired change in state is given by <math> \psi_s \equiv \partial \psi / \partial s </math>; this information can by obtained from the partial derivatives of <math> D(s,a) </math>:

\begin{aligned}
\nabla_aD &= - \frac{\varphi_a(s,a)\psi(s)}{(1 + \varphi(s,a)\psi(s))^2} \\
\nabla_sD &= - \frac{\varphi_s(s,a)\psi(s) + \varphi(s,a)\psi_s(s)}{(1 + \varphi(s,a)\psi(s))^2} \\
\end{aligned}

== Backpropagating through stochastic units ==
There is interest in training stochastic policies because stochasticity encourages exploration for Policy Gradient methods. This is a problem for algorithms that build differentiable computation graphs where the gradients flow from one component to another since it is unclear how to backpropagate through stochastic units. The following subsections show how to estimate the gradients of continuous and categorical stochastic elements for continuous and discrete action domains respectively.

=== Continuous Action Distributions ===
In the case of continuous action policies, re-parameterization was used to enable computing the derivatives of stochastic models. Assuming that the stochastic policy has a Gaussian distribution the policy <math> \pi </math> can be written as <math> \pi_\theta(a|s) = \mu_\theta(s) + \xi \sigma_\theta(s) </math>, where <math> \xi \sim N(0,1) </math>. This way, the authors are able to get a Monte-Carlo estimator of the derivative of the expected value of <math> D(s, a) </math> with respect to <math> \theta </math>:

\begin{align}
\nabla_\theta\mathbb{E}_{\pi(a|s)}D(s,a) = \mathbb{E}_{\rho (\xi )}\nabla_a D(a,s) \nabla_\theta \pi_\theta(a|s) \cong \frac{1}{M}\sum_{i=1}^{M} \nabla_a D(s,a) \nabla_\theta \pi_\theta(a|s)\Bigr|_{\substack{\xi=\xi_i}}
\end{align}

=== Categorical Action Distributions ===
In the case of discrete action domains, the paper uses categorical re-parameterization with Gumbel-Softmax. This method relies on the Gumble-Max trick which is a method for drawing samples from a categorical distribution with class probabilities <math> \pi(a_1|s),\pi(a_2|s),...,\pi(a_N|s) </math>:

\begin{align}
a_{argmax} = \underset{i}{argmax}[g_i + log\ \pi(a_i|s)]
\end{align}

Gumbel-Softmax provides a differentiable approximation of the samples obtained using the Gumble-Max trick:

\begin{align}
a_{softmax} = \frac{exp[\frac{1}{\tau}(g_i + log\ \pi(a_i|s))]}{\sum_{j=1}^{k}exp[\frac{1}{\tau}(g_j + log\ \pi(a_i|s))]}
\end{align}

In the above equation, the hyper-parameter <math> \tau </math> (temperature) trades bias for variance. When <math> \tau </math> gets closer to zero, the softmax operator acts like argmax resulting in a low bias, but high variance; vice versa when the <math> \tau </math> is large.

The authors use <math> a_{softmax} </math> to interact with the environment; argmax is applied over <math> a_{softmax} </math> to obtain a single “pure” action, but the continuous approximation is used in the backward pass using the estimation: <math> \nabla_\theta\; a_{argmax} \approx \nabla_\theta\; a_{softmax} </math>.

== Backpropagating through a Forward model ==
The above subsections presented the means for extracting the partial derivative <math> \nabla_aD </math>. The main contribution of this paper is incorporating the use of <math> \nabla_sD </math>. In a model-free approach the state <math> s </math> is treated as a fixed input, therefore <math> \nabla_sD </math> is discarded. This is illustrated in Figure 1. This work uses a model-based approach which makes incorporating <math> \nabla_sD </math> more involved. In the model-based approach, a state <math> s_t </math> can be written as a function of the previous state action pair: <math> s_t = f(s_{t-1}, a_{t-1}) </math>, where <math> f </math> represents the forward model. Using the forward model and the law of total derivatives we get:

\begin{align}
\nabla_\theta D(s_t,a_t)\Bigr|_{\substack{s=s_t, a=a_t}} &= \frac{\partial D}{\partial a}\frac{\partial a}{\partial \theta}\Bigr|_{\substack{a=a_t}} + \frac{\partial D}{\partial s}\frac{\partial s}{\partial \theta}\Bigr|_{\substack{s=s_t}} \\
&= \frac{\partial D}{\partial a}\frac{\partial a}{\partial \theta}\Bigr|_{\substack{a=a_t}} + \frac{\partial D}{\partial s}\left (\frac{\partial f}{\partial s}\frac{\partial s}{\partial \theta}\Bigr|_{\substack{s=s_{t-1}}} + \frac{\partial f}{\partial a}\frac{\partial a}{\partial \theta}\Bigr|_{\substack{a=a_{t-1}}} \right )
\end{align}

Using this formula, the error regarding deviations of future states <math> (\psi_s) </math> propagate back in time and influence the actions of policies in earlier times. This is summarized in Figure 2.

[[File:modelFree_blockDiagram.PNG]]

Figure 1: Block-diagram of the model-free approach: given a state <math> s </math>, the policy outputs <math> \mu </math> which is fed to a stochastic sampling unit. An action <math> a </math> is sampled, and together with <math> s </math> are presented to the discriminator network. In the backward phase, the error message <math> \delta_a </math> is blocked at the stochastic sampling unit. From there, a high-variance gradient estimation is used (<math> \delta_{HV} </math>). Meanwhile, the error message <math> \delta_s </math> is flushed.

[[File:modelBased_blockDiagram.PNG|1000px]]

Figure 2: Block diagram of model-based adversarial imitation learning. This diagram describes the computation graph for training the policy (i.e. G). The discriminator network D is fixed at this stage and is trained separately. At time <math> t </math> of the forward pass, <math> \pi </math> outputs a distribution over actions: <math> \mu_t = \pi(s_t) </math>, from which an action at is sampled. For example, in the continuous case, this is done using the re-parametrization trick: <math> a_t = \mu_t + \xi \cdot \sigma </math>, where <math> \xi \sim N(0,1) </math>. The next state <math> s_{t+1} = f(s_t, a_t) </math> is computed using the forward model (which is also trained separately), and the entire process repeats for time <math> t+1 </math>. In the backward pass, the gradient of <math> \pi </math> is comprised of a.) the error message <math> \delta_a </math> (Green) that propagates fluently through the differentiable approximation of the sampling process. And b.) the error message <math> \delta_s </math> (Blue) of future time-steps, that propagate back through the differentiable forward model.

== MGAIL Algorithm ==
Shalev- Shwartz et al. (2016) and Heess et al. (2015) built a multi-step computation graph for describing the familiar policy gradient objective; in this case it is given by:

\begin{align}
J(\theta) = \mathbb{E}\left [ \sum_{t=0}^{T} \gamma ^t D(s_t,a_t)|\theta\right ]
\end{align}

Using the results from Heess et al. (2015) this paper demonstrates how to differentiate <math> J(\theta) </math> over a trajectory of <math>(s,a,s’) </math> transitions:

\begin{align}
J_s &= \mathbb{E}_{p(a|s)}\mathbb{E}_{p(s'|s,a)}\left [ D_s + D_a \pi_s + \gamma J'_{s'}(f_s + f_a \pi_s) \right] \\
J_\theta &= \mathbb{E}_{p(a|s)}\mathbb{E}_{p(s'|s,a)}\left [ D_a \pi_\theta + \gamma (J'_{s'} f_a \pi_\theta + J'_\theta) \right]
\end{align}

The policy gradient <math> \nabla_\theta J </math> is calculated by applying equations 12 and 13 recursively for <math> T </math> iterations. The MGAIL algorithm is presented below.

[[File:MGAIL_alg.PNG]]

== Forward Model Structure ==
The stability of the learning process depends on the prediction accuracy of the forward model, but learning an accurate forward model is challenging by itself. The authors propose methods for improving the performance of the forward model based on two aspects of its functionality. First, the forward model should learn to use the action as an operator over the state space. To accomplish this, the actions and states, which are sampled form different distributions need to be first represented in a shared space. This is done by encoding the state and action using two separate neural networks and combining their outputs to form a single vector. Additionally, multiple previous states are used to predict the next state by representing the environment as an <math> n^{th} </math> order MDP. A GRU layer is incorporated into the state encoder to enable recurrent connections from previous states. Using these modifications, the model is able to achieve better, and more stable results compared to the standard forward model based on a feed forward neural network. The comparison is presented in Figure 3.

[[File:performance_comparison.PNG]]

Figure 3: Performance comparison between a basic forward model (Blue), and the advanced forward model (Green).

= Experiments =
The proposed algorithm is evaluated on three discrete control tasks (Cartpole, Mountain-Car, Acrobot), and five continuous control tasks (Hopper, Walker, Half-Cheetah, Ant, and Humanoid), which are modeled by the MuJoCo physics simulator (Todorov et al., 2012). Expert policies are trained using the Trust Region Policy Optimization (TRPO) algorithm (Schulman et al., 2015). Different number of trajectories are used to train the expert for each task, but all trajectories are of length 1000.
The discriminator and generator (policy) networks contains two hidden layers with ReLU non-linearity and are trained using the ADAM optimizer. The total reward received over a period of <math> N </math> steps using BC, GAIL and MGAIL is presented in Table 1. The proposed algorithm achieved the highest reward for most environments while exhibiting performance comparable to the expert over all of them.

[[File:mgail_test_results.PNG]]

Table 1. Policy performance, boldface indicates better results, <math> \pm </math> represents one standard deviation.

= Discussion =
This paper presented a model-free algorithm for imitation learning. It demonstrated how a forward model can be used to train policies using the exact gradient of the discriminator network. A downside of this approach is the need to learn a forward model, since this could be difficult in certain domains. Learning the system dynamics directly from raw images is considered as one line of future work. Another future work is to address the violation of the fundamental assumption made by all supervised learning algorithms, which requires the data to be i.i.d. This problem arises because the discriminator and forward models are trained in a supervised learning fashion using data sampled from a dynamic distribution.

= Source =
# Baram, Nir, et al. "End-to-end differentiable adversarial imitation learning." International Conference on Machine Learning. 2017.
# Ho, Jonathan, and Stefano Ermon. "Generative adversarial imitation learning." Advances in Neural Information Processing Systems. 2016.
# Shalev-Shwartz, Shai, et al. "Long-term planning by short-term prediction." arXiv preprint arXiv:1602.01580 (2016).
# Heess, Nicolas, et al. "Learning continuous control policies by stochastic value gradients." Advances in Neural Information Processing Systems. 2015.
# Schulman, John, et al. "Trust region policy optimization." International Conference on Machine Learning. 2015.

End-to-End Differentiable Adversarial Imitation Learning

2018-03-19T03:26:04Z

Cs4li:

= Introduction =
The ability to imitate an expert policy is very beneficial in the case of automating human demonstrated tasks. Assuming that a sequence of state action pairs (trajectories) of an expert policy are available, a new policy can be trained that imitates the expert without having access to the original reward signal used by the expert. There are two main approaches to solve the problem of imitating a policy; they are Behavioural Cloning (BC) and Inverse Reinforcement Learning (IRL). BC directly learns the conditional distribution of actions over states in a supervised fashion by training on single time-step state-action pairs. The disadvantage of BC is that the training requires large amounts of expert data, which is hard to obtain. In addition, an agent trained using BC is unaware of how its action can affect future state distribution. The second method using IRL involves recovering a reward signal under which the expert is uniquely optimal; the main disadvantage is that it’s an ill-posed problem.

To address the problem of imitating an expert policy, techniques based on Generative Adversarial Networks (GANs) have been proposed in recent years. GANs use a discriminator to guide the generative model towards producing patterns like those of the expert. This idea was used by (Ho & Ermon, 2016) in their work titled Generative Adversarial Imitation Learning (GAIL) to imitate an expert policy in a model-free setup. The disadvantage of GAIL’s model-free approach is that backpropagation required gradient estimation which tends to suffer from high variance, which results in the need for large sample sizes and variance reduction methods. This paper proposed a model-based method (MGAIL) to address these issues.

= Background =
== Imitation Learning ==
A common technique for performing imitation learning is to train a policy <math> \pi </math> that minimizes some loss function <math> l(s, \pi(s)) </math> with respect to a discounted state distribution encountered by the expert: <math> d_\pi(s) = (1-\gamma)\sum_{t=0}^{\infty}\gamma^t p(s_t) </math>. This can be obtained using any supervised learning (SL) algorithm, but the policy's prediction affects future state distributions; this violates the independent and identically distributed (i.i.d) assumption made my most SL algorithms. This process is susceptible to compounding errors since a slight deviation in the learner's behavior can lead to different state distributions not encountered by the expert policy.

This issue was overcome through the use of the Forward Training (FT) algorithm which trains a non-stationary policy iteratively over time. At each timestep <math>t</math>, a new policy <math>\pi_t</math> is trained on the state distribution induced by the previously trained policies <math>\pi_0, \pi_1, ... \pi_{t-1}</math>. This is continued until the end of the time horizon to obtain a policy that can mimic the expert policy. This requirement to train a policy at each time step till the end makes the FT algorithm impractical for cases where the time horizon is very large or undefined. This shortcoming is resolved using the Stochastic Mixing Iterative Learning (SMILe) algorithm. SMILe trains a stochastic stationary policy over several iterations under the trajectory distribution induced by the previously trained policy <math>\pi_{t-1}</math> and then updates the following:

<center><math>\pi_t = \pi_{t-1} + \alpha (1-\alpha)^{t-1}(\hat{\pi_t} - \pi_0)</math></center>

== Generative Adversarial Networks ==
GANs learn a generative model that can fool the discriminator by using a two-player zero-sum game:

\begin{align}
\underset{G}{\operatorname{argmin}}\; \underset{D\in (0,1)}{\operatorname{argmax}} = \mathbb{E}_{x\sim p_E}[log(D(x)]\ +\ \mathbb{E}_{z\sim p_z}[log(1 - D(G(z)))]
\end{align}

In the above equation, <math> p_E </math> represents the expert distribution and <math> p_z </math> represents the input noise distribution from which the input to the generator is sampled. The generator produces patterns and the discriminator judges if the pattern was generated or from the expert data. When the discriminator cannot distinguish between the two distributions the game ends and the generator has learned to mimic the expert. GANs rely on basic ideas such as binary classification and algorithms such as backpropagation in order to learn the expert distribution.

GAIL applies GANs to the task of imitating an expert policy in a model-free approach. GAIL uses similar objective functions like GANs, but the expert distribution in GAIL represents the joint distribution over state action tuples:

\begin{align}
\underset{\pi}{\operatorname{argmin}}\; \underset{D\in (0,1)}{\operatorname{argmax}} = \mathbb{E}_{\pi}[log(D(s,a)]\ +\ \mathbb{E}_{\pi_E}[log(1 - D(s,a))] - \lambda H(\pi))
\end{align}

where <math> H(\pi) \triangleq \mathbb{E}_{\pi}[-log\: \pi(a|s)]</math> is the entropy.

This problem cannot be solved using the standard methods described for GANs because the generator in GAIL represents a stochastic policy. The exact form of the first term in the above equation is given by: <math> \mathbb{E}_{s\sim \rho_\pi(s)}\mathbb{E}_{a\sim \pi(\cdot |s)} [log(D(s,a)] </math>.

The two-player game now depends on the stochastic properties (<math> \theta </math>) of the policy, and it is unclear how to differentiate the above equation with respect to <math> \theta </math>. This problem can be overcome using score functions such as REINFORCE to obtain an unbiased gradient estimation:

\begin{align}
\nabla_\theta\mathbb{E}_{\pi} [log\; D(s,a)] \cong \hat{\mathbb{E}}_{\tau_i}[\nabla_\theta\; log\; \pi_\theta(a|s)Q(s,a)]
\end{align}

where <math> Q(\hat{s},\hat{a}) </math> is the score function of the gradient:

\begin{align}
Q(\hat{s},\hat{a}) = \hat{\mathbb{E}}_{\tau_i}[log\; D(s,a) | s_0 = \hat{s}, a_0 = \hat{a}]
\end{align}

REINFORCE gradients suffer from high variance which makes them difficult to work with even after applying variance reduction techniques. In order to better understand the changes required to fool the discriminator we need access to the gradients of the discriminator network, which can be obtained from the Jacobian of the discriminator. This paper demonstrates the use of a forward model along with the Jacobian of the discriminator to train a policy, without using high-variance gradient estimations.

= Algorithm =
This section first analyzes the characteristics of the discriminator network, then describes how a forward model can enable policy imitation through GANs. Lastly, the model based adversarial imitation learning algorithm is presented.

== The discriminator network ==
The discriminator network is trained to predict the conditional distribution: <math> D(s,a) = p(y|s,a) </math> where <math> y \in (\pi_E, \pi) </math>.

The discriminator is trained on an even distribution of expert and generated examples; hence <math> p(\pi) = p(\pi_E) = \frac{1}{2} </math>. Given this, we can rearrange and factor <math> D(s,a) </math> to obtain:

\begin{aligned}
D(s,a) &= p(\pi|s,a) \\
& = \frac{p(s,a|\pi)p(\pi)}{p(s,a|\pi)p(\pi) + p(s,a|\pi_E)p(\pi_E)} \\
& = \frac{p(s,a|\pi)}{p(s,a|\pi) + p(s,a|\pi_E)} \\
& = \frac{1}{1 + \frac{p(s,a|\pi_E)}{p(s,a|\pi)}} \\
& = \frac{1}{1 + \frac{p(a|s,\pi_E)}{p(a|s,\pi)} \cdot \frac{p(s|\pi_E)}{p(s|\pi)}} \\
\end{aligned}

Define <math> \varphi(s,a) </math> and <math> \psi(s) </math> to be:

\begin{aligned}
\varphi(s,a) = \frac{p(a|s,\pi_E)}{p(a|s,\pi)}, \psi(s) = \frac{p(s|\pi_E)}{p(s|\pi)}
\end{aligned}

to get the final expression for <math> D(s,a) </math>:
\begin{aligned}
D(s,a) = \frac{1}{1 + \varphi(s,a)\cdot \psi(s)}
\end{aligned}

<math> \varphi(s,a) </math> represents a policy likelihood ratio, and <math> \psi(s) </math> represents a state distribution likelihood ratio. Based on these expressions, the paper states that the discriminator makes its decisions by answering two questions. The first question relates to state distribution: what is the likelihood of encountering state <math> s </math> under the distribution induces by <math> \pi_E </math> vs <math> \pi </math>? The second question is about behavior: given a state <math> s </math>, how likely is action a under <math> \pi_E </math> vs <math> \pi </math>? The desired change in state is given by <math> \psi_s \equiv \partial \psi / \partial s </math>; this information can by obtained from the partial derivatives of <math> D(s,a) </math>:

\begin{aligned}
\nabla_aD &= - \frac{\varphi_a(s,a)\psi(s)}{(1 + \varphi(s,a)\psi(s))^2} \\
\nabla_sD &= - \frac{\varphi_s(s,a)\psi(s) + \varphi(s,a)\psi_s(s)}{(1 + \varphi(s,a)\psi(s))^2} \\
\end{aligned}

== Backpropagating through stochastic units ==
There is interest in training stochastic policies because stochasticity encourages exploration for Policy Gradient methods. This is a problem for algorithms that build differentiable computation graphs where the gradients flow from one component to another since it is unclear how to backpropagate through stochastic units. The following subsections show how to estimate the gradients of continuous and categorical stochastic elements for continuous and discrete action domains respectively.

=== Continuous Action Distributions ===
In the case of continuous action policies, re-parameterization was used to enable computing the derivatives of stochastic models. Assuming that the stochastic policy has a Gaussian distribution the policy <math> \pi </math> can be written as <math> \pi_\theta(a|s) = \mu_\theta(s) + \xi \sigma_\theta(s) </math>, where <math> \xi \sim N(0,1) </math>. This way, the authors are able to get a Monte-Carlo estimator of the derivative of the expected value of <math> D(s, a) </math> with respect to <math> \theta </math>:

\begin{align}
\nabla_\theta\mathbb{E}_{\pi(a|s)}D(s,a) = \mathbb{E}_{\rho (\xi )}\nabla_a D(a,s) \nabla_\theta \pi_\theta(a|s) \cong \frac{1}{M}\sum_{i=1}^{M} \nabla_a D(s,a) \nabla_\theta \pi_\theta(a|s)\Bigr|_{\substack{\xi=\xi_i}}
\end{align}

=== Categorical Action Distributions ===
In the case of discrete action domains, the paper uses categorical re-parameterization with Gumbel-Softmax. This method relies on the Gumble-Max trick which is a method for drawing samples from a categorical distribution with class probabilities <math> \pi(a_1|s),\pi(a_2|s),...,\pi(a_N|s) </math>:

\begin{align}
a_{argmax} = \underset{i}{argmax}[g_i + log\ \pi(a_i|s)]
\end{align}

Gumbel-Softmax provides a differentiable approximation of the samples obtained using the Gumble-Max trick:

\begin{align}
a_{softmax} = \frac{exp[\frac{1}{\tau}(g_i + log\ \pi(a_i|s))]}{\sum_{j=1}^{k}exp[\frac{1}{\tau}(g_j + log\ \pi(a_i|s))]}
\end{align}

In the above equation, the hyper-parameter <math> \tau </math> (temperature) trades bias for variance. When <math> \tau </math> gets closer to zero, the softmax operator acts like argmax resulting in a low bias, but high variance; vice versa when the <math> \tau </math> is large.

The authors use <math> a_{softmax} </math> to interact with the environment; argmax is applied over <math> a_{softmax} </math> to obtain a single “pure” action, but the continuous approximation is used in the backward pass using the estimation: <math> \nabla_\theta\; a_{argmax} \approx \nabla_\theta\; a_{softmax} </math>.

== Backpropagating through a Forward model ==
The above subsections presented the means for extracting the partial derivative <math> \nabla_aD </math>. The main contribution of this paper is incorporating the use of <math> \nabla_sD </math>. In a model-free approach the state <math> s </math> is treated as a fixed input, therefore <math> \nabla_sD </math> is discarded. This is illustrated in Figure 1. This work uses a model-based approach which makes incorporating <math> \nabla_sD </math> more involved. In the model-based approach, a state <math> s_t </math> can be written as a function of the previous state action pair: <math> s_t = f(s_{t-1}, a_{t-1}) </math>, where <math> f </math> represents the forward model. Using the forward model and the law of total derivatives we get:

\begin{align}
\nabla_\theta D(s_t,a_t)\Bigr|_{\substack{s=s_t, a=a_t}} &= \frac{\partial D}{\partial a}\frac{\partial a}{\partial \theta}\Bigr|_{\substack{a=a_t}} + \frac{\partial D}{\partial s}\frac{\partial s}{\partial \theta}\Bigr|_{\substack{s=s_t}} \\
&= \frac{\partial D}{\partial a}\frac{\partial a}{\partial \theta}\Bigr|_{\substack{a=a_t}} + \frac{\partial D}{\partial s}\left (\frac{\partial f}{\partial s}\frac{\partial s}{\partial \theta}\Bigr|_{\substack{s=s_{t-1}}} + \frac{\partial f}{\partial a}\frac{\partial a}{\partial \theta}\Bigr|_{\substack{a=a_{t-1}}} \right )
\end{align}

Using this formula, the error regarding deviations of future states <math> (\psi_s) </math> propagate back in time and influence the actions of policies in earlier times. This is summarized in Figure 2.

[[File:modelFree_blockDiagram.PNG]]

Figure 1: Block-diagram of the model-free approach: given a state <math> s </math>, the policy outputs <math> \mu </math> which is fed to a stochastic sampling unit. An action <math> a </math> is sampled, and together with <math> s </math> are presented to the discriminator network. In the backward phase, the error message <math> \delta_a </math> is blocked at the stochastic sampling unit. From there, a high-variance gradient estimation is used (<math> \delta_{HV} </math>). Meanwhile, the error message <math> \delta_s </math> is flushed.

[[File:modelBased_blockDiagram.PNG|1000px]]

Figure 2: Block diagram of model-based adversarial imitation learning. This diagram describes the computation graph for training the policy (i.e. G). The discriminator network D is fixed at this stage and is trained separately. At time <math> t </math> of the forward pass, <math> \pi </math> outputs a distribution over actions: <math> \mu_t = \pi(s_t) </math>, from which an action at is sampled. For example, in the continuous case, this is done using the re-parametrization trick: <math> a_t = \mu_t + \xi \cdot \sigma </math>, where <math> \xi \sim N(0,1) </math>. The next state <math> s_{t+1} = f(s_t, a_t) </math> is computed using the forward model (which is also trained separately), and the entire process repeats for time <math> t+1 </math>. In the backward pass, the gradient of <math> \pi </math> is comprised of a.) the error message <math> \delta_a </math> (Green) that propagates fluently through the differentiable approximation of the sampling process. And b.) the error message <math> \delta_s </math> (Blue) of future time-steps, that propagate back through the differentiable forward model.

== MGAIL Algorithm ==
Shalev- Shwartz et al. (2016) and Heess et al. (2015) built a multi-step computation graph for describing the familiar policy gradient objective; in this case it is given by:

\begin{align}
J(\theta) = \mathbb{E}\left [ \sum_{t=0}^{T} \gamma ^t D(s_t,a_t)|\theta\right ]
\end{align}

Using the results from Heess et al. (2015) this paper demonstrates how to differentiate <math> J(\theta) </math> over a trajectory of <math>(s,a,s’) </math> transitions:

\begin{align}
J_s &= \mathbb{E}_{p(a|s)}\mathbb{E}_{p(s'|s,a)}\left [ D_s + D_a \pi_s + \gamma J'_{s'}(f_s + f_a \pi_s) \right] \\
J_\theta &= \mathbb{E}_{p(a|s)}\mathbb{E}_{p(s'|s,a)}\left [ D_a \pi_\theta + \gamma (J'_{s'} f_a \pi_\theta + J'_\theta) \right]
\end{align}

The policy gradient <math> \nabla_\theta J </math> is calculated by applying equations 12 and 13 recursively for <math> T </math> iterations. The MGAIL algorithm is presented below.

[[File:MGAIL_alg.PNG]]

== Forward Model Structure ==
The stability of the learning process depends on the prediction accuracy of the forward model, but learning an accurate forward model is challenging by itself. The authors propose methods for improving the performance of the forward model based on two aspects of its functionality. First, the forward model should learn to use the action as an operator over the state space. To accomplish this, the actions and states, which are sampled form different distributions need to be first represented in a shared space. This is done by encoding the state and action using two separate neural networks and combining their outputs to form a single vector. Additionally, multiple previous states are used to predict the next state by representing the environment as an <math> n^{th} </math> order MDP. A GRU layer is incorporated into the state encoder to enable recurrent connections from previous states. Using these modifications, the model is able to achieve better, and more stable results compared to the standard forward model based on a feed forward neural network. The comparison is presented in Figure 3.

[[File:performance_comparison.PNG]]

Figure 3: Performance comparison between a basic forward model (Blue), and the advanced forward model (Green).

= Experiments =
The proposed algorithm is evaluated on three discrete control tasks (Cartpole, Mountain-Car, Acrobot), and five continuous control tasks (Hopper, Walker, Half-Cheetah, Ant, and Humanoid), which are modeled by the MuJoCo physics simulator (Todorov et al., 2012). Expert policies are trained using the Trust Region Policy Optimization (TRPO) algorithm (Schulman et al., 2015). Different number of trajectories are used to train the expert for each task, but all trajectories are of length 1000.
The discriminator and generator (policy) networks contains two hidden layers with ReLU non-linearity and are trained using the ADAM optimizer. The total reward received over a period of <math> N </math> steps using BC, GAIL and MGAIL is presented in Table 1. The proposed algorithm achieved the highest reward for most environments while exhibiting performance comparable to the expert over all of them.

[[File:mgail_test_results.PNG]]

Table 1. Policy performance, boldface indicates better results, <math> \pm </math> represents one standard deviation.

= Discussion =
This paper presented a model-free algorithm for imitation learning. It demonstrated how a forward model can be used to train policies using the exact gradient of the discriminator network. A downside of this approach is the need to learn a forward model, since this could be difficult in certain domains. Learning the system dynamics directly from raw images is considered as one line of future work. Another future work is to address the violation of the fundamental assumption made by all supervised learning algorithms, which requires the data to be i.i.d. This problem arises because the discriminator and forward models are trained in a supervised learning fashion using data sampled from a dynamic distribution.

= Source =
# Baram, Nir, et al. "End-to-end differentiable adversarial imitation learning." International Conference on Machine Learning. 2017.
# Ho, Jonathan, and Stefano Ermon. "Generative adversarial imitation learning." Advances in Neural Information Processing Systems. 2016.
# Shalev-Shwartz, Shai, et al. "Long-term planning by short-term prediction." arXiv preprint arXiv:1602.01580 (2016).
# Heess, Nicolas, et al. "Learning continuous control policies by stochastic value gradients." Advances in Neural Information Processing Systems. 2015.
# Schulman, John, et al. "Trust region policy optimization." International Conference on Machine Learning. 2015.

Label-Free Supervision of Neural Networks with Physics and Domain Knowledge

2018-03-18T22:43:14Z

Cs4li:

== Introduction ==
Applications of machine learning are often encumbered by the need for large amounts of labeled training data. Neural networks have made large amounts of labeled data even more crucial to success (LeCun, Bengio, and Hinton 2015[1]). Nonetheless, humans are often able to learn without direct examples, opting instead for high level instructions for how a task should be performed, or what it will look like when completed. This work explores whether a similar principle can be applied to teaching machines: can we supervise networks without individual examples by instead describing only the structure of desired outputs.

[[File:c433li-1.png|300px|center]]

Unsupervised learning methods such as autoencoders, also aim to uncover hidden structure in the data without having access to any label. Such systems succeed in producing highly compressed, yet informative representations of the inputs (Kingma and Welling 2013; Le 2013). However, these representations differ from ours as they are not explicitly constrained to have a particular meaning or semantics. This paper attempts to explicitly provide the semantics of the hidden variables we hope to discover, but still train without labels by learning from constraints that are known to hold according to prior domain knowledge. By training without direct examples of the values our hidden (output) variables take, several advantages are gained over traditional supervised learning, including:
* a reduction in the amount of work spent labeling,
* an increase in generality, as a single set of constraints can be applied to multiple data sets without relabeling.

== Problem Setup ==
In a traditional supervised learning setting, we are given a training set <math>D=\{(x_1, y_1), \cdots, (x_n, y_n)\}</math> of <math>n</math> training examples. Each example is a pair <math>(x_i,y_i)</math> formed by an instance <math>x_i \in X</math> and the corresponding output (label) <math>y_i \in Y</math>. The goal is to learn a function <math>f: X \rightarrow Y</math> mapping inputs to outputs. To quantify performance, a loss function <math>\ell:Y \times Y \rightarrow \mathbb{R}</math> is provided, and a mapping is found via

<center><math> f^* = \text{argmin}_{f \in \mathcal{F}} \sum_{i=1}^n \ell(f(x_i),y_i) </math></center>

where the optimization is over a pre-defined class of functions <math>\mathcal{F}</math> (hypothesis class). In our case, <math>\mathcal{F}</math> will be (convolutional) neural networks parameterized by their weights. The loss could be for example <math>\ell(f(x_i),y_i) = 1[f(x_i) \neq y_i]</math>. By restricting the space of possible functions specifying the hypothesis class <math>\mathcal{F}</math>, we are leveraging prior knowledge about the specific problem we are trying to solve. Informally, the so-called No Free Lunch Theorems state that every machine learning algorithm must make such assumptions in order to work. Another common way in which a modeler incorporates prior knowledge is by specifying an a-priori preference for certain functions in <math>\mathcal{F}</math>, incorporating a regularization term <math>R:\mathcal{F} \rightarrow \mathbb{R}</math>, and solving for <math> f^* = argmin_{f \in \mathcal{F}} \sum_{i=1}^n \ell(f(x_i),y_i) + R(f)</math>. Typically, the regularization term <math>R:\mathcal{F} \rightarrow \mathbb{R}</math> specifies a preference for "simpler" functions (Occam's razor) to prevent overfitting the model on the training data.

The focus is on the set of problems/domains where the problem is a complex environment having a complex representation of the output space, for example mapping an input image to the height of an object(since this leads to a complex output space) rather than simple binary classification problem.

In this paper, prior knowledge on the structure of the outputs is modelled by providing a weighted constraint function <math>g:X \times Y \rightarrow \mathbb{R}</math>, used to penalize “structures” that are not consistent with our prior knowledge. And whether this weak form of supervision is sufficient to learn interesting functions is explored. While one clearly needs labels <math>y</math> to evaluate <math>f^*</math>, labels may not be necessary to discover <math>f^*</math>. If prior knowledge informs us that outputs of <math>f^*</math> have other unique properties among functions in <math>\mathcal{F}</math>, we may use these properties for training rather than direct examples <math>y</math>.

Specifically, an unsupervised approach where the labels <math>y_i</math> are not provided to us is considered, where a necessary property of the output <math>g</math> is optimized instead.
<center><math>\hat{f}^* = \text{argmin}_{f \in \mathcal{F}} \sum_{i=1}^n g(x_i,f(x_i))+ R(f) </math></center>

If the optimizing the above equation is sufficient to find <math>\hat{f}^*</math>, we can use it in replace of labels. If it's not sufficient, additional regularization terms are added. The idea is illustrated with three examples, as described in the next section.

== Experiments ==
=== Tracking an object in free fall ===
In the first experiment, they record videos of an object being thrown across the field of view, and aim to learn the object's height in each frame. The goal is to obtain a regression network mapping from <math>{R^{\text{height} \times \text{width} \times 3}} \rightarrow \mathbb{R}</math>, where <math>\text{height}</math> and <math>\text{width}</math> are the number of vertical and horizontal pixels per frame, and each pixel has 3 color channels. This network is trained as a structured prediction problem operating on a sequence of <math>N</math> images to produce a sequence of <math>N</math> heights, <math>\left(R^{\text{height} \times \text{width} \times 3} \right)^N \rightarrow \mathbb{R}^N</math>, and each piece of data <math>x_i</math> will be a vector of images, <math>\mathbf{x}</math>.
Rather than supervising the network with direct labels, <math>\mathbf{y} \in \mathbb{R}^N</math>, the network is instead supervised to find an object obeying the elementary physics of free falling objects. An object acting under gravity will have a fixed acceleration of <math>a = -9.8 m / s^2</math>, and the plot of the object's height over time will form a parabola:
<center><math>\mathbf{y}_i = y_0 + v_0(i\Delta t) + \frac{1}{2} a(i\Delta t)^2</math></center>

The idea is, given any trajectory of <math>N</math> height predictions, <math>f(\mathbf{x})</math>, we fit a parabola with fixed curvature to those predictions, and minimize the resulting residual. Formally, if we specify <math>\mathbf{a} = [\frac{1}{2} a\Delta t^2, \frac{1}{2} a(2 \Delta t)^2, \ldots, \frac{1}{2} a(N \Delta t)^2]</math>, the prediction produced by the fitted parabola is:
<center><math> \mathbf{\hat{y}} = \mathbf{a} + \mathbf{A} (\mathbf{A}^T\mathbf{A})^{-1} \mathbf{A}^T (f(\mathbf{x}) - \mathbf{a}) </math></center>

where
<center>
<math>
\mathbf{A} =
\left[ {\begin{array}{*{20}c}
\Delta t & 1 \\
2\Delta t & 1 \\
3\Delta t & 1 \\
\vdots & \vdots \\
N\Delta t & 1 \\
\end{array} } \right]
</math>
</center>

The constraint loss is then defined as
<center><math>g(\mathbf{x},f(\mathbf{x})) = g(f(\mathbf{x})) = \sum_{i=1}^{N} |\mathbf{\hat{y}}_i - f(\mathbf{x})_i|</math></center>

Note that <math>\hat{y}</math> is not the ground truth labels. Because <math>g</math> is differentiable almost everywhere, it can be optimized with SGD. They find that when combined with existing regularization methods for neural networks, this optimization is sufficient to recover <math>f^*</math> up to an additive constant <math>C</math> (specifying what object height corresponds to 0).

[[File:c433li-2.png|650px|center]]

The data set is collected on a laptop webcam running at 10 frames per second (<math>\Delta t = 0.1s</math>). The camera position is fixed and 65 diverse trajectories of the object in flight, totalling 602 images are recorded. For each trajectory, the network is trained on randomly selected intervals of <math>N=5</math> contiguous frames. Images are resized to <math>56 \times 56</math> pixels before going into a small, randomly initialized neural network with no pretraining. The network consists of 3 Conv/ReLU/MaxPool blocks followed by 2 Fully Connected/ReLU layers with probability 0.5 dropout and a single regression output.

Since scaling the <math>y_0</math> and <math>v_0</math> results in the same constraint loss <math>g</math>, the authors evaluate the result by the correlation of predicted heights with ground truth pixel measurements. This method was used since the distance from the object to the camera could not be accurately recorded, and this distance is required to calculate the height in meters. This is not a bullet proof evaluation, and is discussed in further detail in the critique section. The results are compared to a supervised network trained with the labels to directly predict the height of the object in pixels. The supervised learning task is viewed as a substantially easier task. From this knowledge we can see from the table below that, under their evaluation criteria, the result is pretty satisfying.

==== Evaluation ====
{| class="wikitable"
|-
! scope="col" | Method !! scope="col" | Random Uniform Output !! scope="col" | Supervised with Labels !! scope="col" | Approach in this Paper
|-
! scope="row" | Correlation
| 12.1% || 94.5% || 90.1%
|}

=== Tracking the position of a walking man ===
In the second experiment, they aim to detect the horizontal position of a person walking across a frame without providing direct labels <math>y \in \mathbb{R}</math> by exploiting the assumption that the person will be walking at a constant velocity over short periods of time. This is formulated as a structured prediction problem <math>f: \left(R^{\text{height} \times \text{width} \times 3} \right)^N \rightarrow \mathbb{R}^N</math>, and each training instances <math>x_i</math> are a vector of images, <math>\mathbf{x}</math>, being mapped to a sequence of predictions, <math>\mathbf{y}</math>. Given the similarities to the first experiment with free falling objects, we might hope to simply remove the gravity term from equation and retrain. However, in this case, that is not possible, as the constraint provides a necessary, but not sufficient, condition for convergence.

Given any sequence of correct outputs, <math>(\mathbf{y}_1, \ldots, \mathbf{y}_N)</math>, the modified sequence, <math>(\lambda * \mathbf{y}_1 + C, \ldots, \lambda * \mathbf{y}_N + C)</math> (<math>\lambda, C \in \mathbb{R}</math>) will also satisfy the constant velocity constraint. In the worst case, when <math>\lambda = 0</math>, <math>f \equiv C</math>, and the network can satisfy the constraint while having no dependence on the image. The trivial output is avoided by adding two two additional loss terms.

<center><math>h_1(\mathbf{x}) = -\text{std}(f(\mathbf{x}))</math></center>
which seeks to maximize the standard deviation of the output, and

<center>
<math>\begin{split}
h_2(\mathbf{x}) = \hphantom{'} & \text{max}(\text{ReLU}(f(\mathbf{x}) - 10)) \hphantom{\text{ }}+ \\
& \text{max}(\text{ReLU}(0 - f(\mathbf{x})))
\end{split}
</math>
</center>
which limit the output to a fixed ranged <math>[0, 10]</math>, the final loss is thus:

<center>
<math>
\begin{split}
g(\mathbf{x}) = \hphantom{'} & ||(\mathbf{A} (\mathbf{A}^T\mathbf{A})^{-1} \mathbf{A}^T - \mathbf{I}) * f(\mathbf{x})||_1 \hphantom{\text{ }}+ \\
& \gamma_1 * h_1(\mathbf{x})
\hphantom{\text{ }}+ \\
& \gamma_2 * h_2(\mathbf{x})
% h_2(y) & = \text{max}(\text{ReLU}(y - 10)) + \\
% & \hphantom{=}\hphantom{a} \text{max}(\text{ReLU}(0 - y))
\end{split}
</math>
</center>

[[File:c433li-3.png|650px|center]]

The data set contains 11 trajectories across 6 distinct scenes, totalling 507 images resized to <math>56 \times 56</math>. The network is trained to output linearly consistent positions on 5 strided frames from the first half of each trajectory, and is evaluated on the second half. The boundary violation penalty is set to <math>\gamma_2 = 0.8</math> and the standard deviation bonus is set to <math>\gamma_1 = 0.6</math>.

As in the previous experiment, the result is evaluated by the correlation with the ground truth. The result is as follows:
==== Evaluation ====
{| class="wikitable"
|-
! scope="col" | Method !! scope="col" | Random Uniform Output !! scope="col" | Supervised with Labels !! scope="col" | Approach in this Paper
|-
! scope="row" | Correlation
| 45.9% || 80.5% || 95.4%
|}
Surprisingly, the approach in this paper beats the same network trained with direct labeled supervision on the test set, which can be attributed to overfitting on the small amount of training data available (as correlation on training data reached 99.8%).

=== Detecting objects with causal relationships ===
In the previous experiments, the authors explored options for incorporating constraints pertaining to dynamics equations in real-world phenomena, i.e., prior knowledge derived from elementary physics. In this experiment, the authors explore the possibilities of learning from logical constraints imposed on single images. More specifically, they ask whether it is possible to learn from causal phenomena.

[[File:paper18_Experiment_3.png|400px|center]]

Here, the authors provide images containing a stochastic collection of up to four characters: Peach, Mario, Yoshi, and Bowser, with each character having small appearance changes across frames due to rotation and reflection. Example images can be seen in Fig. (4). While the existence of objects in each frame is non-deterministic, the generating distribution encodes the underlying phenomenon that Mario will always appear whenever Peach appears. The aim is to create a pair of neural networks <math>f_1, f_2</math> for identifying Peach and Mario, respectively. The networks, <math>f_k : R^{height×width×3} → \{0, 1\}</math>, map the image to the discrete boolean variables, <math>y_1</math> and <math>y_2</math>. Rather than supervising with direct labels, the authors train the networks by constraining their outputs to have the logical relationship <math>y_1 ⇒ y_2</math>. This problem is challenging because the networks must simultaneously learn to recognize the characters and select them according to logical relationships. To avoid the trivial solution <math>y_1 \equiv 1, y_2 \equiv 1</math> on every image, three additional loss terms need to be added:

<center><math> h_1(\mathbf{x}, k) = \frac{1}{M}\sum_i^M |Pr[f_k(\mathbf{x}) = 1] - Pr[f_k(\rho(\mathbf{x})) = 1]|, </math></center>

which forces rotational independence of the outputs in order to encourage the network to learn the existence, rather than location of objects,

<center><math> h_2(\mathbf{x}, k) = -\text{std}_{i \in [1 \dots M]}(Pr[f_k(\mathbf{x}_i) = 1]), </math></center>

which seeks high variance outputs, and

<center>
<math> h_3(\mathbf{x}, v) = \frac{1}{M}\sum_i^{M} (Pr[f(\mathbf{x}_i) = v] - \frac{1}{3} + (\frac{1}{3} - \mu_v))^2 \\
\mu_{v} = \frac{1}{M}\sum_i^{M} \mathbb{1}\{v = \text{argmax}_{v' \in \{0, 1\}^2} Pr[f(\mathbf{x}) = v']\}. </math>
</center>

which seeks high entropy outputs. The final loss function then becomes:

<center>
<math> \begin{split}
g(\mathbf{x}) & = \mathbb{1}\{f_1(\mathbf{x}) \nRightarrow f_2(\mathbf{x})\} \hphantom{\text{ }} + \\
& \sum_{k \in \{1, 2\}} \gamma_1 h_1(\mathbf{x}, k) + \gamma_2 h_2(\mathbf{x}, k) +
\hspace{-0.7em} \sum_{v \neq \{1,0\}} \hspace{-0.7em} \gamma_3 * h_3(\mathbf{x}, v)
\end{split}
</math>
</center>

====Evaluation====

The input images, shown in Fig. (4), are 56 × 56 pixels. The authors used <math>\gamma_1 = 0.65, \gamma_2 = 0.65, \gamma_3 = 0.95</math>, and trained for 4,000 iterations. This experiment demonstrates that networks can learn from constraints that operate over discrete sets with potentially complex logical rules. Removing constraints will cause learning to fail. Thus, the experiment also shows that sophisticated sufficiency conditions can be key to success when learning from constraints.

== Conclusion and Critique ==
This paper has introduced a method for using physics and other domain constraints to supervise neural networks. However, the approach described in this paper is not entirely new. Similar ideas are already widely used in Q learning, where the Q value are not available, and the network is supervised by the constraint, as in Deep Q learning (Mnih, Riedmiller et al. 2013[2]).
<center><math>Q(s,a) = R(r,s) + \gamma \sum_{s' ~ P_{sa}}{\text{max}_{a'}Q(s',a')}</math></center>

Also, the paper has a mistake where they quote the free fall equation as
<center><math>\mathbf{y}_i = y_0 + v_0(i\Delta t) + a(i\Delta t)^2</math></center>
which should be
<center><math>\mathbf{y}_i = y_0 + v_0(i\Delta t) + \frac{1}{2} a(i\Delta t)^2</math></center>
Although in this case it doesn't affect the result.

For the evaluation of the experiments, they used correlation with ground truth as the metric to avoid the fact that the output can be scaled without affecting the constraint loss. This is fine if the network gives output of the same scale. However, there's no such guarantee, and the network may give output of varying scale, in which case, we can't say that the network has learnt the correct thing, although it may have a high correlation with ground truth. In fact, to solve the scaling issue, an obvious way is to combine the constraints introduced in this paper with some labeled training data. It's not clear why the author didn't experiment with a combination of these two losses.

In regards to the free fall experiment in particular, the authors apply a fixed acceleration model to create the constraint loss, with the goal of having the network predict height. However, since they did not measure the true height of the object to create test labels, they evaluate using height in pixel space. They do not mention the accuracy of their camera calibration, nor what camera model was used to remove lens distortion. Since lens distortion tends to be worse at the extreme edges of the image, and that they tossed the pillow throughout the entire frame, it is likely that the ground truth labels were corrupted by distortion. If that is the case, it is possible the supervised network is actually performing worse, because it learning how to predict distorted (beyond a constant scaling factor) heights instead of the true height.

These methods essentially boil down to generating approximate labels for training data using some knowledge of the dynamic that the labels should follow.

Finally, this paper only picks examples where the constraints are easy to design, while in some more common tasks such as image classification, what kind of constraints are needed is not straightforward at all.

== References ==
[1] LeCun, Y.; Bengio, Y.; and Hinton, G. 2015. Deep learning. Nature 521(7553):436–444.

[2] Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; and Riedmiller, M. 2013. Playing Atari with Deep Reinforcement Learning. arxiv 1312.5602.

Label-Free Supervision of Neural Networks with Physics and Domain Knowledge

2018-03-18T22:40:45Z

Cs4li:

== Introduction ==
Applications of machine learning are often encumbered by the need for large amounts of labeled training data. Neural networks have made large amounts of labeled data even more crucial to success (LeCun, Bengio, and Hinton 2015[1]). Nonetheless, humans are often able to learn without direct examples, opting instead for high level instructions for how a task should be performed, or what it will look like when completed. This work explores whether a similar principle can be applied to teaching machines: can we supervise networks without individual examples by instead describing only the structure of desired outputs.

[[File:c433li-1.png|300px|center]]

Unsupervised learning methods such as autoencoders, also aim to uncover hidden structure in the data without having access to any label. Such systems succeed in producing highly compressed, yet informative representations of the inputs (Kingma and Welling 2013; Le 2013). However, these representations differ from ours as they are not explicitly constrained to have a particular meaning or semantics. This paper attempts to explicitly provide the semantics of the hidden variables we hope to discover, but still train without labels by learning from constraints that are known to hold according to prior domain knowledge. By training without direct examples of the values our hidden (output) variables take, several advantages are gained over traditional supervised learning, including:
* a reduction in the amount of work spent labeling,
* an increase in generality, as a single set of constraints can be applied to multiple data sets without relabeling.

== Problem Setup ==
In a traditional supervised learning setting, we are given a training set <math>D=\{(x_1, y_1), \cdots, (x_n, y_n)\}</math> of <math>n</math> training examples. Each example is a pair <math>(x_i,y_i)</math> formed by an instance <math>x_i \in X</math> and the corresponding output (label) <math>y_i \in Y</math>. The goal is to learn a function <math>f: X \rightarrow Y</math> mapping inputs to outputs. To quantify performance, a loss function <math>\ell:Y \times Y \rightarrow \mathbb{R}</math> is provided, and a mapping is found via

<center><math> f^* = \text{argmin}_{f \in \mathcal{F}} \sum_{i=1}^n \ell(f(x_i),y_i) </math></center>

where the optimization is over a pre-defined class of functions <math>\mathcal{F}</math> (hypothesis class). In our case, <math>\mathcal{F}</math> will be (convolutional) neural networks parameterized by their weights. The loss could be for example <math>\ell(f(x_i),y_i) = 1[f(x_i) \neq y_i]</math>. By restricting the space of possible functions specifying the hypothesis class <math>\mathcal{F}</math>, we are leveraging prior knowledge about the specific problem we are trying to solve. Informally, the so-called No Free Lunch Theorems state that every machine learning algorithm must make such assumptions in order to work. Another common way in which a modeler incorporates prior knowledge is by specifying an a-priori preference for certain functions in <math>\mathcal{F}</math>, incorporating a regularization term <math>R:\mathcal{F} \rightarrow \mathbb{R}</math>, and solving for <math> f^* = argmin_{f \in \mathcal{F}} \sum_{i=1}^n \ell(f(x_i),y_i) + R(f)</math>. Typically, the regularization term <math>R:\mathcal{F} \rightarrow \mathbb{R}</math> specifies a preference for "simpler" functions (Occam's razor) to prevent overfitting the model on the training data.

The focus is on the set of problems/domains where the problem is a complex environment having a complex representation of the output space, for example mapping an input image to the height of an object(since this leads to a complex output space) rather than simple binary classification problem.

In this paper, prior knowledge on the structure of the outputs is modelled by providing a weighted constraint function <math>g:X \times Y \rightarrow \mathbb{R}</math>, used to penalize “structures” that are not consistent with our prior knowledge. And whether this weak form of supervision is sufficient to learn interesting functions is explored. While one clearly needs labels <math>y</math> to evaluate <math>f^*</math>, labels may not be necessary to discover <math>f^*</math>. If prior knowledge informs us that outputs of <math>f^*</math> have other unique properties among functions in <math>\mathcal{F}</math>, we may use these properties for training rather than direct examples <math>y</math>.

Specifically, an unsupervised approach where the labels <math>y_i</math> are not provided to us is considered, where a necessary property of the output <math>g</math> is optimized instead.
<center><math>\hat{f}^* = \text{argmin}_{f \in \mathcal{F}} \sum_{i=1}^n g(x_i,f(x_i))+ R(f) </math></center>

If the optimizing the above equation is sufficient to find <math>\hat{f}^*</math>, we can use it in replace of labels. If it's not sufficient, additional regularization terms are added. The idea is illustrated with three examples, as described in the next section.

== Experiments ==
=== Tracking an object in free fall ===
In the first experiment, they record videos of an object being thrown across the field of view, and aim to learn the object's height in each frame. The goal is to obtain a regression network mapping from <math>{R^{\text{height} \times \text{width} \times 3}} \rightarrow \mathbb{R}</math>, where <math>\text{height}</math> and <math>\text{width}</math> are the number of vertical and horizontal pixels per frame, and each pixel has 3 color channels. This network is trained as a structured prediction problem operating on a sequence of <math>N</math> images to produce a sequence of <math>N</math> heights, <math>\left(R^{\text{height} \times \text{width} \times 3} \right)^N \rightarrow \mathbb{R}^N</math>, and each piece of data <math>x_i</math> will be a vector of images, <math>\mathbf{x}</math>.
Rather than supervising the network with direct labels, <math>\mathbf{y} \in \mathbb{R}^N</math>, the network is instead supervised to find an object obeying the elementary physics of free falling objects. An object acting under gravity will have a fixed acceleration of <math>a = -9.8 m / s^2</math>, and the plot of the object's height over time will form a parabola:
<center><math>\mathbf{y}_i = y_0 + v_0(i\Delta t) + \frac{1}{2} a(i\Delta t)^2</math></center>

The idea is, given any trajectory of <math>N</math> height predictions, <math>f(\mathbf{x})</math>, we fit a parabola with fixed curvature to those predictions, and minimize the resulting residual. Formally, if we specify <math>\mathbf{a} = [\frac{1}{2} a\Delta t^2, \frac{1}{2} a(2 \Delta t)^2, \ldots, \frac{1}{2} a(N \Delta t)^2]</math>, the prediction produced by the fitted parabola is:
<center><math> \mathbf{\hat{y}} = \mathbf{a} + \mathbf{A} (\mathbf{A}^T\mathbf{A})^{-1} \mathbf{A}^T (f(\mathbf{x}) - \mathbf{a}) </math></center>

where
<center>
<math>
\mathbf{A} =
\left[ {\begin{array}{*{20}c}
\Delta t & 1 \\
2\Delta t & 1 \\
3\Delta t & 1 \\
\vdots & \vdots \\
N\Delta t & 1 \\
\end{array} } \right]
</math>
</center>

The constraint loss is then defined as
<center><math>g(\mathbf{x},f(\mathbf{x})) = g(f(\mathbf{x})) = \sum_{i=1}^{N} |\mathbf{\hat{y}}_i - f(\mathbf{x})_i|</math></center>

Note that <math>\hat{y}</math> is not the ground truth labels. Because <math>g</math> is differentiable almost everywhere, it can be optimized with SGD. They find that when combined with existing regularization methods for neural networks, this optimization is sufficient to recover <math>f^*</math> up to an additive constant <math>C</math> (specifying what object height corresponds to 0).

[[File:c433li-2.png|650px|center]]

The data set is collected on a laptop webcam running at 10 frames per second (<math>\Delta t = 0.1s</math>). The camera position is fixed and 65 diverse trajectories of the object in flight, totalling 602 images are recorded. For each trajectory, the network is trained on randomly selected intervals of <math>N=5</math> contiguous frames. Images are resized to <math>56 \times 56</math> pixels before going into a small, randomly initialized neural network with no pretraining. The network consists of 3 Conv/ReLU/MaxPool blocks followed by 2 Fully Connected/ReLU layers with probability 0.5 dropout and a single regression output.

Since scaling the <math>y_0</math> and <math>v_0</math> results in the same constraint loss <math>g</math>, the authors evaluate the result by the correlation of predicted heights with ground truth pixel measurements. This method was used since the distance from the object to the camera could not be accurately recorded, and this distance is required to calculate the height in meters. This is not a bullet proof evaluation, and is discussed in further detail in the critique section. The results are compared to a supervised network trained with the labels to directly predict the height of the object in pixels. The supervised learning task is viewed as a substantially easier task. From this knowledge we can see from the table below that, under their evaluation criteria, the result is pretty satisfying.
{| class="wikitable"
|+ style="text-align: left;" | Evaluation
|-
! scope="col" | Method !! scope="col" | Random Uniform Output !! scope="col" | Supervised with Labels !! scope="col" | Approach in this Paper
|-
! scope="row" | Correlation
| 12.1% || 94.5% || 90.1%
|}

=== Tracking the position of a walking man ===
In the second experiment, they aim to detect the horizontal position of a person walking across a frame without providing direct labels <math>y \in \mathbb{R}</math> by exploiting the assumption that the person will be walking at a constant velocity over short periods of time. This is formulated as a structured prediction problem <math>f: \left(R^{\text{height} \times \text{width} \times 3} \right)^N \rightarrow \mathbb{R}^N</math>, and each training instances <math>x_i</math> are a vector of images, <math>\mathbf{x}</math>, being mapped to a sequence of predictions, <math>\mathbf{y}</math>. Given the similarities to the first experiment with free falling objects, we might hope to simply remove the gravity term from equation and retrain. However, in this case, that is not possible, as the constraint provides a necessary, but not sufficient, condition for convergence.

Given any sequence of correct outputs, <math>(\mathbf{y}_1, \ldots, \mathbf{y}_N)</math>, the modified sequence, <math>(\lambda * \mathbf{y}_1 + C, \ldots, \lambda * \mathbf{y}_N + C)</math> (<math>\lambda, C \in \mathbb{R}</math>) will also satisfy the constant velocity constraint. In the worst case, when <math>\lambda = 0</math>, <math>f \equiv C</math>, and the network can satisfy the constraint while having no dependence on the image. The trivial output is avoided by adding two two additional loss terms.

<center><math>h_1(\mathbf{x}) = -\text{std}(f(\mathbf{x}))</math></center>
which seeks to maximize the standard deviation of the output, and

<center>
<math>\begin{split}
h_2(\mathbf{x}) = \hphantom{'} & \text{max}(\text{ReLU}(f(\mathbf{x}) - 10)) \hphantom{\text{ }}+ \\
& \text{max}(\text{ReLU}(0 - f(\mathbf{x})))
\end{split}
</math>
</center>
which limit the output to a fixed ranged <math>[0, 10]</math>, the final loss is thus:

<center>
<math>
\begin{split}
g(\mathbf{x}) = \hphantom{'} & ||(\mathbf{A} (\mathbf{A}^T\mathbf{A})^{-1} \mathbf{A}^T - \mathbf{I}) * f(\mathbf{x})||_1 \hphantom{\text{ }}+ \\
& \gamma_1 * h_1(\mathbf{x})
\hphantom{\text{ }}+ \\
& \gamma_2 * h_2(\mathbf{x})
% h_2(y) & = \text{max}(\text{ReLU}(y - 10)) + \\
% & \hphantom{=}\hphantom{a} \text{max}(\text{ReLU}(0 - y))
\end{split}
</math>
</center>

[[File:c433li-3.png|650px|center]]

The data set contains 11 trajectories across 6 distinct scenes, totalling 507 images resized to <math>56 \times 56</math>. The network is trained to output linearly consistent positions on 5 strided frames from the first half of each trajectory, and is evaluated on the second half. The boundary violation penalty is set to <math>\gamma_2 = 0.8</math> and the standard deviation bonus is set to <math>\gamma_1 = 0.6</math>.

As in the previous experiment, the result is evaluated by the correlation with the ground truth. The result is as follows:
{| class="wikitable"
|+ style="text-align: left;" | Evaluation
|-
! scope="col" | Method !! scope="col" | Random Uniform Output !! scope="col" | Supervised with Labels !! scope="col" | Approach in this Paper
|-
! scope="row" | Correlation
| 45.9% || 80.5% || 95.4%
|}
Surprisingly, the approach in this paper beats the same network trained with direct labeled supervision on the test set, which can be attributed to overfitting on the small amount of training data available (as correlation on training data reached 99.8%).

=== Detecting objects with causal relationships ===
In the previous experiments, the authors explored options for incorporating constraints pertaining to dynamics equations in real-world phenomena, i.e., prior knowledge derived from elementary physics. In this experiment, the authors explore the possibilities of learning from logical constraints imposed on single images. More specifically, they ask whether it is possible to learn from causal phenomena.

[[File:paper18_Experiment_3.png|400px|center]]

Here, the authors provide images containing a stochastic collection of up to four characters: Peach, Mario, Yoshi, and Bowser, with each character having small appearance changes across frames due to rotation and reflection. Example images can be seen in Fig. (4). While the existence of objects in each frame is non-deterministic, the generating distribution encodes the underlying phenomenon that Mario will always appear whenever Peach appears. The aim is to create a pair of neural networks <math>f_1, f_2</math> for identifying Peach and Mario, respectively. The networks, <math>f_k : R^{height×width×3} → \{0, 1\}</math>, map the image to the discrete boolean variables, <math>y_1</math> and <math>y_2</math>. Rather than supervising with direct labels, the authors train the networks by constraining their outputs to have the logical relationship <math>y_1 ⇒ y_2</math>. This problem is challenging because the networks must simultaneously learn to recognize the characters and select them according to logical relationships. To avoid the trivial solution <math>y_1 \equiv 1, y_2 \equiv 1</math> on every image, three additional loss terms need to be added:

<center><math> h_1(\mathbf{x}, k) = \frac{1}{M}\sum_i^M |Pr[f_k(\mathbf{x}) = 1] - Pr[f_k(\rho(\mathbf{x})) = 1]|, </math></center>

which forces rotational independence of the outputs in order to encourage the network to learn the existence, rather than location of objects,

<center><math> h_2(\mathbf{x}, k) = -\text{std}_{i \in [1 \dots M]}(Pr[f_k(\mathbf{x}_i) = 1]), </math></center>

which seeks high variance outputs, and

<center>
<math> h_3(\mathbf{x}, v) = \frac{1}{M}\sum_i^{M} (Pr[f(\mathbf{x}_i) = v] - \frac{1}{3} + (\frac{1}{3} - \mu_v))^2 \\
\mu_{v} = \frac{1}{M}\sum_i^{M} \mathbb{1}\{v = \text{argmax}_{v' \in \{0, 1\}^2} Pr[f(\mathbf{x}) = v']\}. </math>
</center>

which seeks high entropy outputs. The final loss function then becomes:

<center>
<math> \begin{split}
g(\mathbf{x}) & = \mathbb{1}\{f_1(\mathbf{x}) \nRightarrow f_2(\mathbf{x})\} \hphantom{\text{ }} + \\
& \sum_{k \in \{1, 2\}} \gamma_1 h_1(\mathbf{x}, k) + \gamma_2 h_2(\mathbf{x}, k) +
\hspace{-0.7em} \sum_{v \neq \{1,0\}} \hspace{-0.7em} \gamma_3 * h_3(\mathbf{x}, v)
\end{split}
</math>
</center>

===Evaluation===

The input images, shown in Fig. (4), are 56 × 56 pixels. The authors used <math>\gamma_1 = 0.65, \gamma_2 = 0.65, \gamma_3 = 0.95</math>, and trained for 4,000 iterations. This experiment demonstrates that networks can learn from constraints that operate over discrete sets with potentially complex logical rules. Removing constraints will cause learning to fail. Thus, the experiment also shows that sophisticated sufficiency conditions can be key to success when learning from constraints.

== Conclusion and Critique ==
This paper has introduced a method for using physics and other domain constraints to supervise neural networks. However, the approach described in this paper is not entirely new. Similar ideas are already widely used in Q learning, where the Q value are not available, and the network is supervised by the constraint, as in Deep Q learning (Mnih, Riedmiller et al. 2013[2]).
<center><math>Q(s,a) = R(r,s) + \gamma \sum_{s' ~ P_{sa}}{\text{max}_{a'}Q(s',a')}</math></center>

Also, the paper has a mistake where they quote the free fall equation as
<center><math>\mathbf{y}_i = y_0 + v_0(i\Delta t) + a(i\Delta t)^2</math></center>
which should be
<center><math>\mathbf{y}_i = y_0 + v_0(i\Delta t) + \frac{1}{2} a(i\Delta t)^2</math></center>
Although in this case it doesn't affect the result.

For the evaluation of the experiments, they used correlation with ground truth as the metric to avoid the fact that the output can be scaled without affecting the constraint loss. This is fine if the network gives output of the same scale. However, there's no such guarantee, and the network may give output of varying scale, in which case, we can't say that the network has learnt the correct thing, although it may have a high correlation with ground truth. In fact, to solve the scaling issue, an obvious way is to combine the constraints introduced in this paper with some labeled training data. It's not clear why the author didn't experiment with a combination of these two losses.

In regards to the free fall experiment in particular, the authors apply a fixed acceleration model to create the constraint loss, with the goal of having the network predict height. However, since they did not measure the true height of the object to create test labels, they evaluate using height in pixel space. They do not mention the accuracy of their camera calibration, nor what camera model was used to remove lens distortion. Since lens distortion tends to be worse at the extreme edges of the image, and that they tossed the pillow throughout the entire frame, it is likely that the ground truth labels were corrupted by distortion. If that is the case, it is possible the supervised network is actually performing worse, because it learning how to predict distorted (beyond a constant scaling factor) heights instead of the true height.

These methods essentially boil down to generating approximate labels for training data using some knowledge of the dynamic that the labels should follow.

Finally, this paper only picks examples where the constraints are easy to design, while in some more common tasks such as image classification, what kind of constraints are needed is not straightforward at all.

== References ==
[1] LeCun, Y.; Bengio, Y.; and Hinton, G. 2015. Deep learning. Nature 521(7553):436–444.

[2] Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; and Riedmiller, M. 2013. Playing Atari with Deep Reinforcement Learning. arxiv 1312.5602.

On The Convergence Of ADAM And Beyond

2018-03-18T22:32:25Z

Cs4li:

= Introduction =
Somewhat different to the presentation I gave in class, this paper focuses strictly on the pitfalls in convergance of the ADAM training algorithm for neural networks from a theoretical standpoint and proposes a novel improvement to ADAM called AMSGrad. The paper introduces the idea that it is possible for ADAM to get "stuck" in it's weighted average history, preventing it from converging to an optimal solution. An example is that in an experiment there may be a large spike in the gradient during some minibatches, but since ADAM weighs the current update by the exponential moving averages of squared past
gradients, the effect of the large spike in gradient is lost. This can be prevented through novel adjustments to the ADAM optimization algorithm, which can improve convergence.

== Notation ==
The paper presents the following framework that generalizes training algorithms to allow us to define a specific variant such as AMSGrad or SGD entirely within it:

[[File:training_algo_framework.png|700px|center]]

Where we have <math> x_t </math> as our network parameters defined within a vector space <math> \mathcal{F} </math>. <math> \prod_{\mathcal{F}} (y) = </math> the projection of <math> y </math> on to the set <math> \mathcal{F} </math>.
<math> \psi_t </math> and <math> \phi_t </math> correspond to arbitrary functions we will provide later, The former maps from the history of gradients to <math> \mathbb{R}^d </math> and the latter maps from the history of the gradients to positive semi definite matrices. And finally <math> f_t </math> is our loss function at some time <math> t </math>, the rest should be pretty self explanatory. Using this framework and defining different <math> \psi_t </math> , <math> \phi_t </math> will allow us to recover all different kinds of training algorithms under this one roof.

=== SGD As An Example ===
To recover SGD using this framework we simply select <math> \phi_t (g_1, \dotsc, g_t) = g_t</math>, <math> \psi_t (g_1, \dotsc, g_t) = I </math> and <math>\alpha_t = \alpha / \sqrt{t}</math>. It's easy to see that no transformations are ultimately applied to any of the parameters based on any gradient history other than the most recent from <math> \phi_t </math> and that <math> \psi_t </math> in no way transforms any of the parameters by any specific amount as <math> V_t = I </math> has no impact later on.

=== ADAM As Another Example ===
Once you can convince yourself that SGD is correct, you should understand the framework enough to see why the following setup for ADAM will allow us to recover the behaviour we want. ADAM has the ability to define a "learning rate" for every parameter based on how much that parameter moves over time (a.k.a it's momentum) supposedly to help with the learning process.

In order to do this we will choose <math> \phi_t (g_1, \dotsc, g_t) = (1 - \beta_1) \sum_{i=0}^{t} {\beta_1}^{t - i} g_t </math>, psi to be <math> \psi_t (g_1, \dotsc, g_t) = (1 - \beta_2)</math>diag<math>( \sum_{i=0}^{t} {\beta_2}^{t - i} {g_t}^2) </math>, and keep <math>\alpha_t = \alpha / \sqrt{t}</math>. This set up is equivalent to choosing a learning rate decay of <math>\alpha / \sqrt{\sum_i g_{i,j}}</math> for <math>j \in [d]</math>.

From this, we can now see that <math>m_t </math> gets filled up with the exponentially weighted average of the history of our gradients that we've come across so far in the algorithm. And that as we proceed to update we scale each one of our parameters by dividing out <math> V_t </math> (in the case of diagonal it's just 1/the diagonal entry) which contains the exponentially weighted average of each parameters momentum (<math> {g_t}^2 </math>) across our training so far in the algorithm. Thus giving each parameter it's own unique scaling by its second moment or momentum. Intuitively from a physical perspective if each parameter is a ball rolling around in the optimization landscape what we are now doing is instead of having the ball changed positions on the landscape at a fixed velocity (i.e. momentum of 0) the ball now has the ability to accelerate and speed up or slow down if it's on a steep hill or flat trough in the landscape (i.e. a momentum that can change with time).

= <math> \Gamma_t </math>, an Interesting Quantity =
Now that we have an idea of what ADAM looks like in this framework, let us now investigate the following:

<center><math> \Gamma_{t + 1} = \frac{\sqrt{V_{t+1}}}{\alpha_{t+1}} - \frac{\sqrt{V_t}}{\alpha_t} </math></center>

Which essentially measure the change of the "Inverse of the learning rate" across time (since we are using alpha's as step sizes). Looking back to our example of SGD it's not hard to see that this quantity is strictly positive, which leads to "non-increasing" learning rates which, a desired property. However, that is not the case with ADAM, and can pose a problem in a theoretical and applied setting. The problem ADAM can face is that <math> \Gamma_t </math> can be indefinite, which the original proof assumed it could not be. The math for this proof is VERY long so instead we will opt for an example to showcase why this could be an issue.

Consider the loss function <math> f_t(x) = \begin{cases}
Cx & \text{for } t \text{ mod 3} = 1 \\
-x & \text{otherwise}
\end{cases} </math>

Where we have <math> C > 2 </math> and <math> \mathcal{F} </math> is <math> [-1,1] </math>
Additionally we choose <math> \beta_1 = 0 </math> and <math> \beta_2 = 1/(1+C^2) </math>. We then proceed to plug this into our framework from before.
This function is periodic and it's easy to see that it has the gradient of C once and then a gradient of -1 twice every period. It has an optimal solution of <math> x = -1 </math> (from a regret standpoint), but using ADAM we would eventually converge at <math> x = 1 </math> since <math> C </math> would "overpower" the -1's.

We formalize this intuition in the results below.

'''Theorem 1.''' There is an online convex optimization problem where ADAM has non-zero average regret. i.e. <math>R_T/T\nrightarrow 0 </math> as <math>T\rightarrow \infty</math>.

'''Theorem 2.''' For any constant <math>\beta_1,\beta_2 \in [0,1)</math> such that <math>\beta_2 < \sqrt{\beta_2}</math>, there is an online convex optimization problem where ADAM has non-zero average regret i.e. <math>R_T/T\nrightarrow 0 </math> as <math>T\rightarrow \infty</math>.

'''Theorem 3.''' For any constant <math>\beta_1,\beta_2 \in [0,1)</math> such that <math>\beta_2 < \sqrt{\beta_2}</math>, there is a stochastic convex optimization problem for which ADAM does not converge to the optimal solution.

= AMSGrad as an improvement to ADAM =
There is a very simple intuitive fix to ADAM to handle this problem. We simply scale our historical weighted average by the maximum we have seen so far to avoid the negative sign problem. There is a very simple one-liner adaptation of ADAM to get to AMSGRAD:
[[File:AMSGrad_algo.png|700px|center]]

Below are some simple plots comparing ADAM and AMSGrad, the first are from the paper and the second are from another individual who attempted to recreate the experiments. The two plots somewhat disagree with one another so take this heuristic improvement with a grain of salt.

[[File:AMSGrad_vs_adam.png|900px|center]]

Here is another example of a one-dimensional convex optimization problem where ADAM fails to converge

[[File:AMSGrad_vs_adam3.png|900px|center]]

[[File:AMSGrad_vs_adam2.png|700px|center]]

= Conclusion =
We have introduced a framework for which we could view several different training algorithms. From there we used it to recover SGD as well as ADAM. In our recovery of ADAM we investigated the change of the inverse of the learning rate over time to discover in certain cases there were convergence issues. We proposed a new heuristic AMSGrad to help deal with this problem and presented some empirical results that show it may have helped ADAM slightly. Thanks for your time.

= Source =
1. Sashank J. Reddi and Satyen Kale and Sanjiv Kumar. "On the Convergence of Adam and Beyond." International Conference on Learning Representations. 2018

stat946w18/Implicit Causal Models for Genome-wide Association Studies

2018-03-18T22:20:28Z

Cs4li:

==Introduction and Motivation==
There is progression in probabilistic models which could develop rich generative models. The models have been expanded with neural network, implicit densities, and with scalable algorithms to very large data for their Bayesian inference. However, most of the models are focus on capturing statistical relationships rather than causal relationships. Causal models give us a sense on how manipulate the generative process could change the final results.

Genome-wide association studies (GWAS) are examples of causal relationship. Specifically, GWAS is about figuring out how genetic factors cause disease among humans. Here the genetic factors we are referring to is single nucleotide polymorphisms (SNPs), and getting a particular disease is treated as a trait, i.e., the outcome. In order to know about the reason of developing a disease and to cure it, the causation between SNPs and diseases is interested: first, predict which one or multiple SNPs cause the disease; second, target the selected SNPs to cure the disease.

[[File: gwas-example.jpg|500px|center|]]

This paper dealt with two questions. The first one is how to build rich causal models with specific needs by GWAS. In general, probabilistic causal models involve a function <math>f</math> and a noise <math>n</math>. For the working simplicity, we usually assume <math>f</math> as a linear model with a Gaussian noise. However, proof has shown that in GWAS, it is necessary to accommodate non-linearity and interactions between multiple genes into the models.

The second accomplishment of this paper is that it addressed the problem caused by latent confounders. Latent confounders are issues when we apply the causal models since we cannot observe them nor knowing the underlying structure. In this paper, they developed implicit causal models which can adjust for confounders.

There has been growing works on causal models which focus on causal discovery and typically have strong assumptions such as Gaussian processes on noise variable or nonlinearities for the main function.

==Implicit Causal Models==
Implicit causal models are an extension of probabilistic causal models. Probabilistic causal models will be introduced first.

=== Probabilistic Causal Models ===
Probabilistic causal models have two parts: deterministic functions of noise and other variables. Consider a global variable <math>\beta</math> and noise <math>\epsilon</math>, where

[[File: eq1.1.png|800px|center]]

Each <math>\beta</math> and <math>x</math> is a function of noise; <math>y</math> is a function of noise and <math>x</math>，

[[File: eqt1.png|800px|center]]

The target is the causal mechanism <math>f_y</math> so that the causal effect <math>p(y|do(X=x),\beta)</math> can be calculated. <math>do(X=x)</math> means that we specify a value of <math>X</math> under the fixed structure <math>\beta</math>. By other paper’s work, it is assumed that <math>p(y|do(x),\beta) = p(y|x, \beta)</math>.

[[File: f_1.png|650px|center|]]

An example of probabilistic causal models is additive noise model.

[[File: eq2.1.png|800px|center]]

<math>f(.)</math> is usually a linear function or spline functions for nonlinearities. <math>\epsilon</math> is assumed to be standard normal, as well as <math>y</math>. Thus the posterior <math>p(\theta | x, y, \beta)</math> can be represented as

[[File: eqt2.png|800px|center]]

where <math>p(\theta)</math> is the prior which is known. Then, variational inference or MCMC can be applied to calculate the posterior distribution.

===Implicit Causal Models===
The difference between implicit causal models and probabilistic causal models is the noise variable. Instead of an additive noise term, implicit causal models directly take noise <math>\epsilon</math> into a neural network and output <math>x</math>.

The causal diagram has changed to:

[[File: f_2.png|650px|center|]]

They used fully connected neural network with a fair amount of hidden units to approximate each causal mechanism. Below is the formal description:

[[File: theorem.png|650px|center|]]

==Implicit Causal Models with Latent Confounders==
Previously, they assumed the global structure is observed. Next, the unobserved scenario is being considered.

===Causal Inference with a Latent Confounder===
Same as before, the interest is the causal effect <math>p(y|do(x_m), x_{-m})</math>. Here, the SNPs other than <math>x_m</math> is also under consideration. However, it is confounded by the unobserved confounder <math>z_n</math>. As a result, the standard inference method cannot be used in this case.

The paper proposed a new method which include the latent confounders. For each subject <math>n=1,…,N</math> and each SNP <math>m=1,…,M</math>,

[[File: eqt4.png|800px|center]]

The mechanism for latent confounder <math>z_n</math> is assumed to be known. SNPs depend on the confounders and the trait depends on all the SNPs and the confounders as well.

The posterior of <math>\theta</math> is needed to be calculate in order to estimate the mechanism <math>g_y</math> as well as the causal effect <math>p(y|do(x_m), x_{-m})</math>, so that it can be explained how changes to each SNP <math>X_m</math> cause changes to the trait <math>Y</math>.

[[File: eqt5.png|800px|center]]

Note that the latent structure <math>p(z|x, y)</math> is assumed known.

===Implicit Causal Model with a Latent Confounder===
This section is the algorithm and functions to implementing an implicit causal model for GWAS.

====Generative Process of Confounders <math>z_n</math>.====
The distribution of confounders is set as standard normal. <math>z_n \in R^K</math> , where <math>K</math> is the dimension of <math>z_n</math> and <math>K</math> should make the latent space as close as possible to the true population structural.

====Generative Process of SNPs <math>x_{nm}</math>.====
Given SNP is coded for,

[[File: SNP.png|300px|center]]

The authors defined a <math>Binomial(2,\pi_{nm})</math> distribution on <math>x_{nm}</math>. And used logistic factor analysis to design the SNP matrix.

[[File: gpx.png|800px|center]]

A SNP matrix looks like this:
[[File: SNP_matrix.png|200px|center]]

Since logistic factor analysis makes strong assumptions, this paper suggests using a neural network to relax these assumptions,

[[File: gpxnn.png|800px|center]]

This renders the outputs to be a full <math>N*M</math> matrix due the the variables <math>w_m</math>, which act as principal component in PCA.

====Generative Process of Traits <math>y_n</math>.====
Previously, each trait is modeled by a linear regression,

[[File: gpy.png|800px|center]]

This also has very strong assumptions on SNPs, interactions, and additive noise. It can also be replaced by a neural network which only outputs a scalar,

[[File: gpynn.png|800px|center]]

==Likelihood-free Variational Inference==
Calculating the posterior of <math>\theta</math> is the key of applying the implicit causal model with latent confounders.

[[File: eqt5.png|800px|center]]

could be reduces to

[[File: lfvi1.png|800px|center]]

However, with implicit models, integrating over a nonlinear function could be suffered. The authors applied likelihood-free variational inference (LFVI). LFVI proposes a family of distribution over the latent variables. Here the variables <math>w_m</math> and <math>z_n</math> are all assumed to be Normal,

[[File: lfvi2.png|800px|center]]

==Empirical Study==
The authors performed simulation on 100,000 SNPs, 940 to 5,000 individuals, and across 100 replications of 11 settings.
Four methods were compared:

* implicit causal model (ICM);
* PCA with linear regression (PCA);
* a linear mixed model (LMM);
* logistic factor analysis with inverse regression (GCAT).

The feedforward neural networks for traits and SNPs are fully connected with two hidden layers using ReLU activation function, and batch normalization.

===Simulation Study===
Based on real genomic data, a true model is applied to generate the SNPs and traits for each configuration.
There are four datasets used in this simulation study:

# HapMap [Balding-Nichols model]
# 1000 Genomes Project (TGP) [PCA]
#* Human Genome Diversity project (HGDP) [PCA]
#* HGDP [Pritchard-Stephens-Donelly model]
# A latent spatial position of individuals for population structure [spatial]

The table shows the prediction accuracy. The accuracy is calculated by the rate of the number of true positives divide the number of true positives plus false positives. True positives measure the proportion of positives that are correctly identified as such (e.g. the percentage of SNPs which are correctly identified as having the causal relation with the trait). In contrast, false positives state the SNPs has the causal relation with the trait when they don’t. The closer the rate to 1, the better the model is since false positives are considered as the wrong prediction.

[[File: table_1.png|650px|center|]]

The result represented above shows that the implicit causal model has the best performance among these four models in every situation. Especially, other models tend to do poorly on PSD and Spatial when <math>a</math> is small, but the ICM achieved a significantly high rate. The only comparable method to ICM is GCAT, when applying to simpler configurations.

===Real-data Analysis===
They also applied ICM to a real-world GWAS of Northern Finland Birth Cohorts which contain 324,160 SNPs and 5,027 individuals. Ten implicit causal models were fitted and the 2 neural networks both with two hidden layers were used for SNP and trait.

[[File: table_2.png|650px|center|]]

The numbers in the above table are the number of significant loci for each of the 10 traits. The number for other methods, such as GCAT, LMM, PCA, and "uncorrected" are obtained from other papers. By comparison, the ICM reached the level of the best previous model for each trait.

==Conclusion==
This paper introduced implicit causal models in order to account for nonlinear complex causal relationships, and applied the method to GWAS. It can not only capture important interactions between genes within an individual and among population level, but also can adjust for latent confounders by taking account of the latent variables into the model.

By the simulation study, the authors proved that the implicit causal model could beat other methods by 15-45.3% on a variety of datasets with variations on parameters.

The authors also believed this GWAS application is only a start of the usage of implicit causal models. It might also be used in physics or economics.

==Critique==
I think this paper is an interesting and novel work. The main contribution of this paper is to connect the statistical genetics and the machine learning methodology. The method is technically sound and does indeed generalize techniques currently used in statistical genetics.

The neural network used in this paper is a very simple feedforward 2 hidden layers neural network, but the idea of where to use the neural network is crucial and might be significant in GWAS.

It has limitations as well. The empirical example in this paper is too easy, and far away from the realistic situation. Despite the simulation study showed some competing results, the Northern Finland Birth Cohort Data application did not demonstrate the advantage of using implicit causal model whether are better than the previous methods, such as GCAT or LMM.

Another limitation is about linkage disequilibrium as the authors stated as well. SNPs are not completely independent of each other; usually they have correlations when the alleles at close locus. They did not consider this complex case, rather they only considered the simplest case where they assumed all the SNPs are independent.

Furthermore, one SNP maybe does not have enough power to explain the causal relationship. Recent papers indicate that causation to a trait may involve multiple SNPs.
This could be a future work as well.

==References==
Tran D, Blei D M. Implicit Causal Models for Genome-wide Association Studies[J]. arXiv preprint arXiv:1710.10742, 2017.

Patrik O Hoyer, Dominik Janzing, Joris M Mooij, Jonas Peters, and Prof Bernhard Schölkopf. Non- linear causal discovery with additive noise models. In Neural Information Processing Systems, 2009.

Alkes L Price, Nick J Patterson, Robert M Plenge, Michael E Weinblatt, Nancy A Shadick, and David Reich. Principal components analysis corrects for stratification in genome-wide association studies. Nature Genetics, 38(8):904–909, 2006.

Minsun Song, Wei Hao, and John D Storey. Testing for genetic associations in arbitrarily structured populations. Nature, 47(5):550–554, 2015.

Dustin Tran, Rajesh Ranganath, and David M Blei. Hierarchical implicit models and likelihood-free variational inference. In Neural Information Processing Systems, 2017.

stat946w18/MaskRNN: Instance Level Video Object Segmentation

2018-03-18T22:03:30Z

Cs4li:

== Introduction ==
Deep Learning has produced state of the art results in many computer vision tasks like image classification, object localization, object detection, object segmentation, semantic segmentation and instance level video object segmentation. Image classification classify the image based on the prominent objects. Object localization is the task of finding objects’ location in the frame. Object Segmentation task involves providing a pixel map which represents the pixel wise location of the objects in the image. Semantic segmentation task attempts at segmenting the image into meaningful parts. Instance level video object segmentation is the task of consistent object segmentation in video sequences.

There are 2 different types of video object segmentation: Unsupervised and Semi-supervised. In unsupervised video object segmentation, the task is to find the salient objects and track the main objects in the video. In an unsupervised setting, the ground truth mask of the salient objects is provided for the first frame. The task is thus simplified to only track the objects required. In this paper we look at an unsupervised video object segmentation technique.

== Background Papers ==
Video object segmentation has been performed using spatio-temporal graphs and deep learning. The Graph based methods construct 3D spatio-temporal graphs in order to model the inter- and the intra-frame relationship of pixels or superpixels in a video.Hence they are computationally slower than deep learning methods and are unable to run at real-time. There are 2 main deep learning techniques for semi-supervised video object segmentation: One Shot Video Object Segmentation (OSVOS) and Learning Video Object Segmentation from Static Images (MaskTrack). Following a brief description of the new techniques introduced by these papers for semi-supervised video object segmentation task.

=== OSVOS (One-Shot Video Object Segmentation) ===

[[File:OSVOS.jpg | 1000px]]

This paper introduces the technique of using a frame-by-frame object segmentation without any temporal information from the previous frames of the video. The paper uses a VGG-16 network with pre-trained weights from image classification task. This network is then converted into a fully-connected network (FCN) by removing the fully connected dense layers at the end and adding convolution layers to generate a segment mask of the input. This network is then trained on the DAVIS 2016 dataset.

During testing, the trained VGG-16 FCN is fine-tuned using the first frame of the video using the ground truth. Because this is a semi-supervised case, the segmented mask (ground truth) for the first frame is available. The first frame data is augmented by zooming/rotating/flipping the first frame and the associated segment mask.

=== MaskTrack (Learning Video Object Segmentation from Static Images) ===

[[File:MaskTrack.jpg | 500px]]

MaskTrack takes the output of the previous frame to improve its predictions to generate the segmentation mask for the next frame. Thus the input to the network is 4 channel wide (3 RGB channels from the frame at time <math>t</math> plus one binary segmentation mask from frame <math>t-1</math>). The output of the network is the binary segmentation mask for frame at time <math>t</math>. Using the binary segmentation mask (referred to as guided object segmentation in the paper), the network is able to use some temporal information from the previous frame to improve its segmentation mask prediction for the next frame.

The model of the MaskTrack network is similar to a modular VGG-16 and is referred to as MaskTrack ConvNet in the paper. The network is trained offline on saliency segmentation datasets: ECSSD, MSRA 10K, SOD and PASCAL-S. The input mask for the binary segmentation mask channel is generated via non-rigid deformation and affine transformation of the ground truth segmentation mask. Similar data-augmentation techniques are also used during online training. Just like OSVOS, MaskTrack uses the first frame ground truth (with augmented images) to fine-tune the network to improve prediction score for the particular video sequence.

A parallel ConvNet network is used to generate predicted segment mask based on the optical flow magnitude. The optical flow between 2 frames is calculated using the EpicFlow algorithm. The output of the two networks is combined using averaging operation to generate the final predicted segmented mask.

Table 1 gives a summary comparison of the different state of the art algorithms. The noteworthy information included in this table is that the technique presented in this paper is the only one which takes into account long-term temporal information. This is accomplished with a recurrent neural net. Furthermore, the bounding box is also estimated instead of just a segmentation mask. The authors claim that this allows the incorporation of a location prior from the tracked object.

[[File:Paper19-SegmentationComp.png]]

== Dataset ==
The three major datasets used in this paper are DAVIS-2016, DAVIS-2017 and Segtrack v2. DAVIS-2016 dataset provides video sequences with only one segment mask for all salient objects. DAVIS-2017 improves the ground truth data by providing segmentation mask for each salient object as a separate color segment mask. Segtrack v2 also provides multiple segmentation mask for all salient objects in the video sequence. These datasets try to recreate real-life scenarios like occlusions, low resolution videos, background clutter, motion blur, fast motion etc.

== MaskRNN: Introduction ==
Most techniques mentioned above don’t work directly on instance level segmentation of the objects through the video sequence. The above approaches focus on image segmentation on each frame and using additional information (mask propagation and optical flow) from the preceding frame perform predictions for the current frame. To address the instance level segmentation problem, MaskRNN proposes a framework where the salient objects are tracked and segmented by capturing the temporal information in the video sequence using a recurrent neural network.

== MaskRNN: Overview ==
In a video sequence <math>I = \{I_1, I_2, …, I_T\}</math>, the sequence of <math>T</math> frames are given as input to the network, where the video sequence contains <math>N</math> salient objects. The ground truth for the first frame <math>y_1^*</math> is also provided for <math>N</math> salient objects.
In this paper, the problem is formulated as a time dependency problem and using a recurrent neural network, the prediction of the previous frame influences the prediction of the next frame. The approach also computes the optical flow between frames (optical flow is the apparent motion of objects between two consecutive frames in the form of a 2D vector field representing the displacement in brightness patterns for each pixel, apparent because it depends on the relative motion between the observer and the scene) and uses that as the input to the neural network. The optical flow is also used to align the output of the predicted mask. “The warped prediction, the optical flow itself, and the appearance of the current frame are then used as input for <math>N</math> deep nets, one for each of the <math>N</math> objects.”[1 - MaskRNN] Each deep net is a made of an object localization network and a binary segmentation network. The binary segmentation network is used to generate the segmentation mask for an object. The object localization network is used to alleviate outliers from the predictions. The final prediction of the segmentation mask is generated by merging the predictions of the 2 networks. For <math>N</math> objects, there are N deep nets which predict the mask for each salient object. The predictions are then merged into a single prediction using an <math>\text{argmax}</math> operation at test time.

== MaskRNN: Multiple Instance Level Segmentation ==

[[File:2ObjectSeg.jpg | 850px]]

Image segmentation requires producing a pixel level segmentation mask and this can become a multi-class problem. Instead, using the approach from [2- Mask R-CNN] this approach is converted into a multiple binary segmentation problem. A separate segmentation mask is predicted separately for each salient object and thus we get a binary segmentation problem. The binary segments are combined using an <math>\text{argmax}</math> operation where each pixel is assigned to the object containing the largest predicted probability.

=== MaskRNN: Binary Segmentation Network ===

[[File:MaskRNNDeepNet.jpg | 850px]]

The above picture shows a single deep net employed for predicting the segment mask for one salient object in the video frame. The network consists of 2 networks: binary segmentation network and object localization network. The binary segmentation network is split into two streams: appearance and flow stream. The input of the appearance stream is the RGB frame at time t and the wrapped prediction of the binary segmentation mask from time <math>t-1</math>. The wrapping function uses the optical flow between frame <math>t-1</math> and frame <math>t</math> to generate a new binary segmentation mask for frame <math>t</math>. The input to the flow stream is the concatenation of the optical flow magnitude between frames <math>t-1</math> to <math>t</math> and frames <math>t</math> to <math>t+1</math> and the wrapped prediction of the segmentation mask from frame <math>t-1</math>. The magnitude of the optical flow is replicated into an RBG format before feeding it to the flow stream. The network architecture closely resembles a VGG-16 network without the fully connected layers at the end. The fully connected layers are replaced with convolutional and bilinear interpolation upsampling layers to generate a binary segment mask. This technique is borrowed from the Fully Convolutional Network mentioned above. The output of the flow stream and the appearance stream is linearly combined and sigmoid function is applied to the result to generate binary mask for ith object. All parts of the network are fully differentiable and thus it can be fully trained in every pass.

=== MaskRNN: Object Localization Network: ===
Using a similar technique to the Fast-RCNN method of object localization, where the region of interest (RoI) pooling of the features of the region proposals (i.e. the bounding box proposals here) is performed and passed through fully connected layers to perform regression, the Object localization network generates a bounding box of the salient object in the frame. This bounding box is enlarged by a factor of 1.25 and combined with the output of binary segmentation mask. Only the segment mask available in the bounding box is used for prediction and the pixels outside of the bounding box are marked as zero. MaskRNN uses the convolutional feature output of the appearance stream as the input to the RoI-pooling layer to generate the predicted bounding box. A pixel is classified as foreground if it is both predicted to be in the foreground by the binary segmentation net and within the enlarged estimated bounding box from the object localization net.

=== Training and Finetuning ===
For training the network depicted in Figure 1, backpropagation through time is used in order to preserve the recurrence relationship connecting the frames of the video sequence. Predictive performance is further improved by following the algorithm for semi-supervised setting for video object segmentation with fine-tuning achieved by using the first frame segmentation mask of the ground truth. In this way, the network is further optimized using the ground truth data.

== MaskRNN: Implementation Details ==
The deep net is first trained offline on a set of static images. The ground truth is randomly perturbed locally to generate the imperfect mask from frame <math>t-1</math>. Two different networks are trained offline separately for DAVIS-2016 and DAVIS-2017 datasets for a fair evaluation of both datasets. After both the object localization net and binary segmentation networks have trained, the temporal information in the network is used to further improve the segmented prediction results. Because of GPU memory constraints, the RNN is only able to backpropagate the gradients back 7 frames and learn long-term temporal information.

For optical flow, a pre-trained flowNet2.0 is used to compute the optical flow between frames.

The deep nets (without the RNN) are then fine-tuned during test time by online training the networks on the ground truth of the first frame and some augmentations of the first frame data. The learning rate is set to 10-5 for online training for 200 iterations.

== MaskRNN: Experimental Results ==
=== Evaluation Metrics ===
There are 3 different techniques for performance analysis for Video Object Segmentation techniques:

1. Region Similarity (Jaccard Index): Region similarity or Intersection-over-union is used to capture precision of the area covered by the prediction segmentation mask compared to the ground truth segmentation mask.

[[File:IoU.jpg | 200px]]

2. Contour Accuracy (F-score): This metric measures the accuracy in the boundary of the predicted segment mask and the ground truth segment mask using bipartite matching between the bounding pixels of the masks.

[[File:Fscore.jpg | 200px]]

3. Temporal Stability : This estimates the degree of deformation needed to transform the segmentation masks from one frame to the next and is measured by the dissimilarity of the set of points on the contours of the segmentation between two adjacent frames.

Temporal Stability measures how well the pixels of the two masks match, while Contour Accuracy measures the accuracy of the contours.

=== Ablation Study ===

The ablation study summarized how the different components contributed to the algorithm evaluated on DAVIS-2016 and DAVIS-2017 datasets.

[[File:MaskRNNTable2.jpg | 700px]]

The above table presents the contribution of each component of the network to the final prediction score. We observe that online fine-tuning improves the performance by a large margin. Addition of RNN/Localization Net and FStream all seem to positively affect the performance of the deep net.

=== Quantitative Evaluation ===

The authors use DAVIS-2016, DAVIS-2017 and Segtrack v2 to compare the performance of the proposed approach to other methods based on foreground-background video object segmentation and multiple instance-level video object segmentation.

[[File:MaskRNNTable3.jpg | 700px]]

The above table shows the results for contour accuracy mean and region similarity. The MaskRNN method seems to outperform all previously proposed methods. The performance gain is significant by employing a Recurrent Neural Network for learning recurrence relationship and using a object localization network to improve prediction results.

The following table shows the improvements in the state of the art achieved by MaskRNN on the DAVIS-2017 and the SegTrack v2 dataset.

[[File:MaskRNNTable4.jpg | 700px]]

=== Qualitative Evaluation ===
The authors showed example qualitative results from the DAVIS and Segtrack datasets.

Below are some success cases of object segmentation under complex motion, cluttered background, and/or multiple object occlusion.

[[File:maskrnn_example.png | 700px]]

Below are a few failure cases. The authors explain two reasons for failure: a) when similar objects of interest are contained in the frame (left two images), and b) when there are large variations in scale and viewpoint (right two images).

[[File:maskrnn_example_fail.png | 700px]]

== Conclusion ==
In this paper a novel approach to instance level video object segmentation task is presented which performs better than current state of the art. The long-term recurrence relationship is learnt using an RNN. The object localization network is added to improve accuracy of the system. Using online fine-tuning the network is adjusted to predict better for the current video sequence.

stat946w18/MaskRNN: Instance Level Video Object Segmentation

2018-03-18T22:01:53Z

Cs4li:

== Introduction ==
Deep Learning has produced state of the art results in many computer vision tasks like image classification, object localization, object detection, object segmentation, semantic segmentation and instance level video object segmentation. Image classification classify the image based on the prominent objects. Object localization is the task of finding objects’ location in the frame. Object Segmentation task involves providing a pixel map which represents the pixel wise location of the objects in the image. Semantic segmentation task attempts at segmenting the image into meaningful parts. Instance level video object segmentation is the task of consistent object segmentation in video sequences.

There are 2 different types of video object segmentation: Unsupervised and Semi-supervised. In unsupervised video object segmentation, the task is to find the salient objects and track the main objects in the video. In an unsupervised setting, the ground truth mask of the salient objects is provided for the first frame. The task is thus simplified to only track the objects required. In this paper we look at an unsupervised video object segmentation technique.

== Background Papers ==
Video object segmentation has been performed using spatio-temporal graphs and deep learning. The Graph based methods construct 3D spatio-temporal graphs in order to model the inter- and the intra-frame relationship of pixels or superpixels in a video.Hence they are computationally slower than deep learning methods and are unable to run at real-time. There are 2 main deep learning techniques for semi-supervised video object segmentation: One Shot Video Object Segmentation (OSVOS) and Learning Video Object Segmentation from Static Images (MaskTrack). Following a brief description of the new techniques introduced by these papers for semi-supervised video object segmentation task.

=== OSVOS (One-Shot Video Object Segmentation) ===

[[File:OSVOS.jpg | 1000px]]

This paper introduces the technique of using a frame-by-frame object segmentation without any temporal information from the previous frames of the video. The paper uses a VGG-16 network with pre-trained weights from image classification task. This network is then converted into a fully-connected network (FCN) by removing the fully connected dense layers at the end and adding convolution layers to generate a segment mask of the input. This network is then trained on the DAVIS 2016 dataset.

During testing, the trained VGG-16 FCN is fine-tuned using the first frame of the video using the ground truth. Because this is a semi-supervised case, the segmented mask (ground truth) for the first frame is available. The first frame data is augmented by zooming/rotating/flipping the first frame and the associated segment mask.

=== MaskTrack (Learning Video Object Segmentation from Static Images) ===

[[File:MaskTrack.jpg | 500px]]

MaskTrack takes the output of the previous frame to improve its predictions to generate the segmentation mask for the next frame. Thus the input to the network is 4 channel wide (3 RGB channels from the frame at time <math>t</math> plus one binary segmentation mask from frame <math>t-1</math>). The output of the network is the binary segmentation mask for frame at time <math>t</math>. Using the binary segmentation mask (referred to as guided object segmentation in the paper), the network is able to use some temporal information from the previous frame to improve its segmentation mask prediction for the next frame.

The model of the MaskTrack network is similar to a modular VGG-16 and is referred to as MaskTrack ConvNet in the paper. The network is trained offline on saliency segmentation datasets: ECSSD, MSRA 10K, SOD and PASCAL-S. The input mask for the binary segmentation mask channel is generated via non-rigid deformation and affine transformation of the ground truth segmentation mask. Similar data-augmentation techniques are also used during online training. Just like OSVOS, MaskTrack uses the first frame ground truth (with augmented images) to fine-tune the network to improve prediction score for the particular video sequence.

A parallel ConvNet network is used to generate predicted segment mask based on the optical flow magnitude. The optical flow between 2 frames is calculated using the EpicFlow algorithm. The output of the two networks is combined using averaging operation to generate the final predicted segmented mask.

Table 1 gives a summary comparison of the different state of the art algorithms. The noteworthy information included in this table is that the technique presented in this paper is the only one which takes into account long-term temporal information. This is accomplished with a recurrent neural net. Furthermore, the bounding box is also estimated instead of just a segmentation mask. The authors claim that this allows the incorporation of a location prior from the tracked object.

[[File:Paper19-SegmentationComp.png]]

== Dataset ==
The three major datasets used in this paper are DAVIS-2016, DAVIS-2017 and Segtrack v2. DAVIS-2016 dataset provides video sequences with only one segment mask for all salient objects. DAVIS-2017 improves the ground truth data by providing segmentation mask for each salient object as a separate color segment mask. Segtrack v2 also provides multiple segmentation mask for all salient objects in the video sequence. These datasets try to recreate real-life scenarios like occlusions, low resolution videos, background clutter, motion blur, fast motion etc.

== MaskRNN: Introduction ==
Most techniques mentioned above don’t work directly on instance level segmentation of the objects through the video sequence. The above approaches focus on image segmentation on each frame and using additional information (mask propagation and optical flow) from the preceding frame perform predictions for the current frame. To address the instance level segmentation problem, MaskRNN proposes a framework where the salient objects are tracked and segmented by capturing the temporal information in the video sequence using a recurrent neural network.

== MaskRNN: Overview ==
In a video sequence <math>I = \{I_1, I_2, …, I_T\}</math>, the sequence of <math>T</math> frames are given as input to the network, where the video sequence contains <math>N</math> salient objects. The ground truth for the first frame <math>y_1^*</math> is also provided for <math>N</math> salient objects.
In this paper, the problem is formulated as a time dependency problem and using a recurrent neural network, the prediction of the previous frame influences the prediction of the next frame. The approach also computes the optical flow between frames (optical flow is the apparent motion of objects between two consecutive frames in the form of a 2D vector field representing the displacement in brightness patterns for each pixel, apparent because it depends on the relative motion between the observer and the scene) and uses that as the input to the neural network. The optical flow is also used to align the output of the predicted mask. “The warped prediction, the optical flow itself, and the appearance of the current frame are then used as input for <math>N</math> deep nets, one for each of the <math>N</math> objects.”[1 - MaskRNN] Each deep net is a made of an object localization network and a binary segmentation network. The binary segmentation network is used to generate the segmentation mask for an object. The object localization network is used to alleviate outliers from the predictions. The final prediction of the segmentation mask is generated by merging the predictions of the 2 networks. For <math>N</math> objects, there are N deep nets which predict the mask for each salient object. The predictions are then merged into a single prediction using an <math>argmax</math> operation at test time.

== MaskRNN: Multiple Instance Level Segmentation ==

[[File:2ObjectSeg.jpg | 850px]]

Image segmentation requires producing a pixel level segmentation mask and this can become a multi-class problem. Instead, using the approach from [2- Mask R-CNN] this approach is converted into a multiple binary segmentation problem. A separate segmentation mask is predicted separately for each salient object and thus we get a binary segmentation problem. The binary segments are combined using an <math>argmax</math> operation where each pixel is assigned to the object containing the largest predicted probability.

=== MaskRNN: Binary Segmentation Network ===

[[File:MaskRNNDeepNet.jpg | 850px]]

The above picture shows a single deep net employed for predicting the segment mask for one salient object in the video frame. The network consists of 2 networks: binary segmentation network and object localization network. The binary segmentation network is split into two streams: appearance and flow stream. The input of the appearance stream is the RGB frame at time t and the wrapped prediction of the binary segmentation mask from time <math>t-1</math>. The wrapping function uses the optical flow between frame <math>t-1</math> and frame <math>t</math> to generate a new binary segmentation mask for frame <math>t</math>. The input to the flow stream is the concatenation of the optical flow magnitude between frames <math>t-1</math> to <math>t</math> and frames <math>t</math> to <math>t+1</math> and the wrapped prediction of the segmentation mask from frame <math>t-1</math>. The magnitude of the optical flow is replicated into an RBG format before feeding it to the flow stream. The network architecture closely resembles a VGG-16 network without the fully connected layers at the end. The fully connected layers are replaced with convolutional and bilinear interpolation upsampling layers to generate a binary segment mask. This technique is borrowed from the Fully Convolutional Network mentioned above. The output of the flow stream and the appearance stream is linearly combined and sigmoid function is applied to the result to generate binary mask for ith object. All parts of the network are fully differentiable and thus it can be fully trained in every pass.

=== MaskRNN: Object Localization Network: ===
Using a similar technique to the Fast-RCNN method of object localization, where the region of interest (RoI) pooling of the features of the region proposals (i.e. the bounding box proposals here) is performed and passed through fully connected layers to perform regression, the Object localization network generates a bounding box of the salient object in the frame. This bounding box is enlarged by a factor of 1.25 and combined with the output of binary segmentation mask. Only the segment mask available in the bounding box is used for prediction and the pixels outside of the bounding box are marked as zero. MaskRNN uses the convolutional feature output of the appearance stream as the input to the RoI-pooling layer to generate the predicted bounding box. A pixel is classified as foreground if it is both predicted to be in the foreground by the binary segmentation net and within the enlarged estimated bounding box from the object localization net.

=== Training and Finetuning ===
For training the network depicted in Figure 1, backpropagation through time is used in order to preserve the recurrence relationship connecting the frames of the video sequence. Predictive performance is further improved by following the algorithm for semi-supervised setting for video object segmentation with fine-tuning achieved by using the first frame segmentation mask of the ground truth. In this way, the network is further optimized using the ground truth data.

== MaskRNN: Implementation Details ==
The deep net is first trained offline on a set of static images. The ground truth is randomly perturbed locally to generate the imperfect mask from frame <math>t-1</math>. Two different networks are trained offline separately for DAVIS-2016 and DAVIS-2017 datasets for a fair evaluation of both datasets. After both the object localization net and binary segmentation networks have trained, the temporal information in the network is used to further improve the segmented prediction results. Because of GPU memory constraints, the RNN is only able to backpropagate the gradients back 7 frames and learn long-term temporal information.

For optical flow, a pre-trained flowNet2.0 is used to compute the optical flow between frames.

The deep nets (without the RNN) are then fine-tuned during test time by online training the networks on the ground truth of the first frame and some augmentations of the first frame data. The learning rate is set to 10-5 for online training for 200 iterations.

== MaskRNN: Experimental Results ==
=== Evaluation Metrics ===
There are 3 different techniques for performance analysis for Video Object Segmentation techniques:

1. Region Similarity (Jaccard Index): Region similarity or Intersection-over-union is used to capture precision of the area covered by the prediction segmentation mask compared to the ground truth segmentation mask.

[[File:IoU.jpg | 200px]]

2. Contour Accuracy (F-score): This metric measures the accuracy in the boundary of the predicted segment mask and the ground truth segment mask using bipartite matching between the bounding pixels of the masks.

[[File:Fscore.jpg | 200px]]

3. Temporal Stability : This estimates the degree of deformation needed to transform the segmentation masks from one frame to the next and is measured by the dissimilarity of the set of points on the contours of the segmentation between two adjacent frames.

Temporal Stability measures how well the pixels of the two masks match, while Contour Accuracy measures the accuracy of the contours.

=== Ablation Study ===

The ablation study summarized how the different components contributed to the algorithm evaluated on DAVIS-2016 and DAVIS-2017 datasets.

[[File:MaskRNNTable2.jpg | 700px]]

The above table presents the contribution of each component of the network to the final prediction score. We observe that online fine-tuning improves the performance by a large margin. Addition of RNN/Localization Net and FStream all seem to positively affect the performance of the deep net.

=== Quantitative Evaluation ===

The authors use DAVIS-2016, DAVIS-2017 and Segtrack v2 to compare the performance of the proposed approach to other methods based on foreground-background video object segmentation and multiple instance-level video object segmentation.

[[File:MaskRNNTable3.jpg | 700px]]

The above table shows the results for contour accuracy mean and region similarity. The MaskRNN method seems to outperform all previously proposed methods. The performance gain is significant by employing a Recurrent Neural Network for learning recurrence relationship and using a object localization network to improve prediction results.

The following table shows the improvements in the state of the art achieved by MaskRNN on the DAVIS-2017 and the SegTrack v2 dataset.

[[File:MaskRNNTable4.jpg | 700px]]

=== Qualitative Evaluation ===
The authors showed example qualitative results from the DAVIS and Segtrack datasets.

Below are some success cases of object segmentation under complex motion, cluttered background, and/or multiple object occlusion.

[[File:maskrnn_example.png | 700px]]

Below are a few failure cases. The authors explain two reasons for failure: a) when similar objects of interest are contained in the frame (left two images), and b) when there are large variations in scale and viewpoint (right two images).

[[File:maskrnn_example_fail.png | 700px]]

== Conclusion ==
In this paper a novel approach to instance level video object segmentation task is presented which performs better than current state of the art. The long-term recurrence relationship is learnt using an RNN. The object localization network is added to improve accuracy of the system. Using online fine-tuning the network is adjusted to predict better for the current video sequence.

A Neural Representation of Sketch Drawings

2018-03-18T05:17:03Z

Cs4li:

= Introduction =

There have been many recent advances in neural generative models for low resolution pixel-based images. Humans, however, do not see the world in a grid of pixels and more typically communicate drawings of the things we see using a series of pen strokes that represent components of objects. These pen strokes are similar to the way vector-based images store data. This paper proposes a new method for creating conditional and unconditional generative models for creating these kinds of vector sketch drawings based on recurrent neural networks (RNNs). The paper explores many applications of these kinds of models, especially creative applications and makes available their unique dataset of vector images.

= Related Work =

Previous work related to sketch drawing generation includes methods that focussed primarily on converting input photographs into equivalent vector line drawings. Image generating models using neural networks also exist but focussed more on generation of pixel-based imagery. Some recent work has focussed on handwritten character generation using RNNs and Mixture Density Networks to generate continuous data points. This work has been extended somewhat recently to conditionally and unconditionally generate handwritten vectorized Chinese Kanji characters by modeling them as a series of pen strokes. Furthermore, this paper builds on work that employed Sequence-to-Sequence models with Variational Autencoders to model English sentences in latent vector space.

One of the limiting factors for creating models that operate on vector datasets has been the dearth of publicly available data. Previously available datasets include Sketch, a set of 20K vector drawings; Sketchy, a set of 70K vector drawings; and ShadowDraw, a set of 30K raster images with extracted vector drawings.

= Methodology =

=== Dataset ===

The “QuickDraw” dataset used in this research was assembled from 75K user drawings extracted from the game “Quick, Draw!” where users drew objects from one of hundreds of classes in 20 seconds or less. The dataset is split into 70K training samples and 2.5K validation and test samples each and represents each sketch a set of “pen stroke actions”. Each action is provided as a vector in the form <math>(\Delta x, \Delta y, p_{1}, p_{2}, p_{3})</math>. For each vector, <math>\Delta x</math> and <math>\Delta y</math> give the movement of the pen from the previous point, with the initial location being the origin. The last three vector elements are a one-hot representation of pen states; <math>p_{1}</math> indicates that the pen is down and a line should be drawn between the current point and the next point, <math>p_{2}</math> indicates that the pen is up and no line should be drawn between the current point and the next point, and <math>p_{3}</math> indicates that the drawing is finished and subsequent points and the current point should not be drawn.

=== Sketch-RNN ===
[[File:sketchrnn.PNG]]

The model is a Sequence-to-Sequence Variational Autoencoder (VAE). The encoder model is a symmetric and parallel set of two RNNs that individually process the sketch drawings (sequence <math>S</math>) in forward and reverse order, respectively. The hidden state produced by each encoder model is then concatenated into a single hidden state <math>h</math>.

\begin{align}
h_\rightarrow = \text{encode}_\rightarrow(S), h_\leftarrow = \text{encode}_\leftarrow(S_{\text{reverse}}), h=[h_\rightarrow; h_\leftarrow]
\end{align}

The concatenated hidden state <math>h</math> is then projected into two vectors <math>\mu</math> and <math>\hat{\sigma}</math> each of size <math>N_{z}</math> using a fully connected layer. <math>\hat{\sigma}</math> is then converted into a non-negative standard deviation parameter <math>\sigma</math> using an exponential operator. These two parameters <math>\mu</math> and <math>\sigma</math> are then used along with an IID Gaussian vector distributed as <math>\mathcal{N}(0, I)</math> of size <math>N_{z}</math> to construct a random vector <math>z \in ℝ^{N_{z}}</math>, similar to the method used for VAE:
\begin{align}
\mu = W_{\mu}h + b_{mu}\textrm{, }\hat{\sigma} = W_{\sigma}h + b_{\sigma}\textrm{, }\sigma = exp\bigg{(}\frac{\hat{\sigma}}{2}\bigg{)}\textrm{, }z = \mu + \sigma \odot \mathcal{N}(0,I)
\end{align}

The decoder model is an autoregressive RNN that samples output sketches from the latent vector <math>z</math>. The initial hidden states of each recurrent neuron are determined using <math>[h_{0}, c_{0}] = tanh(W_{z}z + b_{z})</math>. Each step of the decoder RNN accepts the previous point <math>S_{i-1}</math> and the latent vector <math>z</math> as concatenated input. The initial point given is the origin point with pen state down. The output at each step are the parameters for a probability distribution of the next point <math>S_{i}</math>. Outputs <math>\Delta x</math> and <math>\Delta y</math> are modelled using a Gaussian Mixture Model (GMM) with M normal distributions and output pen states <math>(q_{1}, q_{2}, q_{3})</math> modelled as a categorical distribution with one-hot encoding.
\begin{align}
P(\Delta x, \Delta y) = \sum_{j=1}^{M}\Pi_{j}\mathcal{N}(\Delta x, \Delta y | \mu_{x, j}, \mu_{y, j}, \sigma_{x, j}, \sigma_{y, j}, \rho_{xy, j})\textrm{, where }\sum_{j=1}^{M}\Pi_{j} = 1
\end{align}

For each of the M distributions in the GMM, parameters <math>\mu</math> and <math>\sigma</math> are output for both the x and y locations signifying the mean location of the next point and the standard deviation, respectively. Also output from each model is parameter <math>\rho_{xy}</math> signifying correlation of each bivariate normal distribution. An additional vector <math>\Pi</math> is an output giving the mixture weights for the GMM. The output <math>S_{i}</math> is determined from each of the mixture models using softmax sampling from these distributions.

One of the key difficulties in training this model is the highly imbalanced class distribution of pen states. In particular, the state that signifies a drawing is complete will only appear one time per each sketch and is difficult to incorporate into the model. In order to have the model stop drawing, the authors introduce a hyperparameter <math>N_{max}</math> which basically is the length of the longest sketch in the dataset and limits the number of points per drawing to being no more than <math>N_{max}</math>, after which all output states form the model are set to (0, 0, 0, 0, 1) to force the drawing to stop.

To sample from the model, the parameters required by the GMM and categorical distributions are generated at each time step and the model is sampled until a “stop drawing” state appears or the time state reaches time <math>N_{max}</math>. The authors also introduce a “temperature” parameter <math>\tau</math> that controls the randomness of the drawings by modifying the pen states, model standard deviations, and mixture weights as follows:

\begin{align}
\hat{q}_{k} \rightarrow \frac{\hat{q}_{k}}{\tau}\textrm{, }\hat{\Pi}_{k} \rightarrow \frac{\hat{\Pi}_{k}}{\tau}\textrm{, }\sigma^{2}_{x} \rightarrow \sigma^{2}_{x}\tau\textrm{, }\sigma^{2}_{y} \rightarrow \sigma^{2}_{y}\tau
\end{align}

This parameter <math>\tau</math> lies in the range (0, 1]. As the parameter approaches 0, the model becomes more deterministic and always produces the point locations with the maximum likelihood for a given timestep.

=== Unconditional Generation ===

[[File:paper15_Unconditional_Generation.png|800px|]]

The authors also explored unconditional generation of sketch drawings by only training the decoder RNN module. To do this, the initial hidden states of the RNN were set to 0, and only vectors from the drawing input are used as input without any conditional latent variable <math>z</math>. Different sketches are sampled from the network by only varying the temperature parameter <math>\tau</math> between 0.2 and 0.9

=== Training ===
The training procedure follows the same approach as training for VAE and uses a loss function that consists of the sum of Reconstruction Loss <math>L_{R}</math> and KL Divergence Loss <math>L_{KL}</math>. The reconstruction loss term is composed of two terms; <math>L_{s}</math>, which tries to maximize the log-likelihood of the generated probability distribution explaining the training data <math>S</math> and <math>L_{p}</math> which is the log loss of the pen state terms.
\begin{align}
L_{s} = -\frac{1}{N_{max}}\sum_{i=1}^{N_{S}}log\bigg{(}\sum_{j=1}^{M}\Pi_{j,i}\mathcal{N}(\Delta x_{i},\Delta y_{i} | \mu_{x,j,i},\mu_{y,j,i},\sigma_{x,j,i},\sigma_{y,j,i},\rho_{xy,j,i})\bigg{)}
\end{align}
\begin{align}
L_{p} = -\frac{1}{N_{max}}\sum_{i=1}^{N_{max}} \sum_{k=1}^{3}p_{k,i}log(q_{k,i})
\end{align}
\begin{align}
L_{R} = L_{s} + L{p}
\end{align}

The KL divergence loss <math>L_{KL}</math> measures the difference between the latent vector <math>z</math> and an IID Gaussian distribution with 0 mean and unit variance. This term, normalized by the number of dimensions <math>N_{z}</math> is calculated as:
\begin{align}
L_{KL} = -\frac{1}{2N_{z}}\big{(}1 + \hat{\sigma} - \mu^{2} – exp(\hat{\sigma})\big{)}
\end{align}

The loss for the entire model is thus the weighted sum:
\begin{align}
Loss = L_{R} + w_{KL}L_{KL}
\end{align}

The value of the weight parameter <math>w_{KL}</math> has the effect that as <math>w_{KL} \rightarrow 0</math>, there is a loss in ability to enforce a prior over the latent space and the model assumes the form of a pure autoencoder. As with VAEs, there is a trade-off between optimizing for the two loss terms (i.e. between how precisely the model can regenerate training data <math>S</math> and how closely the latent vector <math>z</math> follows a standard normal distribution) - smaller values of <math>w_{KL}</math> lead to better <math>L_R</math> and worse <math>L_{KL}</math> compared to bigger values of <math>w_{KL}</math>. Also for unconditional generation, the model is a standalone decoder, so there will be no <math>L_{KL}</math> term as only <math>L_{R}</math> is optimized for. This tradeoff is illustrated in Figure 4 showing different settings of <math>w_{KL}</math> and the resulting <math>L_{KL}</math> and <math>L_{R}</math>, as well as just <math>L_{R}</math> in the case of unconditional generation with only a standalone decoder.

[[File:paper15_fig4.png|600px]]

=== Model Configuration ===
In the given model, the encoder and decoder RNNs consist of 512 and 2048 nodes respectively. Also, M = 20 mixture components are used for the decoder RNN. Layer Normalization is applied to the model, and during training recurrent dropout is applied with a keep probability of 90%. The model is trained with batch sizes of 100 samples, using Adam with a learning rate of 0.0001 and gradient clipping of 1.0. During training, simple data augmentation is performed by multiplying the offset columns by two IID random factors.

= Experiments =
The authors trained multiple conditional and unconditional models using varying values of <math>w_{KL}</math> and recorded the different <math>L_{R}</math> and <math>L_{KL}</math> values at convergence. The network used LSTM as it’s encoder RNN and HyperLSTM as the decoder network. The HyperLSTM model was used for decoding because it has a history of being useful in sequence generation tasks. (A HyperLSTM consists of two coupled LSTMS: an auxiliary LSTM and a main LSTM. At every time step, the auxiliary LSTM reads the previous hidden state and the current input vector, and computes an intermediate vector <math display="inline"> z </math>. The weights of the main LSTM used in the current time step are then a learned function of this intermediate vector <math display="inline"> z </math>. That is, the weights of the main LSTM are allowed to vary between time steps as a function of the output of the auxiliary LSTM. See Ha et al. (2016) for details)

=== Conditional Reconstruction ===
[[File:conditional_generation.PNG]]

The authors qualitatively assessed the reconstructed images <math>S’</math> given input sketch <math>S</math> using different values for the temperature hyperparameter <math>\tau</math>. The figure above shows the results for different values of <math>\tau</math> starting with 0.01 at the far left and increasing to 1.0 on the far right. Interestingly, sketches with extra features like a cat with 3 eyes are reproduced as a sketch of a cat with two eyes and sketches of object of a different class such as a toothbrush are reproduced as a sketch of a cat that maintains several of the input toothbrush sketches features.

=== Latent Space Interpolation ===
[[File:latent_space_interp.PNG]]

The latent space vectors <math>z</math> have few “gaps” between encoded latent space vectors due to the enforcement of a Guassian prior. This allowed the authors to do simple arithmetic on the latent vectors from different sketches and produce logical resulting images in the same style as latent space arithmetic on Word2Vec vectors.

=== Sketch Drawing Analogies ===
Given the latent space arithmetic possible, it was found that features of a sketch could be added after some sketch input was encoded. For example, a drawing of a cat with a body could be produced by providing the network with a drawing of a cat’s head, and then adding a latent vector to the embedding layer that represents “body”. As an example, this “body” vector might be produced by taking a drawing of a pig with a body and subtracting a vector representing the pigs head.

=== Predicting Different Endings of Incomplete Sketches ===
[[File:predicting_endings.PNG]]

Using the decoder RNN only, it is possible to finish sketches by conditioning future vector line predictions on the previous points. To do this, the decoder RNN is first used to encode some existing points into the hidden state of the decoder network and then generating the remaining points of the sketch with <math>\tau</math> set to 0.8.

= Applications and Future Work =
Sketch-rnn may enable the production of several creative applications. These might include suggesting ways an artist could finish a sketch, enabling artists to explore latent space arithmetic to find interesting outputs given different sketch inputs, or allowing the production of multiple different sketches of some object as a purely generative application. The authors suggest that providing some conditional sketch of an object to a model designed to produce output from a different class might be useful for producing sketches that morph the two different object classes into one sketch. For example, the image below was trained on drawing cats, but a chair was used as the input. This results in a chair looking cat.

[[File:cat-chair.png]]

Sketch-rnn may also be useful as a teaching tool to help people learn how to draw, especially if it were to be trained on higher quality images. Teaching tools might suggest to students how to proceed to finish a sketch or intake low fidelity sketches to produce a higher quality and “more coherent” output sketch.

The authors noted that sketch-rnn is not as effective at generating coherent sketches when trained on a large number of classes simultaneously (experiments mostly used datasets consisting of one or two object classes), and plan to use class information outside the latent space to try to model a greater number of classes.

Finally, the authors suggest that combining this model with another that produces photorealistic pixel-based images using sketch input, such as Pix2Pix may be an interesting direction for future research. In this case, the output from the sketch-rnn model would be used as input for Pix2Pix and could produce photorealistic images given some crude sketch from a user.

= Limitations =
The authors note a major limitation to the model is the training time relative to the number of data points. When sketches surpass 300 data points the model is difficult to train. To counteract this effect the Ramer-Douglas-Peucker algorithm was used to reduce the number of data points per sketch. This algorithm attempts to significantly reduce the number of data points while keeping the sketch as close to the original as possible.

Another limitation is the effectiveness of generating sketches as the complexity of the class increases. Below are sketches of a few classes which show how the less complex classes such as cats and crabs are more accurately generated. Frogs (more complex) tend to have overly smooth lines drawn which do not seem to be part of realistic frog samples.

[[File:paper15_classcomplexity.png]]

= Conclusion =
The authors presented sketch-rnn, a RNN model for modelling and generating vector-based sketch drawings. The VAE inspired architecture allows sampling the latent space to generate new drawings and also allows for applications that use latent space arithmetic in the style of Word2Vec to produce new drawings given operations on embedded sketch vectors. The authors also made available a large dataset of sketch drawings in the hope of encouraging more research in the area of vector-based image modelling.

= Criticisms =
The paper produces an interesting model that can effectively model vector-based images instead of traditional pixel-based images. This is an interesting problem because vector based images require producing a new way to encode the data. While the results from this paper are interesting, most of the techniques used are borrowed ideas from Variational Autoencoders and the main architecture is not terribly groundbreaking.

One novel part about the architecture presented was the way the authors used GMMs in the decoder network. While this was interesting and seemed to allow the authors to produce different outputs given the same latent vector input <math>z</math> by manipulating the <math>\tau</math> hyperparameter, it was not that clear in the article why GMMs were used instead of a more simple architecture. Much time was spent explaining basics about GMM parameters like <math>\mu</math> and <math>\sigma</math>, but there was comparatively little explanation about how points were actually sampled from these mixture models.

Finally, the authors gloss somewhat over how they were able to encode previous sketch points using only the decoder network into the hidden state of the decoder RNN to finish partially finished sketches. I can only assume that some kind of back-propagation was used to encode the expected sketch points into the hidden parameters of the decoder, but no explanation was given in the paper.

= Source =

Ha, D., & Eck, D. A neural representation of sketch drawings. In Proc. International Conference on Learning Representations (2018).

A Neural Representation of Sketch Drawings

2018-03-18T05:15:38Z

Cs4li:

= Introduction =

There have been many recent advances in neural generative models for low resolution pixel-based images. Humans, however, do not see the world in a grid of pixels and more typically communicate drawings of the things we see using a series of pen strokes that represent components of objects. These pen strokes are similar to the way vector-based images store data. This paper proposes a new method for creating conditional and unconditional generative models for creating these kinds of vector sketch drawings based on recurrent neural networks (RNNs). The paper explores many applications of these kinds of models, especially creative applications and makes available their unique dataset of vector images.

= Related Work =

Previous work related to sketch drawing generation includes methods that focussed primarily on converting input photographs into equivalent vector line drawings. Image generating models using neural networks also exist but focussed more on generation of pixel-based imagery. Some recent work has focussed on handwritten character generation using RNNs and Mixture Density Networks to generate continuous data points. This work has been extended somewhat recently to conditionally and unconditionally generate handwritten vectorized Chinese Kanji characters by modeling them as a series of pen strokes. Furthermore, this paper builds on work that employed Sequence-to-Sequence models with Variational Autencoders to model English sentences in latent vector space.

One of the limiting factors for creating models that operate on vector datasets has been the dearth of publicly available data. Previously available datasets include Sketch, a set of 20K vector drawings; Sketchy, a set of 70K vector drawings; and ShadowDraw, a set of 30K raster images with extracted vector drawings.

= Methodology =

=== Dataset ===

The “QuickDraw” dataset used in this research was assembled from 75K user drawings extracted from the game “Quick, Draw!” where users drew objects from one of hundreds of classes in 20 seconds or less. The dataset is split into 70K training samples and 2.5K validation and test samples each and represents each sketch a set of “pen stroke actions”. Each action is provided as a vector in the form <math>(\Delta x, \Delta y, p_{1}, p_{2}, p_{3})</math>. For each vector, <math>\Delta x</math> and <math>\Delta y</math> give the movement of the pen from the previous point, with the initial location being the origin. The last three vector elements are a one-hot representation of pen states; <math>p_{1}</math> indicates that the pen is down and a line should be drawn between the current point and the next point, <math>p_{2}</math> indicates that the pen is up and no line should be drawn between the current point and the next point, and <math>p_{3}</math> indicates that the drawing is finished and subsequent points and the current point should not be drawn.

=== Sketch-RNN ===
[[File:sketchrnn.PNG]]

The model is a Sequence-to-Sequence Variational Autoencoder (VAE). The encoder model is a symmetric and parallel set of two RNNs that individually process the sketch drawings (sequence <math>S</math>) in forward and reverse order, respectively. The hidden state produced by each encoder model is then concatenated into a single hidden state <math>h</math>.

\begin{align}
h_\rightarrow = \text{encode}_\rightarrow(S), h_\leftarrow = \text{encode}_\leftarrow(S_{\text{reverse}}), h=[h_\rightarrow; h_\leftarrow]
\end{align}

The concatenated hidden state <math>h</math> is then projected into two vectors <math>\mu</math> and <math>\hat{\sigma}</math> each of size <math>N_{z}</math> using a fully connected layer. <math>\hat{\sigma}</math> is then converted into a non-negative standard deviation parameter <math>\sigma</math> using an exponential operator. These two parameters <math>\mu</math> and <math>\sigma</math> are then used along with an IID Gaussian vector distributed as <math>\mathcal{N}(0, I)</math> of size <math>N_{z}</math> to construct a random vector <math>z \in ℝ^{N_{z}}</math>, similar to the method used for VAE:
\begin{align}
\mu = W_{\mu}h + b_{mu}\textrm{, }\hat{\sigma} = W_{\sigma}h + b_{\sigma}\textrm{, }\sigma = exp\bigg{(}\frac{\hat{\sigma}}{2}\bigg{)}\textrm{, }z = \mu + \sigma \odot \mathcal{N}(0,I)
\end{align}

The decoder model is an autoregressive RNN that samples output sketches from the latent vector <math>z</math>. The initial hidden states of each recurrent neuron are determined using <math>[h_{0}, c_{0}] = tanh(W_{z}z + b_{z})</math>. Each step of the decoder RNN accepts the previous point <math>S_{i-1}</math> and the latent vector <math>z</math> as concatenated input. The initial point given is the origin point with pen state down. The output at each step are the parameters for a probability distribution of the next point <math>S_{i}</math>. Outputs <math>\Delta x</math> and <math>\Delta y</math> are modelled using a Gaussian Mixture Model (GMM) with M normal distributions and output pen states <math>(q_{1}, q_{2}, q_{3})</math> modelled as a categorical distribution with one-hot encoding.
\begin{align}
P(\Delta x, \Delta y) = \sum_{j=1}^{M}\Pi_{j}\mathcal{N}(\Delta x, \Delta y | \mu_{x, j}, \mu_{y, j}, \sigma_{x, j}, \sigma_{y, j}, \rho_{xy, j})\textrm{, where }\sum_{j=1}^{M}\Pi_{j} = 1
\end{align}

For each of the M distributions in the GMM, parameters <math>\mu</math> and <math>\sigma</math> are output for both the x and y locations signifying the mean location of the next point and the standard deviation, respectively. Also output from each model is parameter <math>\rho_{xy}</math> signifying correlation of each bivariate normal distribution. An additional vector <math>\Pi</math> is an output giving the mixture weights for the GMM. The output <math>S_{i}</math> is determined from each of the mixture models using softmax sampling from these distributions.

One of the key difficulties in training this model is the highly imbalanced class distribution of pen states. In particular, the state that signifies a drawing is complete will only appear one time per each sketch and is difficult to incorporate into the model. In order to have the model stop drawing, the authors introduce a hyperparameter <math>N_{max}</math> which basically is the length of the longest sketch in the dataset and limits the number of points per drawing to being no more than <math>N_{max}</math>, after which all output states form the model are set to (0, 0, 0, 0, 1) to force the drawing to stop.

To sample from the model, the parameters required by the GMM and categorical distributions are generated at each time step and the model is sampled until a “stop drawing” state appears or the time state reaches time <math>N_{max}</math>. The authors also introduce a “temperature” parameter <math>\tau</math> that controls the randomness of the drawings by modifying the pen states, model standard deviations, and mixture weights as follows:

\begin{align}
\hat{q}_{k} \rightarrow \frac{\hat{q}_{k}}{\tau}\textrm{, }\hat{\Pi}_{k} \rightarrow \frac{\hat{\Pi}_{k}}{\tau}\textrm{, }\sigma^{2}_{x} \rightarrow \sigma^{2}_{x}\tau\textrm{, }\sigma^{2}_{y} \rightarrow \sigma^{2}_{y}\tau
\end{align}

This parameter <math>\tau</math> lies in the range (0, 1]. As the parameter approaches 0, the model becomes more deterministic and always produces the point locations with the maximum likelihood for a given timestep.

=== Unconditional Generation ===

[[File:paper15_Unconditional_Generation.png|800px|]]

The authors also explored unconditional generation of sketch drawings by only training the decoder RNN module. To do this, the initial hidden states of the RNN were set to 0, and only vectors from the drawing input are used as input without any conditional latent variable <math>z</math>. Different sketches are sampled from the network by only varying the temperature parameter <math>\tau</math> between 0.2 and 0.9

=== Training ===
The training procedure follows the same approach as training for VAE and uses a loss function that consists of the sum of Reconstruction Loss <math>L_{R}</math> and KL Divergence Loss <math>L_{KL}</math>. The reconstruction loss term is composed of two terms; <math>L_{s}</math>, which tries to maximize the log-likelihood of the generated probability distribution explaining the training data <math>S</math> and <math>L_{p}</math> which is the log loss of the pen state terms.
\begin{align}
L_{s} = -\frac{1}{N_{max}}\sum_{i=1}^{N_{S}}log\bigg{(}\sum_{j=1}^{M}\Pi_{j,i}\mathcal{N}(\Delta x_{i},\Delta y_{i} | \mu_{x,j,i},\mu_{y,j,i},\sigma_{x,j,i},\sigma_{y,j,i},\rho_{xy,j,i})\bigg{)}
\end{align}
\begin{align}
L_{p} = -\frac{1}{N_{max}}\sum_{i=1}^{N_{max}} \sum_{k=1}^{3}p_{k,i}log(q_{k,i})
\end{align}
\begin{align}
L_{R} = L_{s} + L{p}
\end{align}

The KL divergence loss <math>L_{KL}</math> measures the difference between the latent vector <math>z</math> and an IID Gaussian distribution with 0 mean and unit variance. This term, normalized by the number of dimensions <math>N_{z}</math> is calculated as:
\begin{align}
L_{KL} = -\frac{1}{2N_{z}}\big{(}1 + \hat{\sigma} - \mu^{2} – exp(\hat{\sigma})\big{)}
\end{align}

The loss for the entire model is thus the weighted sum:
\begin{align}
Loss = L_{R} + w_{KL}L_{KL}
\end{align}

The value of the weight parameter <math>w_{KL}</math> has the effect that as <math>w_{KL} \rightarrow 0</math>, there is a loss in ability to enforce a prior over the latent space and the model assumes the form of a pure autoencoder. As with VAEs, there is a trade-off between optimizing for the two loss terms (i.e. between how precisely the model can regenerate training data <math>S</math> and how closely the latent vector <math>z</math> follows a standard normal distribution) - smaller values of <math>w_{KL}</math> lead to better <math>L_R</math> and worse <math>L_{KL}</math> compared to bigger values of <math>w_{KL}</math>. Also for unconditional generation, the model is a standalone decoder, so there will be no <math>L_{KL}</math> term as only <math>L_{R}</math> is optimized for. This tradeoff is illustrated in Figure 4 showing different settings of <math>w_{KL}</math> and the resulting <math>L_{KL} and <math>L_{R}</math>, as well as just <math>L_{R}</math> in the case of unconditional generation with only a standalone decoder.

=== Model Configuration ===
In the given model, the encoder and decoder RNNs consist of 512 and 2048 nodes respectively. Also, M = 20 mixture components are used for the decoder RNN. Layer Normalization is applied to the model, and during training recurrent dropout is applied with a keep probability of 90%. The model is trained with batch sizes of 100 samples, using Adam with a learning rate of 0.0001 and gradient clipping of 1.0. During training, simple data augmentation is performed by multiplying the offset columns by two IID random factors.

= Experiments =
The authors trained multiple conditional and unconditional models using varying values of <math>w_{KL}</math> and recorded the different <math>L_{R}</math> and <math>L_{KL}</math> values at convergence. The network used LSTM as it’s encoder RNN and HyperLSTM as the decoder network. The HyperLSTM model was used for decoding because it has a history of being useful in sequence generation tasks. (A HyperLSTM consists of two coupled LSTMS: an auxiliary LSTM and a main LSTM. At every time step, the auxiliary LSTM reads the previous hidden state and the current input vector, and computes an intermediate vector <math display="inline"> z </math>. The weights of the main LSTM used in the current time step are then a learned function of this intermediate vector <math display="inline"> z </math>. That is, the weights of the main LSTM are allowed to vary between time steps as a function of the output of the auxiliary LSTM. See Ha et al. (2016) for details)

=== Conditional Reconstruction ===
[[File:conditional_generation.PNG]]

The authors qualitatively assessed the reconstructed images <math>S’</math> given input sketch <math>S</math> using different values for the temperature hyperparameter <math>\tau</math>. The figure above shows the results for different values of <math>\tau</math> starting with 0.01 at the far left and increasing to 1.0 on the far right. Interestingly, sketches with extra features like a cat with 3 eyes are reproduced as a sketch of a cat with two eyes and sketches of object of a different class such as a toothbrush are reproduced as a sketch of a cat that maintains several of the input toothbrush sketches features.

=== Latent Space Interpolation ===
[[File:latent_space_interp.PNG]]

The latent space vectors <math>z</math> have few “gaps” between encoded latent space vectors due to the enforcement of a Guassian prior. This allowed the authors to do simple arithmetic on the latent vectors from different sketches and produce logical resulting images in the same style as latent space arithmetic on Word2Vec vectors.

=== Sketch Drawing Analogies ===
Given the latent space arithmetic possible, it was found that features of a sketch could be added after some sketch input was encoded. For example, a drawing of a cat with a body could be produced by providing the network with a drawing of a cat’s head, and then adding a latent vector to the embedding layer that represents “body”. As an example, this “body” vector might be produced by taking a drawing of a pig with a body and subtracting a vector representing the pigs head.

=== Predicting Different Endings of Incomplete Sketches ===
[[File:predicting_endings.PNG]]

Using the decoder RNN only, it is possible to finish sketches by conditioning future vector line predictions on the previous points. To do this, the decoder RNN is first used to encode some existing points into the hidden state of the decoder network and then generating the remaining points of the sketch with <math>\tau</math> set to 0.8.

= Applications and Future Work =
Sketch-rnn may enable the production of several creative applications. These might include suggesting ways an artist could finish a sketch, enabling artists to explore latent space arithmetic to find interesting outputs given different sketch inputs, or allowing the production of multiple different sketches of some object as a purely generative application. The authors suggest that providing some conditional sketch of an object to a model designed to produce output from a different class might be useful for producing sketches that morph the two different object classes into one sketch. For example, the image below was trained on drawing cats, but a chair was used as the input. This results in a chair looking cat.

[[File:cat-chair.png]]

Sketch-rnn may also be useful as a teaching tool to help people learn how to draw, especially if it were to be trained on higher quality images. Teaching tools might suggest to students how to proceed to finish a sketch or intake low fidelity sketches to produce a higher quality and “more coherent” output sketch.

The authors noted that sketch-rnn is not as effective at generating coherent sketches when trained on a large number of classes simultaneously (experiments mostly used datasets consisting of one or two object classes), and plan to use class information outside the latent space to try to model a greater number of classes.

Finally, the authors suggest that combining this model with another that produces photorealistic pixel-based images using sketch input, such as Pix2Pix may be an interesting direction for future research. In this case, the output from the sketch-rnn model would be used as input for Pix2Pix and could produce photorealistic images given some crude sketch from a user.

= Limitations =
The authors note a major limitation to the model is the training time relative to the number of data points. When sketches surpass 300 data points the model is difficult to train. To counteract this effect the Ramer-Douglas-Peucker algorithm was used to reduce the number of data points per sketch. This algorithm attempts to significantly reduce the number of data points while keeping the sketch as close to the original as possible.

Another limitation is the effectiveness of generating sketches as the complexity of the class increases. Below are sketches of a few classes which show how the less complex classes such as cats and crabs are more accurately generated. Frogs (more complex) tend to have overly smooth lines drawn which do not seem to be part of realistic frog samples.

[[File:paper15_classcomplexity.png]]

= Conclusion =
The authors presented sketch-rnn, a RNN model for modelling and generating vector-based sketch drawings. The VAE inspired architecture allows sampling the latent space to generate new drawings and also allows for applications that use latent space arithmetic in the style of Word2Vec to produce new drawings given operations on embedded sketch vectors. The authors also made available a large dataset of sketch drawings in the hope of encouraging more research in the area of vector-based image modelling.

= Criticisms =
The paper produces an interesting model that can effectively model vector-based images instead of traditional pixel-based images. This is an interesting problem because vector based images require producing a new way to encode the data. While the results from this paper are interesting, most of the techniques used are borrowed ideas from Variational Autoencoders and the main architecture is not terribly groundbreaking.

One novel part about the architecture presented was the way the authors used GMMs in the decoder network. While this was interesting and seemed to allow the authors to produce different outputs given the same latent vector input <math>z</math> by manipulating the <math>\tau</math> hyperparameter, it was not that clear in the article why GMMs were used instead of a more simple architecture. Much time was spent explaining basics about GMM parameters like <math>\mu</math> and <math>\sigma</math>, but there was comparatively little explanation about how points were actually sampled from these mixture models.

Finally, the authors gloss somewhat over how they were able to encode previous sketch points using only the decoder network into the hidden state of the decoder RNN to finish partially finished sketches. I can only assume that some kind of back-propagation was used to encode the expected sketch points into the hidden parameters of the decoder, but no explanation was given in the paper.

= Source =

Ha, D., & Eck, D. A neural representation of sketch drawings. In Proc. International Conference on Learning Representations (2018).

File:paper15 fig4.png

2018-03-18T05:10:06Z

Cs4li:

A Neural Representation of Sketch Drawings

2018-03-18T05:07:16Z

Cs4li:

= Introduction =

There have been many recent advances in neural generative models for low resolution pixel-based images. Humans, however, do not see the world in a grid of pixels and more typically communicate drawings of the things we see using a series of pen strokes that represent components of objects. These pen strokes are similar to the way vector-based images store data. This paper proposes a new method for creating conditional and unconditional generative models for creating these kinds of vector sketch drawings based on recurrent neural networks (RNNs). The paper explores many applications of these kinds of models, especially creative applications and makes available their unique dataset of vector images.

= Related Work =

Previous work related to sketch drawing generation includes methods that focussed primarily on converting input photographs into equivalent vector line drawings. Image generating models using neural networks also exist but focussed more on generation of pixel-based imagery. Some recent work has focussed on handwritten character generation using RNNs and Mixture Density Networks to generate continuous data points. This work has been extended somewhat recently to conditionally and unconditionally generate handwritten vectorized Chinese Kanji characters by modeling them as a series of pen strokes. Furthermore, this paper builds on work that employed Sequence-to-Sequence models with Variational Autencoders to model English sentences in latent vector space.

One of the limiting factors for creating models that operate on vector datasets has been the dearth of publicly available data. Previously available datasets include Sketch, a set of 20K vector drawings; Sketchy, a set of 70K vector drawings; and ShadowDraw, a set of 30K raster images with extracted vector drawings.

= Methodology =

=== Dataset ===

The “QuickDraw” dataset used in this research was assembled from 75K user drawings extracted from the game “Quick, Draw!” where users drew objects from one of hundreds of classes in 20 seconds or less. The dataset is split into 70K training samples and 2.5K validation and test samples each and represents each sketch a set of “pen stroke actions”. Each action is provided as a vector in the form <math>(\Delta x, \Delta y, p_{1}, p_{2}, p_{3})</math>. For each vector, <math>\Delta x</math> and <math>\Delta y</math> give the movement of the pen from the previous point, with the initial location being the origin. The last three vector elements are a one-hot representation of pen states; <math>p_{1}</math> indicates that the pen is down and a line should be drawn between the current point and the next point, <math>p_{2}</math> indicates that the pen is up and no line should be drawn between the current point and the next point, and <math>p_{3}</math> indicates that the drawing is finished and subsequent points and the current point should not be drawn.

=== Sketch-RNN ===
[[File:sketchrnn.PNG]]

The model is a Sequence-to-Sequence Variational Autoencoder (VAE). The encoder model is a symmetric and parallel set of two RNNs that individually process the sketch drawings (sequence <math>S</math>) in forward and reverse order, respectively. The hidden state produced by each encoder model is then concatenated into a single hidden state <math>h</math>.

\begin{align}
h_\rightarrow = \text{encode}_\rightarrow(S), h_\leftarrow = \text{encode}_\leftarrow(S_{\text{reverse}}), h=[h_\rightarrow; h_\leftarrow]
\end{align}

The concatenated hidden state <math>h</math> is then projected into two vectors <math>\mu</math> and <math>\hat{\sigma}</math> each of size <math>N_{z}</math> using a fully connected layer. <math>\hat{\sigma}</math> is then converted into a non-negative standard deviation parameter <math>\sigma</math> using an exponential operator. These two parameters <math>\mu</math> and <math>\sigma</math> are then used along with an IID Gaussian vector distributed as <math>\mathcal{N}(0, I)</math> of size <math>N_{z}</math> to construct a random vector <math>z \in ℝ^{N_{z}}</math>, similar to the method used for VAE:
\begin{align}
\mu = W_{\mu}h + b_{mu}\textrm{, }\hat{\sigma} = W_{\sigma}h + b_{\sigma}\textrm{, }\sigma = exp\bigg{(}\frac{\hat{\sigma}}{2}\bigg{)}\textrm{, }z = \mu + \sigma \odot \mathcal{N}(0,I)
\end{align}

The decoder model is an autoregressive RNN that samples output sketches from the latent vector <math>z</math>. The initial hidden states of each recurrent neuron are determined using <math>[h_{0}, c_{0}] = tanh(W_{z}z + b_{z})</math>. Each step of the decoder RNN accepts the previous point <math>S_{i-1}</math> and the latent vector <math>z</math> as concatenated input. The initial point given is the origin point with pen state down. The output at each step are the parameters for a probability distribution of the next point <math>S_{i}</math>. Outputs <math>\Delta x</math> and <math>\Delta y</math> are modelled using a Gaussian Mixture Model (GMM) with M normal distributions and output pen states <math>(q_{1}, q_{2}, q_{3})</math> modelled as a categorical distribution with one-hot encoding.
\begin{align}
P(\Delta x, \Delta y) = \sum_{j=1}^{M}\Pi_{j}\mathcal{N}(\Delta x, \Delta y | \mu_{x, j}, \mu_{y, j}, \sigma_{x, j}, \sigma_{y, j}, \rho_{xy, j})\textrm{, where }\sum_{j=1}^{M}\Pi_{j} = 1
\end{align}

For each of the M distributions in the GMM, parameters <math>\mu</math> and <math>\sigma</math> are output for both the x and y locations signifying the mean location of the next point and the standard deviation, respectively. Also output from each model is parameter <math>\rho_{xy}</math> signifying correlation of each bivariate normal distribution. An additional vector <math>\Pi</math> is an output giving the mixture weights for the GMM. The output <math>S_{i}</math> is determined from each of the mixture models using softmax sampling from these distributions.

One of the key difficulties in training this model is the highly imbalanced class distribution of pen states. In particular, the state that signifies a drawing is complete will only appear one time per each sketch and is difficult to incorporate into the model. In order to have the model stop drawing, the authors introduce a hyperparameter <math>N_{max}</math> which basically is the length of the longest sketch in the dataset and limits the number of points per drawing to being no more than <math>N_{max}</math>, after which all output states form the model are set to (0, 0, 0, 0, 1) to force the drawing to stop.

To sample from the model, the parameters required by the GMM and categorical distributions are generated at each time step and the model is sampled until a “stop drawing” state appears or the time state reaches time <math>N_{max}</math>. The authors also introduce a “temperature” parameter <math>\tau</math> that controls the randomness of the drawings by modifying the pen states, model standard deviations, and mixture weights as follows:

\begin{align}
\hat{q}_{k} \rightarrow \frac{\hat{q}_{k}}{\tau}\textrm{, }\hat{\Pi}_{k} \rightarrow \frac{\hat{\Pi}_{k}}{\tau}\textrm{, }\sigma^{2}_{x} \rightarrow \sigma^{2}_{x}\tau\textrm{, }\sigma^{2}_{y} \rightarrow \sigma^{2}_{y}\tau
\end{align}

This parameter <math>\tau</math> lies in the range (0, 1]. As the parameter approaches 0, the model becomes more deterministic and always produces the point locations with the maximum likelihood for a given timestep.

=== Unconditional Generation ===

[[File:paper15_Unconditional_Generation.png|800px|]]

The authors also explored unconditional generation of sketch drawings by only training the decoder RNN module. To do this, the initial hidden states of the RNN were set to 0, and only vectors from the drawing input are used as input without any conditional latent variable <math>z</math>. Different sketches are sampled from the network by only varying the temperature parameter <math>\tau</math> between 0.2 and 0.9

=== Training ===
The training procedure follows the same approach as training for VAE and uses a loss function that consists of the sum of Reconstruction Loss <math>L_{R}</math> and KL Divergence Loss <math>L_{KL}</math>. The reconstruction loss term is composed of two terms; <math>L_{s}</math>, which tries to maximize the log-likelihood of the generated probability distribution explaining the training data <math>S</math> and <math>L_{p}</math> which is the log loss of the pen state terms.
\begin{align}
L_{s} = -\frac{1}{N_{max}}\sum_{i=1}^{N_{S}}log\bigg{(}\sum_{j=1}^{M}\Pi_{j,i}\mathcal{N}(\Delta x_{i},\Delta y_{i} | \mu_{x,j,i},\mu_{y,j,i},\sigma_{x,j,i},\sigma_{y,j,i},\rho_{xy,j,i})\bigg{)}
\end{align}
\begin{align}
L_{p} = -\frac{1}{N_{max}}\sum_{i=1}^{N_{max}} \sum_{k=1}^{3}p_{k,i}log(q_{k,i})
\end{align}
\begin{align}
L_{R} = L_{s} + L{p}
\end{align}

The KL divergence loss <math>L_{KL}</math> measures the difference between the latent vector <math>z</math> and an IID Gaussian distribution with 0 mean and unit variance. This term, normalized by the number of dimensions <math>N_{z}</math> is calculated as:
\begin{align}
L_{KL} = -\frac{1}{2N_{z}}\big{(}1 + \hat{\sigma} - \mu^{2} – exp(\hat{\sigma})\big{)}
\end{align}

The loss for the entire model is thus the weighted sum:
\begin{align}
Loss = L_{R} + w_{KL}L_{KL}
\end{align}

The value of the weight parameter <math>w_{KL}</math> has the effect that as <math>w_{KL} \rightarrow 0</math>, there is a loss in ability to enforce a prior over the latent space and the model assumes the form of a pure autoencoder. As with VAEs, there is a trade-off between optimizing for the two loss terms (i.e. between how precisely the model can regenerate training data <math>S</math> and how closely the latent vector <math>z</math> follows a standard normal distribution) - smaller values of <math>w_{KL}</math> lead to better <math>L_R</math> and worse <math>L_{KL}</math> compared to bigger values of <math>w_{KL}</math>.

=== Model Configuration ===
In the given model, the encoder and decoder RNNs consist of 512 and 2048 nodes respectively. Also, M = 20 mixture components are used for the decoder RNN. Layer Normalization is applied to the model, and during training recurrent dropout is applied with a keep probability of 90%. The model is trained with batch sizes of 100 samples, using Adam with a learning rate of 0.0001 and gradient clipping of 1.0. During training, simple data augmentation is performed by multiplying the offset columns by two IID random factors.

= Experiments =
The authors trained multiple conditional and unconditional models using varying values of <math>w_{KL}</math> and recorded the different <math>L_{R}</math> and <math>L_{KL}</math> values at convergence. The network used LSTM as it’s encoder RNN and HyperLSTM as the decoder network. The HyperLSTM model was used for decoding because it has a history of being useful in sequence generation tasks. (A HyperLSTM consists of two coupled LSTMS: an auxiliary LSTM and a main LSTM. At every time step, the auxiliary LSTM reads the previous hidden state and the current input vector, and computes an intermediate vector <math display="inline"> z </math>. The weights of the main LSTM used in the current time step are then a learned function of this intermediate vector <math display="inline"> z </math>. That is, the weights of the main LSTM are allowed to vary between time steps as a function of the output of the auxiliary LSTM. See Ha et al. (2016) for details)

=== Conditional Reconstruction ===
[[File:conditional_generation.PNG]]

The authors qualitatively assessed the reconstructed images <math>S’</math> given input sketch <math>S</math> using different values for the temperature hyperparameter <math>\tau</math>. The figure above shows the results for different values of <math>\tau</math> starting with 0.01 at the far left and increasing to 1.0 on the far right. Interestingly, sketches with extra features like a cat with 3 eyes are reproduced as a sketch of a cat with two eyes and sketches of object of a different class such as a toothbrush are reproduced as a sketch of a cat that maintains several of the input toothbrush sketches features.

=== Latent Space Interpolation ===
[[File:latent_space_interp.PNG]]

The latent space vectors <math>z</math> have few “gaps” between encoded latent space vectors due to the enforcement of a Guassian prior. This allowed the authors to do simple arithmetic on the latent vectors from different sketches and produce logical resulting images in the same style as latent space arithmetic on Word2Vec vectors.

=== Sketch Drawing Analogies ===
Given the latent space arithmetic possible, it was found that features of a sketch could be added after some sketch input was encoded. For example, a drawing of a cat with a body could be produced by providing the network with a drawing of a cat’s head, and then adding a latent vector to the embedding layer that represents “body”. As an example, this “body” vector might be produced by taking a drawing of a pig with a body and subtracting a vector representing the pigs head.

=== Predicting Different Endings of Incomplete Sketches ===
[[File:predicting_endings.PNG]]

Using the decoder RNN only, it is possible to finish sketches by conditioning future vector line predictions on the previous points. To do this, the decoder RNN is first used to encode some existing points into the hidden state of the decoder network and then generating the remaining points of the sketch with <math>\tau</math> set to 0.8.

= Applications and Future Work =
Sketch-rnn may enable the production of several creative applications. These might include suggesting ways an artist could finish a sketch, enabling artists to explore latent space arithmetic to find interesting outputs given different sketch inputs, or allowing the production of multiple different sketches of some object as a purely generative application. The authors suggest that providing some conditional sketch of an object to a model designed to produce output from a different class might be useful for producing sketches that morph the two different object classes into one sketch. For example, the image below was trained on drawing cats, but a chair was used as the input. This results in a chair looking cat.

[[File:cat-chair.png]]

Sketch-rnn may also be useful as a teaching tool to help people learn how to draw, especially if it were to be trained on higher quality images. Teaching tools might suggest to students how to proceed to finish a sketch or intake low fidelity sketches to produce a higher quality and “more coherent” output sketch.

The authors noted that sketch-rnn is not as effective at generating coherent sketches when trained on a large number of classes simultaneously (experiments mostly used datasets consisting of one or two object classes), and plan to use class information outside the latent space to try to model a greater number of classes.

Finally, the authors suggest that combining this model with another that produces photorealistic pixel-based images using sketch input, such as Pix2Pix may be an interesting direction for future research. In this case, the output from the sketch-rnn model would be used as input for Pix2Pix and could produce photorealistic images given some crude sketch from a user.

= Limitations =
The authors note a major limitation to the model is the training time relative to the number of data points. When sketches surpass 300 data points the model is difficult to train. To counteract this effect the Ramer-Douglas-Peucker algorithm was used to reduce the number of data points per sketch. This algorithm attempts to significantly reduce the number of data points while keeping the sketch as close to the original as possible.

Another limitation is the effectiveness of generating sketches as the complexity of the class increases. Below are sketches of a few classes which show how the less complex classes such as cats and crabs are more accurately generated. Frogs (more complex) tend to have overly smooth lines drawn which do not seem to be part of realistic frog samples.

[[File:paper15_classcomplexity.png]]

= Conclusion =
The authors presented sketch-rnn, a RNN model for modelling and generating vector-based sketch drawings. The VAE inspired architecture allows sampling the latent space to generate new drawings and also allows for applications that use latent space arithmetic in the style of Word2Vec to produce new drawings given operations on embedded sketch vectors. The authors also made available a large dataset of sketch drawings in the hope of encouraging more research in the area of vector-based image modelling.

= Criticisms =
The paper produces an interesting model that can effectively model vector-based images instead of traditional pixel-based images. This is an interesting problem because vector based images require producing a new way to encode the data. While the results from this paper are interesting, most of the techniques used are borrowed ideas from Variational Autoencoders and the main architecture is not terribly groundbreaking.

One novel part about the architecture presented was the way the authors used GMMs in the decoder network. While this was interesting and seemed to allow the authors to produce different outputs given the same latent vector input <math>z</math> by manipulating the <math>\tau</math> hyperparameter, it was not that clear in the article why GMMs were used instead of a more simple architecture. Much time was spent explaining basics about GMM parameters like <math>\mu</math> and <math>\sigma</math>, but there was comparatively little explanation about how points were actually sampled from these mixture models.

Finally, the authors gloss somewhat over how they were able to encode previous sketch points using only the decoder network into the hidden state of the decoder RNN to finish partially finished sketches. I can only assume that some kind of back-propagation was used to encode the expected sketch points into the hidden parameters of the decoder, but no explanation was given in the paper.

= Source =

Ha, D., & Eck, D. A neural representation of sketch drawings. In Proc. International Conference on Learning Representations (2018).

stat946w18/Synthetic and natural noise both break neural machine translation

2018-03-18T04:49:44Z

Cs4li:

== Introduction ==
* Humans have surprisingly robust language processing systems which can easily overcome typos, e.g.

Aoccdrnig to a rscheearch at Cmabrigde Uinervtisy, it deosn't mttaer in waht oredr the ltteers in a wrod are, the olny iprmoetnt tihng is taht the frist and lsat ltteer be at the rghit pclae.

* A person's ability to read this text comes as no surprise to the Psychology literature
*# Saberi & Perrott (1999) found that this robustness extends to audio as well.
*# Rayner et al. (2006) found that in noisier settings reading comprehension only slowed by 11%.
*# McCusker et al. (1981) found that the common case of swapping letters could often go unnoticed by the reader.
*# Mayall et al (1997) shows that we rely on word shape.
*# Reicher, 1969; Pelli et al., (2003) found that we can switch between whole word recognition but the first and last letter positions are required to stay constant for comprehension

However, neural machine translation (NMT) systems are brittle. i.e. The Arabic word
[[File:Good_morning.PNG]] means a blessing for good morning, however [[File:Hunt.PNG]] means hunt or slaughter.

Facebook's MT system mistakenly confused two words that only differ by one character, a situation that is challenging for a character-based NMT system.

Figure 1 shows the performance translating German to English as a function of the percent of German words modified. Here we show two types of noise: (1) Random permutation of the word and (2) Swapping a pair of adjacent letters that does not include the first or last letter of the word. The important thing to note is that even small amounts of noise lead to substantial drops in performance.

[[File:BLEU_plot.PNG]]

BLEU (bilingual evaluation understudy) is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another. Quality is considered to be the correspondence between a machine's output and that of a human: "the closer a machine translation is to a professional human translation, the better it is". BLEU is between 0 and 1.

This paper explores two simple strategies for increasing model robustness:
# using structure-invariant representations (character CNN representation)
# robust training on noisy data, a form of adversarial training.

The goal of the paper is two-fold:
# to initiate a conversation on robust training and modeling techniques in NMT
# to promote the creation of better and more linguistically accurate artificial noise to be applied to new languages and tasks

== Adversarial examples ==
The growing literature on adversarial examples has demonstrated how dangerous it can be to have brittle machine learning systems being used so pervasively in the real world. Small changes to the input can lead to dramatic
failures of deep learning models. This leads to a potential for malicious attacks using adversarial examples. An important distinction is often drawn between white-box attacks, where adversarial examples are generated with
access to the model parameters, and black-box attacks, where examples are generated without such access.

The paper devises simple methods for generating adversarial examples for NMT. They do not assume any access to the NMT models' gradients, instead relying on cognitively-informed and naturally occurring language errors to generate noise.

== MT system ==
We experiment with three different NMT systems with access to character information at different levels.
# Use <code>char2char</code>, the fully character-level model of (Lee et al. 2017). This model processes a sentence as a sequence of characters. The encoder works as follows: the characters are embedded as vectors, and then the sequence of vectors is fed to a convolutional layer. The sequence output by the convolutional layer is then shortened by max pooling in the time dimension. The output of the max-pooling layer is then fed to a four-layer highway network (Srivasta et al. 2015), and the output of the highway network is in turn fed to a bidirectional GRU, producing a sequence of hidden units. The sequence of hidden units is then processed by the decoder, a GRU with attention, to produce probabilities over sequences of output characters.
# Use <code>Nematus</code> (Sennrich et al., 2017), a popular NMT toolkit. It is another sequence-to-sequence model with several architecture modifications, especially operating on sub-word units using byte-pair encoding. Byte-pair encoding (Sennich et al. 2015, Gage 1994) is an algorithm according to which we begin with a list of characters as our symbols, and repeatedly fuse common combinations to create new symbols. For example, if we begin with the letters a to z as our symbol list, and we find that "th" is the most common two-letter combination in a corpus, then we would add "th" to our symbol list in the first iteration. After we have used this algorithm to create a symbol list of the desired size, we apply a standard encoder-decoder with attention.
# Use an attentional sequence-to-sequence model with a word representation based on a character convolutional neural network (<code>charCNN</code>). The <code>charCNN</code> model is similar to <code>char2char</code>, but uses a shallower highway network and, although it reads the input sentence as characters, it produces as output a probability distribution over words, not characters.

== Data ==
=== MT Data ===
We use the TED talks parallel corpus prepared for IWSLT 2016 (Cettolo et al., 2012) for testing all of the NMT systems.

[[File:Table1x.PNG]]

=== Natural and Artificial Noise ===
==== Natural Noise ====
The three languages, French, German, and Czech, each have their own frequent natural errors. The corpora of edits used for these languages are:

# French : Wikipedia Correction and Paraphrase Corpus (WiCoPaCo)
# German : RWSE Wikipedia Correction Dataset and The MERLIN corpus
# Czech : CzeSL Grammatical Error Correction Dataset (CzeSL-GEC) which is a manually annotated dataset of essays written by both non-native learners of Czech and Czech pupils

The authors harvested naturally occurring errors (typos, misspellings, etc.) corresponding to these three languages from available corpora of edits to build a look-up table of possible lexical replacements.

They insert these errors into the source-side of the parallel data by replacing every word in the corpus with an error if one exists in our dataset. When there is more than one possible replacement to choose, words for which there is no error, are sampled uniformly and kept as is.

==== Synthetic Noise ====
In addition to naturally collected sources of error, we also experiment with four types of synthetic noise: Swap, Middle Random, Fully Random, and Key Typo.
# <code>Swap</code>: The first and simplest source of noise is swapping two letters (do not alter the first or last letters, only apply to words of length >=4).
# <code>Middle Random</code>: Randomize the order of all the letters in a word except for the first and last (only apply to words of length >=4).
# <code>Fully Random</code> Completely randomized words.
# <code>Keyboard Typo</code> Randomly replace one letter in each word with an adjacent key

[[File:Table3x.PNG]]

Table 3 shows BLEU scores of models trained on clean (Vanilla) texts and tested on clean and noisy
texts. All models suffer a significant drop in BLEU when evaluated on noisy texts. This is true
for both natural noise and all kinds of synthetic noise. The more noise in the text, the worse the
translation quality, with random scrambling producing the lowest BLEU scores.

In contrast to the poor performance of these methods in the presence of noise, humans can perform very well as mentioned in the introduction. The table below shows the translations performed by a German native-speaker human, not familiar with the meme and three machine translation methods. Clearly, the machine translation methods failed.

[[File:paper16_tab4.png]]

The author also examined improvements by using a simple spell checker. The author tried correcting error through Google's spell checker by simply accepting the first suggestion on the detected mistake. There was a small improvement in French and German translations, and a small drop in accuracy for the Czech translation due to more complex grammar. The author concluded using existing spell checkers would not improve the accuracy to be comparable with vanilla text. The results are shown in the table below.

[[File:paper16_tab5.png]]

== Dealing with noise ==
=== Structure Invariant Representations ===
The three NMT models are all sensitive to word structure. The <code>char2char</code> and <code>charCNN</code> models both have convolutional layers on character sequences, designed to capture character n-grams (which are sequences of characters or words, of length n). The model in <code>Nematus</code> is based on sub-word units obtained with byte pair encoding (where common consecutive characters are replaced with a unique byte that does not occur in the data). It thus relies on character order.

The simplest way to improve such a model is to take the average character embeddings as a word representation. This model, referred to as <code>meanChar</code>, first generates a word representation by averaging character embeddings, and then proceeds with a word-level encoder similar to the <code>charCNN</code> model.

[[File:Table5x.PNG]]

<code>meanChar</code> is good with the other three scrambling errors (Swap, Middle Random and Fully Random), but bad with Keyboard errors and Natural errors.

=== Black-Box Adversarial Training ===

<code>charCNN</code> Performance
[[File:Table6x.PNG]]

Here is the result of the translation of the scrambled meme:
“According to a study of Cambridge University, it doesn’t matter which technology in a word is going to get the letters in a word that is the only important thing for the first and last letter.”

== Analysis ==
=== Learning Multiple Kinds of Noise in <code>charCNN</code> ===

As Table 6 above shows, <code>charCNN</code> models performed quite well across different noise types on the test set when they are trained on a mix of noise types, which led the authors to speculate that filters from different convolutional layers learned to be robust to different types of noises. To test this hypothesis, they analyzed the weights learned by <code>charCNN</code> models trained on two kinds of input: completely scrambled words (Rand) without other kinds of noise, and a mix of Rand+Key+Nat kinds of noise. For each model, they computed the variance across the filter dimension for each one of the 1000 filters and for each one of the 25 character embedding dimensions, which were then averaged across the filters to yield 25 variances.

As Figure 2 below shows, the variances for the ensemble model are higher and more varied, which indicates that the filters learned different patterns and the model differentiated between different character embedding dimensions. Under the random scrambling scheme, there should be no patterns for the model to learn, so it makes sense for the filter weights to stay close uniform weights, hence the consistently lower variance measures.

[[File:Table7x.PNG]]

== Conclusion ==
In this work, the authors have shown that character-based NMT models are extremely brittle and tend to break when presented with both natural and synthetic kinds of noise. After a comparison of the models, they found that a character-based CNN can learn to
address multiple types of errors that are seen in training.
For the future work, the author suggested generating more realistic synthetic noise by using phonetic and syntactic structure. Also, they suggested that a better NMT architecture could be designed which can be robust to noise without seeing it in the training data.

== Criticism ==
A major critique of this paper is that the solutions presented do not adequately solve the problem. The response to the meanChar architecture has been mostly negative and the method of noise injection has been seen as a simple start. However, the authors have acknowledged these critiques stating that they realize their solution is just a starting point. They argue that this paper has opened the discussion on dealing with noise in machine translation which has been mostly left untouched. Also these solutions/models still do not tackle the problem of natural noise as the models trained on the synthetic noise don't generalize well to natural noise.

== References ==
# Yonatan Belinkov and Yonatan Bisk. Synthetic and Natural Noise Both Break Neural Machine Translation. In ''International Conference on Learning Representations (ICLR)'', 2017.
# Mauro Cettolo, Christian Girardi, and Marcello Federico. WIT: Web Inventory of Transcribed and Translated Talks. In ''Proceedings of the 16th Conference of the European Association for Machine Translation (EAMT)'', pp. 261–268, Trento, Italy, May 2012.
# Jason Lee, Kyunghyun Cho, and Thomas Hofmann. Fully Character-Level Neural Machine Translation without Explicit Segmentation. ''Transactions of the Association for Computational Linguistics (TACL)'', 2017.
# Rico Sennrich, Orhan Firat, Kyunghyun Cho, Alexandra Birch, Barry Haddow, Julian Hitschler, Marcin Junczys-Dowmunt, Samuel Laubli, Antonio Valerio Miceli Barone, Jozef Mokry, and Maria Nadejde. Nematus: a Toolkit for Neural Machine Translation. In ''Proceedings of the Software Demonstrations of the 15th Conference of the European Chapter of the Association for Computational Linguistics'', pp. 65–68, Valencia, Spain, April 2017. Association for Computational Linguistics. URL http://aclweb.org/anthology/E17-3017.
# Aurlien Max and Guillaume Wisniewski. Mining Naturally-occurring Corrections and Paraphrases from Wikipedias Revision History. In Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC’10), Valletta, Malta, may 2010. European Language Resources Association (ELRA). ISBN 2-9517408-6-7. URL https://wicopaco.limsi.fr.
# Katrin Wisniewski, Karin Schne, Lionel Nicolas, Chiara Vettori, Adriane Boyd, Detmar Meurers, Andrea Abel, and Jirka Hana. MERLIN: An online trilingual learner corpus empirically grounding the European Reference Levels in authentic learner data, 10 2013. URL https://www.ukp.tu-darmstadt.de/data/spelling-correction/rwse-datasets.
# Torsten Zesch. Measuring Contextual Fitness Using Error Contexts Extracted from the Wikipedia Revision History. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pp. 529–538, Avignon, France, April 2012. Association for Computational Linguistics.
# Suranjana Samanta and Sameep Mehta. Towards Crafting Text Adversarial Samples. arXiv preprint arXiv:1707.02812, 2017. Karel Sebesta, Zuzanna Bedrichova, Katerina Sormov́a, Barbora Stindlov́a, Milan Hrdlicka, Tereza Hrdlickov́a, Jiŕı Hana, Vladiḿır Petkevic, Toḿas Jeĺınek, Svatava Skodov́a, Petr Janes, Katerina Lund́akov́a, Hana Skoumalov́a, Simon Sĺadek, Piotr Pierscieniak, Dagmar Toufarov́a, Milan Straka, Alexandr Rosen, Jakub Ńaplava, and Marie Poĺackova. CzeSL grammatical error correction dataset (CzeSL-GEC). Technical report, LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics, Charles University, 2017. URL https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-2143.

File:paper16 tab5.png

2018-03-18T04:48:29Z

Cs4li:

File:paper16 tab4.png

2018-03-18T04:35:44Z

Cs4li:

Continuous Adaptation via Meta-Learning in Nonstationary and Competitive Environments

2018-03-18T02:19:56Z

Cs4li:

= Introduction =

Typically, the basic goal of machine learning is to train a model to perform a task. In Meta-learning, the goal is to train a model to perform the task of training a model to perform a task. Hence in this case the term "Meta-Learning" has the exact meaning you would expect; the word "Meta" has the precise function of introducing a layer of abstraction.

The meta-learning task can be made more concrete by a simple example. Consider the CIFAR-100 classification task that we used for our data competition. We can alter this task from being a 100-class classification problem to a collection of 100 binary classification problems. The goal of Meta-Learning here is to design and train and single binary classifier that will perform well on a randomly sampled task given a limited amount of training data for that specific task. In other words, we would like to train a model to perform the following procedure:

# A task is sampled. The task is "Is X a dog?"
# A small set of labeled training data is provided to the model. The labels represent whether or not the image is a picture of a dog.
# The model uses the training data to adjust itself to the specific task of checking whether or not an image is a picture of a dog.

This example also highlights the intuition that the skill of sight is distinct and separable from the skill of knowing what a dog looks like.

In this paper, a probabilistic framework for meta learning is derived, then applied to tasks involving simulated robotic spiders. This framework generalizes the typical machine learning set up using Markov Decision Processes. This paper focuses on a multi-agent Non-stationary environment which requires Reinforcement Learning(RL) agents to do continuous adaptation in such an environment. Nonstationarity breaks the standard assumptions and requires agents to continuously adapt, both at training and execution time, in order to earn more rewards hence the approach is to break this into a sequence of stationary tasks and present it as a multi-task learning problem.

[[File:paper19_fig1.png|600px|frame|none|alt=Alt text| '''Figure 1'''. a) illustrates a probablistic model for Model Agnostic Meta-Learning(MAML) in a multi-task RL setting, where the taskes <math>T</math>, policies <math>\pi</math>, and trajectories <math>\tau</math> are all random variables with dependencies encoded in the edges of a given graph. b) The proposed extension to MAML by the author suitable for continuous adaptation to a task changing dynamically due to non-stationarity of the enviroment. The distribution of taskes is represented by a Markov chain, policies from a previous step and used to construct a new policy for the current step. c) The computation graph for the meta-update from <math>\phi_i</math> to <math>\phi_{i+1}</math>. Boxes represent replicas of the policy graphs with the specified parameters. The model is optimized using trancated backprogation through time starting from <math>L_{T_{i+1}}</math>]]

= Model Agnostic Meta-Learning =

An initial framework for meta-learning is given in "Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks" (Finn et al, 2017):

"In our approach, the parameters of
the model are explicitly trained such that a small
number of gradient steps with a small amount
of training data from a new task will produce
good generalization performance on that task" (Finn et al, 2017).

[[File:MAML.png | 500px]]

In this training algorithm, the parameter vector <math>\theta</math> belonging to the model <math>f_{\theta}</math> is trained such that the meta-objective function <math>\mathcal{L} (\theta) = \sum_{\tau_i \sim P(\tau)} \mathcal{L}_{\tau_i} (f_{\theta_i' }) </math> is minimized. The sum in the objective function is over a sampled batch of training tasks. <math>\mathcal{L}_{\tau_i} (f_{\theta_i'})</math> is the training loss function corresponding to the <math>i^{th}</math> task in the batch evaluated at the model <math>f_{\theta_i'}</math>. The parameter vector <math>\theta_i'</math> is obtained by updating the general parameter <math>\theta</math> using the loss function <math>\mathcal{L}_{\tau_i}</math> and set of K training examples specific to the <math>i^{th}</math> task. Note that in alternate versions of this algorithm, additional testing sets are sampled from <math>\tau_i</math> and used to update <math>\theta</math> using testing loss functions instead of training loss functions.

One of the important difference between this algorithm and more typical fine-tuning methods is that <math>\theta</math> is explicitly trained to be easily adjusted to perform well on different tasks rather than perform well on any specific tasks then fine tuned as the environment changes. (Sutton et al., 2007)

= Probabilistic Framework for Meta-Learning =

This paper puts the meta-learning problem into a Markov Decision Process (MDP) framework common to RL, see Figure 1a. Instead of training examples <math>\{(x, y)\}</math>, we have trajectories <math>\tau = (x_0, a_1, x_1, R_1, x_2, ... a_H, x_H, R_H)</math>. A trajectory is sequence of states/observations <math>x_t</math>, actions <math>a_t</math> and rewards <math>R_t</math> that is sampled from a task <math> T </math> according to a policy <math>\pi_{\theta}</math>. Included with said task is a method for assigning loss values to trajectories <math>L_T(\tau)</math> which is typically the negative cumulative reward. A policy is a deterministic function that takes in a state and returns an action. Our goal here is to train a policy <math>\pi_{\theta}</math> with parameter vector <math>\theta</math>. This is analougous to training a function <math>f_{\theta}</math> that assigns labels <math>y</math> to feature vectors <math>x</math>. More precisely we have the following definitions:

* <math>T :=(L_T, P_T(x), P_T(x_t | x_{t-1}, a_{t-1}), H )</math> (A Task)
* <math>D(T)</math> : A distribution over tasks.
* <math>L_T</math>: A loss function for the task T that assigns numeric loss values to trajectories.
* <math>P_T(x), P_T(x_t | x_{t-1}, a_{t-1})</math>: Probability measures specifying the markovian dynamics of the observations <math>x_t</math>
* <math>H</math>: The horizon of the MDP. This is a fixed natural number specifying the lengths of the tasks trajectories.

The papers goes further to define a Markov dynamic for sequences of tasks as shown in Figure 1b. Thus the policy that we would like to meta learn <math>\pi_{\theta}</math>, after being exposed to a sample of K trajectories <math>\tau_\theta^{1:K}</math> from the task <math>T_i</math>, should produce a new policy <math>\pi_{\phi}</math> that will perform well on the next task <math>T_{i+1}</math>. Thus we seek to minimize the following expectation:

<math>\mathrm{E}_{P(T_0), P(T_{i+1} | T_i)}\bigg(\sum_{i=1}^{l} \mathcal{L}_{T_i, T_{i+1}}(\theta)\bigg)</math>,

where <math>\mathcal{L}_{T_i, T_{i + 1}}(\theta) := \mathrm{E}_{\tau_{i, \theta}^{1:K} } \bigg( \mathrm{E}_{\tau_{i+1, \phi}}\Big( L_{T_{i+1}}(\tau_{i+1, \phi} | \tau_{i, \theta}^{1:K}, \theta) \Big) \bigg) </math> and <math>l</math> is the number of tasks.

The meta-policy <math>\pi_{\theta}</math> is trained and then adapted at test time using the following procedures. The computational graph is given in Figure 1c.

[[File:MAML2.png | 800px]]

The mathematics of calculating loss gradients is omitted.

= Training Spiders to Run with Dynamic Handicaps (Robotic Locomotion in Non-Stationary Environments) =

The authors used the MuJoCo physics simulator to create a simulated environment where robotic spiders with 6 legs are faced with the task of running due east as quickly as possible. The robotic spider observes the location and velocity of it's body, and the angles and velocities of its legs. It interacts with the environment by exerting torque on the joints of its legs. Each leg has two joints, the joint closer to the body rotates horizontally while the joint farther from the body rotates vertically. The environment is made non-stationary by gradually paralyzing two legs of the spider across training and testing episodes.
Putting this example into the above probabilistic framework yields:

* <math>T_i</math>: The task of walking east with the torques of two legs scaled by <math> (i-1)/6 </math>
* <math>\{T_i\}_{i=1}^{7}</math>: A sequence of tasks with the same two legs handicapped in each task. Note there are 15 different ways to choose such legs resulting in 15 sequences of tasks. 12 are used for training and 3 for testing.
* A Markov Descision process composed of
** Observations <math> x_t </math> containing information about the state of the spider.
** Actions <math> a_t </math> containing information about the torques to apply to the spiders legs.
** Rewards <math> R_t </math> corresponding to the speed at which the spider is moving east.

Three differently structured policy neural networks are trained in this set up using both meta-learning and three different previously developed adaption methods.

At testing time, the spiders following meta learned policies initially perform worse than the spiders using non-adaptive policies. However, by the third episode <math> i=3 </math> the meta learners perform on par. And by the sixth episode, when the selected legs are mostly immobile, the meta learners significantly out perform. These results can be seen in the graphs below.

[[File:locomotion_results.png | 800px]]

= Training Spiders to Fight Each Other (Adversarial Meta-Learning) =

The authors created an adversarial environment called RoboSumo where pairs of agents with 4 (named Ants), 6 (named Bugs),or 8 legs (named spiders) sumo wrestle. The agents observe the location and velocity of their bodies and the bodies of their opponent, the angles and velocities of their legs, and the forces being exerted on them by their opponent (equivalent of tactile sense). The game is organized into episodes and rounds. Episodes are single wrestling matches with 500 time steps and win/lose/draw outcomes. Agents win by pushing their opponent out of the ring or making their opponent's body touch the ground. Rounds are batches of episodes. An episode results in a draw when neither of these things happen after 500 time steps. Rounds have possible outcomes win, lose, and draw that are decided based on majority of episodes won. K rounds will be fought. Both agents may update their policies between rounds. The agent that wins the majority of rounds is deemed the winner of the game.

== Setup ==
Similar to the Robotic locomotion example, this game can be phrased in terms of the RL MDP framework.

* <math>T_i</math>: The task of fighting a round.
* <math>\{T_i\}_{i=1}^{K}</math>: A sequence of rounds against the same opponent. Note that the opponent may update their policy between rounds but the anatomy of both wrestlers will be constant across rounds.
* A Markov Descision process composed of
** A horizon <math>H = 500*n</math> where <math>n</math> is the number of episodes per round.
** Observations <math> x_t </math> containing information about the state of the agent and its opponent.
** Actions <math> a_t </math> containing information about the torques to apply to the agents legs.
** Rewards <math> R_t </math> rewards given to the agent based on its wrestling performance. <math>R_{500*n} = </math> +2000 if win episode, -2000 if lose, and -1000 if draw.

Note that the above reward set up is quite sparse, therefore in order to encourage fast training, rewards are introduced at every time step for the following.
* For staying close to the center of the ring.
* For exerting force on the opponents body.
* For moving towards the opponent.
* For the distance of the opponent to the center of the ring.

This makes sense intuitively as these are reasonable goals for agents to explore when they are learning to wrestle.

== Training ==
The same combinations of policy networks and adaptation methods that were used in the locomotion example are trained and tested here. A family of non-adaptive policies are first trained via self-play and saved at all stages. Self-play simply means the two agents in the training environment use the same policy. All policy versions are saved so that agents of various skill levels can be sampled when training meta-learners. The weights of the different insects were calibrated such that the test win rate between two insects of differing anatomy, who have been trained for the same number of epochs via self-play, is close to 50%.

[[File:weight_cal.png | 800px]]

We can see in the above figure that the weight of the spider had to be increased by almost four times in order for the agents to be evenly matched.

[[File:robosumo_results.png | 800px]]

The above figure shows testing results for various adaptation strategies. The agent and opponent both start with the self-trained policies. The opponent uses all of its testing experience to continue training. The agent uses only the last 75 episodes to adapt its policy network. This shows that metal learners need only a limited amount of experience in order to hold their own against a constantly improving opponent.

= Future Work =
The authors mention that their approach will likely not work well with sparse rewards. This is because the meta-updates, which use policy gradients, are very dependent on the reward signal. They mention that this is an issue they would like to address in the future. A potential solution they have outlined for this is to introduce auxiliary dense rewards which could enable meta-learning.

= Sources =
# Chelsea Finn, Pieter Abbeel, Sergey Levine. "Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks." arXiv preprint arXiv:1703.03400v3 (2017).
# Richard S Sutton, Anna Koop, and David Silver. On the role of tracking in stationary environments. In Proceedings of the 24th international conference on Machine learning, pp. 871–878. ACM, 2007.

File:paper19 fig1.png

2018-03-18T01:27:40Z

Cs4li:

stat946w18/IMPROVING GANS USING OPTIMAL TRANSPORT

2018-03-16T05:06:09Z

Cs4li:

== Introduction ==
Generative Adversarial Networks (GANs) are powerful generative models. A GAN model consists of a generator and a discriminator or critic. The generator is a neural network which is trained to generate data having a distribution matched with the distribution of the real data. The critic is also a neural network, which is trained to separate the generated data from the real data. A loss function that measures the distribution distance between the generated data and the real one is important to train the generator.

Optimal transport theory evaluates the distribution distance based on metric, which provides another method for generator training. The main advantage of optimal transport theory over the distance measurement in GAN is its closed form solution for having a tractable training process. But the theory might also result in inconsistency in statistical estimation due to the given biased gradients if the mini-batches method is applied (Bellemare et al.,
2017).

This paper presents a variant GANs named OT-GAN, which incorporates a discriminative metric called 'MIni-batch Energy Distance' into its critic in order to overcome the issue of biased gradients.

== GANs and Optimal Transport ==

===Generative Adversarial Nets===
Original GAN was firstly reviewed. The objective function of the GAN:

[[File:equation1.png|700px]]

The goal of GANs is to train the generator g and the discriminator d finding a pair of (g,d) to achieve Nash equilibrium. However, it could cause failure of converging since the generator and the discriminator are trained based on gradient descent techniques.

===Wasserstein Distance (Earth-Mover Distance)===

In order to solve the problem of convergence failure, Arjovsky et. al. (2017) suggested Wasserstein distance (Earth-Mover distance) based on the optimal transport theory.

[[File:equation2.png|600px]]

where <math> \prod (p,g) </math> is the set of all joint distributions <math> \gamma (x,y) </math> with marginals <math> p(x) </math> (real data), <math> g(y) </math> (generated data). <math> c(x,y) </math> is a cost function and the Euclidean distance was used by Arjovsky et. al. in the paper.

The Wasserstein distance can be considered as moving the minimum amount of points between distribution <math> g(y) </math> and <math> p(x) </math> such that the generator distribution <math> g(y) </math> is similar to the real data distribution <math> p(x) </math>.

Computing the Wasserstein distance is intractable. The proposed Wasserstein GAN (W-GAN) provides an estimated solution by switching the optimal transport problem into Kantorovich-Rubinstein dual formulation using a set of 1-Lipschitz functions. A neural network can then be used to obtain an estimation.

[[File:equation3.png|600px]]

W-GAN helps to solve the unstable training process of original GAN and it can solve the optimal transport problem approximately, but it is still intractable.

===Sinklhorn Distance===
Genevay et al. (2017) proposed to use the primal formulation of optimal transport instead of the dual formulation to generative modeling. They introduced Sinkhorn distance which is a smoothed generalization of the Wasserstein distance.
[[File: equation4.png|600px]]

It introduced entropy restriction (<math> \beta </math>) to the joint distribution <math> \prod_{\beta} (p,g) </math>. This distance could be generalized to approximate the mini-batches of data <math> X ,Y</math> with <math> K </math> vectors of <math> x, y</math>. The <math> i, j </math> th entry of the cost matrix <math> C </math> can be interpreted as the cost it needs to transport the <math> x_i </math> in mini-batch X to the <math> y_i </math> in mini-batch <math>Y </math>. The resulting distance will be:

[[File: equation5.png|550px]]

where <math> M </math> is a <math> K \times K </math> matrix, each row of <math> M </math> is a joint distribution of <math> \gamma (x,y) </math> with positive entries. The summmation of rows or columns of <math> M </math> is always equal to 1.

This mini-batch Sinkhorn distance is not only fully tractable but also capable of solving the instability problem of GANs. However, it is not a valid metric over probability distribution when taking the expectation of <math> \mathcal{W}_{c} </math> and the gradients are biased when the mini-batch size is fixed.

===Energy Distance (Cramer Distance)===
In order to solve the above problem, Bellemare et al. proposed Energy distance:

[[File: equation6.png|700px]]

where <math> x, x' </math> and <math> y, y'</math> are independent samples from data distribution <math> p </math> and generator distribution <math> g </math>, respectively. Based on the Energy distance, Cramer GAN is to minimize the ED distance metric when training the generator.

==MINI-BATCH ENERGY DISTANCE==
Salimans et al. (2016) mentioned that comparing to use distributions over individual images, mini-batch GAN is more powerful when use the distributions over mini-batches <math> g(X), p(X) </math>. The distance measure is generated for mini-batches.

===GENERALIZED ENERGY DISTANCE===
The generalized energy distance allowed to use non-Euclidean distance functions d. It is also valid for mini-batches and is considered better than working with individual data batch.

[[File: equation7.png|670px]]

Similarly as defined in the Energy distance, <math> X, X' </math> and <math> Y, Y'</math> can be the independent samples from data distribution <math> p </math> and the generator distribution <math> g </math>, respectively. While in Generalized engergy distance, <math> X, X' </math> and <math> Y, Y'</math> can also be valid for mini-batches. The <math> D_{GED}(p,g) </math> is a metric when having <math> d </math> as a metric. Thus, taking the triangle inequality of <math> d </math> into account, <math> D(p,g) \geq 0,</math> and <math> D(p,g)=0 </math> when <math> p=g </math>.

===MINI-BATCH ENERGY DISTANCE===
As <math> d </math> is free to choose, authors proposed Mini-batch Energy Distance by using entropy-regularized Wasserstein distnace as <math> d </math>.

[[File: equation8.png|650px]]

where <math> X, X' </math> and <math> Y, Y'</math> are independent sampled mini-batches from the data distribution <math> p </math> and the generator distribution <math> g </math>, respectively. This distance metric combines the energy distance with primal form of optimal tranport over mini-batch distributions <math> g(Y) </math> and <math> p(X) </math>. Inside the generalized energy distance, the Sinkhorn distance is a valid metric between each mini-batches. By adding the <math> - \mathcal{W}_c (Y,Y')</math> and <math> \mathcal{W}_c (X,Y)</math> to equation (5) and using enregy distance, the objective becomes statistically consistent and mini-batch gradients are unbiased.

==OPTIMAL TRANSPORT GAN (OT-GAN)==

In order to secure the statistical efficiency, authors suggested using cosine distance between vectors <math> v_\eta (x) </math> and <math> v_\eta (y) </math> based on the deep neural network that maps the mini-batch data to a learned latent space. The reason for not using Euclidean distance is because of its poor performance in the high dimensional space. Here is the transportation cost:

[[File: euqation9.png|370px]]

where the <math> v_\eta </math> is chosen to maximize the resulting minibatch energy distance.

Unlike the practice when using the original GANs, the generator was trained more often than the critic, which keep the cost function from degeneration. The resulting generator in OT-GAN has a well defined and statistically consistent objective through the training process.

The algorithm is defined below. The backpropagation is not used in the algorithm due to the envelope theorem. Stochastic gradient descent is used as the optimization method.

[[File: al.png|600px]]

[[File: al_figure.png|600px]]

==EXPERIMENTS==

In order to demonstrate the supermum performance of the OT-GAN, authors compared it with the original GAN and other popular models based on four experiments: Dataset recovery; CIFAR-10 test; ImageNet test; and the conditional image synthesis test.

===MIXTURE OF GAUSSIAN DATASET===
OT-GAN has a statistically consistent objective when it is compared with the original GAN (DC-GAN), such that the generator would not update to a wrong direction even if the signal provided by the cost function to the generator is not good. In order to prove this advantage, authors compared the OT-GAN with the original GAN loss (DAN-S) based on a simple task. The task was set to recover all of the 8 modes from 8 Gaussian mixers in which the means were arranged in a circle. MLP with RLU activation functions were used in this task. The critic was only updated for 15K iterations. The generator distribution was tracked for another 25K iteration. The results showed that the original GAN experiences the model collapse after fixing the discriminator while the OT-GAN recovered all the 8 modes from the mixed Gaussian data.

[[File: 5_1.png|600px]]

===CIFAR-10===

The dataset CIFAR-10 was then used for inspecting the effect of batch-size to the model training process and the image quality. OT-GAN and four other methods were compared using "inception score" as the criteria for comparison. Figure 3 shows the change of inceptions scores (y-axis) by the increased of the iteration number. Scores of four different batch sizes (200, 800, 3200 and 8000) were compared. The results show that a larger batch size would lead to a more stable model showing a larger value in inception score. However, a large batch size would also require a high-performance computational environment. The sample quality across all 5 methods are compared in Table 1 where the OT_GAN has the best score.

[[File: 5_2.png|600px]]

===IMAGENET DOGS===

In order to investigate the performance of OT-GAN when dealing with the high-quality images, the dog subset of ImageNet (128*128) was used to train the model. Figure 6 shows that OT-GAN produces less nonsensical images and it has a higher inception score compare to the DC-GAN.

[[FIle: 5_3.png|600px]]

===CONDITIONAL GENERATION OF BIRDS===

The last experiment was to compare OT-GAN with three popular GAN models for processing the text-to-image generation demonstrating the performance on conditional image synthesis. As can be found from Table 2, OT-GAN received the highest inception score than the scores of the other three models.

[[File: 5_4.png|600px]]

The algorithm used to obtain the results above is conditional generation generalized from '''Algorithm 1''' to include conditional information <math>s</math> such as some text description of an image. The modified algorithm is outlined in '''Algorithm 2'''.

[[File: paper23_alg2.png|600px]]

==CONCLUSION==

In this paper, an OT-GAN method was proposed based on the optimal transport theory. A distance metric that combines the primal form of the optimal transport and the energy distance was given was presented for realizing the OT-GAN. One of the advantages of OT-GAN over other GAN models is that OT-GAN can stay on the correct track with an unbiased gradient even if the training on critic is stopped or presents a weak cost signal. The performance of the OT-GAN can be maintained when the batch size is increasing, though the computational cost has to be taken into consideration.

==CRITIQUE==

The paper presents a variant of GANs by defining a new distance metric based on the primal form of optimal transport and the mini-batch energy distance. The stability was demonstrated based on the four experiments that comparing OP-GAN with other popular methods. However, limitations in computational efficiency was not discussed much. Furthermore, in section 2, the paper is lack of explanation on using mini-batches instead of a vector as input when applying Sinkhorn distance. It is also confusing when explaining the algorithm in section 4 about choosing M for minimizing <math> \mathcal{W}_c </math>. Lastly, it is found that it is lack of parallel comparison with existing GAN variants in this paper. Readers may feel jumping from one algorithm to another without necessary explanations.

==Reference==
Salimans, Tim, Han Zhang, Alec Radford, and Dimitris Metaxas. "Improving GANs using optimal transport." (2018).

File:paper23 alg2.png

2018-03-16T05:00:00Z

Cs4li:

Wavelet Pooling CNN

2018-03-16T04:49:58Z

Cs4li:

== Introduction ==
It is generally the case that Convolution Neural Networks (CNNs) out perform vector-based deep learning techniques. As such, the fundamentals of CNNs are good candidates to be innovated in order to improve said performance. The pooling layer is one of these fundamentals, and although various methods exist ranging from deterministic and simple: max pooling and average pooling, to probabilistic: mixed pooling and stochastic pooling, all these methods employ a neighborhood approach to the sub-sampling which, albeit fast and simple, can produce artifacts such as blurring, aliasing, and edge halos (Parker et al., 1983).

This paper introduces a novel pooling method based on the discrete wavelet transform. Specifically, it uses a second-level wavelet decomposition for the sub-sampling. This method, instead of nearest neighbor interpolation, uses a sub-band method that the authors claim produces less artifacts and represents the underlying features more accurately. Therefore, if pooling is viewed as a lossy process, the reason for employing a wavelet approach is to try to minimize this loss.

== Pooling Background ==
Pooling essentially means sub-sampling. After the pooling layer, the spatial dimensions of the data is reduced to some degree, with the goal being to compress the data rather than discard some of it. Typical approaches to pooling reduce the dimensionality by using some method to combine a region of values into one value. For max pooling, this can be represented by the equation <math>a_{kij} = max_{(p,q) \epsilon R_{ij}} (a_{kpq})</math> where <math>a_{kij}</math> is the output activation of the <math>k^th</math> feature map at <math>(i,j)</math>, <math>a_{kpq}</math> is input activation at <math>(p,q)</math> within <math>R_{ij}</math>, and <math>|R_{ij}|</math> is the size of the pooling region. Mean pooling can be represented by the equation <math>a_{kij} = \frac{1}{|R_{ij}|} \sum_{(p,q) \epsilon R_{ij}} (a_{kpq})</math> with everything defined as before. Figure 1 provides a numerical example that can be followed.

[[File:WT_Fig1.PNG|650px|center|]]

The paper mentions that these pooling methods, although simple and effective, have shortcomings. Max pooling can omit details from an image if the important features have less intensity than the insignificant ones, and also commonly overfits. On the other hand, average pooling can dilute important features if the data is averaged with values of significantly lower intensities. Figure 2 displays an image of this.

[[File:WT_Fig2.PNG|650px|center|]]

== Wavelet Background ==
Data or signals tend to be composed of slowly changing trends (low frequency) as well as fast changing transients (high frequency). Similarly, images have smooth regions of intensity which are perturbed by edges or abrupt changes. We know that these abrupt changes can represent features that are of great importance to us when we perform deep learning. Wavelets are a class of functions that are well localized in time and frequency. Compare this to the Fourier transform which represents signals as the sum of sine waves which oscillate forever (not localized in time and space). The ability of wavelets to be localized in time and space is what makes it suitable for detecting the abrupt changes in an image well.

Essentially, a wavelet is a fast decaying, oscillating signal with zero mean that only exists for a fixed duration and can be scaled and shifted in time. There are some well defined types of wavelets as shown in Figure 3. The key characteristic of wavelets for us is that they have a band-pass characteristic, and the band can be adjusted based on the scaling and shifting.

[[File:WT_Fig3.jpg|650px|center|]]

The paper uses discrete wavelet transform and more specifically a faster variation called Fast Wavelet Transform (FWT) using the Haar wavelet. There also exists a continuous wavelet transform. The main difference in these is how the scale and shift parameters are selected.

== Discrete Wavelet Transform General==
The discrete wavelet transform for images is essentially applying a low pass and high pass filter to your image where the transfer functions of the filters are related and defined by the type of wavelet used (Haar in this paper). This is shown in the figures below, which also show the recursive nature of the transform. For an image, the per row transform is taken first. This results in a new image where the first half is a low frequency sub-band and the second half is the high frequency sub-band. Then this new image is transformed again per column, resulting in four sub-bands. Generally, the low frequency content approximates the image and the high frequency content represents abrupt changes. Therefore, one can simply take the LL band and perform the transformation again to sub-sample even more.

[[File:WT_Fig8.png|650px|center|]]

[[File:WT_Fig9.png|650px|center|]]

== DWT example using Haar Wavelet ==
Suppose we have an image represented by the following pixels:
<math> \begin{bmatrix}
100 & 50 & 60 & 150 \\
20 & 60 & 40 & 30 \\
50 & 90 & 70 & 82 \\
74 & 66 & 90 & 58 \\
\end{bmatrix} </math>

For each level of the DWT using the Haar wavelet, we will perform the transform on the rows first and then the columns. For the row pass, we transform each row as follows:
* Take row i = [ i1, i2, i3, i4], and let i_t = [a1, a2, d1, d2] represent the transformed row
* a1 = (i1 + i2)/2
* a2 = (i3 + i4)/2
* d1 = (i1 - i2)/2
* d2 = (i3 - i4)/2

After the row transforms, the images looks as follows:
<math> \begin{bmatrix}
75 & 105 & 25 & -45 \\
40 & 35 & -20 & 5 \\
70 & 76 & -20 & -6 \\
70 & 74 & 4 & 16 \\
\end{bmatrix} </math>

Now we apply the same method to the columns in the exact same way.

== Proposed Method ==
The proposed method uses subbands from the second level FWT and discards the first level subbands. The authors postulate that this method is more 'organic' in capturing the data compression and will create less artifacts that may affect the image classification.
=== Forward Propagation ===
FWT can be expressed by <math>W_\varphi[j + 1, k] = h_\varphi[-n]*W_\varphi[j,n]|_{n = 2k, k <= 0}</math> and <math>W_\psi[j + 1, k] = h_\psi[-n]*W_\psi[j,n]|_{n = 2k, k <= 0}</math> where <math>\varphi</math> is the approximation function, <math>\psi</math> is the detail function, <math>W_\varphi</math>, <math>W_\psi</math>, are approximation and detail coefficients, <math>h_\varphi[-n]</math> and <math>h_\psi[-n]</math> are time reversed scaling and wavelet vectors, <math>(n)</math> represents the sample in the vector, and <math>j</math> denotes the resolution level. To apply to images, FWT is first applied on the rows and then the columns. If a low (L) and high(H) sub-band is extracted from the rows and similarly for the columns than at each level there is 4 sub-bands (LH, HL, HH, and LL) where LL will further be decomposed into the level 2 decomposition.

Using the level 2 decomposition sub-bands, the Inverse Fast Wavelet Transform (IFWT) is used to obtain the resulting sub-sampled image, which is sub-sampled by a factor of two. The Equation for IFWT is <math>W_\varphi[j, k] = h_\varphi[-n]*W_\varphi[j + 1,n] + h_\psi[-n]*W_\psi[j + 1,n]|_{n = \frac{k}{2}, k <= 0}</math> where the parameters are the same as previously explained. Figure 4 displays the algorithm for the forward propagation.

[[File:WT_Fig6.PNG|650px|center|]]

=== Back Propagation ===
This is simply the reverse of the forward propagation. The FWT of the image is upsampled to be used as the level 2 decomposition. Then IFWT is performed to obtain the original image which is upsampled by a factor of two using wavelet methods. Figure 5 displays the algorithm.

[[File:WT_Fig7.PNG|650px|center|]]

== Results ==
The authors tested on MNIST, CIFAR-10, SHVN, and KDEF and the paper provides comprehensive results for each. Stochastic gradient descent was used and the Haar wavelet is used due to its even, square subbands. The network for all datasets except MNIST is loosedly based on (Zeiler & Fergus, 2013). The authors keep the network consistent, but change the pooling method for each dataset. They also experiment with dropout and Batch Normalization to examine the effects of regularization on their method. All pooling methods compared use a 2x2 window. The overall results teach us that the pooling method should be chosen specific to the type of data we have. In some cases wavelet pooling may perform the best, and in other cases, other methods may perform better, if the data is more suited for those types of pooling.

=== MNIST ===
Figure 7 shows the network and Table 1 shows the accuracy. It can be seen that wavelet pooling achieves the best accuracy from all pooling methods compared. Figure 8 shows the energy of each method per epoch.

[[File:WT_Fig4.PNG|650px|center|]]

[[File:paper21_fig8.png|800px|center]]

[[File:WT_Tab1.PNG|650px|center|]]

=== CIFAR-10 ===
In order to investigate the performance of different pooling methods, two networks are trained based on CIFAR-10. The first one is the regular CNN and the second one is the network with dropout and batch normalization. Figure 9 shows the network and Tables 2 and 3 shows the accuracy without and with dropout. Average pooling achieves the best accuracy but wavelet pooling is still competitive.

[[File:WT_Fig5.PNG|650px|center|]]

[[File:paper21_fig10.png|800px|center]]

[[File:WT_Tab2.PNG|650px|center|]]

[[File:WT_Tab3.PNG|650px|center|]]

===SHVN===
Figure 11 shows the network and Tables 4 and 5 shows the accuracy without and with dropout. The proposed method does not perform well in this experiment.

[[File: a.png|650px|center|]]

[[File:paper21_fig12.png|800px|center]]

[[File: b.png|650px|center|]]

== Computational Complexity ==
The authors explain that their paper is a proof of concept and is not meant to implement wavelet pooling in the most efficient way. The table below displays a comparison of the number of mathematical operations for each method according to the dataset. It can be seen that wavelet pooling is significantly worse. The authors explain that through good implementation and coding practices, the method can prove to be viable.

[[File:WT_Tab4.PNG|650px|center|]]

== Criticism ==
=== Positive ===
* Wavelet Pooling achieves competitive performance with standard go to pooling methods
* Leads to comparison of discrete transformation techniques for pooling (DCT, DFT)
=== Negative ===
* Only 2x2 pooling window used for comparison
* Highly computationally extensive
* Not as simple as other pooling methods
* Only one wavelet used (HAAR wavelet)

== References ==
Travis Williams and Robert Li. Wavelet Pooling for Convolutional Neural Networks. ICLR 2018.

File:paper21 fig12.png

2018-03-16T04:48:53Z

Cs4li:

File:paper21 fig10.png

2018-03-16T04:48:44Z

Cs4li:

File:paper21 fig8.png

2018-03-16T04:44:49Z

Cs4li:

On The Convergence Of ADAM And Beyond

2018-03-16T04:27:43Z

Cs4li:

= Introduction =
Somewhat different to the presentation I gave in class, this paper focuses strictly on the pitfalls in convergance of the ADAM training algorithm for neural networks from a theoretical standpoint and proposes a novel improvement to ADAM called AMSGrad. The paper introduces the idea that it is possible for ADAM to get "stuck" in it's weighted average history, preventing it from converging to an optimal solution. An example is that in an experiment there may be a large spike in the gradient during some minibatches, but since ADAM weighs the current update by the exponential moving averages of squared past
gradients, the effect of the large spike in gradient is lost. This can be prevented through novel adjustments to the ADAM optimization algorithm, which can improve convergence.

== Notation ==
The paper presents the following framework that generalizes training algorithms to allow us to define a specific variant such as AMSGrad or SGD entirely within it:

[[File:training_algo_framework.png]]

Where we have <math> x_t </math> as our network parameters defined within a vector space <math> \mathcal{F} </math>. <math> \prod_{\mathcal{F}} (y) = </math> the projection of <math> y </math> on to the set <math> \mathcal{F} </math>.
<math> \psi_t </math> and <math> \phi_t </math> correspond to arbitrary functions we will provide later, The former maps from the history of gradients to <math> \mathbb{R}^d </math> and the latter maps from the history of the gradients to positive semi definite matrices. And finally <math> f_t </math> is our loss function at some time <math> t </math>, the rest should be pretty self explanatory. Using this framework and defining different <math> \psi_t </math> , <math> \phi_t </math> will allow us to recover all different kinds of training algorithms under this one roof.

=== SGD As An Example ===
To recover SGD using this framework we simply select <math> \phi_t (g_1, \dotsc, g_t) = g_t</math>, <math> \psi_t (g_1, \dotsc, g_t) = I </math> and <math>\alpha_t = \alpha / \sqrt{t}</math>. It's easy to see that no transformations are ultimately applied to any of the parameters based on any gradient history other than the most recent from <math> \phi_t </math> and that <math> \psi_t </math> in no way transforms any of the parameters by any specific amount as <math> V_t = I </math> has no impact later on.

=== ADAM As Another Example ===
Once you can convince yourself that SGD is correct, you should understand the framework enough to see why the following setup for ADAM will allow us to recover the behavior we want. ADAM has the ability to define a "learning rate" for every parameter based on how much that parameter moves over time (a.k.a it's momentum) supposedly to help with the learning process.

In order to do this we will choose <math> \phi_t (g_1, \dotsc, g_t) = (1 - \beta_1) \sum_{i=0}^{t} {\beta_1}^{t - i} g_t </math>, psi to be <math> \psi_t (g_1, \dotsc, g_t) = (1 - \beta_2)</math>diag<math>( \sum_{i=0}^{t} {\beta_2}^{t - i} {g_t}^2) </math>, and keep <math>\alpha_t = \alpha / \sqrt{t}</math>. This set up is equivalent to choosing a learning rate decay of <math>\alpha / \sqrt{\sum_i g_{i,j}}</math> for <math>j \in [d]</math>.

From this we can now see that <math>m_t </math> gets filled up with the exponentially weighted average of the history of our gradients that we've come across so far in the algorithm. And that as we proceed to update we scale each one of our paramaters by dividing out <math> V_t </math> (in the case of diagonal it's just 1/the diagonal entry) which contains the exponentially weighted average of each parameters momentum (<math> {g_t}^2 </math>) across our training so far in the algorithm. Thus giving each parameter it's own unique scaling by it's second moment or momentum. Intuitively from a physical perspective if each parameter is a ball rolling around in the optimization landscape what we are now doing is instead of having the ball changed positions on the landscape at a fixed velocity (i.e. momentum of 0) the ball now has the ability to accelerate and speed up or slow down if it's on a steep hill or flat trough in the landscape (i.e. a momentum that can change with time).

= <math> \Gamma_t </math> An Interesting Quantity =
Now that we have an idea of what ADAM looks like in this framework, let us now investigate the following:

<math> \Gamma_{t + 1} = \frac{\sqrt{V_{t+1}}}{\alpha_{t+1}} - \frac{\sqrt{V_t}}{\alpha_t} </math>

Which essentially measure the change of the "Inverse of the learning rate" across time (since we are using alpha's as step sizes). Looking back to our example of SGD it's not hard to see that this quantity is strictly positive, which leads to "non-increasing" learning rates a desired property. However that is not the case with ADAM, and can pose a problem in a theoretical and applied setting. The problem ADAM can face is that <math> \Gamma_t </math> can be indefinite, which the original proof assumed it could not be. The math for this proof is VERY long so instead we will opt for an example to showcase why this could be an issue.

Consider the loss function <math> f_t(x) = \begin{cases}
Cx & \text{for } t \text{ mod 3} = 1 \\
-x & \text{otherwise}
\end{cases} </math>

Where we have <math> C > 2 </math> and <math> \mathcal{F} </math> is <math> [-1,1] </math>
Additionally we choose <math> \beta_1 = 0 </math> and <math> \beta_2 = 1/(1+C^2) </math>. We then proceed to plug this into our framework from before.
This function is periodic and it's easy to see that it has the gradient of C once and then a gradient of -1 twice every period. It has an optimal solution of <math> x = -1 </math> (from a regret standpoint), but using ADAM we would eventually converge at <math> x = 1 </math> since <math> C </math> would "overpower" the -1's.

We formalize this intuition in the results below.

'''Theorem 1.''' There is an online convex optimization problem where ADAM has non-zero average regret. i.e. <math>R_T/T\nrightarrow 0 </math> as <math>T\rightarrow \infty</math>.

'''Theorem 2.''' For any constant <math>\beta_1,\beta_2 \in [0,1)</math> such that <math>\beta_2 < \sqrt{\beta_2}</math>, there is an online convex optimization problem where ADAM has non-zero average regret i.e. <math>R_T/T\nrightarrow 0 </math> as <math>T\rightarrow \infty</math>.

'''Theorem 3.''' For any constant <math>\beta_1,\beta_2 \in [0,1)</math> such that <math>\beta_2 < \sqrt{\beta_2}</math>, there is a stochastic convex optimization problem for which ADAM does not converge to the optimal solution.

= AMSGrad as an improvement to ADAM =

There is a very simple intuitive fix to ADAM to handle this problem. We simply scale our historical weighted average by the maximum we have seen so far to avoid the negative sign problem. There is a very simple one liner adaptation of ADAM to get to AMSGRAD:
[[File:AMSGrad_algo.png]]

Below are some simple plots comparing ADAM and AMSGrad, the first are from the paper and the second are from another individual who attempted to recreate the experiments. The two plots somewhat disagree with one another so take this heuristic improvement with a grain of salt.
[[File:AMSGrad_vs_adam.png]]

Here is another example of a one-dimensional convex optimization problem where ADAM fails to converge

[[File:AMSGrad_vs_adam3.png | 1000px]]

[[File:AMSGrad_vs_adam2.png]]

= Conclusion =
We have introduced a framework for which we could view several different training algorithms. From there we used it to recover SGD as well as ADAM. In our recovery of ADAM we investigated the change of the inverse of the learning rate over time to discover in certain cases there were convergence issues. We proposed a new heuristic AMSGrad to help deal with this problem and presented some empirical results that show it may have helped ADAM slightly. Thanks for your time.

= Source =
1. Sashank J. Reddi and Satyen Kale and Sanjiv Kumar. "On the Convergence of Adam and Beyond." International Conference on Learning Representations. 2018

stat946w18/Self Normalizing Neural Networks

2018-03-16T03:54:52Z

Cs4li:

==Introduction and Motivation==

While neural networks have been making a lot of headway in improving benchmark results and narrowing the gap with human-level performance, success has been fairly limited to visual and sequential processing tasks through advancements in convolutional network and recurrent network structures. Most data science competitions outside of these domains are still outperformed by algorithms such as gradient boosting and random forests. The traditional (densely connected) feed-forward neural networks (FNNs) are rarely used competitively, and when they do win on rare occasions, they have very shallow network architectures with just up to four layers [10].

The authors, Klambauer et al., believe that what prevents FNNs from becoming more competitively useful is the inability to train a deeper FNN structure, which would allow the network to learn more levels of abstract representations. To have a deeper network, oscillations in the distribution of activations need to be kept under control so that stable gradients can be obtained during training. Several techniques are available to normalize activations, including batch normalization [6], layer normalization [1] and weight normalization [8]. These methods work well with CNNs and RNNs, but not so much with FNNs because backpropagating through normalization parameters introduces additional variance to the gradients, and regularization techniques like dropout further perturb the normalization effect. CNNs and RNNs are less sensitive to such perturbations, presumably due to their weight sharing architecture, but FNNs do not have such property, and thus suffer from high variance in training errors, which hinders learning. Furthermore, the aforementioned normalization techniques involve adding external layers to the model and can slow down computations.

Therefore, the authors were motivated to develop a new FNN implementation that can achieve the intended effect of normalization techniques that works well with stochastic gradient descent and dropout. Self-normalizing neural networks (SNNs) are based on the idea of scaled exponential linear units (SELU), a new activation function introduced in this paper, whose output distribution is proved to converge to a fixed point, thus making it possible to train deeper networks.

==Notations==

As the paper (primarily in the supplementary materials) comes with lengthy proofs, important notations are listed first.

Consider two fully-connected layers, let <math display="inline">x</math> denote the inputs to the second layer, then <math display="inline">z = Wx</math> represents the network inputs of the second layer, and <math display="inline">y = f(z)</math> represents the activations in the second layer.

Assume that all <math display="inline">x_i</math>'s, <math display="inline">1 \leqslant i \leqslant n</math>, have mean <math display="inline">\mu := \mathrm{E}(x_i)</math> and variance <math display="inline">\nu := \mathrm{Var}(x_i)</math> and that each <math display="inline">y</math> has mean <math display="inline">\widetilde{\mu} := \mathrm{E}(y)</math> and variance <math display="inline">\widetilde{\nu} := \mathrm{Var}(y)</math>, then let <math display="inline">g</math> be the set of functions that maps <math display="inline">(\mu, \nu)</math> to <math display="inline">(\widetilde{\mu}, \widetilde{\nu})</math>.

For the weight vector <math display="inline">w</math>, <math display="inline">n</math> times the mean of the weight vector is <math display="inline">\omega := \sum_{i = 1}^n \omega_i</math> and <math display="inline">n</math> times the second moment is <math display="inline">\tau := \sum_{i = 1}^{n} w_i^2</math>.

==Key Concepts==

===Self-Normalizing Neural-Net (SNN)===

''A neural network is self-normalizing if it possesses a mapping <math display="inline">g: \Omega \rightarrow \Omega</math> for each activation <math display="inline">y</math> that maps mean and variance from one layer to the next and has a stable and attracting fixed point depending on <math display="inline">(\omega, \tau)</math> in <math display="inline">\Omega</math>. Furthermore, the mean and variance remain in the domain <math display="inline">\Omega</math>, that is <math display="inline">g(\Omega) \subseteq \Omega</math>, where <math display="inline">\Omega = \{ (\mu, \nu) | \mu \in [\mu_{min}, \mu_{max}], \nu \in [\nu_{min}, \nu_{max}] \}</math>. When iteratively applying the mapping <math display="inline">g</math>, each point within <math display="inline">\Omega</math> converges to this fixed point.''

In other words, in SNNs, if the inputs from an earlier layer (<math display="inline">x</math>) already have their mean and variance within a predefined interval <math display="inline">\Omega</math>, then the activations to the next layer (<math display="inline">y = f(z = Wx)</math>) should remain within those intervals. This is true across all pairs of connecting layers as the normalizing effect gets propagated through the network, hence why the term self-normalizing. When the mapping is applied iteratively, it should draw the mean and variance values closer to a fixed point within <math display="inline">\Omega</math>, the value of which depends on <math display="inline">\omega</math> and <math display="inline">\tau</math> (recall that they are from the weight vector).

We will design a FNN then construct a g that takes the mean and variance of each layer to those of the next and is a contraction mapping i.e. <math>g(\mu_i, \nu_i) = (\mu_{i+1}, \nu_{i+1}) \forall i </math>. It should be noted that although the g required in the SNN definition depends on <math display="inline">(\omega, \tau)</math> of an individual layer, the FNN that we construct will have the same values of <math display="inline">(\omega, \tau)</math> for each layer. Intuitively this definition can be interpreted as saying that the mean and variance of the final layer of a sufficiently deep SNN will not change when the mean and variance of the input data change. This is because the mean and variance are passing through a contraction mapping at each layer, converging to the mapping's fixed point.

The activation function that makes an SNN possible should meet the following four conditions:

# It can take on both negative and positive values, so it can normalize the mean;
# It has a saturation region, so it can dampen variances that are too large;
# It has a slope larger than one, so it can increase variances that are too small; and
# It is a continuous curve, which is necessary for the fixed point to exist (see the definition of Banach fixed point theorem to follow).

Commonly used activation functions such as rectified linear units (ReLU), sigmoid, tanh, leaky ReLUs and exponential linear units (ELUs) do not meet all four criteria, therefore, a new activation function is needed.

===Scaled Exponential Linear Units (SELUs)===

One of the main ideas introduced in this paper is the SELU function. As the name suggests, it is closely related to ELU [3],

\[ \mathrm{elu}(x) = \begin{cases} x & x > 0 \\
\alpha e^x - \alpha & x \leqslant 0
\end{cases} \]

but further builds upon it by introducing a new scale parameter $\lambda$ and proving the exact values that $\alpha$ and $\lambda$ should take on to achieve self-normalization. SELU is defined as:

\[ \mathrm{selu}(x) = \lambda \begin{cases} x & x > 0 \\
\alpha e^x - \alpha & x \leqslant 0
\end{cases} \]

SELUs meet all four criteria listed above - it takes on positive values when <math display="inline">x > 0</math> and negative values when <math display="inline">x < 0</math>, it has a saturation region when <math display="inline">x</math> is a larger negative value, the value of <math display="inline">\lambda</math> can be set to greater than one to ensure a slope greater than one, and it is continuous at <math display="inline">x = 0</math>.

Figure 1 below gives an intuition for how SELUs normalize activations across layers. As shown, a variance dampening effect occurs when inputs are negative and far away from zero, and a variance increasing effect occurs when inputs are close to zero.

[[File:snnf1.png|500px]]

Figure 2 below plots the progression of training error on the MNIST and CIFAR10 datasets when training with SNNs versus FNNs with batch normalization at varying model depths. As shown, FNNs that adopted the SELU activation function exhibited lower and less variable training loss compared to using batch normalization, even as the depth increased to 16 and 32 layers.

[[File:snnf2.png|600px]]

=== Banach Fixed Point Theorem and Contraction Mappings ===

The underlying theory behind SNNs is the Banach fixed point theorem, which states the following: ''Let <math display="inline">(X, d)</math> be a non-empty complete metric space with a contraction mapping <math display="inline">f: X \rightarrow X</math>. Then <math display="inline">f</math> has a unique fixed point <math display="inline">x_f \subseteq X</math> with <math display="inline">f(x_f) = x_f</math>. Every sequence <math display="inline">x_n = f(x_{n-1})</math> with starting element <math display="inline">x_0 \subseteq X</math> converges to the fixed point: <math display="inline">x_n \underset{n \rightarrow \infty}\rightarrow x_f</math>.''

A contraction mapping is a function <math display="inline">f: X \rightarrow X</math> on a metric space <math display="inline">X</math> with distance <math display="inline">d</math>, such that for all points <math display="inline">\mathbf{u}</math> and <math display="inline">\mathbf{v}</math> in <math display="inline">X</math>: <math display="inline">d(f(\mathbf{u}), f(\mathbf{v})) \leqslant \delta d(\mathbf{u}, \mathbf{v})</math>, for a <math display="inline">0 \leqslant \delta \leqslant 1</math>.

The easiest way to prove a contraction mapping is usually to show that the spectral norm [12] of its Jacobian is less than 1 [13], as was done for this paper.

==Proving the Self-Normalizing Property==

===Mean and Variance Mapping Function===

<math display="inline">g</math> is derived under the assumption that <math display="inline">x_i</math>'s are independent but not necessarily having the same mean and variance [[#Footnotes |(2)]]. Under this assumption (and recalling earlier notation of <math display="inline">\omega</math> and <math display="inline">\tau</math>),

\begin{align}
\mathrm{E}(z = \mathbf{w}^T \mathbf{x}) = \sum_{i = 1}^n w_i \mathrm{E}(x_i) = \mu \omega
\end{align}

\begin{align}
\mathrm{Var}(z) = \mathrm{Var}(\sum_{i = 1}^n w_i x_i) = \sum_{i = 1}^n w_i^2 \mathrm{Var}(x_i) = \nu \sum_{i = 1}^n w_i^2 = \nu\tau \textrm{ .}
\end{align}

When the weight terms are normalized, <math display="inline">z</math> can be viewed as a weighted sum of <math display="inline">x_i</math>'s. Wide neural net layers with a large number of nodes is common, so <math display="inline">n</math> is usually large, and by the Central Limit Theorem, <math display="inline">z</math> approaches a normal distribution <math display="inline">\mathcal{N}(\mu\omega, \sqrt{\nu\tau})</math>.

Using the above property, the exact form for <math display="inline">g</math> can be obtained using the definitions for mean and variance of continuous random variables:

[[File:gmapping.png|600px|center]]

Analytical solutions for the integrals can be obtained as follows:

[[File:gintegral.png|600px|center]]

The authors are interested in the fixed point <math display="inline">(\mu, \nu) = (0, 1)</math> as these are the parameters associated with the common standard normal distribution. The authors also proposed using normalized weights such that <math display="inline">\omega = \sum_{i = 1}^n = 0</math> and <math display="inline">\tau = \sum_{i = 1}^n w_i^2= 1</math> as it gives a simpler, cleaner expression for <math display="inline">\widetilde{\mu}</math> and <math display="inline">\widetilde{\nu}</math> in the calculations in the next steps. This weight scheme can be achieved in several ways, for example, by drawing from a normal distribution <math display="inline">\mathcal{N}(0, \frac{1}{n})</math> or from a uniform distribution <math display="inline">U(-\sqrt{3}, \sqrt{3})</math>.

At <math display="inline">\widetilde{\mu} = \mu = 0</math>, <math display="inline">\widetilde{\nu} = \nu = 1</math>, <math display="inline">\omega = 0</math> and <math display="inline">\tau = 1</math>, the constants <math display="inline">\lambda</math> and <math display="inline">\alpha</math> from the SELU function can be solved for - <math display="inline">\lambda_{01} \approx 1.0507</math> and <math display="inline">\alpha_{01} \approx 1.6733</math>. These values are used throughout the rest of the paper whenever an expression calls for <math display="inline">\lambda</math> and <math display="inline">\alpha</math>.

===Details of Moment-Mapping Integrals ===
Consider the moment-mapping integrals:
\begin{align}
\widetilde{\mu} & = \int_{-\infty}^\infty \mathrm{selu} (z) p_N(z; \mu \omega, \sqrt{\nu \tau})dz\\
\widetilde{\nu} & = \int_{-\infty}^\infty \mathrm{selu} (z)^2 p_N(z; \mu \omega, \sqrt{\nu \tau}) dz-\widetilde{\mu}^2.
\end{align}

The equation for <math display="inline">\widetilde{\mu}</math> can be expanded as
\begin{align}
\widetilde{\mu} & = \frac{\lambda}{2}\left( 2\alpha\int_{-\infty}^0 (\exp(z)-1) p_N(z; \mu \omega, \sqrt{\nu \tau})dz +2\int_{0}^\infty z p_N(z; \mu \omega, \sqrt{\nu \tau})dz \right)\\
&= \frac{\lambda}{2}\left( 2 \alpha \frac{1}{\sqrt{2\pi\tau\nu}} \int_{-\infty}^0 (\exp(z)-1) \exp(\frac{-1}{2\tau \nu} (z-\mu \omega)^2 ) dz +2\frac{1}{\sqrt{2\pi\tau\nu}}\int_{0}^\infty z \exp(\frac{-1}{2\tau \nu} (z-\mu \omega)^2 dz \right)\\
&= \frac{\lambda}{2}\left( 2 \alpha\frac{1}{\sqrt{2\pi\tau\nu}}\int_{-\infty}^0 \exp(z) \exp(\frac{-1}{2\tau \nu} (z-\mu \omega)^2 ) dz - 2 \alpha\frac{1}{\sqrt{2\pi\tau\nu}}\int_{-\infty}^0 \exp(\frac{-1}{2\tau \nu} (z-\mu \omega)^2 ) dz +2\frac{1}{\sqrt{2\pi\tau\nu}}\int_{0}^\infty z \exp(\frac{-1}{2\tau \nu} (z-\mu \omega)^2 dz \right)\\
\end{align}

The first integral can be simplified via the substituiton
\begin{align}
q:= \frac{1}{\sqrt{2\tau \nu}}(z-\mu \omega -\tau \nu).
\end{align}
While the second and third can be simplified via the substitution
\begin{align}
q:= \frac{1}{\sqrt{2\tau \nu}}(z-\mu \omega ).
\end{align}
Using the definitions of <math display="inline">\mathrm{erf}</math> and <math display="inline">\mathrm{erfc}</math> then yields the result of the previous section.

===Self-Normalizing Property Under Normalized Weights===

Assuming the the weights normalized with <math display="inline">\omega=0</math> and <math display="inline">\tau=1</math>, it is possible to calculate the exact value for the spectral norm [12] of <math display="inline">g</math>'s Jacobian around the fixed point <math display="inline">(\mu, \nu) = (0, 1)</math>, which turns out to be <math display="inline">0.7877</math>. Thus, at initialization, SNNs have a stable and attracting fixed point at <math display="inline">(0, 1)</math>, which means that when <math display="inline">g</math> is applied iteratively to a pair <math display="inline">(\mu_{new}, \nu_{new})</math>, it should draw the points closer to <math display="inline">(0, 1)</math>. The rate of convergence is determined by the spectral norm [12], whose value depends on <math display="inline">\mu</math>, <math display="inline">\nu</math>, <math display="inline">\omega</math> and <math display="inline">\tau</math>.

[[File:paper10_fig2.png|600px|frame|none|alt=Alt text|The figure illustrates, in the scenario described above, the mapping <math display="inline">g</math> of mean and variance <math display="inline">(\mu, \nu)</math> to <math display="inline">(\mu_{new}, \nu_{new})</math>. The arrows show the direction <math display="inline">(\mu, \nu)</math> is mapped by <math display="inline">g: (\mu, \nu)\mapsto(\mu_{new}, \nu_{new})</math>. One can clearly see the fixed point mapping <math display="inline">g</math> is at <math display="inline">(0, 1)</math>.]]

===Self-Normalizing Property Under Unnormalized Weights===

As weights are updated during training, there is no guarantee that they would remain normalized. The authors addressed this issue through the first key theorem presented in the paper, which states that a fixed point close to (0, 1) can still be obtained if <math display="inline">\mu</math>, <math display="inline">\nu</math>, <math display="inline">\omega</math> and <math display="inline">\tau</math> are restricted to a specified range.

Additionally, there is no guarantee that the mean and variance of the inputs would stay within the range given by the first theorem, which led to the development of theorems #2 and #3. These two theorems established an upper and lower bound on the variance of inputs if the variance of activations from the previous layer are above or below the range specified, respectively. This ensures that the variance would not explode or vanish after being propagated through the network.

The theorems come with lengthy proofs in the supplementary materials for the paper. High-level proof sketches are presented here.

====Theorem 1: Stable and Attracting Fixed Points Close to (0, 1)====

'''Definition:''' We assume <math display="inline">\alpha = \alpha_{01}</math> and <math display="inline">\lambda = \lambda_{01}</math>. We restrict the range of the variables to the domain <math display="inline">\mu \in [-0.1, 0.1]</math>, <math display="inline">\omega \in [-0.1, 0.1]</math>, <math display="inline">\nu \in [0.8, 1.5]</math>, and <math display="inline">\tau \in [0.9, 1.1]</math>. For <math display="inline">\omega = 0</math> and <math display="inline">\tau = 1</math>, the mapping has the stable fixed point <math display="inline">(\mu, \nu) = (0, 1</math>. For other <math display="inline">\omega</math> and <math display="inline">\tau</math>, g has a stable and attracting fixed point depending on <math display="inline">(\omega, \tau)</math> in the <math display="inline">(\mu, \nu)</math>-domain: <math display="inline">\mu \in [-0.03106, 0.06773]</math> and <math display="inline">\nu \in [0.80009, 1.48617]</math>. All points within the <math display="inline">(\mu, \nu)</math>-domain converge when iteratively applying the mapping to this fixed point.

'''Proof:''' In order to show the the mapping <math display="inline">g</math> has a stable and attracting fixed point close to <math display="inline">(0, 1)</math>, the authors again applied Banach's fixed point theorem, which states that a contraction mapping on a nonempty complete metric space that does not map outside its domain has a unique fixed point, and that all points in the <math display="inline">(\mu, \nu)</math>-domain converge to the fixed point when <math display="inline">g</math> is iteratively applied.

The two requirements are proven as follows:

'''1. g is a contraction mapping.'''

For <math display="inline">g</math> to be a contraction mapping in <math display="inline">\Omega</math> with distance <math display="inline">||\cdot||_2</math>, there must exist a Lipschitz constant <math display="inline">M < 1</math> such that:

\begin{align}
\forall \mu, \nu \in \Omega: ||g(\mu) - g(\nu)||_2 \leqslant M||\mu - \nu||_2
\end{align}

As stated earlier, <math display="inline">g</math> is a contraction mapping if the spectral norm [12] of the Jacobian <math display="inline">\mathcal{H}</math> [[#Footnotes | (3)]] is below one, or equivalently, if the the largest singular value of <math display="inline">\mathcal{H}</math> is less than 1.

To find the singular values of <math display="inline">\mathcal{H}</math>, the authors used an explicit formula derived by Blinn [2] for <math display="inline">2\times2</math> matrices, which states that the largest singular value of the matrix is <math display="inline">\frac{1}{2}(\sqrt{(a_{11} + a_{22}) ^ 2 + (a_{21} - a{12})^2} + \sqrt{(a_{11} - a_{22}) ^ 2 + (a_{21} + a{12})^2})</math>.

For <math display="inline">\mathcal{H}</math>, an expression for the largest singular value of <math display="inline">\mathcal{H}</math>, made up of the first-order partial derivatives of the mapping <math display="inline">g</math> with respect to <math display="inline">\mu</math> and <math display="inline">\nu</math>, can be derived given the analytical solutions for <math display="inline">\widetilde{\mu}</math> and <math display="inline">\widetilde{\nu}</math> (and denoted <math display="inline">S(\mu, \omega, \nu, \tau, \lambda, \alpha)</math>).

From the mean value theorem, we know that for a <math display="inline">t \in [0, 1]</math>,

[[File:seq.png|600px|center]]

Therefore, the distance of the singular value at <math display="inline">S(\mu, \omega, \nu, \tau, \lambda_{\mathrm{01}}, \alpha_{\mathrm{01}})</math> and at <math display="inline">S(\mu + \Delta\mu, \omega + \Delta\omega, \nu + \Delta\nu, \tau \Delta\tau, \lambda_{\mathrm{01}}, \alpha_{\mathrm{01}})</math> can be bounded above by

[[File:seq2.png|600px|center]]

An upper bound was obtained for each partial derivative term above, mainly through algebraic reformulations and by making use of the fact that many of the functions are monotonically increasing or decreasing on the variables they depend on in <math display="inline">\Omega</math> (see pages 17 - 25 in the supplementary materials).

The <math display="inline">\Delta</math> terms were then set (rather arbitrarily) to be: <math display="inline">\Delta \mu=0.0068097371</math>,
<math display="inline">\Delta \omega=0.0008292885</math>, <math display="inline">\Delta \nu=0.0009580840</math>, and <math display="inline">\Delta \tau=0.0007323095</math>. Plugging in the upper bounds on the absolute values of the derivative terms for <math display="inline">S</math> and the <math display="inline">\Delta</math> terms yields

\[ S(\mu + \Delta \mu,\omega + \Delta \omega,\nu + \Delta \nu,\tau + \Delta \tau,\lambda_{\rm 01},\alpha_{\rm 01}) - S(\mu,\omega,\nu,\tau,\lambda_{\rm 01},\alpha_{\rm 01}) < 0.008747 \]

Next, the largest singular value is found from a computer-assisted fine grid-search [[#Footnotes | (1)]] over the domain <math display="inline">\Omega</math>, with grid lengths <math display="inline">\Delta \mu=0.0068097371</math>, <math display="inline">\Delta \omega=0.0008292885</math>, <math display="inline">\Delta \nu=0.0009580840</math>, and <math display="inline">\Delta \tau=0.0007323095</math>, which turned out to be <math display="inline">0.9912524171058772</math>. Therefore,

\[ S(\mu + \Delta \mu,\omega + \Delta \omega,\nu + \Delta \nu,\tau + \Delta \tau,\lambda_{\rm 01},\alpha_{\rm 01}) \leq 0.9912524171058772 + 0.008747 < 1 \]

Since the largest singular value is smaller than 1, <math display="inline>g</math> is a contraction mapping.

'''2. g does not map outside its domain.'''

To prove that <math display="inline">g</math> does not map outside of the domain <math display="inline">\mu \in [-0.1, 0.1]</math> and <math display="inline">\nu \in [0.8, 1.5]</math>, lower and upper bounds on <math display="inline">\widetilde{\mu}</math> and <math display="inline">\widetilde{\nu}</math> were obtained to show that they stay within <math display="inline">\Omega</math>.

First, it was shown that the derivatives of <math display="inline">\widetilde{\mu}</math> and <math display="inline">\widetilde{\xi}</math> with respect to <math display="inline">\mu</math> and <math display="inline">\nu</math> are either positive or have the sign of <math display="inline">\omega</math> in <math display="inline">\Omega</math>, so the minimum and maximum points are found at the borders. In <math display="inline">\Omega</math>, it then follows that

\begin{align}
-0.03106 <\widetilde{\mu}(-0.1,0.1, 0.8, 0.95, \lambda_{\rm 01}, \alpha_{\rm 01}) \leq & \widetilde{\mu} \leq \widetilde{\mu}(0.1,0.1,1.5, 1.1, \lambda_{\rm 01}, \alpha_{\rm 01}) < 0.06773
\end{align}

and

\begin{align}
0.80467 <\widetilde{\xi}(-0.1,0.1, 0.8, 0.95, \lambda_{\rm 01}, \alpha_{\rm 01}) \leq & \widetilde{\xi} \leq \widetilde{\xi}(0.1,0.1,1.5, 1.1, \lambda_{\rm 01}, \alpha_{\rm 01}) < 1.48617.
\end{align}

Since <math display="inline">\widetilde{\nu} = \widetilde{\xi} - \widetilde{\mu}^2</math>,

\begin{align}
0.80009 & \leqslant \widetilde{\nu} \leqslant 1.48617
\end{align}

The bounds on <math display="inline">\widetilde{\mu}</math> and <math display="inline">\widetilde{\nu}</math> are narrower than those for <math display="inline">\mu</math> and <math display="inline">\nu</math> set out in <math display="inline">\Omega</math>, therefore <math display="inline">g(\Omega) \subseteq \Omega</math>.

==== Theorem 2: Decreasing Variance from Above ====

'''Definition:''' For <math display="inline">\lambda = \lambda_{01}</math>, <math display="inline">\alpha = \alpha_{01}</math>, and the domain <math display="inline">\Omega^+: -1 \leqslant \mu \leqslant 1, -0.1 \leqslant \omega \leqslant 0.1, 3 \leqslant \nu \leqslant 16</math>, and <math display="inline">0.8 \leqslant \tau \leqslant 1.25</math>, we have for the mapping of the variance <math display="inline">\widetilde{\nu}(\mu, \omega, \nu, \tau, \lambda, \alpha)</math> under <math display="inline">g</math>: <math display="inline">\widetilde{\nu}(\mu, \omega, \nu, \tau, \lambda, \alpha) < \nu</math>.

Theorem 2 states that when <math display="inline">\nu \in [3, 16]</math>, the mapping <math display="inline">g</math> draws it to below 3 when applied across layers, thereby establishing an upper bound of <math display="inline">\nu < 3</math> on variance.

'''Proof:''' The authors proved the inequality by showing that <math display="inline">g(\mu, \omega, \xi, \tau, \lambda_{01}, \alpha_{01}) = \widetilde{\xi}(\mu, \omega, \xi, \tau, \lambda_{01}, \alpha_{01}) - \nu < 0</math>, since the second moment should be greater than or equal to variance <math display="inline">\widetilde{\nu}</math>. The behavior of <math display="inline">\frac{\partial }{\partial \mu } \widetilde{\xi}(\mu, \omega, \nu, \tau, \lambda, \alpha)</math>, <math display="inline">\frac{\partial }{\partial \omega } \widetilde{\xi}(\mu, \omega, \nu, \tau, \lambda, \alpha)</math>, <math display="inline">\frac{\partial }{\partial \nu } \widetilde{\xi}(\mu, \omega, \nu, \tau, \lambda, \alpha)</math>, and <math display="inline">\frac{\partial }{\partial \tau } \widetilde{\xi}(\mu, \omega, \nu, \tau, \lambda, \alpha)</math> are used to find the bounds on <math display="inline">g(\mu, \omega, \xi, \tau, \lambda_{01}, \alpha_{01})</math> (see pages 9 - 13 in the supplementary materials). Again, the partial derivative terms were monotonic, which made it possible to find the upper bound at the board values. It was shown that the maximum value of <math display="inline">g</math> does not exceed <math display="inline">-0.0180173</math>.

==== Theorem 3: Increasing Variance from Below ====

'''Definition''': We consider <math display="inline">\lambda = \lambda_{01}</math>, <math display="inline">\alpha = \alpha_{01}</math>, and the domain <math display="inline">\Omega^-: -0.1 \leqslant \mu \leqslant 0.1</math> and <math display="inline">-0.1 \leqslant \omega \leqslant 0.1</math>. For the domain <math display="inline">0.02 \leqslant \nu \leqslant 0.16</math> and <math display="inline">0.8 \leqslant \tau \leqslant 1.25</math> as well as for the domain <math display="inline">0.02 \leqslant \nu \leqslant 0.24</math> and <math display="inline">0.9 \leqslant \tau \leqslant 1.25</math>, the mapping of the variance <math display="inline">\widetilde{\nu}(\mu, \omega, \nu, \tau, \lambda, \alpha)</math> increases: <math display="inline">\widetilde{\nu}(\mu, \omega, \nu, \tau, \lambda, \alpha) > \nu</math>.

Theorem 3 states that the variance <math display="inline">\widetilde{\nu}</math> increases when variance is smaller than in <math display="inline">\Omega</math>. The lower bound on variance is <math display="inline">\widetilde{\nu} > 0.16</math> when <math display="inline">0.8 \leqslant \tau</math> and <math display="inline">\widetilde{\nu} > 0.24</math> when <math display="inline">0.9 \leqslant \tau</math> under the proposed mapping.

'''Proof:''' According to the mean value theorem, for a <math display="inline">t \in [0, 1]</math>,

[[File:th3.png|700px|center]]

Similar to the proof for Theorem 2 (except we are interested in the smallest <math display="inline">\widetilde{\nu}</math> instead of the biggest), the lower bound for <math display="inline">\frac{\partial }{\partial \nu} \widetilde{\xi}(\mu,\omega,\nu+t(\nu_{\mathrm{min}}-\nu),\tau,\lambda_{\rm 01},\alpha_{\rm 01})</math> can be derived, and substituted into the relationship <math display="inline">\widetilde{\nu} = \widetilde{\xi}(\mu,\omega,\nu,\tau,\lambda_{\rm 01},\alpha_{\rm 01}) - (\widetilde{\mu}(\mu,\omega,\nu,\tau,\lambda_{\rm 01},\alpha_{\rm 01}))^2</math>. The lower bound depends on <math display="inline">\tau</math> and <math display="inline">\nu</math>, and in the <math display="inline">\Omega^{-1}</math> listed, it is slightly above <math display="inline">\nu</math>.

== Implementation Details ==

=== Initialization ===

As previously explained, SNNs work best when inputs to the network are standardized, and the weights are initialized with mean of 0 and variance of <math display="inline">\frac{1}{n}</math> to help converge to the fixed point <math display="inline">(\mu, \nu) = (0, 1)</math>.

=== Dropout Technique ===

The authors reason that regular dropout, randomly setting activations to 0 with probability <math display="inline">1 - q</math>, is not compatible with SELUs. This is because the low variance region in SELUs is at <math display="inline">\lim_{x \rightarrow -\infty} = -\lambda \alpha</math>, not 0. Contrast this with ReLUs, which work well with dropout since they have <math display="inline">\lim_{x \rightarrow -\infty} = 0</math> as the saturation region. Therefore, a new dropout technique for SELUs was needed, termed ''alpha dropout''.

With alpha dropout, activations are randomly set to <math display="inline">-\lambda\alpha = \alpha'</math>, which for this paper corresponds to the constant <math display="inline">1.7581</math>, with probability <math display="inline">1 - q</math>.

The updated mean and variance of the activations are now:
\[ \mathrm{E}(xd + \alpha'(1 - d)) = \mu q + \alpha'(1 - q) \]

and

\[ \mathrm{Var}(xd + \alpha'(1 - d)) = q((1-q)(\alpha' - \mu)^2 + \nu) \]

Activations need to be transformed (e.g. scaled) after dropout to maintain the same mean and variance. In regular dropout, conserving the mean and variance correlates to scaling activations by a factor of 1/q while training. To ensure that mean and variance are unchanged after alpha dropout, the authors used an affine transformation <math display="inline">a(xd + \alpha'(1 - d)) + b</math>, and solved for the values of <math display="inline">a</math> and <math display="inline">b</math> to give <math display="inline">a = (\frac{\nu}{q((1-q)(\alpha' - \mu)^2 + \nu)})^{\frac{1}{2}}</math> and <math display="inline">b = \mu - a(q\mu + (1-q)\alpha'))</math>. As the values for <math display="inline">\mu</math> and <math display="inline">\nu</math> are set to <math display="inline">0</math> and <math display="inline">1</math> throughout the paper, these expressions can be simplified into <math display="inline">a = (q + \alpha'^2 q(1-q))^{-\frac{1}{2}}</math> and <math display="inline">b = -(q + \alpha^2 q (1-q))^{-\frac{1}{2}}((1 - q)\alpha')</math>, where <math display="inline">\alpha' \approx 1.7581</math>.

Empirically, the authors found that dropout rates (1-q) of <math display="inline">0.05</math> or <math display="inline">0.10</math> worked well with SNNs.

=== Optimizers ===

Through experiments, the authors found that stochastic gradient descent, momentum, Adadelta and Adamax work well on SNNs. For Adam, configuration parameters <math display="inline">\beta_2 = 0.99</math> and <math display="inline">\epsilon = 0.01</math> were found to be more effective.

==Experimental Results==

Three sets of experiments were conducted to compare the performance of SNNs to six other FNN structures and to other machine learning algorithms, such as support vector machines and random forests. The experiments were carried out on (1) 121 UCI Machine Learning Repository datasets, (2) the Tox21 chemical compounds toxicity effects dataset (with 12,000 compounds and 270,000 features), and (3) the HTRU2 dataset of statistics on radio wave signals from pulsar candidates (with 18,000 observations and eight features). In each set of experiment, hyperparameter search was conducted on a validation set to select parameters such as the number of hidden units, number of hidden layers, learning rate, regularization parameter, and dropout rate (see pages 95 - 107 of the supplementary material for exact hyperparameters considered). Whenever models of different setups gave identical results on the validation data, preference was given to the structure with more layers, lower learning rate and higher dropout rate.

The six FNN structures considered were: (1) FNNs with ReLU activations, no normalization and “Microsoft weight initialization” (MSRA) [5] to control the variance of input signals [5]; (2) FNNs with batch normalization [6], in which normalization is applied to activations of the same mini-batch; (3) FNNs with layer normalization [1], in which normalization is applied on a per layer basis for each training example; (4) FNNs with weight normalization [8], whereby each layer’s weights are normalized by learning the weight’s magnitude and direction instead of the weight vector itself; (5) highway networks, in which layers are not restricted to being sequentially connected [9]; and (6) an FNN-version of residual networks [4], with residual blocks made up of two or three densely connected layers.

On the Tox21 dataset, the authors demonstrated the self-normalizing effect by comparing the distribution of neural inputs <math display="inline">z</math> at initialization and after 40 epochs of training to that of the standard normal. As Figure 3 show, the distribution of <math display="inline">z</math> remained similar to a normal distribution.

[[File:snnf3.png|600px]]

On all three sets of classification tasks, the authors demonstrated that SNN outperformed the other FNN counterparts on accuracy and AUC measures, came close to the state-of-the-art results on the Tox21 dataset with an 8-layer network, and produced a new state-of-the-art AUC on predicting pulsars for the HTRU2 dataset by a small margin (achieving an AUC 0.98, averaged over 10 cross-validation folds, versus the previous record of 0.976).

On UCI datasets with fewer than 1,000 observations, SNNs did not outperform SVMs or random forests in terms of average rank in accuracy, but on datasets with at least 1,000 observations, SNNs showed the best overall performance (average rank of 5.8, compared to 6.1 for support vector machines and 6.6 for random forests). Through hyperparameter tuning, it was also discovered that the average depth of FNNs is 10.8 layers, more than the other FNN architectures tried.

Here are the results on the Tox21 challenge. The challenge requires prediction of toxic effects of 12000 chemicals based on their chemical structures. SNN with 8 layers had the best performance.

[[File:tox21.png|600px]]

==Future Work==

Although not the focus of this paper, the authors also briefly noted that their initial experiments with applying SELUs on relatively simple CNN structures showed promising results, which is not surprising given that ELUs, which do not have the self-normalizing property, has already been shown to work well with CNNs, demonstrating faster convergence than ReLU networks and even pushing the state-of-the-art error rates on CIFAR-100 at the time of publishing in 2015 [3].

Since the paper was published, SELUs have been adopted by several researchers, not just with FNNs [https://github.com/bioinf-jku/SNNs see link], but also with CNNs, GANs, autoencoders, reinforcement learning and RNNs. In a few cases, researchers for those papers concluded that networks trained with SELUs converged faster than those trained with ReLUs, and that SELUs have the same convergence quality as batch normalization. There is potential for SELUs to be incorporated into more architectures in the future.

==Critique==

Overall, the authors presented a convincing case for using SELUs (along with proper initialization and alpha dropout) on FNNs. FNNs trained with SELU have more layers than those with other normalization techniques, so the work here provides a promising direction for making traditional FNNs more powerful. There are not as many well-established benchmark datasets to evaluate FNNs, but the experiments carried out, particularly on the larger Tox21 dataset, showed that SNNs can be very effective at classification tasks.

The only question I have with the proofs is the lack of explanation for how the domains, <math display="inline">\Omega</math>, <math display="inline">\Omega^-</math> and <math display="inline">\Omega^+</math> are determined, which is an important consideration because they are used for deriving the upper and lower bounds on expressions needed for proving the three theorems. The ranges appear somewhat set through trial-and-error and heuristics to ensure the numbers work out (e.g. make the spectral norm [12] of <math display="inline">\mathcal{J}</math> as large as can be below 1 so as to ensure <math display="inline">g</math> is a contraction mapping), so it is not clear whether they are unique conditions, or that the parameters will remain within those prespecified ranges throughout training; and if the parameters can stray away from the ranges provided, then the issue of what will happen to the self-normalizing property was not addressed. Perhaps that is why the authors gave preference to models with a deeper structure and smaller learning rate during experiments to help the parameters stay within their domains. Further, in addition to the hyperparameters considered, it would be helpful to know the final values that went into the best-performing models, for a better understanding of what range of values work better for SNNs empirically.

==Conclusion==

The SNN structure proposed in this paper is built on the traditional FNN structure with a few modifications, including the use of SELUs as the activation function (with <math display="inline">\lambda \approx 1.0507</math> and <math display="inline">\alpha \approx 1.6733</math>), alpha dropout, network weights initialized with mean of zero and variance <math display="inline">\frac{1}{n}</math>, and inputs normalized to mean of zero and variance of one. It is simple to implement while being backed up by detailed theory.

When properly initialized, SELUs will draw neural inputs towards a fixed point of zero mean and unit variance as the activations are propagated through the layers. The self-normalizing property is maintained even when weights deviate from their initial values during training (under mild conditions). When the variance of inputs goes beyond the prespecified range imposed, they are still bounded above and below so SNNs do not suffer from exploding and vanishing gradients. This self-normalizing property allows SNNs to be more robust to perturbations in stochastic gradient descent, so deeper structures with better prediction performance can be built.

In the experiments conducted, the authors demonstrated that SNNs outperformed FNNs trained with other normalization techniques, such as batch, layer and weight normalization, and specialized architectures, such as highway or residual networks, on several classification tasks, including on the UCI Machine Learning Repository datasets. The adoption of SELUs by other researchers also lends credence to the potential for SELUs to be implemented in more neural network architectures.

==References==

# Ba, Kiros and Hinton. "Layer Normalization". arXiv:1607.06450. (2016).
# Blinn. "Consider the Lowly 2X2 Matrix." IEEE Computer Graphics and Applications. (1996).
# Clevert, Unterthiner, Hochreiter. "Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)." arXiv: 1511.07289. (2015).
# He, Zhang, Ren and Sun. "Deep Residual Learning for Image Recognition." arXiv:1512.03385. (2015).
# He, Zhang, Ren and Sun. "Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification." arXiv:1502.01852. (2015).
# Ioffe and Szegedy. "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariance Shift." arXiv:1502.03167. (2015).
# Klambauer, Unterthiner, Mayr and Hochreiter. "Self-Normalizing Neural Networks." arXiv: 1706.02515. (2017).
# Salimans and Kingma. "Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks." arXiv:1602.07868. (2016).
# Srivastava, Greff and Schmidhuber. "Highway Networks." arXiv:1505.00387 (2015).
# Unterthiner, Mayr, Klambauer and Hochreiter. "Toxicity Prediction Using Deep Learning." arXiv:1503.01445. (2015).
# https://en.wikipedia.org/wiki/Central_limit_theorem
# http://mathworld.wolfram.com/SpectralNorm.html
# https://www.math.umd.edu/~petersd/466/fixedpoint.pdf

==Online Resources==
https://github.com/bioinf-jku/SNNs (GitHub repository maintained by some of the paper's authors)

==Footnotes==

1. Error propagation analysis: The authors performed an error analysis to quantify the potential numerical imprecisions propagated through the numerous operations performed. The potential imprecision <math display="inline">\epsilon</math> was quantified by applying the mean value theorem

\[ |f(x + \Delta x - f(x)| \leqslant ||\triangledown f(x + t\Delta x|| ||\Delta x|| \textrm{ for } t \in [0, 1]\textrm{.} \]

The error propagation rules, or <math display="inline">|f(x + \Delta x - f(x)|</math>, was first obtained for simple operations such as addition, subtraction, multiplication, division, square root, exponential function, error function and complementary error function. Them, the error bounds on the compound terms making up <math display="inline">\Delta (S(\mu, \omega, \nu, \tau, \lambda, \alpha)</math> were found by decomposing them into the simpler expressions. If each of the variables have a precision of <math display="inline">\epsilon</math>, then it turns out <math display="inline">S</math> has a precision better than <math display="inline">292\epsilon</math>. For a machine with a precision of <math display="inline">2^{-56}</math>, the rounding error is <math display="inline">\epsilon \approx 10^{-16}</math>, and <math display="inline">292\epsilon < 10^{-13}</math>. In addition, all computations are correct up to 3 ulps (“unit in last place”) for the hardware architectures and GNU C library used, with 1 ulp being the highest precision that can be achieved.

2. Independence Assumption: The classic definition of central limit theorem requires <math display="inline">x_i</math>’s to be independent and identically distributed, which is not guaranteed to hold true in a neural network layer. However, according to the Lyapunov CLT, the <math display="inline">x_i</math>’s do not need to be identically distributed as long as the <math display="inline">(2 + \delta)</math>th moment exists for the variables and meet the Lyapunov condition for the rate of growth of the sum of the moments [11]. In addition, CLT has also shown to be valid under weak dependence under mixing conditions [11]. Therefore, the authors argue that the central limit theorem can be applied with network inputs.

3. <math display="inline">\mathcal{H}</math> versus <math display="inline">\mathcal{J}</math> Jacobians: In solving for the largest singular value of the Jacobian <math display="inline">\mathcal{H}</math> for the mapping <math display="inline">g: (\mu, \nu)</math>, the authors first worked with the terms in the Jacobian <math display="inline">\mathcal{J}</math> for the mapping <math display="inline">h: (\mu, \nu) \rightarrow (\widetilde{\mu}, \widetilde{\xi})</math> instead, because the influence of <math display="inline">\widetilde{\mu}</math> on <math display="inline">\widetilde{\nu}</math> is small when <math display="inline">\widetilde{\mu}</math> is small in <math display="inline">\Omega</math> and <math display="inline">\mathcal{H}</math> can be easily expressed as terms in <math display="inline">\mathcal{J}</math>. <math display="inline">\mathcal{J}</math> was referenced in the paper, but I used <math display="inline">\mathcal{H}</math> in the summary here to avoid confusion.

File:paper10 fig2.png

2018-03-16T03:26:14Z

Cs4li:

Word translation without parallel data

2018-03-09T20:59:48Z

Cs4li:

[[File:Toy_example.png]]

= Presented by =

Xia Fan

= Introduction =

Many successful methods for learning relationships between languages stem from the hypothesis that there is a relationship between the context of words and their meanings. This means that if an adequate representation of a language is found in a high dimensional space (this is called an embedding), then words similar to a given word are close to one another in this space (ex. some norm can be minimized to find a word with similar context). Historically, another significant hypothesis is that these embedding spaces show similar structures over different languages. That is to say that given an embedding space for English and one for Spanish, a mapping could be found that aligns the two spaces and such a mapping could be used as a tool for translation. Many papers exploit these hypotheses, but use large parallel datasets for training. Recently, to remove the need for supervised training, methods have been implemented that utilize identical character strings (ex. letters or digits) in order to try to align the embeddings. The downside of this approach is that the two languages need to be similar to begin with as they need to have some shared basic building block. The method proposed in this paper uses an adversarial method to find this mapping between the embedding spaces of two languages without the use of large parallel datasets.

This paper introduces a model that either is on par, or outperforms supervised state-of-the-art methods, without employing any cross-lingual annotated data. This method uses an idea similar to GANs: it leverages adversarial training to learn a linear mapping from a source to distinguish between the mapped source embeddings and the target embeddings, while the mapping is jointly trained to fool the discriminator. Second, this paper extracts a synthetic dictionary from the resulting shared embedding space and fine-tunes the mapping with the closed-form Procrustes solution from Schonemann (1966). Third, this paper also introduces an unsupervised selection metric that is highly correlated with the mapping quality and that the authors use both as a stopping criterion and to select the best hyper-parameters.

= Model =

=== Estimation of Word Representations in Vector Space ===

This model focuses on learning a mapping between the two sets such that translations are close in the shared space. Before talking about the model it used, a model which can exploit the similarities of monolingual embedding spaces should be introduced. Mikolov et al.(2013) use a known dictionary of n=5000 pairs of words <math> \{x_i,y_i\}_{i\in{1,n}} </math>. and learn a linear mapping W between the source and the target space such that

\begin{align}
W^*=argmin_{W{\in}M_d(R)}||WX-Y||_F \hspace{1cm} (1)
\end{align}

where d is the dimension of the embeddings, <math> M_d(R) </math> is the space of d*d matrices of real numbers, and X and Y are two aligned matrices of size d*n containing the embeddings of the words in the parallel vocabulary.

Xing et al. (2015) showed that these results are improved by enforcing orthogonality constraint on W. In that case, equation (1) boils down to the Procrustes problem, which advantageously offers a closed form solution obtained from the singular value decomposition (SVD) of <math> YX^T </math> :

\begin{align}
W^*=argmin_{W{\in}M_d(R)}||WX-Y||_F=UV^T, with U\Sigma V^T=SVD(YX^T).
\end{align}

This can be proven as follows. First note that
\begin{align}
&||WX-Y||_F\\
&= \langle WX, WX \rangle_F -2 \langle W X, Y \rangle_F + \langle Y, Y \rangle_F \\
&= ||X||_F^2 -2 \langle W X, Y \rangle_F + || Y||_F^2,
\end{align}

where <math display="inline"> \langle \cdot, \cdot \rangle_F </math> denotes the Frobenius inner-product and we have used the orthogonality of <math display="inline"> W </math>. It follows that we need only maximize the inner-product above. Let <math display="inline"> u_1, \ldots, u_d </math> denote the columns of <math display="inline"> U </math>. Let <math display="inline"> v_1, \ldots , v_d </math> denote the columns of <math display="inline"> V </math>. Let <math display="inline"> \sigma_1, \ldots, \sigma_d </math> denote the diagonal entries of <math display="inline"> \Sigma </math>. We have
\begin{align}
&\langle W X, Y \rangle_F \\
&= \text{Tr} (W^T Y X^T)\\
&=\sum_i \sigma_i \text{Tr}(W^T u_i v_i^T)\\
&=\sum_i \sigma_i ((Wv_i)^T u_i )\\
&\le \sum_i \sigma_i ||Wv_i|| ||u_i||\\
&= \sum_i \sigma_i
\end{align}
where we have used the invariance of trace under cyclic permutations, Cauchy-Schwarz, and the orthogonality of the columns of U and V. Note that choosing
\begin{align}
W=UV^T
\end{align}
achieves the bound. This completes the proof.

=== Domain-adversarial setting ===

This paper shows how to learn this mapping W without cross-lingual supervision. An illustration of the approach is given in Fig. 1. First, this model learn an initial proxy of W by using an adversarial criterion. Then, it use the words that match the best as anchor points for Procrustes. Finally, it improve performance over less frequent words by changing the metric of the space, which leads to spread more of those points in dense region.

[[File:Toy_example.png |frame|none|alt=Alt text|Figure 1: Toy illustration of the method. (A) There are two distributions of word embeddings, English words in red denoted by X and Italian words in blue denoted by Y , which we want to align/translate. Each dot represents a word in that space. The size of the dot is proportional to the frequency of the words in the training corpus of that language. (B) Using adversarial learning, we learn a rotation matrix W which roughly aligns the two distributions. The green stars are randomly selected words that are fed to the discriminator to determine whether the two word embeddings come from the same distribution. (C) The mapping W is further refined via Procrustes. This method uses frequent words aligned by the previous step as anchor points, and minimizes an energy function that corresponds to a spring system between anchor points. The refined mapping is then used to map all words in the dictionary. (D) Finally, we translate by using the mapping W and a distance metric, dubbed CSLS, that expands the space where there is high density of points (like the area around the word “cat”), so that “hubs” (like the word “cat”) become less close to other word vectors than they would otherwise (compare to the same region in panel (A)).]]

Let <math> X={x_1,...,x_n} </math> and <math> Y={y_1,...,y_m} </math> be two sets of n and m word embeddings coming from a source and a target language respectively. A model is trained is trained to discriminate between elements randomly sampled from <math> WX={Wx_1,...,Wx_n} </math> and Y, We call this model the discriminator. W is trained to prevent the discriminator from making accurate predictions. As a result, this is a two-player game, where the discriminator aims at maximizing its ability to identify the origin of an embedding, and W aims at preventing the discriminator from doing so by making WX and Y as similar as possible. This approach is in line with the work of Ganin et al.(2016), who proposed to learn latent representations invariant to the input domain, where in this case, a domain is represented by a language(source or target).

1. Discriminator objective

Refer to the discriminator parameters as <math> \theta_D </math>. Consider the probability <math> P_{\theta_D}(source = 1|z) </math> that a vector z is the mapping of a source embedding (as opposed to a target embedding) according to the discriminator. The discriminator loss can be written as:

\begin{align}
L_D(\theta_D|W)=-\frac{1}{n} \sum_{i=1}^n log P_{\theta_D}(source=1|Wx_i)-\frac{1}{m} \sum_{i=1}^m log P_{\theta_D}(source=0|y_i)
\end{align}

2. Mapping objective

In the unsupervised setting, W is now trained so that the discriminator is unable to accurately predict the embedding origins:

\begin{align}
L_W(W|\theta_D)=-\frac{1}{n} \sum_{i=1}^n log P_{\theta_D}(source=0|Wx_i)-\frac{1}{m} \sum_{i=1}^m log P_{\theta_D}(source=1|y_i)
\end{align}

3. Learning algorithm
To train the model, the authors follow the standard training procedure of deep adversarial networks of Goodfellow et al. (2014). For every input sample, the discriminator and the mapping matrix W are trained successively with stochastic gradient updates to respectively minimize <math> L_D </math> and <math> L_W </math>

=== Refinement procedure ===

The matrix W obtained with adversarial training gives good performance (see Table 1), but the results are still not on par with the supervised approach. In fact, the adversarial approach tries to align all words irrespective of their frequencies. However, rare words have embeddings that are less updated and are more likely to appear in different contexts in each corpus, which makes them harder to align. Under the assumption that the mapping is linear, it is then better to infer the global mapping using only the most frequent words as anchors. Besides, the accuracy on the most frequent word pairs is high after adversarial training.
To refine the mapping, this paper build a synthetic parallel vocabulary using the W just learned with adversarial training. Specifically, this paper consider the most frequent words and retain only mutual nearest neighbors to ensure a high-quality dictionary. Subsequently, this paper apply the Procrustes solution in (2) on this generated dictionary. Considering the improved solution generated with the Procrustes algorithm, it is possible to generate a more accurate dictionary and apply this method iteratively, similarly to Artetxe et al. (2017). However, given that the synthetic dictionary obtained using adversarial training is already strong, this paper only observe small improvements when doing more than one iteration, i.e., the improvements on the word translation task are usually below 1%.

=== Cross-Domain Similarity Local Scaling (CSLS) ===

This paper considers a bi-partite neighborhood graph, in which each word of a given dictionary is connected to its K nearest neighbors in the other language. <math> N_T(Wx_s) </math> is used to denote the neighborhood, on this bi-partite graph, associated with a mapped source word embedding <math> Wx_s </math>. All K elements of <math> N_T(Wx_s) </math> are words from the target language. Similarly we denote by <math> N_S(y_t) </math> the neighborhood associated with a word t of the target language. Consider the mean similarity of a source embedding <math> x_s </math> to its target neighborhood as

\begin{align}
r_T(Wx_s)=\frac{1}{K}\sum_{y\in N_T(Wx_s)}cos(Wx_s,y_t)
\end{align}

where cos(,) is the cosine similarity. Likewise, the mean similarity of a target word <math> y_t </math> to its neighborhood is denotes as <math> r_S(y_t) </math>. This is used to define similarity measure CSLS(.,.) between mapped source words and target words,as

\begin{align}
CSLS(Wx_s,y_t)=2cos(Wx_s,y_t)-r_T(Wx_s)-r_S(y_t)
\end{align}

This process increases the similarity associated with isolated word vectors, but decreases the similarity of vectors lying in dense areas.

= Training and architectural choices =
=== Architecture ===

This paper use unsupervised word vectors that were trained using fastText2. These correspond to monolingual embeddings of dimension 300 trained on Wikipedia corpora; therefore, the mapping W has size 300 × 300. Words are lower-cased, and those that appear less than 5 times are discarded for training. As a post-processing step, only the first 200k most frequent words were selected in the experiments.
For the discriminator, it use a multilayer perceptron with two hidden layers of size 2048, and Leaky-ReLU activation functions. The input to the discriminator is corrupted with dropout noise with a rate of 0.1. As suggested by Goodfellow (2016), a smoothing coefficient s = 0.2 is included in the discriminator predictions. This paper use stochastic gradient descent with a batch size of 32, a learning rate of 0.1 and a decay of 0.95 both for the discriminator and W .

=== Discriminator inputs ===
The embedding quality of rare words is generally not as good as the one of frequent words (Luong et al., 2013), and it is observed that feeding the discriminator with rare words had a small, but not negligible negative impact. As a result, this paper only feed the discriminator with the 50,000 most frequent words. At each training step, the word embeddings given to the discriminator are sampled uniformly. Sampling them according to the word frequency did not have any noticeable impact on the results.

=== Orthogonality===
In this work, it propose to use a simple update step to ensure that the matrix W stays close to an orthogonal matrix during training (Cisse et al. (2017)). Specifically, the following update rule on the matrix W is used :

\begin{align}
W \leftarrow (1+\beta)W-\beta(WW^T)W
\end{align}

where β = 0.01 is usually found to perform well. This method ensures that the matrix stays close to the manifold of orthogonal matrices after each update.

This update rule can be justified as follows. Consider the function
\begin{align}
g: \mathbb{R}^{d\times d} \to \mathbb{R}^{d \times d}
\end{align}
defined by
\begin{align}
g(W)= W^T W -I.
\end{align}

The derivative of g at W is is the linear map
\begin{align}
Dg[W]: \mathbb{R}^{d \times d} \to \mathbb{R}^{d \times d}
\end{align}
defined by
\begin{align}
Dg[W](H)= H^T W + W^T H.
\end{align}

The adjoint of this linear map is

\begin{align}
D^\ast g[W](H)= WH^T +WH.
\end{align}

Now consider the function f
\begin{align}
f: \mathbb{R}^{d \times d} \to \mathbb{R}
\end{align}

defined by

\begin{align}
f(W)=||g(W) ||_F^2=||W^TW -I ||_F^2.
\end{align}

f has gradient:
\begin{align}
\nabla f (W) = 2D^\ast g[W] (g(W ) ) =2W(W^TW-I) +2W(W^TW-I)=4W W^TW-4W.
\end{align}

Thus the update
\begin{align}
W \leftarrow (1+\beta)W-\beta(WW^T)W
\end{align}
amounts to a step in the direction opposite the gradient of f. That is, a step toward the set of orthogonal matrices.

=== Dictionary generation ===
The refinement step requires to generate a new dictionary at each iteration. In order for the Procrustes solution to work well, it is best to apply it on correct word pairs. As a result, the CSLS method is used to select more accurate translation pairs in the dictionary. To increase even more the quality of the dictionary, and ensure that W is learned from correct translation pairs, only mutual nearest neighbors were considered, i.e. pairs of words that are mutually nearest neighbors of each other according to CSLS. This significantly decreases the size of the generated dictionary, but improves its accuracy, as well as the overall performance.

=== Validation criterion for unsupervised model selection ===

This paper consider the 10k most frequent source words, and use CSLS to generate a translation for each of them, then compute the average cosine similarity between these deemed translations, and use this average as a validation metric. Figure 2 shows the correlation between the evaluation score and this unsupervised criterion (without stabilization by learning rate shrinkage)

[[File:fig2_fan.png |frame|none|alt=Alt text|Figure 2: Unsupervised model selection.
Correlation between the unsupervised validation criterion (black line) and actual word translation accuracy (blue line). In this particular experiment, the selected model is at epoch 10. Observe how the criterion is well correlated with translation accuracy.]]

= Results =

In what follows, the results on word translation retrieval using the bilingual dictionaries were presented in Table 1 and the comparison to previous work in Table 2 where unsupervised model significantly outperform previous approaches. The results on the sentence translation retrieval task were also presented in Table 3 and the cross-lingual word similarity task in Table 4. Finally, the results on word-by-word translation for English-Esperanto was presented in Table 5.

[[File:table1_fan.png |frame|none|alt=Alt text|Table 1: Word translation retrieval P@1 for the released vocabularies in various language pairs. The authors consider 1,500 source test queries, and 200k target words for each language pair. The authors use fastText embeddings trained on Wikipedia. NN: nearest neighbors. ISF: inverted softmax. (’en’ is English, ’fr’ is French, ’de’ is German, ’ru’ is Russian, ’zh’ is classical Chinese and ’eo’ is Esperanto)]]

[[File:table2_fan.png |frame|none|alt=Alt text|English-Italian word translation average precisions (@1, @5, @10) from 1.5k source word queries using 200k target words. Results marked with the symbol † are from Smith et al. (2017). Wiki means the embeddings were trained on Wikipedia using fastText. Note that the method used by Artetxe et al. (2017) does not use the same supervision as other supervised methods, as they only use numbers in their ini- tial parallel dictionary.]]

[[File:table3_fan.png |frame|none|alt=Alt text|Table 3: English-Italian sentence translation retrieval. The authors report the average P@k from 2,000 source queries using 200,000 target sentences. The authors use the same embeddings as in Smith et al. (2017). Their results are marked with the symbol †.]]

[[File:table4_fan.png |frame|none|alt=Alt text|Table 4: Cross-lingual wordsim task. NASARI
(Camacho-Collados et al. (2016)) refers to the official SemEval2017 baseline. The authors report Pearson correlation.]]

[[File:table5_fan.png |frame|none|alt=Alt text|Table 5: BLEU score on English-Esperanto.
Although being a naive approach, word-by- word translation is enough to get a rough idea of the input sentence. The quality of the gener- ated dictionary has a significant impact on the BLEU score.]]

[[File:paper9_fig3.png |frame|none|alt=Alt text|Figure 3: The paper also investigated the impact of monolingual embeddings. It was found that model from this paper can align embeddings obtained through different methods, but not embeddings obtained from different corpora, which explains the large performance increase in Table 2 due to the corpus change from WaCky to Wiki using CBOW embedding. This is conveyed in this figure which displays English to English world alignment accuracies with regard to word frequency. Perfect alignment is achieved using the same model and corpora (a). Also good alignment using different model and corpora, although CSLS consistently has better results (b). Worse results due to use of different corpora (c). Even worse results when both embedding model and corpora are different.]]

= Conclusion =
This paper shows for the first time that one can align word embedding spaces without any cross-lingual supervision, i.e., solely based on unaligned datasets of each language, while reaching or outperforming the quality of previous supervised approaches in several cases. Using adversarial training, the model is able to initialize a linear mapping between a source and a target space, which is also used to produce a synthetic parallel dictionary. It is then possible to apply the same techniques proposed for supervised techniques, namely a Procrustean optimization.

= Source =
Lample, Guillaume; Denoyer, Ludovic; Ranzato, Marc'Aurelio
| Unsupervised Machine Translation Using Monolingual Corpora Only
| arVix: 1701.04087

File:paper9 fig3.png

2018-03-09T20:32:09Z

Cs4li:

stat946w18/Rethinking the Smaller-Norm-Less-Informative Assumption in Channel Pruning of Convolutional Layers

2018-03-08T23:59:13Z

Cs4li:

== Introduction ==

With the recent and ongoing surge in low-power, intelligent agents (such as wearables, smartphones, and IoT devices), there exists a growing need for machine learning models to work well in memory and CPU-constrained environments. Deep learning models have achieved state-of-the-art on a broad range of tasks; however, they are difficult to deploy in their original forms. For example, AlexNet (Krizhevsky et al., 2012), a model for image classification, contains 61 million parameters and requires 1.5 billion floating point operations per second (FLOPs) in one inference pass. A more accurate model, ResNet-50 (He et al., 2016), has 25 million parameters but requires 4.08 billion FLOPs. Clearly, it would be difficult to deploy and run these models on low-power devices.

In general, model compression can be accomplished using four main non-exclusive methods (Cheng et al., 2017): weight pruning, quantization, matrix transformations, and weight tying. By non-exclusive, we mean that these methods can be used in combination for pruning a single model; the use of one method does not exclude any of the other methods from being viable.

Ye et al. (2018) explores pruning entire channels in a convolutional neural network. Past work has mostly focused on norm or error-based heuristics to prune channels; instead, Ye et al. (2018) show that their approach is, "mathematically appealing from an optimization perspective and easy to reproduce" (Ye et al., 2018). In other words, they argue that the norm-based assumption is not as informative or theoretically justified as their approach, and provide strong empirical findings.

== Motivation ==

Some previous works on pruning channel filters (Li et al., 2016; Molchanov et al., 2016) have focused on using the L1 norm to determine the importance of a channel. Ye et al. (2018) show that, in the deep linear convolution case, penalizing the per-layer norm is coarse-grained; they argue that one cannot assign different coefficients to L1 penalties associated with different layers without risking the loss function being susceptible to trivial re-parameterizations. As an example, consider the following deep linear convolutional neural network with modified LASSO loss:

$$\min \mathbb{E}_D \lVert W_{2n} * \dots * W_1 x - y\rVert^2 + \lambda \sum_{i=1}^n \lVert W_{2i} \rVert_1$$

where W are the weights and * is convolution. Here we have chosen the coefficient 0 for the L1 penalty associated with odd-numbered layers and the coefficient 1 for the L1 penalty associated with even-numbered layers. This loss is susceptible to trivial re-parameterizations: without affecting the least-squares loss, we can always reduce the LASSO loss by halving the weights of all even-numbered layers and doubling the weights of all odd-numbered layers.

Furthermore, batch normalization (Ioffe, 2015) is incompatible with this method of weight regularization. Consider batch normalization at the <math>l</math>-th layer.

<center><math>x^{l+1} = max\{\gamma \cdot BN_{\mu,\sigma,\epsilon}(W^l * x^l) + \beta, 0\}</math></center>

Due to the batch normalization, any uniform scaling of <math>W^l</math> which would change <math>l_1</math> and <math>l_2</math> norms, but has no have no effect on <math>x^{l+1}</math>. Thus, when trying to minimize weight norms of multiple layers, it is unclear how to properly choose penalties for each layer. Therefore, penalizing the norm of a filter in a deep convolutional network is hard to justify from a theoretical perspective.

Thus, although not providing a complete theoretical guarantee on loss, Ye et al. (2018) develop a pruning technique that claims to be more justified than norm-based pruning is.

== Method ==

At a high level, Ye et al. (2018) propose that, instead of discovering sparsity via penalizing the per-filter or per-channel norm, penalize the batch normalization scale parameters ''gamma'' instead. The reasoning is that by having fewer parameters to constrain and working with normalized values, sparsity is easier to enforce, monitor, and learn. Having sparse batch normalization terms has the effect of pruning '''entire''' channels: if ''gamma'' is zero, then the output at that layer becomes constant (the bias term), and thus the preceding channels can be pruned.

=== Summary ===

The basic algorithm can be summarized as follows:

1. Penalize the L1-norm of the batch normalization scaling parameters in the loss

2. Train until loss plateaus

3. Remove channels that correspond to a downstream zero in batch normalization

4. Fine-tune the pruned model using regular learning

=== Details ===

There still exist a few problems that this summary has not addressed so far. Sub-gradient descent is known to have inverse square root convergence rate on subdifferentials (Gordon et al., 2012), so the sparsity gradient descent update may be suboptimal. Furthermore, the sparse penalty needs to be normalized with respect to previous channel sizes, since the penalty should be roughly equally distributed across all convolution layers.

==== Slow Convergence ====
To address the issue of slow convergence, Ye et al. (2018) use an iterative shrinking-thresholding algorithm (ISTA) (Beck & Teboulle, 2009) to update the batch normalization scale parameter. The intuition for ISTA is that the structure of the optimization objective can be taken advantage of. Consider: $$L(x) = f(x) + g(x).$$

Let ''f'' be the model loss and ''g'' be the non-differentiable penalty (LASSO). ISTA is able to use the structure of the loss and converge in O(1/n), instead of O(1/sqrt(n)) when using subgradient descent, which assumes no structure about the loss. Even though ISTA is used in convex settings, Ye et. al (2018) argue that it still performs better than gradient descent.

==== Penalty Normalization ====

In the paper, Ye et al. (2018) normalize the per-layer sparse penalty with respect to the global input size, the current layer kernel areas, the previous layer kernel areas, and the local input feature map area.

[[File:Screenshot_from_2018-02-28_17-06-41.png]] (Ye et al., 2018)

To control the global penalty, a hyperparamter ''rho'' is multiplied with all the per-layer ''lambda'' in the final loss.

=== Steps ===

The final algorithm can be summarized as follows:

1. Compute the per-layer normalized sparse penalty constant ''lambda''

2. Compute the global LASSO loss with global scaling constant ''rho''

3. Until convergence, train scaling parameters using ISTA and non-scaling parameters using regular gradient descent.

4. Remove channels that correspond to a downstream zero in batch normalization

5. Fine-tune the pruned model using regular learning

== Results ==

The authors show state-of-the-art performance, compared with other channel-pruning approaches. It is important to note that it would be unfair to compare against general pruning approaches; channel pruning specifically removes channels without introducing '''intra-kernel sparsity''', whereas other pruning approaches introduce irregular kernel sparsity and hence computational inefficiencies.

Results on CIFAR-10:

[[File:Screenshot_from_2018-02-28_17-24-25.png]]

Results on ILSVRC2012:

[[File:Screenshot_from_2018-02-28_17-24-36.png]]

== Conclusion ==

Pruning large neural architectures to fit on low-power devices is an important task. For a real quantitative measure of efficiency, it would be interesting to conduct actual power measurements on the pruned models versus baselines; reduction in FLOPs doesn't necessarily correspond with vastly reduced power since memory accesses dominate energy consumption (Han et al., 2015). However, the reduction in the number of FLOPs and parameters is encouraging, so moderate power savings should be expected.

It would also be interesting to combine multiple approaches, or "throw the whole kitchen sink" at this task. Han et al. (2015) sparked much recent interest by successfully combining weight pruning, quantization, and Huffman coding without loss in accuracy. However, their approach introduced irregular sparsity in the convolutional layers, so a direct comparison cannot be made.

In conclusion, this novel, theoretically-motivated interpretation of channel pruning was successfully applied to several important tasks.

== References ==

* Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (pp. 1097-1105).
* He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).
* Cheng, Y., Wang, D., Zhou, P., & Zhang, T. (2017). A Survey of Model Compression and Acceleration for Deep Neural Networks. arXiv preprint arXiv:1710.09282.
* Ye, J., Lu, X., Lin, Z., & Wang, J. Z. (2018). Rethinking the Smaller-Norm-Less-Informative Assumption in Channel Pruning of Convolution Layers. arXiv preprint arXiv:1802.00124.
* Li, H., Kadav, A., Durdanovic, I., Samet, H., & Graf, H. P. (2016). Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710.
* Molchanov, P., Tyree, S., Karras, T., Aila, T., & Kautz, J. (2016). Pruning convolutional neural networks for resource efficient inference.
* Ioffe, S., & Szegedy, C. (2015, June). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning (pp. 448-456).
* Gordon, G., & Tibshirani, R. (2012). Subgradient method. https://www.cs.cmu.edu/~ggordon/10725-F12/slides/06-sg-method.pdf
* Beck, A., & Teboulle, M. (2009). A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences, 2(1), 183-202.
* Han, S., Mao, H., & Dally, W. J. (2015). Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149

stat946w18/Rethinking the Smaller-Norm-Less-Informative Assumption in Channel Pruning of Convolutional Layers

2018-03-08T23:58:59Z

Cs4li:

== Introduction ==

With the recent and ongoing surge in low-power, intelligent agents (such as wearables, smartphones, and IoT devices), there exists a growing need for machine learning models to work well in memory and CPU-constrained environments. Deep learning models have achieved state-of-the-art on a broad range of tasks; however, they are difficult to deploy in their original forms. For example, AlexNet (Krizhevsky et al., 2012), a model for image classification, contains 61 million parameters and requires 1.5 billion floating point operations per second (FLOPs) in one inference pass. A more accurate model, ResNet-50 (He et al., 2016), has 25 million parameters but requires 4.08 billion FLOPs. Clearly, it would be difficult to deploy and run these models on low-power devices.

In general, model compression can be accomplished using four main non-exclusive methods (Cheng et al., 2017): weight pruning, quantization, matrix transformations, and weight tying. By non-exclusive, we mean that these methods can be used in combination for pruning a single model; the use of one method does not exclude any of the other methods from being viable.

Ye et al. (2018) explores pruning entire channels in a convolutional neural network. Past work has mostly focused on norm or error-based heuristics to prune channels; instead, Ye et al. (2018) show that their approach is, "mathematically appealing from an optimization perspective and easy to reproduce" (Ye et al., 2018). In other words, they argue that the norm-based assumption is not as informative or theoretically justified as their approach, and provide strong empirical findings.

== Motivation ==

Some previous works on pruning channel filters (Li et al., 2016; Molchanov et al., 2016) have focused on using the L1 norm to determine the importance of a channel. Ye et al. (2018) show that, in the deep linear convolution case, penalizing the per-layer norm is coarse-grained; they argue that one cannot assign different coefficients to L1 penalties associated with different layers without risking the loss function being susceptible to trivial re-parameterizations. As an example, consider the following deep linear convolutional neural network with modified LASSO loss:

$$\min \mathbb{E}_D \lVert W_{2n} * \dots * W_1 x - y\rVert^2 + \lambda \sum_{i=1}^n \lVert W_{2i} \rVert_1$$

where W are the weights and * is convolution. Here we have chosen the coefficient 0 for the L1 penalty associated with odd-numbered layers and the coefficient 1 for the L1 penalty associated with even-numbered layers. This loss is susceptible to trivial re-parameterizations: without affecting the least-squares loss, we can always reduce the LASSO loss by halving the weights of all even-numbered layers and doubling the weights of all odd-numbered layers.

Furthermore, batch normalization (Ioffe, 2015) is incompatible with this method of weight regularization. Consider batch normalization at the <math>l</math>-th layer.

<center><math>x^{l+1} = max\{\gamma \cdot BN_{\mu,\sigma,\epsilon}(W^l * x^l) + \beta, 0\}</math><center/>

Due to the batch normalization, any uniform scaling of <math>W^l</math> which would change <math>l_1</math> and <math>l_2</math> norms, but has no have no effect on <math>x^{l+1}</math>. Thus, when trying to minimize weight norms of multiple layers, it is unclear how to properly choose penalties for each layer. Therefore, penalizing the norm of a filter in a deep convolutional network is hard to justify from a theoretical perspective.

Thus, although not providing a complete theoretical guarantee on loss, Ye et al. (2018) develop a pruning technique that claims to be more justified than norm-based pruning is.

== Method ==

At a high level, Ye et al. (2018) propose that, instead of discovering sparsity via penalizing the per-filter or per-channel norm, penalize the batch normalization scale parameters ''gamma'' instead. The reasoning is that by having fewer parameters to constrain and working with normalized values, sparsity is easier to enforce, monitor, and learn. Having sparse batch normalization terms has the effect of pruning '''entire''' channels: if ''gamma'' is zero, then the output at that layer becomes constant (the bias term), and thus the preceding channels can be pruned.

=== Summary ===

The basic algorithm can be summarized as follows:

1. Penalize the L1-norm of the batch normalization scaling parameters in the loss

2. Train until loss plateaus

3. Remove channels that correspond to a downstream zero in batch normalization

4. Fine-tune the pruned model using regular learning

=== Details ===

There still exist a few problems that this summary has not addressed so far. Sub-gradient descent is known to have inverse square root convergence rate on subdifferentials (Gordon et al., 2012), so the sparsity gradient descent update may be suboptimal. Furthermore, the sparse penalty needs to be normalized with respect to previous channel sizes, since the penalty should be roughly equally distributed across all convolution layers.

==== Slow Convergence ====
To address the issue of slow convergence, Ye et al. (2018) use an iterative shrinking-thresholding algorithm (ISTA) (Beck & Teboulle, 2009) to update the batch normalization scale parameter. The intuition for ISTA is that the structure of the optimization objective can be taken advantage of. Consider: $$L(x) = f(x) + g(x).$$

Let ''f'' be the model loss and ''g'' be the non-differentiable penalty (LASSO). ISTA is able to use the structure of the loss and converge in O(1/n), instead of O(1/sqrt(n)) when using subgradient descent, which assumes no structure about the loss. Even though ISTA is used in convex settings, Ye et. al (2018) argue that it still performs better than gradient descent.

==== Penalty Normalization ====

In the paper, Ye et al. (2018) normalize the per-layer sparse penalty with respect to the global input size, the current layer kernel areas, the previous layer kernel areas, and the local input feature map area.

[[File:Screenshot_from_2018-02-28_17-06-41.png]] (Ye et al., 2018)

To control the global penalty, a hyperparamter ''rho'' is multiplied with all the per-layer ''lambda'' in the final loss.

=== Steps ===

The final algorithm can be summarized as follows:

1. Compute the per-layer normalized sparse penalty constant ''lambda''

2. Compute the global LASSO loss with global scaling constant ''rho''

3. Until convergence, train scaling parameters using ISTA and non-scaling parameters using regular gradient descent.

4. Remove channels that correspond to a downstream zero in batch normalization

5. Fine-tune the pruned model using regular learning

== Results ==

The authors show state-of-the-art performance, compared with other channel-pruning approaches. It is important to note that it would be unfair to compare against general pruning approaches; channel pruning specifically removes channels without introducing '''intra-kernel sparsity''', whereas other pruning approaches introduce irregular kernel sparsity and hence computational inefficiencies.

Results on CIFAR-10:

[[File:Screenshot_from_2018-02-28_17-24-25.png]]

Results on ILSVRC2012:

[[File:Screenshot_from_2018-02-28_17-24-36.png]]

== Conclusion ==

Pruning large neural architectures to fit on low-power devices is an important task. For a real quantitative measure of efficiency, it would be interesting to conduct actual power measurements on the pruned models versus baselines; reduction in FLOPs doesn't necessarily correspond with vastly reduced power since memory accesses dominate energy consumption (Han et al., 2015). However, the reduction in the number of FLOPs and parameters is encouraging, so moderate power savings should be expected.

It would also be interesting to combine multiple approaches, or "throw the whole kitchen sink" at this task. Han et al. (2015) sparked much recent interest by successfully combining weight pruning, quantization, and Huffman coding without loss in accuracy. However, their approach introduced irregular sparsity in the convolutional layers, so a direct comparison cannot be made.

In conclusion, this novel, theoretically-motivated interpretation of channel pruning was successfully applied to several important tasks.

== References ==

* Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (pp. 1097-1105).
* He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).
* Cheng, Y., Wang, D., Zhou, P., & Zhang, T. (2017). A Survey of Model Compression and Acceleration for Deep Neural Networks. arXiv preprint arXiv:1710.09282.
* Ye, J., Lu, X., Lin, Z., & Wang, J. Z. (2018). Rethinking the Smaller-Norm-Less-Informative Assumption in Channel Pruning of Convolution Layers. arXiv preprint arXiv:1802.00124.
* Li, H., Kadav, A., Durdanovic, I., Samet, H., & Graf, H. P. (2016). Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710.
* Molchanov, P., Tyree, S., Karras, T., Aila, T., & Kautz, J. (2016). Pruning convolutional neural networks for resource efficient inference.
* Ioffe, S., & Szegedy, C. (2015, June). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning (pp. 448-456).
* Gordon, G., & Tibshirani, R. (2012). Subgradient method. https://www.cs.cmu.edu/~ggordon/10725-F12/slides/06-sg-method.pdf
* Beck, A., & Teboulle, M. (2009). A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences, 2(1), 183-202.
* Han, S., Mao, H., & Dally, W. J. (2015). Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149

stat946w18/AmbientGAN: Generative Models from Lossy Measurements

2018-03-08T23:08:25Z

Cs4li:

= Introduction =
Generative Adversarial Networks operate by simulating complex distributions but training them requires access to large amounts of high quality data. Often times we only have access to noisy or partial observations, which will from here on be referred to as measurements of the true data. If we know the measurement function and would like to train a generative model for the true data, there are several ways to continue which have varying degrees of success. We will use noisy MNIST data as an illustrative example. Suppose we only see MNIST data that has been run through a Gaussian kernel (blurred) with some noise from a <math>N(0, 0.5^2)</math> distribution added to each pixel:

<gallery mode="packed">
File:mnist.png| True Data (Unobserved)
File:mnistmeasured.png| Measured Data (Observed)
</gallery>

=== Ignore the problem ===
[[File:GANignore.png|500px]] [[File:mnistignore.png|300px]]

Train a generative model directly on the measured data. This will obviously be unable to generate the true distribution before measurement has occurred.

=== Try to recover the information lost ===
[[File:GANrecovery.png|420px]] [[File:mnistrecover.png|300px]]

Works better than ignoring the problem but depends on how easily the measurement function can be inverted.

=== AmbientGAN ===
[[File:GANambient.png|500px]] [[File:mnistambient.png|300px]]

Ashish Bora, Eric Price and Alexandros G. Dimakis propose AmbientGAN as a way to recover the true underlying distribution from measurements of the true data.

AmbientGAN works by training a generator which attempts to have the measurements of the output it generates fool the discriminator. The discriminator must distinguish between real and generated measurements.

= Model =
For the following variables superscript <math>r</math> represents the true distributions while superscript <math>g</math> represents the generated distributions. Let <math>x</math>, represent the underlying space and <math>y</math> for the measurement.

Thus <math>p_x^r</math> is the real underlying distribution over <math>\mathbb{R}^n</math> that we are interested in. However if we assume our (known) measurement functions, <math>f_\theta: \mathbb{R}^n \to \mathbb{R}^m</math> are parameterized by <math>\theta \sim p_\theta</math>, we can only observe <math>y = f_\theta(x)</math>.

Mirroring the standard GAN setup we let <math>Z \in \mathbb{R}^k, Z \sim p_z</math> and <math>\Theta \sim p_\theta</math> be random variables coming from a distribution that is easy to sample.

If we have a generator <math>G: \mathbb{R}^k \to \mathbb{R}^n</math> then we can generate <math>X^g = G(Z)</math> which has distribution <math>p_x^g</math> a measurement <math>Y^g = f_\Theta(G(Z))</math> which has distribution <math>p_y^g</math>.

Unfortunately we do not observe any <math>X^g \sim p_x</math> so we can use the discriminator directly on <math>G(Z)</math> to train the generator. Instead we will use the discriminator to distinguish between the <math>Y^g -
f_\Theta(G(Z))</math> and <math>Y^r</math>. That is we train the discriminator, <math>D: \mathbb{R}^m \to \mathbb{R}</math> to detect if a measurement came from <math>p_y^r</math> or <math>p_y^g</math>.

AmbientGAN has the objective function:

<math>\min_G \max_D \mathbb{E}_{Y^r \sim p_y^r}[q(D(Y^r))] + \mathbb{E}_{Z \sim p_z, \Theta \sim p_\theta}[q(1 - D(f_\Theta(G(Z))))] </math>

where <math>q(.)</math> is the quality function; for the standard GAN <math>q(x) = log(x)</math> and for Wasserstein GAN <math>q(x) = x</math>.

As a technical limitation we require <math>f_\theta</math> to be differentiable with the respect each input for all values of <math>\theta</math>.

With this set up we sample <math>Z \sim p_z</math>, <math>\Theta \sim p_\theta</math>, and <math>Y^r \sim U\{y_1, \cdots, y_s\}</math> each iteration and use them to compute the stochastic gradients of the objective function. We alternate between updating <math>G</math> and updating <math>D</math>.

= Empirical Results =

The paper continues to present results of AmbientGAN under various measurement functions when compared to baseline models. We have already seen one example in the introduction: a comparison of AmbientGAN in the Convolve + Noise Measurement case compared to the ignore-baseline, and the unmeasure-baseline.

=== Convolve + Noise ===
Additional results with the convolve + noise case with the celebA dataset, with the AmbientGAN compared to the baseline results with Wiener deconvolution. It is clear that AmbientGAN has superior performance in this case. The measurement is created from <math>f_{\Theta}(x) = k*x + \Theta</math>, where <math>*</math> is the convolution operatorm, <math>k</math> is the convolution kernel, and <math>\Theta \sim p_{\theta}</math> is the noise distribution.

[[File:paper7_fig3.png]]

Images undergone convolve + noise transformations (left). Results with Wiener deconvolution (middle). Results with AmbientGAN (right).

=== Block-Pixels ===
With the block-pixels measurement function each pixel is independently set to 0 with probability <math>p</math>.

[[File:block-pixels.png]]

Measurements from the celebA dataset with <math>p=0.95</math> (left). Images generated from GAN trained on unmeasured (via blurring) data (middle). Results generated from AmbientGAN (right).

=== Block-Patch ===

[[File:block-patch.png]]

A random 14x14 patch is set to zero (left). Unmeasured using-navier-stoke inpainting (middle). AmbientGAN (right).

=== Pad-Rotate-Project-<math>\theta</math> ===

[[File:pad-rotate-project-theta.png]]

Results generated by AmbientGAN where the measurement function 0 pads the images, rotates it by <math>\theta</math>, and projects it on to the x axis. For each measurement the value of <math>\theta</math> is known.

The generated images only have the basic features of a face and is referred to as a failure case in the paper. However the measurement function performs relatively well given how lossy the measurement function is.

=== Explanation of Inception Score ===
To evaluate GAN performance, the authors make use of the inception score, a metric introduced by Salimans et al.(2016). To evaluate the inception score on a datapoint, a pre-trained inception classification model (Szegedy et al. 2016) is applied to that datapoint, and the KL divergence between its label distribution conditional on the datapoint and its marginal label distribution is computed. This KL divergence is the inception score. The idea is that meaningful images should be recognized by the inception model as belonging to some class, and so the conditional distribution should have low entropy, while the model should produce a variety of images, so the marginal should have high entropy. Thus an effective GAN should have a high inception score.

=== MNIST Inception ===

[[File:MNIST-inception.png]]

AmbientGAN was compared with baselines through training several models with different probability <math>p</math> of blocking pixels. The plot on the left shows that the inception scores change as the block probability <math>p</math> changes. All four models are similar when no pixels are blocked <math>(p=0)</math>. By the increase of the blocking probability, AmbientGAN models present a relatively stable performance and perform better than the baseline models. Therefore, AmbientGAN is more robust than all other baseline models.

The plot on the right reveals the changes in inception scores while the standard deviation of the additive Gaussian noise increased. Baselines perform better when the noise is small. By the increase of the variance, AmbientGAN models present a much better performance compare to the baseline models. Further AmbientGAN retains high inception scores as measurements become more and more lossy.

=== CIFAR-10 Inception ===

[[File:CIFAR-inception.png]]

AmbientGAN is faster to train and more robust even on more complex distributions such as CIFAR-10.

= Theoretical Results =

The theoretical results in the paper prove the true underlying distribution of <math>p_x^r</math> can be recovered when we have data that comes from the Gaussian-Projection measurement, Fourier transform measurement and the block-pixels measurement. The do this by showing the distribution of the measurements <math>p_y^r</math> corresponds to a unique distribution <math>p_x^r</math>. Thus even when the measurement itself is non-invertible the effect of the measurement on the distribution <math>p_x^r</math> is invertible. Lemma 5.1 ensures this is sufficient to provide the AmbientGAN training process with a consistency guarantee. For full proofs of the results please see appendix A.

=== Lemma 5.1 ===
Let <math>p_x^r</math> be the true data distribution, and <math>p_\theta</math> be the distributions over the parameters of the measurement function. Let <math>p_y^r</math> be the induced measurement distribution.

Assume for <math>p_\theta</math> there is a unique probability distribution <math>p_x^r</math> that induces <math>p_y^r</math>.

Then for the standard GAN model if the Discriminator is optimal, then a generator <math>G</math> is optimal if and only if <math>p_x^g = p_x^r</math>.

=== Theorems 5.2===
For the Gussian-Projection measurement model, there is a unique underlying distribution <math>p_x^{r} </math> that can induce the observed measurement distribution <math>p_y^{r} </math>.

=== Theorems 5.3===
Let <math> \mathcal{F} (\cdot) </math> denote the Fourier transform and let <math>supp (\cdot) </math> be the support of a function. Consider the Convolve+Noise measurement model with the convolution kernel <math> k </math>and additive noise distribution <math>p_\theta </math>. If <math> supp( \mathcal{F} (k))^{c}=\phi </math> and <math> supp( \mathcal{F} (p_\theta))^{c}=\phi </math>, then there is a unique distribution <math>p_x^{r} </math> that can induce the measurement distribution <math>p_y^{r} </math>.

=== Theorems 5.4===
Assume that each image pixel takes values in a finite set P. Thus <math>x \in P^n \subset \mathbb{R}^{n} </math>. Assume <math>0 \in P </math>, and consider the Block-Pixels measurement model with <math>p </math> being the probability of blocking a pixel. If <math>p <1</math>, then there is a unique distribution <math>p_x^{r} </math> that can induce the measurement distribution <math>p_y^{r} </math>. Further, for any <math> \epsilon > 0, \delta \in (0, 1] </math>, given a dataset of
\begin{equation}
s=\Omega \left( \frac{|P|^{2n}}{(1-p)^{2n} \epsilon^{2}} log \left( \frac{|P|^{n}}{\delta} \right) \right)
\end{equation}
IID measurement samples from pry , if the discriminator D is optimal, then with probability <math> \geq 1 - \delta </math> over the dataset, any optimal generator G must satisfy <math> d_{TV} \left( p^g_x , p^r_x \right) \leq \epsilon </math>, where <math> d_{TV} \left( \cdot, \cdot \right) </math> is the total variation distance.

= Future Research =

One critical weakness of AmbientGAN is the assumption that the measurement model is known. It would be nice to be able to train an AmbientGAN model when we have an unknown measurement model but also a small sample of unmeasured data.

A related piece of work is [https://arxiv.org/abs/1802.01284 here]. In particular, Algorithm 2 in the paper excluding the discriminator is similar to AmbientGAN.

= References =
# https://openreview.net/forum?id=Hy7fDog0b
# Salimans, Tim, et al. "Improved techniques for training gans." Advances in Neural Information Processing Systems. 2016.
# Szegedy, Christian, et al. "Rethinking the inception architecture for computer vision." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016.

stat946w18/AmbientGAN: Generative Models from Lossy Measurements

2018-03-08T23:07:59Z

Cs4li:

= Introduction =
Generative Adversarial Networks operate by simulating complex distributions but training them requires access to large amounts of high quality data. Often times we only have access to noisy or partial observations, which will from here on be referred to as measurements of the true data. If we know the measurement function and would like to train a generative model for the true data, there are several ways to continue which have varying degrees of success. We will use noisy MNIST data as an illustrative example. Suppose we only see MNIST data that has been run through a Gaussian kernel (blurred) with some noise from a <math>N(0, 0.5^2)</math> distribution added to each pixel:

<gallery mode="packed">
File:mnist.png| True Data (Unobserved)
File:mnistmeasured.png| Measured Data (Observed)
</gallery>

=== Ignore the problem ===
[[File:GANignore.png|500px]] [[File:mnistignore.png|300px]]

Train a generative model directly on the measured data. This will obviously be unable to generate the true distribution before measurement has occurred.

=== Try to recover the information lost ===
[[File:GANrecovery.png|420px]] [[File:mnistrecover.png|300px]]

Works better than ignoring the problem but depends on how easily the measurement function can be inverted.

=== AmbientGAN ===
[[File:GANambient.png|500px]] [[File:mnistambient.png|300px]]

Ashish Bora, Eric Price and Alexandros G. Dimakis propose AmbientGAN as a way to recover the true underlying distribution from measurements of the true data.

AmbientGAN works by training a generator which attempts to have the measurements of the output it generates fool the discriminator. The discriminator must distinguish between real and generated measurements.

= Model =
For the following variables superscript <math>r</math> represents the true distributions while superscript <math>g</math> represents the generated distributions. Let <math>x</math>, represent the underlying space and <math>y</math> for the measurement.

Thus <math>p_x^r</math> is the real underlying distribution over <math>\mathbb{R}^n</math> that we are interested in. However if we assume our (known) measurement functions, <math>f_\theta: \mathbb{R}^n \to \mathbb{R}^m</math> are parameterized by <math>\theta \sim p_\theta</math>, we can only observe <math>y = f_\theta(x)</math>.

Mirroring the standard GAN setup we let <math>Z \in \mathbb{R}^k, Z \sim p_z</math> and <math>\Theta \sim p_\theta</math> be random variables coming from a distribution that is easy to sample.

If we have a generator <math>G: \mathbb{R}^k \to \mathbb{R}^n</math> then we can generate <math>X^g = G(Z)</math> which has distribution <math>p_x^g</math> a measurement <math>Y^g = f_\Theta(G(Z))</math> which has distribution <math>p_y^g</math>.

Unfortunately we do not observe any <math>X^g \sim p_x</math> so we can use the discriminator directly on <math>G(Z)</math> to train the generator. Instead we will use the discriminator to distinguish between the <math>Y^g -
f_\Theta(G(Z))</math> and <math>Y^r</math>. That is we train the discriminator, <math>D: \mathbb{R}^m \to \mathbb{R}</math> to detect if a measurement came from <math>p_y^r</math> or <math>p_y^g</math>.

AmbientGAN has the objective function:

<math>\min_G \max_D \mathbb{E}_{Y^r \sim p_y^r}[q(D(Y^r))] + \mathbb{E}_{Z \sim p_z, \Theta \sim p_\theta}[q(1 - D(f_\Theta(G(Z))))] </math>

where <math>q(.)</math> is the quality function; for the standard GAN <math>q(x) = log(x)</math> and for Wasserstein GAN <math>q(x) = x</math>.

As a technical limitation we require <math>f_\theta</math> to be differentiable with the respect each input for all values of <math>\theta</math>.

With this set up we sample <math>Z \sim p_z</math>, <math>\Theta \sim p_\theta</math>, and <math>Y^r \sim U\{y_1, \cdots, y_s\}</math> each iteration and use them to compute the stochastic gradients of the objective function. We alternate between updating <math>G</math> and updating <math>D</math>.

= Empirical Results =

The paper continues to present results of AmbientGAN under various measurement functions when compared to baseline models. We have already seen one example in the introduction: a comparison of AmbientGAN in the Convolve + Noise Measurement case compared to the ignore-baseline, and the unmeasure-baseline.

=== Convolve + Noise ===
Additional results with the convolve + noise case, with the AmbientGAN compared to the baseline results with Wiener deconvolution. It is clear that AmbientGAN has superior performance in this case. The measurement is created from <math>f_{\Theta}(x) = k*x + \Theta</math>, where <math>*</math> is the convolution operatorm, <math>k</math> is the convolution kernel, and <math>\Theta \sim p_{\theta}</math> is the noise distribution.

[[File:paper7_fig3.png]]

Images undergone convolve + noise transformations (left). Results with Wiener deconvolution (middle). Results with AmbientGAN (right).

=== Block-Pixels ===
With the block-pixels measurement function each pixel is independently set to 0 with probability <math>p</math>.

[[File:block-pixels.png]]

Measurements from the celebA dataset with <math>p=0.95</math> (left). Images generated from GAN trained on unmeasured (via blurring) data (middle). Results generated from AmbientGAN (right).

=== Block-Patch ===

[[File:block-patch.png]]

A random 14x14 patch is set to zero (left). Unmeasured using-navier-stoke inpainting (middle). AmbientGAN (right).

=== Pad-Rotate-Project-<math>\theta</math> ===

[[File:pad-rotate-project-theta.png]]

Results generated by AmbientGAN where the measurement function 0 pads the images, rotates it by <math>\theta</math>, and projects it on to the x axis. For each measurement the value of <math>\theta</math> is known.

The generated images only have the basic features of a face and is referred to as a failure case in the paper. However the measurement function performs relatively well given how lossy the measurement function is.

=== Explanation of Inception Score ===
To evaluate GAN performance, the authors make use of the inception score, a metric introduced by Salimans et al.(2016). To evaluate the inception score on a datapoint, a pre-trained inception classification model (Szegedy et al. 2016) is applied to that datapoint, and the KL divergence between its label distribution conditional on the datapoint and its marginal label distribution is computed. This KL divergence is the inception score. The idea is that meaningful images should be recognized by the inception model as belonging to some class, and so the conditional distribution should have low entropy, while the model should produce a variety of images, so the marginal should have high entropy. Thus an effective GAN should have a high inception score.

=== MNIST Inception ===

[[File:MNIST-inception.png]]

AmbientGAN was compared with baselines through training several models with different probability <math>p</math> of blocking pixels. The plot on the left shows that the inception scores change as the block probability <math>p</math> changes. All four models are similar when no pixels are blocked <math>(p=0)</math>. By the increase of the blocking probability, AmbientGAN models present a relatively stable performance and perform better than the baseline models. Therefore, AmbientGAN is more robust than all other baseline models.

The plot on the right reveals the changes in inception scores while the standard deviation of the additive Gaussian noise increased. Baselines perform better when the noise is small. By the increase of the variance, AmbientGAN models present a much better performance compare to the baseline models. Further AmbientGAN retains high inception scores as measurements become more and more lossy.

=== CIFAR-10 Inception ===

[[File:CIFAR-inception.png]]

AmbientGAN is faster to train and more robust even on more complex distributions such as CIFAR-10.

= Theoretical Results =

The theoretical results in the paper prove the true underlying distribution of <math>p_x^r</math> can be recovered when we have data that comes from the Gaussian-Projection measurement, Fourier transform measurement and the block-pixels measurement. The do this by showing the distribution of the measurements <math>p_y^r</math> corresponds to a unique distribution <math>p_x^r</math>. Thus even when the measurement itself is non-invertible the effect of the measurement on the distribution <math>p_x^r</math> is invertible. Lemma 5.1 ensures this is sufficient to provide the AmbientGAN training process with a consistency guarantee. For full proofs of the results please see appendix A.

=== Lemma 5.1 ===
Let <math>p_x^r</math> be the true data distribution, and <math>p_\theta</math> be the distributions over the parameters of the measurement function. Let <math>p_y^r</math> be the induced measurement distribution.

Assume for <math>p_\theta</math> there is a unique probability distribution <math>p_x^r</math> that induces <math>p_y^r</math>.

Then for the standard GAN model if the Discriminator is optimal, then a generator <math>G</math> is optimal if and only if <math>p_x^g = p_x^r</math>.

=== Theorems 5.2===
For the Gussian-Projection measurement model, there is a unique underlying distribution <math>p_x^{r} </math> that can induce the observed measurement distribution <math>p_y^{r} </math>.

=== Theorems 5.3===
Let <math> \mathcal{F} (\cdot) </math> denote the Fourier transform and let <math>supp (\cdot) </math> be the support of a function. Consider the Convolve+Noise measurement model with the convolution kernel <math> k </math>and additive noise distribution <math>p_\theta </math>. If <math> supp( \mathcal{F} (k))^{c}=\phi </math> and <math> supp( \mathcal{F} (p_\theta))^{c}=\phi </math>, then there is a unique distribution <math>p_x^{r} </math> that can induce the measurement distribution <math>p_y^{r} </math>.

=== Theorems 5.4===
Assume that each image pixel takes values in a finite set P. Thus <math>x \in P^n \subset \mathbb{R}^{n} </math>. Assume <math>0 \in P </math>, and consider the Block-Pixels measurement model with <math>p </math> being the probability of blocking a pixel. If <math>p <1</math>, then there is a unique distribution <math>p_x^{r} </math> that can induce the measurement distribution <math>p_y^{r} </math>. Further, for any <math> \epsilon > 0, \delta \in (0, 1] </math>, given a dataset of
\begin{equation}
s=\Omega \left( \frac{|P|^{2n}}{(1-p)^{2n} \epsilon^{2}} log \left( \frac{|P|^{n}}{\delta} \right) \right)
\end{equation}
IID measurement samples from pry , if the discriminator D is optimal, then with probability <math> \geq 1 - \delta </math> over the dataset, any optimal generator G must satisfy <math> d_{TV} \left( p^g_x , p^r_x \right) \leq \epsilon </math>, where <math> d_{TV} \left( \cdot, \cdot \right) </math> is the total variation distance.

= Future Research =

One critical weakness of AmbientGAN is the assumption that the measurement model is known. It would be nice to be able to train an AmbientGAN model when we have an unknown measurement model but also a small sample of unmeasured data.

A related piece of work is [https://arxiv.org/abs/1802.01284 here]. In particular, Algorithm 2 in the paper excluding the discriminator is similar to AmbientGAN.

= References =
# https://openreview.net/forum?id=Hy7fDog0b
# Salimans, Tim, et al. "Improved techniques for training gans." Advances in Neural Information Processing Systems. 2016.
# Szegedy, Christian, et al. "Rethinking the inception architecture for computer vision." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016.

File:paper7 fig3.png

2018-03-08T22:52:30Z

Cs4li:

stat946w18/Unsupervised Machine Translation Using Monolingual Corpora Only

2018-03-08T22:21:47Z

Cs4li:

[[File:MC_Translation_Example.png]]
== Introduction ==
Neural machine translation systems must be trained on large corpora consisting of pairs of pre-translated sentences. The paper ''Unsupervised Machine Translation Using Monolingual Corpora Only'' by Guillaume Lample, Ludovic Denoyer, and Marc'Aurelio Ranzato proposes an unsupervised neural machine translation system, which can be trained without such parallel data.

==Motivation==
The authors offer two motivations for their work:
# To translate between languages for which large parallel corpora do not exist.
# To provide a strong baseline against which translation systems using parallel corpora can be compared.
== Overview of unsupervised translation system ==
The unsupervised translation scheme has the following outline:
* The word-vector embeddings of the source and target languages are aligned in an unsupervised manner.
* Sentences from the source and target language are mapped to a common latent vector space by an encoder, and then mapped to probability distributions over sentences in the target or source language by a decoder.
* A de-noising auto-encoder loss encourages the latent-space representations to be insensitive to noise.
* An adversarial loss encourages the latent-space representations of source and target sentences to be indistinguishable from each other. It is intended that the latent-space representation of a sentence should reflect its meaning, and not the particular language in which it is expressed.
* A reconstruction loss encourages the model to improve on the translation model of the previous epoch.
[[File:paper4_fig1.png|frame|none|alt=Alt text|A toy example of illustrating the training process which guides the design of the objective function. The key idea here is to build a common latent space between languages. On the left, the model is trained to reconstruct a sentence from a noisy version of it in the same language. On the right, the model is trained to reconstruct a sentence given the same sentence but in another language.]]
==Notation==
Let <math display="inline">S</math> denote the set of words in the source language, and let <math display="inline">T</math> denote the set of words in the target language. Let <math display="inline">H \subset \mathbb{R}^{n_H}</math> denote the latent vector space. Moreover, let <math display="inline">S'</math> and <math display="inline">T'</math> denote the sets of finite sequences of words in the source and target language, and let <math display="inline">H'</math> denote the set of finite sequences of vectors in the latent space. For any set X, elide measure-theoretic details and let <math display="inline">\mathcal{P}(X)</math> denote the set of probability distributions over X.

==Word vector alignment ==

Conneau et al. (2017) describe an unsupervised method for aligning word vectors across languages. By "alignment", I mean that their method maps words with related meanings to nearby vectors, regardless of the language of the words. Moreover, if two words are one another's literal translations, their word vectors tend to be mutual nearest neighbors.

The underlying idea of the alignment scheme can be summarized as follows: methods like word2vec or GLoVe generate vectors for which there is a correspondence between semantics and geometry. If <math display="inline">f</math> maps English words to their corresponding vectors, we have the approximate equation
\begin{align}
f(\text{king}) -f(\text{man}) +f(\text{woman})\approx f(\text{queen}).
\end{align}
Furthermore, if <math display="inline">g</math> maps French words to their corresponding vectors, then
\begin{align}
g(\text{roi}) -g(\text{homme}) +g(\text{femme})\approx g(\text{reine}).
\end{align}

Thus if <math display="inline">W</math> maps the word vectors of English words to the word vectors of their French translations, we should expect <math display="inline">W</math> to be linear. As was observed by Mikolov et al. (2013), the problem of word-vector alignment then becomes a problem of learning the linear transformation that best aligns two point clouds, one from the source language and one from the target language. For more on the history of the word-vector alignment problem, see my CS698 project ([https://uwaterloo.ca/scholar/sites/ca.scholar/files/pa2forsy/files/project_dec_3_0.pdf link]).

Conneau et al. (2017)'s word vector alignment scheme is unique in that it requires no parallel data, and uses only the shapes of the two word-vector point clouds to be aligned. I will not go into detail, but the heart of the method is a special GAN, in which only the discriminator is a neural network, and the generator is the map corresponding to an orthogonal matrix.

This unsupervised alignment method is crucial to the translation scheme of the current paper. From now on we denote by
<math display="inline">A: S' \cup T' \to \mathcal{Z}'</math> the function that maps a source- or target- language word sequence to the corresponding aligned word vector sequence.

==Encoder ==
The encoder <math display="inline">E </math> reads a sequence of word vectors <math display="inline">(z_1,\ldots, z_m) \in \mathcal{Z}'</math> and outputs a sequence of hidden states <math display="inline">(h_1,\ldots, h_m) \in H'</math> in the latent space. Crucially, because the word vectors of the two languages have been aligned, the same encoder can be applied to both. That is, to map a source sentence <math display="inline">x=(x_1,\ldots, x_M)\in S'</math> to the latent space, we compute <math display="inline">E(A(x))</math>, and to map a target sentence <math display="inline">y=(y_1,\ldots, y_K)\in T'</math> to the latent space, we compute <math display="inline">E(A(y))</math>.

The encoder consists of two LSTMs, one of which reads the word-vector sequence in the forward direction, and one of which reads it in the backward direction. The hidden state sequence is generated by concatenating the hidden states produced by the forward and backward LSTMs at each word vector.

==Decoder==

The decoder is a mono-directional LSTM that accepts a sequence of hidden states <math display="inline">h=(h_1,\ldots, h_m) \in H'</math> from the latent space and a language <math display="inline">L \in \{S,T \}</math> and outputs a probability distribution over sentences in that language. We have

\begin{align}
D: H' \times \{S,T \} \to \mathcal{P}(S') \cup \mathcal{P}(T').
\end{align}

The decoder makes use of the attention mechanism of Bahdanau et al. (2014). To compute the probability of a given sentence <math display="inline">y=(y_1,\ldots,y_K)</math> , the LSTM processes the sentence one word at a time, accepting at step <math display="inline">k</math> the aligned word vector of the previous word in the sentence <math display="inline">A(y_{k-1})</math> and a context vector <math display="inline">c_k\in H</math> computed from the hidden sequence <math display="inline">h\in H'</math>, and outputting a probability distribution over possible next words. The LSTM is initiated with a special, language-specific start-of-sequence token. Otherwise, the decoder is does not depend on the language of the sentence it is producing. The context vector is computed as described by Bahdanau et al. (2014), where we let <math display="inline">l_{k}</math> denote the hidden state of the LSTM at step <math display="inline">k</math>, and where <math display="inline">U,W</math> are learnable weight matrices, and <math display="inline">v</math> is a learnable weight vector:
\begin{align}
c_k&= \sum_{m=1}^M \alpha_{k,m} h_m\\
\alpha_{k,m}&= \frac{\exp(e_{k,m})}{\sum_{m'=1}^M\exp(e_{k,m'}) },\\
e_{k,m} &= v^T \tanh (Wl_{k-1} + U h_m ).
\end{align}

By learning <math display="inline">U,W</math> and <math display="inline">v</math>, the decoder can learn to decide which vectors in the sequence <math display="inline">h</math> are relevant to computing which words in the output sentence.

At step <math display="inline">k</math>, after receiving the context vector <math display="inline">c_k\in H</math> and the aligned word vector of the previous word in the sequence,<math display="inline">A(y_{k-1})</math>, the LSTM outputs a probability distribution over words, which should be interpreted as the distribution of the next word according to the decoder. The probability the decoder assigns to a sentence is then the product of the probabilities computed for each word in this manner.

==Overview of objective ==
The objective function is the sum of:
# The de-noising auto-encoder loss,
# The translation loss,
# The adversarial loss.
I shall describe these in the following sections.

==De-noising Auto-encoder Loss ==
A de-noising auto-encoder is a function optimized to map a corrupted sample from some dataset to the original un-corrupted sample. De-noising auto-encoders were introduced by Vincent et al. (2008), who provided numerous justifications, one of which is particularly illuminating. If we think of the dataset of interest as a thin manifold in a high-dimensional space, the corruption process is likely perturbed a datapoint off the manifold. To learn to restore the corrupted datapoint, the de-noising auto-encoder must learn the shape of the manifold.

Hill et al. (2016), used a de-noising auto-encoder to learn vectors representing sentences. They corrupted input sentences by randomly dropping and swapping words, and then trained a neural network to map the corrupted sentence to a vector, and then map the vector to the un-corrupted sentence. Interestingly, they found that sentence vectors learned this way were particularly effective when applied to tasks that involved generating paraphrases. This makes some sense: for a vector to be useful in restoring a corrupted sentence, it must capture something of the sentence's underlying meaning.

The present paper uses the principal of de-noising auto-encoders to compute one of the terms in its loss function. In each iteration, a sentence is sampled from the source or target language, and a corruption process <math display="inline"> C</math> is applied to it. <math display="inline"> C</math> works by deleting each word in the sentence with probability <math display="inline">p_C</math> and applying to the sentence a permutation randomly selected from those that do not move words more than <math display="inline">k_C</math> spots from their original positions. The authors select <math display="inline">p_C=0.1</math> and <math display="inline">k_C=3</math>. The corrupted sentence is then mapped to the latent space using <math display="inline">E\circ A</math>. The loss is then the negative log probability of the original un-corrupted sentence according to the decoder <math display="inline">D</math> applied to the latent-space sequence.

The explanation of Vincent et al. (2008) can help us understand this loss-function term: the de-noising auto-encoder loss forces the translation system to learn the shapes of the manifolds of the source and target languages.

==Translation Loss==
To compute the translation loss, we sample a sentence from one of the languages, translate it with the encoder and decoder of the previous epoch, and then corrupt its output with <math display="inline">C</math>. We then use the current encoder <math display="inline">E</math> to map the corrupted translation to a sequence <math display="inline">h \in H'</math> and the decoder <math display="inline">D</math> to map <math display="inline">h</math> to a probability distribution over sentences. The translation loss is the negative log probability the decoder assigns to the original uncorrupted sentence.

It is interesting and useful to consider why this translation loss, which depends on the translation model of the previous iteration, should promote an improved translation model in the current iteration. One loose way to understand this is to think of the translator as a de-noising translator. We are given a sentence perturbed from the manifold of possible sentences from a given language both by the corruption process and by the poor quality of the translation. The model must learn to both project and translate. The technique employed here resembles that used by Sennrich et al. (2014), who trained a neural machine translation system using both parallel and monolingual data. To make use of the monolingual target-language data, they used an auxiliary model to translate it to the source language, then trained their model to reconstruct the original target-language data from the source-language translation. Sennrich et al. argued that training the model to reconstruct true data from synthetic data was more robust than the opposite approach. The authors of the present paper use similar reasoning.

==Adversarial Loss ==
The intuition underlying the latent space is that it should encode the meaning of a sentence in a language-independent way. Accordingly, the authors introduce an adversarial loss, to encourage latent-space vectors mapped from the source and target languages to be indistinguishable. Central to this adversarial loss is the discriminator <math display="inline">R:H' \to [0,1]</math>, which makes use of <math display="inline">r: H\to [0,1]</math> a three-layer fully-connected neural network with 1024 hidden units per layer. Given a sequence of latent-space vectors <math display="inline">h=(h_1,\ldots,h_m)\in H'</math> the discriminator assigns probability <math display="inline">R(h)=\prod_{i=1}^m r(h_i)</math> that they originated in the target space. Each iteration, the discriminator is trained to maximize the objective function

\begin{align}
I_T(q) \log (R(E(q))) +(1-I_T(q) )\log(1-R(E(q)))
\end{align}

where <math display="inline">q</math> is a randomly selected sentence, and <math display="inline">I_T(q)</math> is 1 when <math display="inline">q\in I_T</math> is from the source language and 0 if <math display="inline">q\in I_S</math>

The same term is added to the primary objective function, which the encoder and decoder are trained to minimize. The result is that the encoder and decoder learn to fool the discriminator by mapping sentences from the source and target language to similar sequences of latent-space vectors.

The authors note that they make use of label smoothing, a technique recommended by Goodfellow (2016) for regularizing GANs, in which the objective described above is replaced by

\begin{align}
I_T(q)( (1-\alpha)\log (R(E(q))) +\alpha\log(1-R(E(q))) )+(1-I_T(q) ) ( (1-\beta) \log(1-R(E(q))) +\beta\log (R(E(q)) ))
\end{align}
for some small nonnegative values of <math display="inline">\alpha, \beta</math>, the idea being to prevent the discriminator from making extreme predictions. While one-sided label smoothing (<math display="inline">\beta = 0</math>) is generally recommended, the present model differs from a standard GAN in that it is symmetric, and hence two-sided label smoothing would appear more reasonable.

It is interesting to observe that while the intuition justifying the use of the latent space suggests that the latent space representation of a sentence should be language-independent, this is not actually true: if two sentences are translations of one another, but have different lengths, their latent-space representations will necessarily be different, since a a sentence's latent space representation has the same length as the sentence itself.

==Objective Function==

Combining the above-described terms, we can write the overall objective function. Let <math display="inline">Q_S</math> denote the monolingual dataset for the source language, and let <math display="inline">Q_T</math> denote the monolingual dataset for the target language. Let <math display="inline">D_S:= D(\cdot, S)</math> and<math display="inline">D_T= D(\cdot, T)</math> (i.e. <math display="inline">D_S, D_T</math>) be the decoder restricted to the source or target language, respectively. Let <math display="inline"> M_S </math> and <math display="inline"> M_T </math> denote the target-to-source and source-to-target translation models of the previous epoch. Then our objective function is

\begin{align}
\mathcal{L}(D,E,R)=\text{T Translation Loss}+\text{T De-noising Loss} +\text{T Adversarial Loss} +\text{S Translation Loss} +\text{S De-noising Loss} +\text{S Adversarial Loss}\\
\end{align}
\begin{align}
=\sum_{q\in Q_T}\left( -\log D_T \circ E \circ C \circ M _S(q) (q) -\log D_T \circ E \circ C (q) (q)+(1-\alpha)\log (R\circ E(q)) +\alpha\log(1-R\circ E(q)) \right)+\sum_{q\in Q_S}\left( -\log D_S \circ E \circ C \circ M_T (q) (q) -\log D_S \circ E \circ C (q) (q)+(1-\beta) \log(1-R \circ E(q)) +\beta\log (R\circ E(q) \right).
\end{align}

They alternate between iterations minimizing <math display="inline">\mathcal{L} </math> with respect to <math display="inline">E, D</math> and iterations maximizing with respect to <math display="inline">R</math>. ADAM is used for minimization, while RMSprop is used for maximization. After each epoch, M is updated so that <math display="inline">M_S=D_S \circ E</math> and <math display="inline">M_T=D_T \circ E</math>, after which <math display="inline"> M </math> is frozen until the next epoch.

==Validation==
The authors' aim is for their method to be completely unsupervised, so they do not use parallel corpora even for the selection of hyper-parameters. Instead, they validate by translating sentences to the other language and back, and comparing the resulting sentence with the original according to BLEU, a similarity metric frequently used in translation (Papineni et al. 2002).

==Experimental Procedure and Results==

The authors test their method on four data sets. The first is from the English-French translation task of the Workshop on Machine Translation 2014 (WMT14). This data set consists of parallel data. The authors generate a monolingual English corpus by randomly sampling 15 million sentence pairs, and choosing only the English sentences. They then generate a French corpus by selecting the French sentences from those pairs that were not previous chosen. Importantly, this means that the monolingual data sets have no parallel sentences. The second data set is generated from the English-German translation task from WMT14 using the same procedure.

The third and fourth data sets are generated from Multi30k data set, which consists of multilingual captions of various images. The images are discarded and the English, French, and German captions are used to generate monolingual data sets in the manner described above. These monolingual corpora are much smaller, consisting of 14500 sentences each.

The unsupervised translation scheme performs well, though not as well as a supervised translation scheme. It converges after a small number of epochs. Besides supervised translation, the authors compare their method with three other baselines: "Word-by-Word" uses only the previously-discussed word-alignment scheme; "Word-Reordering" uses a simple LSTM based language model and a greedy algorithm to select a reordering of the words produced by "Word-by-Word". "Oracle Word Reordering" means the optimal reordering of the words produced by "Word-by-Word".

==Result Figures==
[[File:MC_Translation Results.png]]
[[File:MC_Translation_Convergence.png]]

==Commentary==
This paper's results are impressive: that it is even possible to translate between languages without parallel data suggests that languages are more similar than we might initially suspect, and that the method the authors present has, at least in part, discovered some common deep structure. As the authors point out, using no parallel data at all, their method is able to produce results comparable to those produced by neural machine translation methods trained on hundreds of thousands of a parallel sentences on the WMT dataset. On the other hand, the results they offer come with a few significant caveats.

The first caveat is that the workhorse of the method is the unsupervised word-vector alignment scheme presented in Conneau et al. (2017) (that paper shares 3 authors with this one). As the ablation study reveals, without word-vector alignment, this method preforms extremely poorly. Moreover, word-by-word translation using word-vector alignment alone performs well, albeit not as well as this method. This suggests that the method of this paper mainly learns to perform (sometimes significant) corrections to word-by-word translations by reordering and occasional word substitution. Presumably, it does this by learning something of the natural structure of sentences in each of the two languages, so that it can correct the errors made by word-by-word translation.

The second caveat is that the best results are attained translating between English and French, two very closely related languages, and the quality of translation between English and German, a slightly-less related pair, is significantly worse ( according to the ''Shorter Oxford English Dictionary'', 28.3 percent of the English vocabulary is French-derived, 28.2 percent is Latin-derived, and 25 percent is derived from Germanic languages. This probably understates the degree of correspondence between the French and English vocabularies, since French likely derives from Latin many of the same words English does. ). The authors do not report results with more distantly-related pairs, but it is reasonable to expect that performance would degrade significantly, for two reasons. Firstly, Conneau et al. (2017) shows that the word-alignment scheme performs much worse on more distant language pairs. This may be because there are more one-to-one correspondences between the words of closely related languages than there are between more distant languages. Secondly, because the same encoder is used to read sentences of both languages, the encoder cannot adapt to the unique word-order properties of either language. This would become a problem for language pairs with very different grammar. The authors suggest that their scheme could be a useful tool for translating between language pairs for which their are few parallel corpora. However, language pairs lacking parallel corpora are often (though not always) distantly related, and it is for such pairs that the performance of the present method likely suffers.

The proposed method always beats Oracle Word Reordering on the Multi30k data set, but sometimes does not on the WMT data set. This may be because the WMT sentences are much more syntactically complex than the simple image captions of the Multi30k data set.

The ablation study also reveals the importance of the corruption process <math display="inline">C</math>: the absence of <math display="inline">C</math> significantly degrades translation quality, though not as much as the absence of word-vector alignment. We can understand this in two related ways. First of all, if we view the model as learning to correct structural errors in word-by-word translations, then the corruption process introduces more errors of this kind, and so provides additional data upon which the model can train. Second, as Vincent et al. (2008) point out, de-noising auto-encoder training encourages a model to learn the structure of the manifold from which the data is drawn. By learning the structure of the source and target languages, the model can better correct the errors of word-by-word translation.

[[File:MC_Alignment_Results.png|frame|none|alt=Alt text|From Conneau et al. (2017). The final row shows the performance of alignment method used in the present paper. Note the degradation in performance for more distant languages.]]

[[File:MC_Translation_Ablation.png|frame|none|alt=Alt text|From the present paper. Results of an ablation study. Of note are the first, third, and forth rows, which demonstrate that while the translation component of the loss is relatively unimportant, the word vector alignment scheme and de-noising auto-encoder matter a great deal.]]

==Future Work==
The principal of performing unsupervised translation by starting with a rough but reasonable guess, and then improving it using knowledge of the structure of target language seems promising. Word by word translation using word-vector alignment works well for closely related languages like English and French, but is unlikely to work as well for more distant languages. For those languages, a better method for getting an initial guess is required.

==References==
#Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. "Neural machine translation by jointly learning to align and translate." arXiv preprint arXiv:1409.0473 (2014).
#Conneau, Alexis, Guillaume Lample, Marc’Aurelio Ranzato, Ludovic Denoyer, Hervé Jégou. "Word Translation without Parallel Data". arXiv:1710.04087, (2017)
# Dictionary, Shorter Oxford English. "Shorter Oxford english dictionary." (2007).
#Goodfellow, Ian. "NIPS 2016 tutorial: Generative adversarial networks." arXiv preprint arXiv:1701.00160 (2016).
# Hill, Felix, Kyunghyun Cho, and Anna Korhonen. "Learning distributed representations of sentences from unlabelled data." arXiv preprint arXiv:1602.03483 (2016).
# Lample, Guillaume, Ludovic Denoyer, and Marc'Aurelio Ranzato. "Unsupervised Machine Translation Using Monolingual Corpora Only." arXiv preprint arXiv:1711.00043 (2017).
#Papineni, Kishore, et al. "BLEU: a method for automatic evaluation of machine translation." Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics, 2002.
# Mikolov, Tomas, Quoc V Le, and Ilya Sutskever. "Exploiting similarities among languages for machine translation." arXiv preprint arXiv:1309.4168. (2013).
#Sennrich, Rico, Barry Haddow, and Alexandra Birch. "Improving neural machine translation models with monolingual data." arXiv preprint arXiv:1511.06709 (2015).
# Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. "Sequence to sequence learning with neural networks." Advances in neural information processing systems. 2014.
# Vincent, Pascal, et al. "Extracting and composing robust features with denoising autoencoders." Proceedings of the 25th international conference on Machine learning. ACM, 2008.

File:paper4 fig1.png

2018-03-08T22:09:33Z

Cs4li:

Understanding Image Motion with Group Representations

2018-03-08T22:08:05Z

Cs4li:

== Introduction ==
Motion perception is a key component of computer vision. It is critical to problems such as optical flow and visual odometry, where a sequence of images are used to calculate either the pixel level (local) motion or the motion of the entire scene (global). The smooth image transformation caused by camera motion is a subspace of all position image transformations. Here, we are interested in realistic transformation caused by motion, therefore unrealistic motion caused by say, face swapping, are not considered.

Supervised learning of 3D motion is challenging since explicit motion labels are no trivial to obtain. The proposed learning method does not need label data. Instead, the method constraints learning by using the properties of motion space. The paper presents a general model of visual motion, and how the motion space properties of associativity and can be used to constrain the learning of a deep neural network. The results show evidence that the learned model captions motion in both 2D and 3D settings.

[[File:paper13_fig1.png|650px|center|]]

== Related Work ==
The most common global representations of motion are from structure from motion (SfM) and simultaneous localization and mapping (SLAM), which represents poses in special Euclidean group <math> SE(3) </math> to represent a sequence of motions. However, these cannot be used to represent non-rigid or independent motions. Another approache to representing motion is spatiotemporal features (STFs), which are flexible enough to represent non-rigid motions.

There are also works using CNN’s to learn optical flow using brightness constancy assumptions, and/or photometric local constraints. Works on stereo depth estimation using learning has also shown results. Regarding to image sequences, there are works on shuffling the order of images to learn representations of its contents, as well as learning representations equivariant to the egomotion of the camera.

== Approach ==
The proposed method is based on the observation that 3D motions, equipped with composition forms a group. By learning the underlying mapping that captures the motion transformations, we are approximating latent motion of the scene.The method is designed to capture group associativity and invertibility.

Consider a latent structure space <math>S</math>, element of the structure space generates images via projection <math>\pi:S\rightarrow I</math>, latent motion space <math>M</math> which is some closed subgroup of the set of homeomorphism on <math>S</math>. For <math>s \in S</math>, a continuous motion sequence <math> \{m_t \in M | t \geq 0\} </math> generates continous image sequence <math> \{i_t \in I | t \geq 0\} </math> where <math> i_t=\pi(m_t(s)) </math>. Writing this as a hidden Markov model gives <math> i_t=\pi(m_{\Delta t}(s_{t-1}))) </math> where the current state is based on the change from the previous. Since <math> M </math> is a closed group on <math> S </math>, it is associative, has inverse, and contains idenity. <math> SE(3) </math> is an exmaple of this. To be more specific, the latent structure of a scene from rigid image motion could be modelled by a point cloud with a motion space <math>M=SE(3)</math>, where rigid image motion can be produced by a camera translating and rotating through a rigid scene in 3D. When a scene has N rigid bodies, the motion space can be represented as <math>M=[SE(3)]^N</math>.

=== Learning Motion by Group Properties ===
The goal is to learn function <math> \Phi : I \times I \rightarrow \overline{M} </math>, <math> \overline{M} </math> indicating representation of <math> M </math>, as well as the composition operator <math> \diamond : \overline{M} \rightarrow \overline{M} </math> that represents composition in <math> M </math>. For all sequences, it is assumed <math> t_0 < t_1 < t_2 ... </math>
# Associativity: <math> \Phi(I_{t_0}, I_{t_1}) \diamond \Phi(I_{t_2}, I_{t_3}) = (\Phi(I_{t_0}, I_{t_1}) \diamond \Phi(I_{t_1}, I_{t_2})) \diamond \Phi(I_{t_2}, I_{t_3}) = \Phi(I_{t_0}, I_{t_1}) \diamond (\Phi(I_{t_1}, I_{t_2}) \diamond \Phi(I_{t_2}, I_{t_3})) = \Phi(I_{t_0}, I_{t_1}) \diamond \Phi(I_{t_1}, I_{t_3}) </math>
# Has Identity: <math> \Phi(I_{t_0}, I_{t_1}) \diamond e = \Phi(I_{t_0}, I_{t_1}) = e \diamond \Phi(I_{t_0}, I_{t_1}) </math> and <math> e=\Phi(I_{t}, I_{t}) \forall t </math>
# Invertibility: <math> \Phi(I_{t_0}, I_{t_1}) \diamond \Phi(I_{t_1}, I_{t_0}) = e </math>
A embedding loss is used to approximately enforce associativity and invertibility among subsequences sampled from image sequence. Associativity is encouraged by pushing same the same final motion with different transition to the same representation. Invertibility is encouraged by pushing the same motion with same transition but in opposite direction away from each other, as well as push loops to the same representation. Uniqueness of identity is encouraged by pushing loops away from non-identity representations. Loops are also pushed to the same representation (identity) from different sequences.

These constraints are true to any type of transformation resulting from image motion. This puts little restriction on the learning problems and allows all features relevant to the motion structure to be captured.

Also with this method, it is possible multiple representations <math> \overline{M} </math> can be learned from a single <math> M </math>, thus the learned representation is not necessary unique. In addition, the scenes are not expected to have rapid changing contents, scene cuts, or long-term occlusions.

=== Sequence Learning with Neural Networks ===
The functions <math> \Phi </math> and <math> \diamond </math> are approximated by CNN and RNN, respectively. LSTM is used for RNN. The input to the network is a sequence of images <math> I_t = \{I_1,...,I_t\} </math>. The CNN processes pairs of images are intermediate representations, and the LSTM operates over the sequence of CNN outputs to produce and embedding sequence <math> R_t = \{R_{1,2},...,R_{t-1,t}\} </math>. Only the embedding at the final timstep is used for loss. The network is trained to minimize a hinge loss with respect to embeddings to pairs of sequences. The cost function is:

<center><math>L(R^1,R^2) = \begin{cases} d(R^1,R^2), & \text{if positive pair} \\ max(0, m - d(R^1,R^2)), & \text{if negative pair} \end{cases}</math></center>
<center><math> d_{cosine}(R^1,R^2)=1-\frac{\langle R^1,R^2 \rangle}{\lVert R^1 \rVert \lVert R^2 \rVert} </math></center>

where <math>d(R^1,R^2)</math> measure the distance between the embeddings of two sequences used for training selected to be cosine distance, <math> m </math> is a fixed margin selected to be 0.5. Positive pair are training example where two sequences have the same final motion, negative pairs are training examples where two sequences have the exact opposite final motion. Using L2 distances yields similar results as cosine distances.

Each training sequence is composed into 6 subsequences: two forward, two backward, and two identity. To prevent the network from only looking at static differences, subsequence pairs are sampled such that they have the same start and end frames but different motions in between. Sequences of varying lengths are also used to generalize motion on different temporal scale. Training the network with only one input images per timestep is also tried, but consistently yielded work results than image pairs.

[[File:paper13_fig2.png|650px|center|]]

Overall, training with image pairs resulted in lower error than training with just single images. This is demonstrated in the below table.

[[File:table.png|700px|center|]]

== Experimentation ==
Trained network using rotated and translated MNIST dataset as well as KITTI dataset.
* Used Torch
* Used Adam for optimization, decay schedule of 30 epochs, learning rate chosen by random serach
* 50-60 batch size for MNIST, 25-30 batch size for KITTI
* Dilated convolution with Relu and batch normalization
* Two LSTM cell per layer 256 hidden units each
* Sequence length of 3-5 images

=== Rigid Motion in 2D ===
* MNIST data rotated <math>[0, 360)</math> degrees and translated <math>[-10, 10] </math> pixels, i.e. <math>SE(2)</math> transformations
* Visualized the representation using t-SNE
** Clear clustering by translation and rotation but not object classes
** Suggests the representation captures the motion properties in the dataset, but is independent of image contents
* Visualized the image-conditioned saliency maps
** Take derivative of the network output respect to the map
** The area that has the highest gradient means that part contributes the most to the output
** The resulting salient map strongly resembles spatiotemporal energy filters of classical motion processing
** Suggests the network is learn the right motion structure

[[File:paper13_fig3.png|700px|center|]]

=== Real World Motion in 3D ===
* Uses KITTI dataset collected on a car driving through roads in Germany
* On a separate dataset with ground truth camera pose, linearly regress the representation to the ground truth
** The result is compared against self supervised flow algorithm Yu et al.(2016) after the output from the flow algorithm is downsampled, then feed through PCA, then regressed against the camera motion
** The data shows it performs not as well as the supervised algorithm, but consistent better than chance (guessing the mean value)
** Shows the method is able to capture dominant motion structure
* Test performance on interpolation task
** Check <math>R([I_1,I_T])</math> against <math>R([I_1, I_m, I_T])</math>, <math>R([I_1, I_{IN}, I_T])</math>, and <math>R([I_1, I_{OUT}, I_T])</math>
** Test how sensitive the network is to deviations from unnatural motion
** High errors <math>\gg 1</math> means the network can distinguish between realistic and unrealistic motion
** In order to do this, the distance between the embeddings of the frame sequences of the first and last frame <math>R([I_1,I_T])</math> and of the first, middle, and last frame <math>R([I_1, I_m, I_T])</math> is computed. This distance is compared with the distance when the middle frame of the second embedding is changed to a frame that is visually similar (inside sequence): <math>R([I_1, I_{IN}, I_T])</math> and one that is visually dissimilar (outside sequence): <math>R([I_1, I_{OUT}, I_T])</math>. The results are shown in Table 3. The embedding distance method is compared to the euclidean distance which is defined as the mean pixel distance between the test frame and <math>{I_1,I_T}</math>, whichever is smaller. It can be seen from the results that the embedding distance of the true frame is significantly lower than other frames. This means that the embedding distance used in the network is more sensitive to any atypical motions of the scenes.
* Visualized saliency maps
** Highs objects moving in the background, and motion of the car in the foreground
** Suggests the method can be used for tracking as well

[[File:paper13_tab2.png|700px|center|]]

[[File:paper13_fig4.png|700px|center|]]

[[File:paper13_fig5.png|700px|center|]]

[[File:table3_motion.PNG|700px|center|]]

== Conclusion ==
The author presented a new model of motion and method for learning motion representations. It is shown that enforcing group properties can learn motion representations that is able to generalize between scenes with disparate content. The results can be useful for navigation, prediction, and other behavioral tasks relying on motion. Due to the fact that this method does not require labelled data, it can be applied to useful for large variety of tasks.

== Criticism ==
Although this method does not require any labelled data, it is still learning by supervision through defined constraints. The idea of training using unlabelled data is interesting and it does have meaningful practical application. Unfortunately, the author did not provide convincing experimental results. Results from motion estimation problems are typically compared against ground truth data for their accuracy. The author performed experiments on transformed MNIST data and KITTI data. The MNIST data is transformed by the author, thus the ground truth is readily available. However the author only claimed the validity of the results through indirect means of using t-SNE and saliency map visualization. For the KITTI dataset, the author regressed the representations against ground truth for some mapping from the network output to some physical motion representation. Again, the results again compared only indirectly against ground truth. Such experimentation made the method hardly convincing and applicable to real world applications. In addition, the network does not output motion representations with physical meanings, make the proposed method useless for any real world applications.

== References ==
Jaegle, A. (2018). Understanding image motion with group representations . ICLR. Retrieved from https://openreview.net/pdf?id=SJLlmG-AZ.

Understanding Image Motion with Group Representations

2018-03-06T04:30:13Z

Cs4li:

== Introduction ==
Motion perception is a key component of computer vision. It is critical to problems such as optical flow and visual odometry, where a sequence of images are used to calculate either the pixel level (local) motion or the motion of the entire scene (global). The smooth image transformation caused by camera motion is a subspace of all position image transformations. Here, we are interested in realistic transformation caused by motion, therefore unrealistic motion caused by say, face swapping, are not considered.

Supervised learning of 3D motion is challenging since explicit motion labels are no trivial to obtain. The proposed learning method does not need label data. Instead, the method constraints learning by using the properties of motion space. The paper presents a general model of visual motion, and how the motion space properties of associativity and can be used to constrain the learning of a deep neural network. The results show evidence that the learned model captions motion in both 2D and 3D settings.

[[File:paper13_fig1.png|500px]]

== Related Work ==
The most common global representations of motion are from structure from motion (SfM) and simultaneous localization and mapping (SLAM), which represents poses in special Euclidean group <math> SE(3) </math> to represent a sequence of motions. However, these cannot be used to represent non-rigid or independent motions. Another approache to representing motion is spatiotemporal features (STFs), which are flexible enough to represent non-rigid motions.

There are also works using CNN’s to learn optical flow using brightness constancy assumptions, and/or photometric local constraints. Works on stereo depth estimation using learning has also shown results. Regarding to image sequences, there are works on shuffling the order of images to learn representations of its contents, as well as learning representations equivariant to the egomotion of the camera.

== Approach ==
The proposed method is based on the observation that 3D motions, equipped with composition forms a group. By learning the underlying mapping that captures the motion transformations, we are approximating latent motion of the scene.The method is designed to capture group associativity and invertibility.

Consider a latent structure space <math>S</math>, element of the structure space generates images via projection <math>\pi:S\rightarrow I</math>, latent motion space <math>M</math> which is some closed subgroup of the set of homeomorphism on <math>S</math>. For <math>s \in S</math>, a continuous motion sequence <math> \{m_t \in M | t \geq 0\} </math> generates continous image sequence <math> \{i_t \in I | t \geq 0\} </math> where <math> i_t=\pi(m_t(s)) </math>. Writing this as a hidden Markov model gives <math> i_t=\pi(m_{\Delta t}(s_{t-1}))) </math> where the current state is based on the change from the previous. Since <math> M </math> is a closed group on <math> S </math>, it is associative, has inverse, and contains idenity. <math> SE(3) </math> is an exmaple of this.

=== Learning Motion by Group Properties ===
The goal is to learn function <math> \Phi : I \times I \rightarrow \overline{M} </math>, <math> \overline{M} </math> indicating representation of <math> M </math>, as well as the composition operator <math> \diamond : \overline{M} \rightarrow \overline{M} </math> that represents composition in <math> M </math>. For all sequences, it is assumed <math> t_0 < t_1 < t_2 ... </math>
# Associativity: <math> \Phi(I_{t_0}, I_{t_1}) \diamond \Phi(I_{t_2}, I_{t_3}) = (\Phi(I_{t_0}, I_{t_1}) \diamond \Phi(I_{t_1}, I_{t_2})) \diamond \Phi(I_{t_2}, I_{t_3}) = \Phi(I_{t_0}, I_{t_1}) \diamond (\Phi(I_{t_1}, I_{t_2}) \diamond \Phi(I_{t_2}, I_{t_3})) = \Phi(I_{t_0}, I_{t_1}) \diamond \Phi(I_{t_1}, I_{t_3}) </math>
# Has Identity: <math> \Phi(I_{t_0}, I_{t_1}) \diamond e = \Phi(I_{t_0}, I_{t_1}) = e \diamond \Phi(I_{t_0}, I_{t_1}) </math> and <math> e=\Phi(I_{t}, I_{t}) \forall t </math>
# Invertibility: <math> \Phi(I_{t_0}, I_{t_1}) \diamond \Phi(I_{t_1}, I_{t_0}) = e </math>
A embedding loss is used to approximately enforce associativity and invertibility among subsequences sampled from image sequence. Associativity is encouraged by pushing same the same final motion with different transition to the same representation. Invertibility is encouraged by pushing the same motion with same transition but in opposite direction away from each other, as well as push loops to the same representation. Uniqueness of identity is encouraged by pushing loops away from non-identity representations. Loops are also pushed to the same representation (identity) from different sequences.

These constraints are true to any type of transformation resulting from image motion. This puts little restriction on the learning problems and allows all features relevant to the motion structure to be captured.

Also with this method, it is possible multiple representations <math> \overline{M} </math> can be learned from a single <math> M </math>, thus the learned representation is not necessary unique. In addition, the scenes are not expected to have rapid changing contents, scene cuts, or long-term occlusions.

=== Sequence Learning with Neural Networks ===
The functions <math> \Phi </math> and <math> \diamond </math> are approximated by CNN and RNN, respectively. LSTM is used for RNN. The input to the network is a sequence of images <math> I_t = \{I_1,...,I_t\} </math>. The CNN processes pairs of images are intermediate representations, and the LSTM operates over the sequence of CNN outputs to produce and embedding sequence <math> R_t = \{R_{1,2},...,R_{t-1,t}\} </math>. Only the embedding at the final timstep is used for loss. The network is trained to minimize a hinge loss with respect to embeddings to pairs of sequences. The cost function is:

<center><math>L(R^1,R^2) = \begin{cases} d(R^1,R^2), & \text{if positive pair} \\ max(0, m - d(R^1,R^2)), & \text{if negative pair} \end{cases}</math></center>
<center><math> d_{cosine}(R^1,R^2)=1-\frac{\langle R^1,R^2 \rangle}{\lVert R^1 \rVert \lVert R^2 \rVert} </math></center>

where <math>d(R^1,R^2)</math> measure the distance between the embeddings of two sequences used for training selected to be cosine distance, <math> m </math> is a fixed margin selected to be 0.5. Positive pair are training example where two sequences have the same final motion, negative pairs are training examples where two sequences have the exact opposite final motion. Using L2 distances yields similar results as cosine distances.

Each training sequence is composed into 6 subsequences: two forward, two backward, and two identity. To prevent the network from only looking at static differences, subsequence pairs are sampled such that they have the same start and end frames but different motions in between. Sequences of varying lengths are also used to generalize motion on different temporal scale. Training the network with only one input images per timestep is also tried, but consistently yielded work results than image pairs.

[[File:paper13_fig2.png|500px]]

== Experimentation ==
Trained network using rotated and translated MNIST dataset as well as KITTI dataset.
* Used torch
* Used Adam for optimization, decay schedule of 30 epochs, learning rate chosen by random serach
* 50-60 batch size for MNIST, 25-30 batch size for KITTI
* dilated convolution with Relu and batch normalization
* Two LSTM cell per layer 256 hidden units each
* sequence length of 3-5 images

=== Rigid Motion in 2D ===
* MNIST data rotated <math>[0, 360)</math> degrees and translated <math>[-10, 10] </math> pixels, i.e. <math>SE(2)</math> transformations
* visualized the representation using t-SNE
** clear clustering by translation and rotation but not object classes
** suggests the representation captures the motion properties in the dataset, but is independent of image contents
* visualized the image-conditioned saliency maps
** take derivative of the network output respect to the map
** the area that has the highest gradient means that part contributes the most to the output
** the resulting salient map strongly resembles spatiotemporal energy filters of classical motion processing
** suggests the network is learn the right motion structure

[[File:paper13_fig3.png|500px]]

=== Real World Motion in 3D ===
* Uses KITTI dataset collected on a car driving through roads in Germany
* On a separate dataset with ground truth camera pose, linearly regress the representation to the ground truth
** The result is compared against self supervised flow algorithm Yu et al.(2016) after the output from the flow algorithm is downsampled, then feed through PCA, then regressed against the camera motion
** The data shows it performs not as well as the supervised algorithm, but consistent better than chance (guessing the mean value)
** shows the method is able to capture dominant motion structure
* test performance on interpolation task
** check <math>R([I_1,I_T])</math> against <math>R([I_1, I_m, I_T])</math>, <math>R([I_1, I_{IN}, I_T])</math>, and <math>R([I_1, I_{OUT}, I_T])</math>
** test how sensitive the network is to deviations from unnatural motion
** high errors <math>\gg 1</math> means the network can distinguish between realistic and unrealistic motion
* visualized saliency maps
** highs objects moving in the background, and motion of the car in the foreground
** suggests the method can be used for tracking as well

[[File:paper13_tab2.png|500px]]

[[File:paper13_fig4.png|500px]]

[[File:paper13_fig5.png|500px]]

== Conclusion ==
The author presented a new model of motion and method for learning motion representations. It is shown that enforcing group properties can learn motion representations that is able to generalize between scenes with disparate content. The results can be useful for navigation, prediction, and other behavioral tasks relying on motion. Due to the fact that this method does not require labelled data, it can be applied to useful for large variety of tasks.

== Criticism ==
Although this method does not require any labelled data, it is still learning by supervision through defined constraints. The idea of training using unlabelled data is interesting and it does have meaningful practical application. Unfortunately, the author did not provide convincing experimental results. Results from motion estimation problems are typically compared against ground truth data for their accuracy. The author performed experiments on transformed MNIST data and KITTI data. The MNIST data is transformed by the author, thus the ground truth is readily available. However the author only claimed the validity of the results through indirect means of using t-SNE and saliency map visualization. For the KITTI dataset, the author regressed the representations against ground truth for some mapping from the network output to some physical motion representation. Again, the results again compared only indirectly against ground truth. Such experimentation made the method hardly convincing and applicable to real world applications. In addition, the network does not output motion representations with physical meanings, make the proposed method useless for any real world applications.

Understanding Image Motion with Group Representations

2018-03-06T04:29:42Z

Cs4li:

== Introduction ==
Motion perception is a key component of computer vision. It is critical to problems such as optical flow and visual odometry, where a sequence of images are used to calculate either the pixel level (local) motion or the motion of the entire scene (global). The smooth image transformation caused by camera motion is a subspace of all position image transformations. Here, we are interested in realistic transformation caused by motion, therefore unrealistic motion caused by say, face swapping, are not considered.

Supervised learning of 3D motion is challenging since explicit motion labels are no trivial to obtain. The proposed learning method does not need label data. Instead, the method constraints learning by using the properties of motion space. The paper presents a general model of visual motion, and how the motion space properties of associativity and can be used to constrain the learning of a deep neural network. The results show evidence that the learned model captions motion in both 2D and 3D settings.

[[File:paper13_fig1.png|500px]]

== Related Work ==
The most common global representations of motion are from structure from motion (SfM) and simultaneous localization and mapping (SLAM), which represents poses in special Euclidean group <math> SE(3) </math> to represent a sequence of motions. However, these cannot be used to represent non-rigid or independent motions. Another approache to representing motion is spatiotemporal features (STFs), which are flexible enough to represent non-rigid motions.

There are also works using CNN’s to learn optical flow using brightness constancy assumptions, and/or photometric local constraints. Works on stereo depth estimation using learning has also shown results. Regarding to image sequences, there are works on shuffling the order of images to learn representations of its contents, as well as learning representations equivariant to the egomotion of the camera.

== Approach ==
The proposed method is based on the observation that 3D motions, equipped with composition forms a group. By learning the underlying mapping that captures the motion transformations, we are approximating latent motion of the scene.The method is designed to capture group associativity and invertibility.

Consider a latent structure space <math>S</math>, element of the structure space generates images via projection <math>\pi:S\rightarrow I</math>, latent motion space <math>M</math> which is some closed subgroup of the set of homeomorphism on <math>S</math>. For <math>s \in S</math>, a continuous motion sequence <math> \{m_t \in M | t \geq 0\} </math> generates continous image sequence <math> \{i_t \in I | t \geq 0\} </math> where <math> i_t=\pi(m_t(s)) </math>. Writing this as a hidden Markov model gives <math> i_t=\pi(m_{\Delta t}(s_{t-1}))) </math> where the current state is based on the change from the previous. Since <math> M </math> is a closed group on <math> S </math>, it is associative, has inverse, and contains idenity. <math> SE(3) </math> is an exmaple of this.

=== Learning Motion by Group Properties ===
The goal is to learn function <math> \Phi : I \times I \rightarrow \overline{M} </math>, <math> \overline{M} </math> indicating representation of <math> M </math>, as well as the composition operator <math> \diamond : \overline{M} \rightarrow \overline{M} </math> that represents composition in <math> M </math>. For all sequences, it is assumed <math> t_0 < t_1 < t_2 ... </math>
# Associativity: <math> \Phi(I_{t_0}, I_{t_1}) \diamond \Phi(I_{t_2}, I_{t_3}) = (\Phi(I_{t_0}, I_{t_1}) \diamond \Phi(I_{t_1}, I_{t_2})) \diamond \Phi(I_{t_2}, I_{t_3}) = \Phi(I_{t_0}, I_{t_1}) \diamond (\Phi(I_{t_1}, I_{t_2}) \diamond \Phi(I_{t_2}, I_{t_3})) = \Phi(I_{t_0}, I_{t_1}) \diamond \Phi(I_{t_1}, I_{t_3}) </math>
# Has Identity: <math> \Phi(I_{t_0}, I_{t_1}) \diamond e = \Phi(I_{t_0}, I_{t_1}) = e \diamond \Phi(I_{t_0}, I_{t_1}) </math> and <math> e=\Phi(I_{t}, I_{t}) \forall t </math>
# Invertibility: <math> \Phi(I_{t_0}, I_{t_1}) \diamond \Phi(I_{t_1}, I_{t_0}) = e </math>
A embedding loss is used to approximately enforce associativity and invertibility among subsequences sampled from image sequence. Associativity is encouraged by pushing same the same final motion with different transition to the same representation. Invertibility is encouraged by pushing the same motion with same transition but in opposite direction away from each other, as well as push loops to the same representation. Uniqueness of identity is encouraged by pushing loops away from non-identity representations. Loops are also pushed to the same representation (identity) from different sequences.

These constraints are true to any type of transformation resulting from image motion. This puts little restriction on the learning problems and allows all features relevant to the motion structure to be captured.

Also with this method, it is possible multiple representations <math> \overline{M} </math> can be learned from a single <math> M </math>, thus the learned representation is not necessary unique. In addition, the scenes are not expected to have rapid changing contents, scene cuts, or long-term occlusions.

=== Sequence Learning with Neural Networks ===
The functions <math> \Phi </math> and <math> \diamond </math> are approximated by CNN and RNN, respectively. LSTM is used for RNN. The input to the network is a sequence of images <math> I_t = \{I_1,...,I_t\} </math>. The CNN processes pairs of images are intermediate representations, and the LSTM operates over the sequence of CNN outputs to produce and embedding sequence <math> R_t = \{R_{1,2},...,R_{t-1,t}\} </math>. Only the embedding at the final timstep is used for loss. The network is trained to minimize a hinge loss with respect to embeddings to pairs of sequences. The cost function is:

<center><math>L(R^1,R^2) = \begin{cases} d(R^1,R^2), & \text{if positive pair} \\ max(0, m - d(R^1,R^2)), & \text{if negative pair} \end{cases}</math></center>
<center><math> d_{cosine}(R^1,R^2)=1-\frac{\langle R^1,R^2 \rangle}{\lVert R^1 \rVert \lVert R^2 \rVert} </math></center>

where <math>d(R^1,R^2)</math> measure the distance between the embeddings of two sequences used for training selected to be cosine distance, <math> m </math> is a fixed margin selected to be 0.5. Positive pair are training example where two sequences have the same final motion, negative pairs are training examples where two sequences have the exact opposite final motion. Using L2 distances yields similar results as cosine distances.

Each training sequence is composed into 6 subsequences: two forward, two backward, and two identity. To prevent the network from only looking at static differences, subsequence pairs are sampled such that they have the same start and end frames but different motions in between. Sequences of varying lengths are also used to generalize motion on different temporal scale. Training the network with only one input images per timestep is also tried, but consistently yielded work results than image pairs.

[[File:paper13_fig2.png|500px]

== Experimentation ==
Trained network using rotated and translated MNIST dataset as well as KITTI dataset.
* Used torch
* Used Adam for optimization, decay schedule of 30 epochs, learning rate chosen by random serach
* 50-60 batch size for MNIST, 25-30 batch size for KITTI
* dilated convolution with Relu and batch normalization
* Two LSTM cell per layer 256 hidden units each
* sequence length of 3-5 images

=== Rigid Motion in 2D ===
* MNIST data rotated <math>[0, 360)</math> degrees and translated <math>[-10, 10] </math> pixels, i.e. <math>SE(2)</math> transformations
* visualized the representation using t-SNE
** clear clustering by translation and rotation but not object classes
** suggests the representation captures the motion properties in the dataset, but is independent of image contents
* visualized the image-conditioned saliency maps
** take derivative of the network output respect to the map
** the area that has the highest gradient means that part contributes the most to the output
** the resulting salient map strongly resembles spatiotemporal energy filters of classical motion processing
** suggests the network is learn the right motion structure

[[File:paper13_fig3.png|500px]

=== Real World Motion in 3D ===
* Uses KITTI dataset collected on a car driving through roads in Germany
* On a separate dataset with ground truth camera pose, linearly regress the representation to the ground truth
** The result is compared against self supervised flow algorithm Yu et al.(2016) after the output from the flow algorithm is downsampled, then feed through PCA, then regressed against the camera motion
** The data shows it performs not as well as the supervised algorithm, but consistent better than chance (guessing the mean value)
** shows the method is able to capture dominant motion structure
* test performance on interpolation task
** check <math>R([I_1,I_T])</math> against <math>R([I_1, I_m, I_T])</math>, <math>R([I_1, I_{IN}, I_T])</math>, and <math>R([I_1, I_{OUT}, I_T])</math>
** test how sensitive the network is to deviations from unnatural motion
** high errors <math>\gg 1</math> means the network can distinguish between realistic and unrealistic motion
* visualized saliency maps
** highs objects moving in the background, and motion of the car in the foreground
** suggests the method can be used for tracking as well

[[File:paper13_tab2.png|500px]

[[File:paper13_fig4.png|500px]

[[File:paper13_fig5.png|500px]

== Conclusion ==
The author presented a new model of motion and method for learning motion representations. It is shown that enforcing group properties can learn motion representations that is able to generalize between scenes with disparate content. The results can be useful for navigation, prediction, and other behavioral tasks relying on motion. Due to the fact that this method does not require labelled data, it can be applied to useful for large variety of tasks.

== Criticism ==
Although this method does not require any labelled data, it is still learning by supervision through defined constraints. The idea of training using unlabelled data is interesting and it does have meaningful practical application. Unfortunately, the author did not provide convincing experimental results. Results from motion estimation problems are typically compared against ground truth data for their accuracy. The author performed experiments on transformed MNIST data and KITTI data. The MNIST data is transformed by the author, thus the ground truth is readily available. However the author only claimed the validity of the results through indirect means of using t-SNE and saliency map visualization. For the KITTI dataset, the author regressed the representations against ground truth for some mapping from the network output to some physical motion representation. Again, the results again compared only indirectly against ground truth. Such experimentation made the method hardly convincing and applicable to real world applications. In addition, the network does not output motion representations with physical meanings, make the proposed method useless for any real world applications.

Understanding Image Motion with Group Representations

2018-03-06T04:28:08Z

Cs4li:

== Introduction ==
Motion perception is a key component of computer vision. It is critical to problems such as optical flow and visual odometry, where a sequence of images are used to calculate either the pixel level (local) motion or the motion of the entire scene (global). The smooth image transformation caused by camera motion is a subspace of all position image transformations. Here, we are interested in realistic transformation caused by motion, therefore unrealistic motion caused by say, face swapping, are not considered.

Supervised learning of 3D motion is challenging since explicit motion labels are no trivial to obtain. The proposed learning method does not need label data. Instead, the method constraints learning by using the properties of motion space. The paper presents a general model of visual motion, and how the motion space properties of associativity and can be used to constrain the learning of a deep neural network. The results show evidence that the learned model captions motion in both 2D and 3D settings.

[[File:paper13_fig1.png|500px]]

== Related Work ==
The most common global representations of motion are from structure from motion (SfM) and simultaneous localization and mapping (SLAM), which represents poses in special Euclidean group <math> SE(3) </math> to represent a sequence of motions. However, these cannot be used to represent non-rigid or independent motions. Another approache to representing motion is spatiotemporal features (STFs), which are flexible enough to represent non-rigid motions.

There are also works using CNN’s to learn optical flow using brightness constancy assumptions, and/or photometric local constraints. Works on stereo depth estimation using learning has also shown results. Regarding to image sequences, there are works on shuffling the order of images to learn representations of its contents, as well as learning representations equivariant to the egomotion of the camera.

== Approach ==
The proposed method is based on the observation that 3D motions, equipped with composition forms a group. By learning the underlying mapping that captures the motion transformations, we are approximating latent motion of the scene.The method is designed to capture group associativity and invertibility.

Consider a latent structure space <math>S</math>, element of the structure space generates images via projection <math>\pi:S\rightarrow I</math>, latent motion space <math>M</math> which is some closed subgroup of the set of homeomorphism on <math>S</math>. For <math>s \in S</math>, a continuous motion sequence <math> \{m_t \in M | t \geq 0\} </math> generates continous image sequence <math> \{i_t \in I | t \geq 0\} </math> where <math> i_t=\pi(m_t(s)) </math>. Writing this as a hidden Markov model gives <math> i_t=\pi(m_{\Delta t}(s_{t-1}))) </math> where the current state is based on the change from the previous. Since <math> M </math> is a closed group on <math> S </math>, it is associative, has inverse, and contains idenity. <math> SE(3) </math> is an exmaple of this.

=== Learning Motion by Group Properties ===
The goal is to learn function <math> \Phi : I \times I \rightarrow \overline{M} </math>, <math> \overline{M} </math> indicating representation of <math> M </math>, as well as the composition operator <math> \diamond : \overline{M} \rightarrow \overline{M} </math> that represents composition in <math> M </math>. For all sequences, it is assumed <math> t_0 < t_1 < t_2 ... </math>
# Associativity: <math> \Phi(I_{t_0}, I_{t_1}) \diamond \Phi(I_{t_2}, I_{t_3}) = (\Phi(I_{t_0}, I_{t_1}) \diamond \Phi(I_{t_1}, I_{t_2})) \diamond \Phi(I_{t_2}, I_{t_3}) = \Phi(I_{t_0}, I_{t_1}) \diamond (\Phi(I_{t_1}, I_{t_2}) \diamond \Phi(I_{t_2}, I_{t_3})) = \Phi(I_{t_0}, I_{t_1}) \diamond \Phi(I_{t_1}, I_{t_3}) </math>
# Has Identity: <math> \Phi(I_{t_0}, I_{t_1}) \diamond e = \Phi(I_{t_0}, I_{t_1}) = e \diamond \Phi(I_{t_0}, I_{t_1}) </math> and <math> e=\Phi(I_{t}, I_{t}) \forall t </math>
# Invertibility: <math> \Phi(I_{t_0}, I_{t_1}) \diamond \Phi(I_{t_1}, I_{t_0}) = e </math>
A embedding loss is used to approximately enforce associativity and invertibility among subsequences sampled from image sequence. Associativity is encouraged by pushing same the same final motion with different transition to the same representation. Invertibility is encouraged by pushing the same motion with same transition but in opposite direction away from each other, as well as push loops to the same representation. Uniqueness of identity is encouraged by pushing loops away from non-identity representations. Loops are also pushed to the same representation (identity) from different sequences.

These constraints are true to any type of transformation resulting from image motion. This puts little restriction on the learning problems and allows all features relevant to the motion structure to be captured.

Also with this method, it is possible multiple representations <math> \overline{M} </math> can be learned from a single <math> M </math>, thus the learned representation is not necessary unique. In addition, the scenes are not expected to have rapid changing contents, scene cuts, or long-term occlusions.

=== Sequence Learning with Neural Networks ===
The functions <math> \Phi </math> and <math> \diamond </math> are approximated by CNN and RNN, respectively. LSTM is used for RNN. The input to the network is a sequence of images <math> I_t = \{I_1,...,I_t\} </math>. The CNN processes pairs of images are intermediate representations, and the LSTM operates over the sequence of CNN outputs to produce and embedding sequence <math> R_t = \{R_{1,2},...,R_{t-1,t}\} </math>. Only the embedding at the final timstep is used for loss. The network is trained to minimize a hinge loss with respect to embeddings to pairs of sequences. The cost function is:

<center><math>L(R^1,R^2) = \begin{cases} d(R^1,R^2), & \text{if positive pair} \\ max(0, m - d(R^1,R^2)), & \text{if negative pair} \end{cases}</math></center>
<center><math> d_{cosine}(R^1,R^2)=1-\frac{\langle R^1,R^2 \rangle}{\lVert R^1 \rVert \lVert R^2 \rVert} </math></center>

where <math>d(R^1,R^2)</math> measure the distance between the embeddings of two sequences used for training selected to be cosine distance, <math> m </math> is a fixed margin selected to be 0.5. Positive pair are training example where two sequences have the same final motion, negative pairs are training examples where two sequences have the exact opposite final motion. Using L2 distances yields similar results as cosine distances.

Each training sequence is composed into 6 subsequences: two forward, two backward, and two identity. To prevent the network from only looking at static differences, subsequence pairs are sampled such that they have the same start and end frames but different motions in between. Sequences of varying lengths are also used to generalize motion on different temporal scale. Training the network with only one input images per timestep is also tried, but consistently yielded work results than image pairs.

== Experimentation ==
Trained network using rotated and translated MNIST dataset as well as KITTI dataset.
* Used torch
* Used Adam for optimization, decay schedule of 30 epochs, learning rate chosen by random serach
* 50-60 batch size for MNIST, 25-30 batch size for KITTI
* dilated convolution with Relu and batch normalization
* Two LSTM cell per layer 256 hidden units each
* sequence length of 3-5 images

=== Rigid Motion in 2D ===
* MNIST data rotated <math>[0, 360)</math> degrees and translated <math>[-10, 10] </math> pixels, i.e. <math>SE(2)</math> transformations
* visualized the representation using t-SNE
** clear clustering by translation and rotation but not object classes
** suggests the representation captures the motion properties in the dataset, but is independent of image contents
* visualized the image-conditioned saliency maps
** take derivative of the network output respect to the map
** the area that has the highest gradient means that part contributes the most to the output
** the resulting salient map strongly resembles spatiotemporal energy filters of classical motion processing
** suggests the network is learn the right motion structure

=== Real World Motion in 3D ===
* Uses KITTI dataset collected on a car driving through roads in Germany
* On a separate dataset with ground truth camera pose, linearly regress the representation to the ground truth
** The result is compared against self supervised flow algorithm Yu et al.(2016) after the output from the flow algorithm is downsampled, then feed through PCA, then regressed against the camera motion
** The data shows it performs not as well as the supervised algorithm, but consistent better than chance (guessing the mean value)
** shows the method is able to capture dominant motion structure
* test performance on interpolation task
** check <math>R([I_1,I_T])</math> against <math>R([I_1, I_m, I_T])</math>, <math>R([I_1, I_{IN}, I_T])</math>, and <math>R([I_1, I_{OUT}, I_T])</math>
** test how sensitive the network is to deviations from unnatural motion
** high errors <math>\gg 1</math> means the network can distinguish between realistic and unrealistic motion
* visualized saliency maps
** highs objects moving in the background, and motion of the car in the foreground
** suggests the method can be used for tracking as well

== Conclusion ==
The author presented a new model of motion and method for learning motion representations. It is shown that enforcing group properties can learn motion representations that is able to generalize between scenes with disparate content. The results can be useful for navigation, prediction, and other behavioral tasks relying on motion. Due to the fact that this method does not require labelled data, it can be applied to useful for large variety of tasks.

== Criticism ==
Although this method does not require any labelled data, it is still learning by supervision through defined constraints. The idea of training using unlabelled data is interesting and it does have meaningful practical application. Unfortunately, the author did not provide convincing experimental results. Results from motion estimation problems are typically compared against ground truth data for their accuracy. The author performed experiments on transformed MNIST data and KITTI data. The MNIST data is transformed by the author, thus the ground truth is readily available. However the author only claimed the validity of the results through indirect means of using t-SNE and saliency map visualization. For the KITTI dataset, the author regressed the representations against ground truth for some mapping from the network output to some physical motion representation. Again, the results again compared only indirectly against ground truth. Such experimentation made the method hardly convincing and applicable to real world applications. In addition, the network does not output motion representations with physical meanings, make the proposed method useless for any real world applications.

Understanding Image Motion with Group Representations

2018-03-06T04:28:01Z

Cs4li:

== Introduction ==
Motion perception is a key component of computer vision. It is critical to problems such as optical flow and visual odometry, where a sequence of images are used to calculate either the pixel level (local) motion or the motion of the entire scene (global). The smooth image transformation caused by camera motion is a subspace of all position image transformations. Here, we are interested in realistic transformation caused by motion, therefore unrealistic motion caused by say, face swapping, are not considered.

Supervised learning of 3D motion is challenging since explicit motion labels are no trivial to obtain. The proposed learning method does not need label data. Instead, the method constraints learning by using the properties of motion space. The paper presents a general model of visual motion, and how the motion space properties of associativity and can be used to constrain the learning of a deep neural network. The results show evidence that the learned model captions motion in both 2D and 3D settings.

[[File:paper13_fig1.png|200px]]

== Related Work ==
The most common global representations of motion are from structure from motion (SfM) and simultaneous localization and mapping (SLAM), which represents poses in special Euclidean group <math> SE(3) </math> to represent a sequence of motions. However, these cannot be used to represent non-rigid or independent motions. Another approache to representing motion is spatiotemporal features (STFs), which are flexible enough to represent non-rigid motions.

There are also works using CNN’s to learn optical flow using brightness constancy assumptions, and/or photometric local constraints. Works on stereo depth estimation using learning has also shown results. Regarding to image sequences, there are works on shuffling the order of images to learn representations of its contents, as well as learning representations equivariant to the egomotion of the camera.

== Approach ==
The proposed method is based on the observation that 3D motions, equipped with composition forms a group. By learning the underlying mapping that captures the motion transformations, we are approximating latent motion of the scene.The method is designed to capture group associativity and invertibility.

Consider a latent structure space <math>S</math>, element of the structure space generates images via projection <math>\pi:S\rightarrow I</math>, latent motion space <math>M</math> which is some closed subgroup of the set of homeomorphism on <math>S</math>. For <math>s \in S</math>, a continuous motion sequence <math> \{m_t \in M | t \geq 0\} </math> generates continous image sequence <math> \{i_t \in I | t \geq 0\} </math> where <math> i_t=\pi(m_t(s)) </math>. Writing this as a hidden Markov model gives <math> i_t=\pi(m_{\Delta t}(s_{t-1}))) </math> where the current state is based on the change from the previous. Since <math> M </math> is a closed group on <math> S </math>, it is associative, has inverse, and contains idenity. <math> SE(3) </math> is an exmaple of this.

=== Learning Motion by Group Properties ===
The goal is to learn function <math> \Phi : I \times I \rightarrow \overline{M} </math>, <math> \overline{M} </math> indicating representation of <math> M </math>, as well as the composition operator <math> \diamond : \overline{M} \rightarrow \overline{M} </math> that represents composition in <math> M </math>. For all sequences, it is assumed <math> t_0 < t_1 < t_2 ... </math>
# Associativity: <math> \Phi(I_{t_0}, I_{t_1}) \diamond \Phi(I_{t_2}, I_{t_3}) = (\Phi(I_{t_0}, I_{t_1}) \diamond \Phi(I_{t_1}, I_{t_2})) \diamond \Phi(I_{t_2}, I_{t_3}) = \Phi(I_{t_0}, I_{t_1}) \diamond (\Phi(I_{t_1}, I_{t_2}) \diamond \Phi(I_{t_2}, I_{t_3})) = \Phi(I_{t_0}, I_{t_1}) \diamond \Phi(I_{t_1}, I_{t_3}) </math>
# Has Identity: <math> \Phi(I_{t_0}, I_{t_1}) \diamond e = \Phi(I_{t_0}, I_{t_1}) = e \diamond \Phi(I_{t_0}, I_{t_1}) </math> and <math> e=\Phi(I_{t}, I_{t}) \forall t </math>
# Invertibility: <math> \Phi(I_{t_0}, I_{t_1}) \diamond \Phi(I_{t_1}, I_{t_0}) = e </math>
A embedding loss is used to approximately enforce associativity and invertibility among subsequences sampled from image sequence. Associativity is encouraged by pushing same the same final motion with different transition to the same representation. Invertibility is encouraged by pushing the same motion with same transition but in opposite direction away from each other, as well as push loops to the same representation. Uniqueness of identity is encouraged by pushing loops away from non-identity representations. Loops are also pushed to the same representation (identity) from different sequences.

These constraints are true to any type of transformation resulting from image motion. This puts little restriction on the learning problems and allows all features relevant to the motion structure to be captured.

Also with this method, it is possible multiple representations <math> \overline{M} </math> can be learned from a single <math> M </math>, thus the learned representation is not necessary unique. In addition, the scenes are not expected to have rapid changing contents, scene cuts, or long-term occlusions.

=== Sequence Learning with Neural Networks ===
The functions <math> \Phi </math> and <math> \diamond </math> are approximated by CNN and RNN, respectively. LSTM is used for RNN. The input to the network is a sequence of images <math> I_t = \{I_1,...,I_t\} </math>. The CNN processes pairs of images are intermediate representations, and the LSTM operates over the sequence of CNN outputs to produce and embedding sequence <math> R_t = \{R_{1,2},...,R_{t-1,t}\} </math>. Only the embedding at the final timstep is used for loss. The network is trained to minimize a hinge loss with respect to embeddings to pairs of sequences. The cost function is:

<center><math>L(R^1,R^2) = \begin{cases} d(R^1,R^2), & \text{if positive pair} \\ max(0, m - d(R^1,R^2)), & \text{if negative pair} \end{cases}</math></center>
<center><math> d_{cosine}(R^1,R^2)=1-\frac{\langle R^1,R^2 \rangle}{\lVert R^1 \rVert \lVert R^2 \rVert} </math></center>

where <math>d(R^1,R^2)</math> measure the distance between the embeddings of two sequences used for training selected to be cosine distance, <math> m </math> is a fixed margin selected to be 0.5. Positive pair are training example where two sequences have the same final motion, negative pairs are training examples where two sequences have the exact opposite final motion. Using L2 distances yields similar results as cosine distances.

Each training sequence is composed into 6 subsequences: two forward, two backward, and two identity. To prevent the network from only looking at static differences, subsequence pairs are sampled such that they have the same start and end frames but different motions in between. Sequences of varying lengths are also used to generalize motion on different temporal scale. Training the network with only one input images per timestep is also tried, but consistently yielded work results than image pairs.

== Experimentation ==
Trained network using rotated and translated MNIST dataset as well as KITTI dataset.
* Used torch
* Used Adam for optimization, decay schedule of 30 epochs, learning rate chosen by random serach
* 50-60 batch size for MNIST, 25-30 batch size for KITTI
* dilated convolution with Relu and batch normalization
* Two LSTM cell per layer 256 hidden units each
* sequence length of 3-5 images

=== Rigid Motion in 2D ===
* MNIST data rotated <math>[0, 360)</math> degrees and translated <math>[-10, 10] </math> pixels, i.e. <math>SE(2)</math> transformations
* visualized the representation using t-SNE
** clear clustering by translation and rotation but not object classes
** suggests the representation captures the motion properties in the dataset, but is independent of image contents
* visualized the image-conditioned saliency maps
** take derivative of the network output respect to the map
** the area that has the highest gradient means that part contributes the most to the output
** the resulting salient map strongly resembles spatiotemporal energy filters of classical motion processing
** suggests the network is learn the right motion structure

=== Real World Motion in 3D ===
* Uses KITTI dataset collected on a car driving through roads in Germany
* On a separate dataset with ground truth camera pose, linearly regress the representation to the ground truth
** The result is compared against self supervised flow algorithm Yu et al.(2016) after the output from the flow algorithm is downsampled, then feed through PCA, then regressed against the camera motion
** The data shows it performs not as well as the supervised algorithm, but consistent better than chance (guessing the mean value)
** shows the method is able to capture dominant motion structure
* test performance on interpolation task
** check <math>R([I_1,I_T])</math> against <math>R([I_1, I_m, I_T])</math>, <math>R([I_1, I_{IN}, I_T])</math>, and <math>R([I_1, I_{OUT}, I_T])</math>
** test how sensitive the network is to deviations from unnatural motion
** high errors <math>\gg 1</math> means the network can distinguish between realistic and unrealistic motion
* visualized saliency maps
** highs objects moving in the background, and motion of the car in the foreground
** suggests the method can be used for tracking as well

== Conclusion ==
The author presented a new model of motion and method for learning motion representations. It is shown that enforcing group properties can learn motion representations that is able to generalize between scenes with disparate content. The results can be useful for navigation, prediction, and other behavioral tasks relying on motion. Due to the fact that this method does not require labelled data, it can be applied to useful for large variety of tasks.

== Criticism ==
Although this method does not require any labelled data, it is still learning by supervision through defined constraints. The idea of training using unlabelled data is interesting and it does have meaningful practical application. Unfortunately, the author did not provide convincing experimental results. Results from motion estimation problems are typically compared against ground truth data for their accuracy. The author performed experiments on transformed MNIST data and KITTI data. The MNIST data is transformed by the author, thus the ground truth is readily available. However the author only claimed the validity of the results through indirect means of using t-SNE and saliency map visualization. For the KITTI dataset, the author regressed the representations against ground truth for some mapping from the network output to some physical motion representation. Again, the results again compared only indirectly against ground truth. Such experimentation made the method hardly convincing and applicable to real world applications. In addition, the network does not output motion representations with physical meanings, make the proposed method useless for any real world applications.

Understanding Image Motion with Group Representations

2018-03-06T04:27:44Z

Cs4li:

== Introduction ==
Motion perception is a key component of computer vision. It is critical to problems such as optical flow and visual odometry, where a sequence of images are used to calculate either the pixel level (local) motion or the motion of the entire scene (global). The smooth image transformation caused by camera motion is a subspace of all position image transformations. Here, we are interested in realistic transformation caused by motion, therefore unrealistic motion caused by say, face swapping, are not considered.

Supervised learning of 3D motion is challenging since explicit motion labels are no trivial to obtain. The proposed learning method does not need label data. Instead, the method constraints learning by using the properties of motion space. The paper presents a general model of visual motion, and how the motion space properties of associativity and can be used to constrain the learning of a deep neural network. The results show evidence that the learned model captions motion in both 2D and 3D settings.

[[File:paper13_fig1.png]]

== Related Work ==
The most common global representations of motion are from structure from motion (SfM) and simultaneous localization and mapping (SLAM), which represents poses in special Euclidean group <math> SE(3) </math> to represent a sequence of motions. However, these cannot be used to represent non-rigid or independent motions. Another approache to representing motion is spatiotemporal features (STFs), which are flexible enough to represent non-rigid motions.

There are also works using CNN’s to learn optical flow using brightness constancy assumptions, and/or photometric local constraints. Works on stereo depth estimation using learning has also shown results. Regarding to image sequences, there are works on shuffling the order of images to learn representations of its contents, as well as learning representations equivariant to the egomotion of the camera.

== Approach ==
The proposed method is based on the observation that 3D motions, equipped with composition forms a group. By learning the underlying mapping that captures the motion transformations, we are approximating latent motion of the scene.The method is designed to capture group associativity and invertibility.

Consider a latent structure space <math>S</math>, element of the structure space generates images via projection <math>\pi:S\rightarrow I</math>, latent motion space <math>M</math> which is some closed subgroup of the set of homeomorphism on <math>S</math>. For <math>s \in S</math>, a continuous motion sequence <math> \{m_t \in M | t \geq 0\} </math> generates continous image sequence <math> \{i_t \in I | t \geq 0\} </math> where <math> i_t=\pi(m_t(s)) </math>. Writing this as a hidden Markov model gives <math> i_t=\pi(m_{\Delta t}(s_{t-1}))) </math> where the current state is based on the change from the previous. Since <math> M </math> is a closed group on <math> S </math>, it is associative, has inverse, and contains idenity. <math> SE(3) </math> is an exmaple of this.

=== Learning Motion by Group Properties ===
The goal is to learn function <math> \Phi : I \times I \rightarrow \overline{M} </math>, <math> \overline{M} </math> indicating representation of <math> M </math>, as well as the composition operator <math> \diamond : \overline{M} \rightarrow \overline{M} </math> that represents composition in <math> M </math>. For all sequences, it is assumed <math> t_0 < t_1 < t_2 ... </math>
# Associativity: <math> \Phi(I_{t_0}, I_{t_1}) \diamond \Phi(I_{t_2}, I_{t_3}) = (\Phi(I_{t_0}, I_{t_1}) \diamond \Phi(I_{t_1}, I_{t_2})) \diamond \Phi(I_{t_2}, I_{t_3}) = \Phi(I_{t_0}, I_{t_1}) \diamond (\Phi(I_{t_1}, I_{t_2}) \diamond \Phi(I_{t_2}, I_{t_3})) = \Phi(I_{t_0}, I_{t_1}) \diamond \Phi(I_{t_1}, I_{t_3}) </math>
# Has Identity: <math> \Phi(I_{t_0}, I_{t_1}) \diamond e = \Phi(I_{t_0}, I_{t_1}) = e \diamond \Phi(I_{t_0}, I_{t_1}) </math> and <math> e=\Phi(I_{t}, I_{t}) \forall t </math>
# Invertibility: <math> \Phi(I_{t_0}, I_{t_1}) \diamond \Phi(I_{t_1}, I_{t_0}) = e </math>
A embedding loss is used to approximately enforce associativity and invertibility among subsequences sampled from image sequence. Associativity is encouraged by pushing same the same final motion with different transition to the same representation. Invertibility is encouraged by pushing the same motion with same transition but in opposite direction away from each other, as well as push loops to the same representation. Uniqueness of identity is encouraged by pushing loops away from non-identity representations. Loops are also pushed to the same representation (identity) from different sequences.

These constraints are true to any type of transformation resulting from image motion. This puts little restriction on the learning problems and allows all features relevant to the motion structure to be captured.

Also with this method, it is possible multiple representations <math> \overline{M} </math> can be learned from a single <math> M </math>, thus the learned representation is not necessary unique. In addition, the scenes are not expected to have rapid changing contents, scene cuts, or long-term occlusions.

=== Sequence Learning with Neural Networks ===
The functions <math> \Phi </math> and <math> \diamond </math> are approximated by CNN and RNN, respectively. LSTM is used for RNN. The input to the network is a sequence of images <math> I_t = \{I_1,...,I_t\} </math>. The CNN processes pairs of images are intermediate representations, and the LSTM operates over the sequence of CNN outputs to produce and embedding sequence <math> R_t = \{R_{1,2},...,R_{t-1,t}\} </math>. Only the embedding at the final timstep is used for loss. The network is trained to minimize a hinge loss with respect to embeddings to pairs of sequences. The cost function is:

<center><math>L(R^1,R^2) = \begin{cases} d(R^1,R^2), & \text{if positive pair} \\ max(0, m - d(R^1,R^2)), & \text{if negative pair} \end{cases}</math></center>
<center><math> d_{cosine}(R^1,R^2)=1-\frac{\langle R^1,R^2 \rangle}{\lVert R^1 \rVert \lVert R^2 \rVert} </math></center>

where <math>d(R^1,R^2)</math> measure the distance between the embeddings of two sequences used for training selected to be cosine distance, <math> m </math> is a fixed margin selected to be 0.5. Positive pair are training example where two sequences have the same final motion, negative pairs are training examples where two sequences have the exact opposite final motion. Using L2 distances yields similar results as cosine distances.

Each training sequence is composed into 6 subsequences: two forward, two backward, and two identity. To prevent the network from only looking at static differences, subsequence pairs are sampled such that they have the same start and end frames but different motions in between. Sequences of varying lengths are also used to generalize motion on different temporal scale. Training the network with only one input images per timestep is also tried, but consistently yielded work results than image pairs.

== Experimentation ==
Trained network using rotated and translated MNIST dataset as well as KITTI dataset.
* Used torch
* Used Adam for optimization, decay schedule of 30 epochs, learning rate chosen by random serach
* 50-60 batch size for MNIST, 25-30 batch size for KITTI
* dilated convolution with Relu and batch normalization
* Two LSTM cell per layer 256 hidden units each
* sequence length of 3-5 images

=== Rigid Motion in 2D ===
* MNIST data rotated <math>[0, 360)</math> degrees and translated <math>[-10, 10] </math> pixels, i.e. <math>SE(2)</math> transformations
* visualized the representation using t-SNE
** clear clustering by translation and rotation but not object classes
** suggests the representation captures the motion properties in the dataset, but is independent of image contents
* visualized the image-conditioned saliency maps
** take derivative of the network output respect to the map
** the area that has the highest gradient means that part contributes the most to the output
** the resulting salient map strongly resembles spatiotemporal energy filters of classical motion processing
** suggests the network is learn the right motion structure

=== Real World Motion in 3D ===
* Uses KITTI dataset collected on a car driving through roads in Germany
* On a separate dataset with ground truth camera pose, linearly regress the representation to the ground truth
** The result is compared against self supervised flow algorithm Yu et al.(2016) after the output from the flow algorithm is downsampled, then feed through PCA, then regressed against the camera motion
** The data shows it performs not as well as the supervised algorithm, but consistent better than chance (guessing the mean value)
** shows the method is able to capture dominant motion structure
* test performance on interpolation task
** check <math>R([I_1,I_T])</math> against <math>R([I_1, I_m, I_T])</math>, <math>R([I_1, I_{IN}, I_T])</math>, and <math>R([I_1, I_{OUT}, I_T])</math>
** test how sensitive the network is to deviations from unnatural motion
** high errors <math>\gg 1</math> means the network can distinguish between realistic and unrealistic motion
* visualized saliency maps
** highs objects moving in the background, and motion of the car in the foreground
** suggests the method can be used for tracking as well

== Conclusion ==
The author presented a new model of motion and method for learning motion representations. It is shown that enforcing group properties can learn motion representations that is able to generalize between scenes with disparate content. The results can be useful for navigation, prediction, and other behavioral tasks relying on motion. Due to the fact that this method does not require labelled data, it can be applied to useful for large variety of tasks.

== Criticism ==
Although this method does not require any labelled data, it is still learning by supervision through defined constraints. The idea of training using unlabelled data is interesting and it does have meaningful practical application. Unfortunately, the author did not provide convincing experimental results. Results from motion estimation problems are typically compared against ground truth data for their accuracy. The author performed experiments on transformed MNIST data and KITTI data. The MNIST data is transformed by the author, thus the ground truth is readily available. However the author only claimed the validity of the results through indirect means of using t-SNE and saliency map visualization. For the KITTI dataset, the author regressed the representations against ground truth for some mapping from the network output to some physical motion representation. Again, the results again compared only indirectly against ground truth. Such experimentation made the method hardly convincing and applicable to real world applications. In addition, the network does not output motion representations with physical meanings, make the proposed method useless for any real world applications.

File:paper13 tab2.png

2018-03-06T04:26:16Z

Cs4li:

File:paper13 fig5.png

2018-03-06T04:26:06Z

Cs4li:

File:paper13 fig4.png

2018-03-06T04:25:56Z

Cs4li:

File:paper13 fig3.png

2018-03-06T04:25:48Z

Cs4li:

File:paper13 fig2.png

2018-03-06T04:25:39Z

Cs4li: