http://wiki.math.uwaterloo.ca/statwiki/api.php?action=feedcontributions&user=Tdunn&feedformat=atomstatwiki - User contributions [US]2024-03-29T06:21:48ZUser contributionsMediaWiki 1.41.0http://wiki.math.uwaterloo.ca/statwiki/index.php?title=Dynamic_Routing_Between_Capsules_STAT946&diff=36293Dynamic Routing Between Capsules STAT9462018-04-11T23:58:03Z<p>Tdunn: /* Weakness of Capsule Network */</p>
<hr />
<div>= Presented by =<br />
<br />
Yang, Tong(Richard)<br />
<br />
= Contributions =<br />
<br />
This paper introduces the concept of "capsules" and an approach to implement this concept in neural networks. Capsules are groups of neurons used to represent various properties of an entity/object present in the image, such as pose, deformation, and even the existence of the entity. Instead of the obvious representation of a logistic unit for the probability of existence, the paper explores using the length of the capsule output vector to represent existence, and the orientation to represent other properties of the entity. The paper makes the following major contributions:<br />
<br />
* Proposes an alternative to max-pooling called routing-by-agreement.<br />
* Demonstrates a mathematical structure for capsule layers and a routing mechanism. Builds a prototype architecture for capsule networks. <br />
* Presented promising results that confirm the value of Capsnet as a new direction for development in deep learning.<br />
<br />
= Hinton's Critiques on CNN =<br />
<br />
In a past talk [4], Hinton tried to explain why max-pooling is the biggest problem with current convolutional networks. Here are some highlights from his talk. <br />
<br />
== Four arguments against pooling ==<br />
<br />
* It is a bad fit to the psychology of shape perception: It does not explain why we assign intrinsic coordinate frames to objects and why they have such huge effects. <br />
<br />
* It solves the wrong problem: We want equivariance, not invariance. Disentangling rather than discarding.<br />
<br />
* It fails to use the underlying linear structure: It does not make use of the natural linear manifold that perfectly handles the largest source of variance in images.<br />
<br />
* Pooling is a poor way to do dynamic routing: We need to route each part of the input to the neurons that know how to deal with it. Finding the best routing is equivalent to parsing the image.<br />
<br />
===Intuition Behind Capsules ===<br />
We try to achieve viewpoint invariance in the activities of neurons by doing max-pooling. Invariance here means that by changing the input a little, the output still stays the same while the activity is just the output signal of a neuron. In other words, when in the input image we shift the object that we want to detect by a little bit, networks activities (outputs of neurons) will not change because of max pooling and the network will still detect the object. But the spacial relationships are not taken care of in this approach so instead capsules are used, because they encapsulate all important information about the state of the features they are detecting in a form of a vector. Capsules encode probability of detection of a feature as the length of their output vector. And the state of the detected feature is encoded as the direction in which that vector points to. So when detected feature moves around the image or its state somehow changes, the probability still stays the same (length of vector does not change), but its orientation changes.<br />
<br />
For example given two sets of hospital records the first of which sorts by [age, weight, height] and the second by [height, age, weight] if we apply the machine learning to this data set it would not preform very well. Capsules aims to solve this problem by routing the information (age, weight, height) to the appropriate neurons.<br />
<br />
== Equivariance ==<br />
<br />
To deal with the invariance problem of CNN, Hinton proposes the concept called equivariance, which is the foundation of capsule concept.<br />
<br />
=== Two types of equivariance ===<br />
<br />
==== Place-coded equivariance ====<br />
If a low-level part moves to a very different position it will be represented by a different capsule.<br />
<br />
==== Rate-coded equivariance ====<br />
If a part only moves a small distance it will be represented by the same capsule but the pose outputs of the capsule will change.<br />
<br />
Higher-level capsules have bigger domains so low-level place-coded equivariance gets converted into high-level rate-coded equivariance.<br />
<br />
= Dynamic Routing =<br />
<br />
In the second section of this paper, authors give mathematical representations for two key features in routing algorithm in capsule network, which are squashing and agreement. The general setting for this algorithm is between two arbitrary capsules i and j. Capsule j is assumed to be an arbitrary capsule from the first layer of capsules, and capsule i is an arbitrary capsule from the layer below. The purpose of routing algorithm is to generate a vector output for routing decision between capsule j and capsule i. Furthermore, this vector output will be used in the decision for choice of dynamic routing. <br />
<br />
== Routing Algorithm ==<br />
<br />
The routing algorithm is as the following:<br />
<br />
[[File:DRBC_Figure_1.png|650px|center||Source: Sabour, Frosst, Hinton, 2017]]<br />
<br />
In the following sections, each part of this algorithm will be explained in details.<br />
<br />
=== Log Prior Probability ===<br />
<br />
<math>b_{ij}</math> represents the log prior probabilities that capsule i should be coupled to capsule j, and updated in each routing iteration. As line 2 suggests, the initial values of <math>b_{ij}</math> for all possible pairs of capsules are set to 0. In the very first routing iteration, <math>b_{ij}</math> equals to zero. For each routing iteration, <math>b_{ij}</math> gets updated by the value of agreement, which will be explained later.<br />
<br />
=== Coupling Coefficient === <br />
<br />
<math>c_{ij}</math> represents the coupling coefficient between capsule j and capsule i. It is calculated by applying the softmax function on the log prior probability <math>b_{ij}</math>. The mathematical transformation is shown below (Equation 3 in paper): <br />
<br />
\begin{align}<br />
c_{ij} = \frac{exp(b_ij)}{\sum_{k}exp(b_ik)}<br />
\end{align}<br />
<br />
<math>c_{ij}</math> are served as weights for computing the weighted sum and probabilities. Therefore, as probabilities, they have the following properties:<br />
<br />
\begin{align}<br />
c_{ij} \geq 0, \forall i, j<br />
\end{align}<br />
<br />
and, <br />
<br />
\begin{align}<br />
\sum_{i,j}c_{ij} = 1, \forall i, j<br />
\end{align}<br />
<br />
=== Predicted Output from Layer Below === <br />
<br />
<math>u_{i}</math> are the output vector from capsule i in the lower layer, and <math>\hat{u}_{j|i}</math> are the input vector for capsule j, which are the "prediction vectors" from the capsules in the layer below. <math>\hat{u}_{j|i}</math> is produced by multiplying <math>u_{i}</math> by a weight matrix <math>W_{ij}</math>, such as the following:<br />
<br />
\begin{align}<br />
\hat{u}_{j|i} = W_{ij}u_i<br />
\end{align}<br />
<br />
where <math>W_{ij}</math> encodes some spatial relationship between capsule j and capsule i.<br />
<br />
=== Capsule ===<br />
<br />
By using the definitions from previous sections, the total input vector for an arbitrary capsule j can be defined as:<br />
<br />
\begin{align}<br />
s_j = \sum_{i}c_{ij}\hat{u}_{j|i}<br />
\end{align}<br />
<br />
which is a weighted sum over all prediction vectors by using coupling coefficients.<br />
<br />
=== Squashing ===<br />
<br />
The length of <math>s_j</math> is arbitrary, which is needed to be addressed with. The next step is to convert its length between 0 and 1, since we want the length of the output vector of a capsule to represent the probability that the entity represented by the capsule is present in the current input. The "squashing" process is shown below:<br />
<br />
\begin{align}<br />
v_j = \frac{||s_j||^2}{1+||s_j||^2}\frac{s_j}{||s_j||}<br />
\end{align}<br />
<br />
Notice that "squashing" is not just normalizing the vector into unit length. In addition, it does extra non-linear transformation to ensure that short vectors get shrunk to almost zero length and long vectors get shrunk to a length slightly below 1. The reason for doing this is to make decision of routing, which is called "routing by agreement" much easier to make between capsule layers.<br />
<br />
=== Agreement ===<br />
<br />
The final step of a routing iteration is to form an routing agreement <math>a_{ij}</math>, which is represents as a scalar product:<br />
<br />
\begin{align}<br />
a_{ij} = v_{j} \cdot \hat{u}_{j|i}<br />
\end{align}<br />
<br />
As we mentioned in "squashing" section, the length of <math>v_{j}</math> is either close to 0 or close to 1, which will effect the magnitude of <math>a_{ij}</math> in this case. Therefore, the magnitude of <math>a_{ij}</math> indicate the how strong the routing algorithm agrees on taking the route between capsule j and capsule i. For each routing iteration, the log prior probability, <math>b_{ij}</math> will be updated by adding the value of its agreement value, which will effect how the coupling coefficients are computed in the next routing iteration. Because of the "squashing" process, we will eventually end up with a capsule j with its <math>v_{j}</math> close to 1 while all other capsules with its <math>v_{j}</math> close to 0, which indicates that this capsule j should be activated.<br />
<br />
= CapsNet Architecture =<br />
<br />
The second part of this paper discuss the experiment results from a 3-layer CapsNet, the architecture can be divided into two parts, encoder and decoder. <br />
<br />
== Encoder == <br />
<br />
[[File:DRBC_Architecture.png|650px|center||Source: Sabour, Frosst, Hinton, 2017]]<br />
<br />
=== How many routing iteration to use? === <br />
In appendix A of this paper, the authors have shown the empirical results from 500 epochs of training at different choice of routing iterations. According to their observation, more routing iterations increases the capacity of CapsNet but tends to bring additional risk of overfitting. Moreover, CapsNet with routing iterations less than three are not effective in general. As result, they suggest 3 iterations of routing for all experiments.<br />
<br />
=== Marginal loss for digit existence ===<br />
<br />
The experiments performed include segmenting overlapping digits on MultiMINST data set, so the loss function has be adjusted for presents of multiple digits. The marginal lose <math>L_k</math> for each capsule k is calculate by:<br />
<br />
\begin{align}<br />
L_k = T_k max(0, m^+ - ||v_k||)^2 + \lambda(1 - T_k) max(0, ||v_k|| - m^-)^2<br />
\end{align}<br />
<br />
where <math>m^+ = 0.9</math>, <math>m^- = 0.1</math>, and <math>\lambda = 0.5</math>.<br />
<br />
<math>T_k</math> is an indicator for presence of digit of class k, it takes value of 1 if and only if class k is presented. If class k is not presented, <math>\lambda</math> down-weight the loss which shrinks the lengths of the activity vectors for all the digit capsules. By doing this, The loss function penalizes the initial learning for all absent digit class, since we would like the top-level capsule for digit class k to have long instantiation vector if and only if that digit class is present in the input.<br />
<br />
=== Layer 1: Conv1 === <br />
<br />
The first layer of CapsNet. Similar to CNN, this is just convolutional layer that converts pixel intensities to activities of local feature detectors. <br />
<br />
* Layer Type: Convolutional Layer.<br />
* Input: <math>28 \times 28</math> pixels.<br />
* Kernel size: <math>9 \times 9</math>.<br />
* Number of Kernels: 256.<br />
* Activation function: ReLU.<br />
* Output: <math>20 \times 20 \times 256</math> tensor.<br />
<br />
=== Layer 2: PrimaryCapsules ===<br />
<br />
The second layer is formed by 32 primary 8D capsules. By 8D, it means that each primary capsule contains 8 convolutional units with a <math>9 \times 9</math> kernel and a stride of 2. Each capsule will take a <math>20 \times 20 \times 256</math> tensor from Conv1 and produce an output of a <math>6 \times 6 \times 8</math> tensor.<br />
<br />
* Layer Type: Convolutional Layer<br />
* Input: <math>20 \times 20 \times 256</math> tensor.<br />
* Number of capsules: 32.<br />
* Number of convolutional units in each capsule: 8.<br />
* Size of each convolutional unit: <math>6 \times 6</math>.<br />
* Output: <math>6 \times 6 \times 8</math> 8-dimensional vectors.<br />
<br />
=== Layer 3: DigitsCaps ===<br />
<br />
The last layer has 10 16D capsules, one for each digit. Not like the PrimaryCapsules layer, this layer is fully connected. Since this is the top capsule layer, dynamic routing mechanism will be applied between DigitsCaps and PrimaryCapsules. The process begins by taking a transformation of predicted output from PrimaryCapsules layer. Each output is a 8-dimensional vector, which needed to be mapped to a 16-dimensional space. Therefore, the weight matrix, <math>W_{ij}</math> is a <math>8 \times 16</math> matrix. The next step is to acquire coupling coefficients from routing algorithm and to perform "squashing" to get the output. <br />
<br />
* Layer Type: Fully connected layer.<br />
* Input: <math>6 \times 6 \times 8</math> 8-dimensional vectors.<br />
* Output: <math>16 \times 10 </math> matrix.<br />
<br />
=== The loss function ===<br />
<br />
The output of the loss function would be a ten-dimensional one-hot encoded vector with 9 zeros and 1 one at the correct position.<br />
<br />
<br />
== Regularization Method: Reconstruction ==<br />
<br />
This is regularization method introduced in the implementation of CapsNet. The method is to introduce a reconstruction loss (scaled down by 0.0005) to margin loss during training. The authors argue this would encourage the digit capsules to encode the instantiation parameters the input digits. All the reconstruction during training is by using the true labels of the image input. The results from experiments also confirms that adding the reconstruction regularizer enforces the pose encoding in CapsNet and thus boots the performance of routing procedure. <br />
<br />
=== Decoder ===<br />
<br />
The decoder consists of 3 fully connected layers, each layer maps pixel intensities to pixel intensities. The number of parameters in each layer and the activation functions used are indicated in the figure below:<br />
<br />
[[File:DRBC_Decoder.png|650px|center||Source: Sabour, Frosst, Hinton, 2017]]<br />
<br />
=== Result ===<br />
<br />
The authors include some results for CapsNet classification test accuracy to justify the result of reconstruction. We can see that for CapsNet with 1 routing iteration and CapsNet with 3 routing iterations, implement reconstruction shows significant improvements in both MINIST and MultiMINST data set. These improvements show the importance of routing and reconstruction regularizer. <br />
<br />
[[File:DRBC_Reconstruction.png|650px|center||Source: Sabour, Frosst, Hinton, 2017]]<br />
<br />
The decision to use a 3 iteration approach came from experimental results. The image below shows the average logit difference over epochs and at the end for different numbers of routing iterations.<br />
<br />
[[File:DRBC_AvgLogitDiff.png|700px|center||Source: Sabour, Frosst, Hinton, 2017]]<br />
<br />
The above image shows that the average logit difference decreases at a logarithmic rate according to the number of iterations. As part of this, it was seen that the higher routing iterations lead to overfitting on the training dataset. The following image however, shows that when trained on CIFAR10 the training loss is much lower for the 3 iteration method over the 1 iteration method. From these two evaluations the 3 iteration approach was selected as the most ideal.<br />
<br />
[[File:DRBC_TrainLossIter.png|350px|center||Source: Sabour, Frosst, Hinton, 2017]]<br />
<br />
= Experiment Results for CapsNet = <br />
<br />
In this part, the authors demonstrate experiment results of CapsNet on different data sets, such as MINIST and different variation of MINST, such as expanded MINST, affNIST, MultiMNIST. Moreover, they also briefly discuss the performance on some other popular data set such CIFAR 10. <br />
<br />
== MINST ==<br />
<br />
=== Highlights ===<br />
<br />
* CapsNet archives state-of-the-art performance on MINST with significantly fewer parameters (3-layer baseline CNN model has 35.4M parameters, compared to 8.2M for CapsNet with reconstruction network).<br />
* CapsNet with shallow structure (3 layers) achieves performance that only achieves by deeper network before.<br />
<br />
=== Interpretation of Each Capsule ===<br />
<br />
The authors suggest that they found evidence that dimension of some capsule always captures some variance of the digit, while some others represents the global combinations of different variations, this would open some possibility for interpretation of capsules in the future. After computing the activity vector for the correct digit capsule, the authors fed perturbed versions of those activity vectors to the decoder to examine the effect on reconstruction. Some results from perturbations are shown below, where each row represents the reconstructions when one of the 16 dimensions in the DigitCaps representation is tweaked by intervals of 0.05 from the range [-0.25, 0.25]: <br />
<br />
[[File:DRBC_Dimension.png|650px|center||Source: Sabour, Frosst, Hinton, 2017]]<br />
<br />
== affNIST == <br />
<br />
affNIT data set contains different affine transformation of original MINST data set. By the concept of capsule, CapsNet should gain more robustness from its equivariance nature, and the result confirms this. Compare the baseline CNN, CapsNet achieves 13% improvement on accuracy.<br />
<br />
== MultiMNIST ==<br />
<br />
The MultiMNIST is basically the overlapped version of MINIST. An important point to notice here is that this data set is generated by overlaying a digit on top of another digit from the same set but different class. In other words, the case of stacking digits from the same class is not allowed in MultiMINST. For example, stacking a 5 on a 0 is allowed, but stacking a 5 on another 5 is not. The reason is that CapsNet suffers from the "crowding" effect which will be discussed in the weakness of CapsNet section.<br />
<br />
The architecture used for the training is same as the one used for MNIST dataset. However, decay step of the learning rate is 10x larger to account for the larger dataset. Even with the overlap in MultiMNIST, the network is able to segment both digits separately and it shows that the network is able to position and style of the object in the image.<br />
<br />
[[File:multimnist.PNG | 700px|thumb|center|This figure shows some sample reconstructions on the MultiMNIST dataset using CapsNet. CapsNet reconstructs both of the digits in the image in different colours (green and red). It can be seen that the right most images have incorrect classifications with the 9 being classified as a 0 and the 7 being classified as an 8. ]]<br />
<br />
== Other data sets ==<br />
<br />
CapsNet is used on other data sets such as CIFAR10, smallNORB and SVHN. The results are not comparable with state-of-the-art performance, but it is still promising since this architecture is the very first, while other networks have been development for a long time. The authors pointed out one drawback of CapsNet is that they tend to account for everything in the input images - in the CIFAR10 dataset, the image backgrounds were too varied to model in a reasonably sized network, which partly explains the poorer results.<br />
<br />
= Conclusion = <br />
<br />
This paper discuss the specific part of capsule network, which is the routing-by-agreement mechanism. <br />
<br />
The authors suggest this is a great approach to solve the current problem with max-pooling in convolutional neural network. We see that the design of the capsule builds up upon the design of artificial neuron, but expands it to the vector form to allow for more powerful representational capabilities. It also introduces matrix weights to encode important hierarchical relationships between features of different layers. The result succeeds to achieve the goal of the designer: neuronal activity equivariance with respect to changes in inputs and invariance in probabilities of feature detection. <br />
<br />
Moreover, as author mentioned, the approach mentioned in this paper is only one possible implementation of the capsule concept. Approaches like [https://openreview.net/pdf?id=HJWLfGWRb/ this] have also been proposed to test other routing techniques.<br />
<br />
The preliminary results from experiment using a simple shallow CapsNet also demonstrate unparalleled performance that indicates the capsules are a direction worth exploring.<br />
<br />
= Weakness of Capsule Network =<br />
<br />
* Routing algorithm introduces internal loops for each capsule. As number of capsules and layers increases, these internal loops may exponentially expand the training time. <br />
* Capsule network suffers a perceptual phenomenon called "crowding", which is common for human vision as well. To address this weakness, capsules have to make a very strong representation assumption that at each location of the image, there is at most one instance of the type of entity that capsule represents. This is also the reason for not allowing overlaying digits from same class in generating process of MultiMINST.<br />
* Other criticisms include that the design of capsule networks requires domain knowledge or feature engineering, contrary to the abstraction-oriented goals of deep learning.<br />
* Capsule networks have not been able to produce results on data sets such as CIFAR10, smallNORB and SVHN that are comparable with the state of the art. This is likely due to the fact that Capsule nets have a hard time dealing with background image information.<br />
<br />
= Implementations = <br />
1) Tensorflow Implementation : https://github.com/naturomics/CapsNet-Tensorflow<br />
<br />
2) Keras Implementation. : https://github.com/XifengGuo/CapsNet-Keras<br />
<br />
= References =<br />
# S. Sabour, N. Frosst, and G. E. Hinton, “Dynamic routing between capsules,” arXiv preprint arXiv:1710.09829v2, 2017<br />
# “XifengGuo/CapsNet-Keras.” GitHub, 14 Dec. 2017, github.com/XifengGuo/CapsNet-Keras. <br />
# “Naturomics/CapsNet-Tensorflow.” GitHub, 6 Mar. 2018, github.com/naturomics/CapsNet-Tensorflow.<br />
# Geoffrey Hinton. "What is wrong with convolutional neural nets?", https://youtu.be/rTawFwUvnLE?t=612</div>Tdunnhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=MarrNet:_3D_Shape_Reconstruction_via_2.5D_Sketches&diff=36291MarrNet: 3D Shape Reconstruction via 2.5D Sketches2018-04-11T17:20:10Z<p>Tdunn: /* Commentary */</p>
<hr />
<div>= Introduction =<br />
Humans are able to quickly recognize 3D shapes from images, even in spite of drastic differences in object texture, material, lighting, and background.<br />
<br />
[[File:marrnet_intro_image.png|700px|thumb|center|Objects in real images. The appearance of the same shaped object varies based on colour, texture, lighting, background, etc. However, the 2.5D sketches (e.g. depth or normal maps) of the object remain constant, and can be seen as an abstraction of the object which is used to reconstruct the 3D shape.]]<br />
<br />
In this work, the authors propose a novel end-to-end trainable model that sequentially estimates 2.5D sketches and 3D object shape from images and also enforce re-projection consistency between the 3D shape and the estimated sketch. 2.5D is the construction of a 3D environment using 2D retina projection along with depth perception obtained from the image. The two step approach makes the network more robust to differences in object texture, material, lighting and background. Based on the idea from [Marr, 1982] that human 3D perception relies on recovering 2.5D sketches, which include depth maps (contains information related to the distance of surfaces from a viewpoint) and surface normal maps (technique for adding the illusion of depth details to surfaces using an image's RGB information), the authors design an end-to-end trainable pipeline which they call MarrNet. MarrNet first estimates depth, normal maps, and silhouette, followed by a 3D shape. MarrNet uses an encoder-decoder structure for the sub-components of the framework. <br />
<br />
The authors claim several unique advantages to their method. Single image 3D reconstruction is a highly under-constrained problem, requiring strong prior knowledge of object shapes. As well, accurate 3D object annotations using real images are not common, and many previous approaches rely on purely synthetic data. However, most of these methods suffer from domain adaptation due to imperfect rendering.<br />
<br />
Using 2.5D sketches can alleviate the challenges of domain transfer. It is straightforward to generate perfect object surface normals and depths using a graphics engine. Since 2.5D sketches contain only depth, surface normal, and silhouette information, the second step of recovering 3D shape can be trained purely from synthetic data. As well, the introduction of differentiable constraints between 2.5D sketches and 3D shape makes it possible to fine-tune the system, even without any annotations.<br />
<br />
The framework is evaluated on both synthetic objects from ShapeNet, and real images from PASCAL 3D+, showing good qualitative and quantitative performance in 3D shape reconstruction.<br />
<br />
= Related Work =<br />
<br />
== 2.5D Sketch Recovery ==<br />
Researchers have explored recovering 2.5D information from shading, texture, and colour images in the past. More recently, the development of depth sensors has led to the creation of large RGB-D datasets, and papers on estimating depth, surface normals, and other intrinsic images using deep networks. While this method employs 2.5D estimation, the final output is a full 3D shape of an object.<br />
<br />
[[File:2-5d_example.PNG|700px|thumb|center|Results from the paper: Learning Non-Lambertian Object Intrinsics across ShapeNet Categories. The results show that neural networks can be trained to recover 2.5D information from an image. The top row predicts the albedo and the bottom row predicts the shading. It can be observed that the results are still blurry and the fine details are not fully recovered.]]<br />
<br />
=== Notes: 2.5D === <br />
<br />
Two and a half dimensional (shortened to 2.5D, known alternatively as three-quarter perspective and pseudo-3D) is a term used to describe either 2D graphical projections and similar techniques used to cause images to simulate the appearance of being three-dimensional (3D) when in fact they are not, or gameplay in an otherwise three-dimensional video game that is restricted to a two-dimensional plane or has a virtual camera with fixed angle.<br />
<br />
== Single Image 3D Reconstruction ==<br />
The development of large-scale shape repositories like ShapeNet has allowed for the development of models encoding shape priors for single image 3D reconstruction. These methods normally regress voxelized 3D shapes, relying on synthetic data or 2D masks for training. A voxel is an abbreviation for volume element, the three-dimensional version of a pixel. The formulation in the paper tackles domain adaptation better, since the network can be fine-tuned on images without any annotations.<br />
<br />
== 2D-3D Consistency ==<br />
Intuitively, the 3D shape can be constrained to be consistent with 2D observations. This idea has been explored for decades, and has been widely used in 3D shape completion with the use of depths and silhouettes. A few recent papers [5,6,7,8] discussed enforcing differentiable 2D-3D constraints between shape and silhouettes to enable joint training of deep networks for the task of 3D reconstruction. In this work, this idea is exploited to develop differentiable constraints for consistency between the 2.5D sketches and 3D shape.<br />
<br />
= Approach =<br />
The 3D structure is recovered from a single RGB view using three steps, shown in the figure below. The first step estimates 2.5D sketches, including depth, surface normal, and silhouette of the object. The second step estimates a 3D voxel representation of the object. The third step uses a reprojection consistency function to enforce the 2.5D sketch and 3D structure alignment.<br />
<br />
[[File:marrnet_model_components.png|700px|thumb|center|MarrNet architecture. 2.5D sketches of normals, depths, and silhouette are first estimated. The sketches are then used to estimate the 3D shape. Finally, re-projection consistency is used to ensure consistency between the sketch and 3D output.]]<br />
<br />
== 2.5D Sketch Estimation ==<br />
The first step takes a 2D RGB image and predicts the 2.5 sketch with surface normal, depth, and silhouette of the object. The goal is to estimate intrinsic object properties from the image, while discarding non-essential information such as texture and lighting. An encoder-decoder architecture is used. The encoder is a A ResNet-18 network, which takes a 256 x 256 RGB image and produces 512 feature maps of size 8 x 8. The decoder is four sets of 5 x 5 fully convolutional and ReLU layers, followed by four sets of 1 x 1 convolutional and ReLU layers. The output is 256 x 256 resolution depth, surface normal, and silhouette images.<br />
<br />
== 3D Shape Estimation ==<br />
The second step estimates a voxelized 3D shape using the 2.5D sketches from the first step. The focus here is for the network to learn the shape prior that can explain the input well, and can be trained on synthetic data without suffering from the domain adaptation problem since it only takes in surface normal and depth images as input. The network architecture is inspired by the TL[10] network, and 3D-VAE-GAN, with an encoder-decoder structure. The normal and depth image, masked by the estimated silhouette, are passed into 5 sets of convolutional, ReLU, and pooling layers, followed by two fully connected layers, with a final output width of 200. The 200-dimensional vector is passed into a decoder of 5 fully convolutional and ReLU layers, outputting a 128 x 128 x 128 voxelized estimate of the input.<br />
<br />
== Re-projection Consistency ==<br />
The third step consists of a depth re-projection loss and surface normal re-projection loss. Here, <math>v_{x, y, z}</math> represents the value at position <math>(x, y, z)</math> in a 3D voxel grid, with <math>v_{x, y, z} \in [0, 1] ∀ x, y, z</math>. <math>d_{x, y}</math> denotes the estimated depth at position <math>(x, y)</math>, <math>n_{x, y} = (n_a, n_b, n_c)</math> denotes the estimated surface normal. Orthographic projection is used.<br />
<br />
[[File:marrnet_reprojection_consistency.png|700px|thumb|center|Reprojection consistency for voxels. Left and middle: criteria for depth and silhouettes. Right: criterion for surface normals]]<br />
<br />
=== Depths ===<br />
The voxel with depth <math>v_{x, y}, d_{x, y}</math> should be 1, while all voxels in front of it should be 0. This ensures the estimated 3D shape matches the estimated depth values. The projected depth loss and its gradient are defined as follows:<br />
<br />
<math><br />
L_{depth}(x, y, z)=<br />
\left\{<br />
\begin{array}{ll}<br />
v^2_{x, y, z}, & z < d_{x, y} \\<br />
(1 - v_{x, y, z})^2, & z = d_{x, y} \\<br />
0, & z > d_{x, y} \\<br />
\end{array}<br />
\right.<br />
</math><br />
<br />
<math><br />
\frac{∂L_{depth}(x, y, z)}{∂v_{x, y, z}} =<br />
\left\{<br />
\begin{array}{ll}<br />
2v{x, y, z}, & z < d_{x, y} \\<br />
2(v_{x, y, z} - 1), & z = d_{x, y} \\<br />
0, & z > d_{x, y} \\<br />
\end{array}<br />
\right.<br />
</math><br />
<br />
When <math>d_{x, y} = \infty</math>, all voxels in front of it should be 0 when there is no intersection between the line and its shape, referred as the silhouette criterion.<br />
<br />
=== Surface Normals ===<br />
Since vectors <math>n_{x} = (0, −n_{c}, n_{b})</math> and <math>n_{y} = (−n_{c}, 0, n_{a})</math> are orthogonal to the normal vector <math>n_{x, y} = (n_{a}, n_{b}, n_{c})</math>, they can be normalized to obtain <math>n’_{x} = (0, −1, n_{b}/n_{c})</math> and <math>n’_{y} = (−1, 0, n_{a}/n_{c})</math> on the estimated surface plane at <math>(x, y, z)</math>. The projected surface normal tried to guarantee voxels at <math>(x, y, z) ± n’_{x}</math> and <math>(x, y, z) ± n’_{y}</math> should be 1 to match the estimated normal. The constraints are only applied when the target voxels are inside the estimated silhouette.<br />
<br />
The projected surface normal loss is defined as follows, with <math>z = d_{x, y}</math>:<br />
<br />
<math><br />
L_{normal}(x, y, z) =<br />
(1 - v_{x, y-1, z+\frac{n_b}{n_c}})^2 + (1 - v_{x, y+1, z-\frac{n_b}{n_c}})^2 + <br />
(1 - v_{x-1, y, z+\frac{n_a}{n_c}})^2 + (1 - v_{x+1, y, z-\frac{n_a}{n_c}})^2<br />
</math><br />
<br />
Gradients along x are:<br />
<br />
<math><br />
\frac{dL_{normal}(x, y, z)}{dv_{x-1, y, z+\frac{n_a}{n_c}}} = 2(v_{x-1, y, z+\frac{n_a}{n_c}}-1)<br />
</math><br />
and<br />
<math><br />
\frac{dL_{normal}(x, y, z)}{dv_{x+1, y, z-\frac{n_a}{n_c}}} = 2(v_{x+1, y, z-\frac{n_a}{n_c}}-1)<br />
</math><br />
<br />
Gradients along y are:<br />
<br />
<math><br />
\frac{dL_{normal}(x, y, z)}{dv_{x, y-1, z+\frac{n_b}{n_c}}} = 2(v_{x, y-1, z+\frac{n_b}{n_c}}-1)<br />
</math><br />
and<br />
<math><br />
\frac{dL_{normal}(x, y, z)}{dv_{x, y+1, z-\frac{n_b}{n_c}}} = 2(v_{x, y+1, z-\frac{n_b}{n_c}}-1)<br />
</math><br />
<br />
= Training =<br />
The 2.5D and 3D estimation components are first pre-trained separately on synthetic data from ShapeNet, and then fine-tuned on real images.<br />
<br />
For pre-training, the 2.5D sketch estimator is trained on synthetic ShapeNet depth, surface normal, and silhouette ground truth, using an L2 loss. The 3D estimator is trained with ground truth voxels using a cross-entropy loss.<br />
<br />
Reprojection consistency loss is used to fine-tune the 3D estimation using real images, using the predicted depth, normals, and silhouette. A straightforward implementation leads to shapes that explain the 2.5D sketches well, but lead to unrealistic 3D appearance due to overfitting.<br />
<br />
Instead, the decoder of the 3D estimator is fixed, and only the encoder is fine-tuned. The model is fine-tuned separately on each image for 40 iterations, which takes up to 10 seconds on the GPU. Without fine-tuning, testing time takes around 100 milliseconds. SGD is used for optimization with batch size of 4, learning rate of 0.001, and momentum of 0.9.<br />
<br />
= Evaluation =<br />
Qualitative and quantitative results are provided using different variants of the framework. The framework is evaluated on both synthetic and real images on three datasets; ShapeNet, PASCAL 3D+, and IKEA. Intersection-over-Union (IoU) is the main measurement of comparison between the models. However the authors note that models which focus on the IoU metric fail to capture the details of the object they are trying to model, disregarding details to focus on the overall shape. To counter this drawback they poll people on which reconstruction is preferred. IoU is also computationally inefficient since it has to check over all possible scales.<br />
<br />
== ShapeNet ==<br />
The data is based on synthesized images of ShapeNet chairs [Chang et al., 2015]. From the SUN database [Xiao et al., 2010], they combine the chars with random backgrounds and use a physics-based renderer by Jakob to render the corresponding RGB, depth, surface normal, and silhouette images.<br />
Synthesized images of 6,778 chairs from ShapeNet are rendered from 20 random viewpoints. The chairs are placed in front of random background from the SUN dataset, and the RGB, depth, normal, and silhouette images are rendered using the physics-based renderer Mitsuba for more realistic images.<br />
<br />
=== Method ===<br />
MarrNet is trained without the final fine-tuning stage, since 3D shapes are available. A baseline is created that directly predicts the 3D shape using the same 3D shape estimator architecture with no 2.5D sketch estimation.<br />
<br />
=== Results ===<br />
The baseline output is compared to the full framework, and the figure below shows that MarrNet provides model outputs with more details and smoother surfaces than the baseline. The estimated normal and depth images are able to extract intrinsic information about object shape while leaving behind non-essential information such as textures from the original images. Quantitatively, the full model also achieves 0.57 integer over union score (which compares the overlap of the predicted model and ground truth), which is higher than the direct prediction baseline.<br />
<br />
[[File:marrnet_shapenet_results.png|700px|thumb|center|ShapeNet results.]]<br />
<br />
== PASCAL 3D+ ==<br />
Rough 3D models are provided from real-life images.<br />
<br />
=== Method ===<br />
Each module is pre-trained on the ShapeNet dataset, and then fine-tuned on the PASCAL 3D+ dataset. Three variants of the model are tested. The first is trained using ShapeNet data only with no fine-tuning. The second is fine-tuned without fixing the decoder. The third is fine-tuned with a fixed decoder.<br />
<br />
=== Results ===<br />
The figure below shows the results of the ablation study. The model trained only on synthetic data provides reasonable estimates. However, fine-tuning without fixing the decoder leads to impossible shapes from certain views. The third model keeps the shape prior, providing more details in the final shape.<br />
<br />
[[File:marrnet_pascal_3d_ablation.png|600px|thumb|center|Ablation studies using the PASCAL 3D+ dataset.]]<br />
<br />
Additional comparisons are made with the state-of-the-art (DRC) on the provided ground truth shapes. MarrNet achieves 0.39 IoU, while DRC achieves 0.34. Since PASCAL 3D+ only has rough annotations, with only 10 CAD chair models for all images, computing IoU with these shapes is not very informative. Instead, human studies are conducted and MarrNet reconstructions are preferred 74% of the time over DRC, and 42% of the time to ground truth. This shows how MarrNet produces nice shapes and also highlights the fact that ground truth shapes are not very good.<br />
<br />
[[File:human_studies.png|400px|thumb|center|Human preferences on chairs in PASCAL 3D+ (Xiang et al. 2014). The numbers show the percentage of how often humans prefered the 3D shape from DRC (state-of-the-art), MarrNet, or GT.]]<br />
<br />
<br />
[[File:marrnet_pascal_3d_drc_comparison.png|600px|thumb|center|Comparison between DRC and MarrNet results.]]<br />
<br />
Several failure cases are shown in the figure below. Specifically, the framework does not seem to work well on thin structures.<br />
<br />
[[File:marrnet_pascal_3d_failure_cases.png|500px|thumb|center|Failure cases on PASCAL 3D+. The algorithm cannot recover thin structures.]]<br />
<br />
== IKEA ==<br />
This dataset contains images of IKEA furniture, with accurate 3D shape and pose annotations. Objects are often heavily occluded or truncated.<br />
<br />
=== Results ===<br />
Qualitative results are shown in the figure below. The model is shown to deal with mild occlusions in real life scenarios. Human studes show that MarrNet reconstructions are preferred 61% of the time to 3D-VAE-GAN.<br />
<br />
[[File:marrnet_ikea_results.png|700px|thumb|center|Results on chairs in the IKEA dataset, and comparison with 3D-VAE-GAN.]]<br />
<br />
== Other Data ==<br />
MarrNet is also applied on cars and airplanes. Shown below, smaller details such as the horizontal stabilizer and rear-view mirrors are recovered.<br />
<br />
[[File:marrnet_airplanes_and_cars.png|700px|thumb|center|Results on airplanes and cars from the PASCAL 3D+ dataset, and comparison with DRC.]]<br />
<br />
MarrNet is also jointly trained on three object categories, and successfully recovers the shapes of different categories. Results are shown in the figure below.<br />
<br />
[[File:marrnet_multiple_categories.png|700px|thumb|center|Results when trained jointly on all three object categories (cars, airplanes, and chairs).]]<br />
<br />
= Commentary =<br />
Qualitatively, the results look quite impressive. The 2.5D sketch estimation seems to distill the useful information for more realistic looking 3D shape estimation. The disentanglement of 2.5D and 3D estimation steps also allows for easier training and domain adaptation from synthetic data.<br />
<br />
As the authors mention, the IoU metric is not very descriptive, and most of the comparisons in this paper are only qualitative, mainly being human preference studies. A better quantitative evaluation metric would greatly help in making an unbiased comparison between different results.<br />
<br />
As seen in several of the results, the network does not deal well with objects that have thin structures, which is particularly noticeable with many of the chair arm rests. As well, looking more carefully at some results, it seems that fine-tuning only the 3D encoder does not seem to transfer well to unseen objects, since shape priors have already been learned by the decoder. Therefore, future work should address more "difficult" shapes and forms; it should be more difficult to generalize shapes that are more complex than furniture.<br />
<br />
Also there is ambiguity in terms of how the aforementioned self-supervision can work as the authors claim that the model can be fine-tuned using a single image itself. If the parameters are constrained to a single image, then it means it will not generalize well. It is not clearly explained as to what can be fine-tuned.<br />
<br />
The paper does not propose or implement a baseline model to which MarrNet should be compared.<br />
<br />
The model uses information from a single image. 3D shape estimation in biological agents incorporates information from multiple images or even video. A logical next step for improving this model would be to include images of the object from multiple angles.<br />
<br />
= Conclusion =<br />
The proposed MarrNet employs a novel model to estimate 2.5D sketches for 3D shape reconstruction. The sketches are shown to improve the model’s performance, and make it easy to adapt to images across different domains and categories. Differentiable loss functions are created such that the model can be fine-tuned end-to-end on images without ground truth. The experiments show that the model performs well, and human studies show that the results are preferred over other methods.<br />
<br />
= Implementation =<br />
The following repository provides the source code for the paper. The repository provides the source code as written by the authors: https://github.com/jiajunwu/marrnet<br />
<br />
= References =<br />
# Jiajun Wu, Yifan Wang, Tianfan Xue, Xingyuan Sun, William T. Freeman, Joshua B. Tenenbaum. MarrNet: 3D Shape Reconstruction via 2.5D Sketches, 2017<br />
# David Marr. Vision: A computational investigation into the human representation and processing of visual information. W. H. Freeman and Company, 1982.<br />
# Shubham Tulsiani, Tinghui Zhou, Alexei A Efros, and Jitendra Malik. Multi-view supervision for single-view reconstruction via differentiable ray consistency. In CVPR, 2017.<br />
# JiajunWu, Chengkai Zhang, Tianfan Xue,William T Freeman, and Joshua B Tenenbaum. Learning a Proba- bilistic Latent Space of Object Shapes via 3D Generative-Adversarial Modeling. In NIPS, 2016b.<br />
# Wu, J. (n.d.). Jiajunwu/marrnet. Retrieved March 25, 2018, from https://github.com/jiajunwu/marrnet<br />
# Jiajun Wu, Tianfan Xue, Joseph J Lim, Yuandong Tian, Joshua B Tenenbaum, Antonio Torralba, and William T Freeman. Single image 3d interpreter network. In ECCV, 2016a.<br />
# Xinchen Yan, Jimei Yang, Ersin Yumer, Yijie Guo, and Honglak Lee. Perspective transformer nets: Learning single-view 3d object reconstruction without 3d supervision. In NIPS, 2016.<br />
# Danilo Jimenez Rezende, SM Ali Eslami, Shakir Mohamed, Peter Battaglia, Max Jaderberg, and Nicolas Heess. Unsupervised learning of 3d structure from images. In NIPS, 2016.<br />
# Shubham Tulsiani, Tinghui Zhou, Alexei A Efros, and Jitendra Malik. Multi-view supervision for single-view reconstruction via differentiable ray consistency. In CVPR, 2017.<br />
# Rohit Girdhar, David F. Fouhey, Mikel Rodriguez and Abhinav Gupta, Learning a Predictable and Generative Vector Representation for Objects, in ECCV 2016<br />
#Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, et al. Shapenet: An information-rich 3d model repository. arXiv:1512.03012, 2015. <br />
#Jianxiong Xiao, James Hays, Krista A Ehinger, Aude Oliva, and Antonio Torralba. Sun database: Large-scale scene recognition from abbey to zoo. In CVPR, 2010. <br />
#Wenzel Jakob. Mitsuba renderer, 2010. http://www.mitsuba-renderer.org.</div>Tdunnhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Neural_Audio_Synthesis_of_Musical_Notes_with_WaveNet_autoencoders&diff=36290Neural Audio Synthesis of Musical Notes with WaveNet autoencoders2018-04-08T17:33:43Z<p>Tdunn: /* Training */</p>
<hr />
<div>= Introduction =<br />
The authors of this paper have pointed out that the method in which most notes are created are hand-designed instruments modifying pitch, velocity and filter parameters to produce the required tone, timbre and dynamics of a sound. The authors suggest that this may be a problem and thus suggest a data-driven approach to audio synthesis. They demonstrate how to generate new types of expressive and realistic instrument sounds using a neural network model instead of using specific arrangements of oscillators or algorithms for sample playback. The model is capable of learning semantically meaningful hidden representations which can be used as control signals for manipulating tone, timbre, and dynamics during playback. To train such a data expensive model the authors highlight the need for a large dataset much like ImageNet for music. The motivation for this work stems from recent advances in autoregressive models like WaveNet <sup>[[#References|[5]]]</sup> and SampleRNN<sup>[[#References|[6]]]</sup>. These models are effective at modeling short and medium scale (~500ms) signals, but rely on external conditioning for large-term dependencies; the proposed model removes the need for external conditioning.<br />
<br />
= Contributions =<br />
This paper has two main contributions, one theoretical and one empirical: <br />
<br />
=== Theoretical contribution ===<br />
Proposed Wavenet-style autoencoder that learn to encode temural data over a long term audio structures without requiring external conditioning.<br />
<br />
=== Empirical contribution === <br />
Provided NSynth data set. The authors constructed this data set from scratch, which is a a large data set of musical notes inspired by the emerging of large image data sets. This data set servers as a great training/test resource for future works.<br />
<br />
= Models =<br />
<br />
[[File:paper26-figure1-models.png|center]]<br />
<br />
== WaveNet Autoencoder ==<br />
<br />
While the proposed autoencoder structure is very similar to that of WaveNet the authors argue that the algorithm is novel in two ways:<br />
* It is able to attain consistent long-term structure without any external conditioning <br />
* Creating meaningful embedding which can be interpolated between<br />
<br />
In the original WaveNet architecture the authors use a stack of dilated convolutions to predict the next sample of audio given a prior sample. This approach was prone to "babbling" since it did not take into account long-term structure of the audio. In this model the joint probability of generating audio <math>x</math> is:<br />
<br />
\begin{align}<br />
p(x) = \prod_{i=1}^N\{x_i | x_1, … , x_N-1\}<br />
\end{align}<br />
<br />
They authors try to capture long-term structure by passing the raw audio through the encoder to produce an embedding <math>Z = f(x) </math>, and then shifting the input and feeding it into the decoder which reproduces the input. The resulting probability distribution: <br />
<br />
\begin{align}<br />
p(x) = \prod_{i=1}^N\{x_i | x_1, … , x_N-1, f(x) \}<br />
\end{align}<br />
<br />
<br />
A detailed block diagram of the modified WaveNet structure can be seen in figure 1b. This diagram demonstrates the encoder as a 30 layer network in each each node is a ReLU nonlinearity followed by a non-causal dilated convolution. Dilated convolution (aka convolutions with holes) is a type of convolution in which the filter skips input values with a certain step (step size of 1 is equivalent to the standard convolution), effectively allowing the network to operate at a coarser scale compared to traditional convolutional layers and have very large receptive fields. The resulting convolution is 128 channels all feed into another ReLU nonlinearity which is feed into another 1x1 convolution before getting down sampled with average pooling to produce a 16 dimension <math>Z </math> distribution. Each <math>Z </math> encoding is for a specific temporal resolution which the authors of the paper tuned to 32ms. This means that there are 125, 16 dimension <math>Z </math> encodings for each 4 second note present in the NSynth database (1984 embeddings). <br />
Before the <math>Z </math> embedding enters the decoder it is first upsampled to the original audio rate using nearest neighbor interpolation. The embedding then passes through the decoder to recreate the original audio note. The input audio data is first quantized using 8-bit mu-law encoding into 256 possible values, and the output prediction is the softmax over the possible values.<br />
<br />
== Baseline: Spectral Autoencoder ==<br />
Being unable to find an alternative fully deep model which the authors could use to compare to there proposed WaveNet autoencoder to, the authors just made a strong baseline. The baseline algorithm that the authors developed is a spectral autoencoder. The block diagram of its architecture can be seen in figure 1a. The baseline network is 10 layer deep. Each layer has a 4x4 kernels with 2x2 strides followed by a leaky-ReLU (0.1) and batch normalization. The final hidden vector(Z) was set to 1984 to exactly match the hidden vector of the WaveNet autoencoder. <br />
<br />
Given the simple architecture, the authors first attempted to train the baseline on raw waveforms as input, with a mean-squared error cost. This did not work well and showed the problem of the independent Gaussian assumption. Spectral representations from FFT worked better, but had low perceptual quality despite having low MSE cost after training. Training on the log magnitude of the power spectra, normalized between 0 and 1, was found to be best correlated with perceptual distortion. The authors also explored several representations of phase, finding that estimating magnitude and using established iterative techniques to reconstruct phase to be most effective. (The technique to reconstruct the phase from the magnitude comes from (Griffin and Lim 1984). It can be summarized as follows. In each iteration, generate a Fourier signal z by taking the Short Time Fourier transform of the current estimate of the complete time-domain signal, and replacing its magnitude component with the known true magnitude. Then find the time-domain signal whose Short Time Fourier transform is closest to z in the least-squares sense. This is the estimate of the complete signal for the next iteration. ) A final heuristic that was used by the authors to increase the accuracy of the baseline was weighting the mean square error (MSE) loss starting at 10 for 0 HZ and decreasing linearly to 1 at 4000 Hz and above. This is valid as the fundamental frequency of most instrument are found at lower frequencies. <br />
<br />
== Training ==<br />
Both the modified WaveNet and the baseline autoencoder used stochastic gradient descent with an Adam optimizer. The authors trained the baseline autoencoder model asynchronously for 1800000 epocs with a batch size of 8 with a learning rate of 1e-4. Where as the WaveNet modules were trained synchronously for 250000 epocs with a batch size of 32 with a decaying learning rate ranging from 2e-4 to 6e-6. Here synchronous training refers to the process of training both the encoder and decoder at the same time.<br />
<br />
= The NSynth Dataset =<br />
To evaluate the WaveNet autoencoder model, the authors' wanted an audio dataset that let them explore the learned embeddings. Musical notes are an ideal setting for this study. Prior to this paper, the existing music datasets included the RWC music database (Goto et al., 2003)<sup>[[#References|[8]]]</sup> and the dataset from Romani Picas et al.<sup>[[#References|[9]]]</sup> However, the authors wanted to develop a larger dataset.<br />
<br />
The NSynth dataset has 306 043 unique musical notes (each have a unique pitch, timbre, envelope) all 4 seconds in length sampled at 16,000 Hz. The data set consists of 1006 different instruments playing on average of 65.4 different pitches across on average 4.75 different velocities. Average pitches and velocities are used as not all instruments, can reach all 88 MIDI frequencies, or the 5 velocities desired by the authors. The dataset has the following split: training set with 289,205 notes, validation set with 12,678 notes, and test set with 4,096 notes.<br />
<br />
Along with each note the authors also included the following annotations:<br />
* Source - The way each sound was produced. There were 3 classes ‘acoustic’, ‘electronic’ and ‘synthetic’.<br />
* Family - The family class of instruments that produced each note. There are 11 classes which include: {‘bass’, ‘brass’, ‘vocal’ ext.}<br />
* Qualities - Sonic qualities about each note<br />
<br />
The full dataset is publicly available here: https://magenta.tensorflow.org/datasets/nsynth as TFRecord files with training and holdout splits.<br />
<br />
[[File:nsynth_table.png | 400px|thumb|center|Full details of the NSynth dataset.]]<br />
<br />
= Evaluation =<br />
<br />
To fully analyze all aspects of WaveNet the authors proposed three evaluations:<br />
* Reconstruction - Both Quantitative and Qualitative analysis were considered<br />
* Interpolation in Timbre and Dynamics<br />
* Entanglement of Pitch and Timbre <br />
<br />
Sound is historically very difficult to quantify from a picture representation as it requires training and expertise to analyze. Even with expertise it can be difficult to complete a full analysis as two very different sounds can look quite similar in their respective pictorial representations. This is why the authors recommend all readers to listen to the created notes which can be found here: https://magenta.tensorflow.org/nsynth.<br />
<br />
However, even when taking this under consideration the authors do pictorially demonstrate differences in the two proposed algorithms along with the original note, as it is hard to publish a paper with sound included. To demonstrate the pictorial difference the authors demonstrate each note using constant-q transform (CQT) which is able to capture the dynamics of timbre along with representing the frequencies of the sound.<br />
<br />
== Reconstruction ==<br />
<br />
[[File:paper27-figure2-reconstruction.png|center]]<br />
<br />
The authors attempted to show magnitude and phase on the same plot above. Instantaneous frequency is the derivative of the phase and the intensity of solid lines is proportional to the log magnitude of the power spectrum. If fharm and an FFT bin are not the same, then there will be a constant phase shift: <br />
<math><br />
\triangle \phi = (f_{bin} − f_{harm}) \dfrac{hopsize}{samplerate}<br />
</math>.<br />
<br />
=== Qualitative Comparison ===<br />
In Figure 2, CQT spectrograms are displayed from 3 different instruments, including the original note spectrograms and the model reconstruction spectrograms. For the model reconstruction spectrograms, a baseline is adopted to compare with WaveNet. Each note contains some noise, a fundamental frequency with a series of harmonics, and a decay. In the Glockenspiel the WaveNet autoencoder is able to reproduce the magnitude, phase of the fundamental frequency (A and C in Figure 2), and the attack (B in Figure 2) of the instrument; Whereas the Baseline autoencoder introduces non existing harmonics (D in Figure 2). The flugelhorn on the other hand, presents the starkest difference between the WaveNet and baseline autoencoders. The WaveNet while not perfect is able to reproduce the verbarto (I and J in Figure 2) across multiple frequencies, which results in a natural sounding note. The baseline not only fails to do this but also adds extra noise (K in Figure 2). The authors do add that the WaveNet produces some strikes (L in Figure 2) however they argue that they are inaudible.<br />
<br />
[[File:paper27-table1.png|center]]<br />
<br />
Mu-law encoding was used in the original WaveNet [https://arxiv.org/pdf/1609.03499.pdf paper] to make the problem "more tractable" compared to raw 16-bit integer values. In that paper, they note that "especially for speech, this non-linear quantization produces a significantly better reconstruction" compared to a linear scheme. This might be expected considering that the mu-law companding transformation was designed to [https://www.cisco.com/c/en/us/support/docs/voice/h323/8123-waveform-coding.html#t4 encode speech]. In this application though, using this encoding creates perceptible distortion that sounds similar to clipping.<br />
<br />
=== Quantitative Comparison ===<br />
For a quantitative comparison the authors trained a separate multi-task classifier to classify a note using given pitch or quality of a note. The results of both the Baseline and the WaveNet where then inputted and attempted to be classified. As seen in table 1 WaveNet significantly outperformed the Baseline in both metrics posting a ~70% increase when only considering pitch.<br />
<br />
== Interpolation in Timbre and Dynamics ==<br />
<br />
[[File:paper27-figure3-interpolation.png|center]]<br />
<br />
For this evaluation the authors reconstructed from linear interpolations in Z space among different instruments and compared these to superimposed position of the original two instruments. Not surprisingly the model fuse aspects of both instruments during the recreation. The authors claim however, that WaveNet produces much more realistic sounding results. <br />
To support their claim the authors the authors point to WaveNet ability to create dynamic mixing of overtone in time, even jumping to higher harmonics (A in Figure 3), capturing the timbre and dynamics of both the bass and flute. This can be once again seen in (B in Figure 3) where Wavenet adds additional harmonics as well as a sub-harmonics to the original flute note.<br />
<br />
== Entanglement of Pitch and Timbre ==<br />
<br />
[[File:paper27-table2.png|center]]<br />
<br />
[[File:paper27-figure4-entanglement.png|center]]<br />
<br />
To study the entanglement between pitch and Z space the authors constructed a classifier which was expected to drop in accuracy if the representation of pitch and timbre is disentangled as it relies heavily on the pitch information. This is clearly demonstrated by the first two rows of table 2 where WaveNet relies more strongly on pitch then the baseline algorithm. The authors provide a more qualitative demonstrating in figure 4. They demonstrate a situation in which a classifier may be confused; a note with pitch of +12 is almost exactly the same as the original apart from an emergence of sub-harmonics.<br />
<br />
Further insight can be gained on the relationship between pitch and timbre by studying the trend amongst the network embeddings among the pitches for specific instruments. This is depicted in figure 5 for several instruments across their entire 88 note range at 127 velocity. It can be noted from the figure that the instruments have unique separation of two or more registers over which the embeddings of notes with different pitches are similar. This is expected since instrumental dynamics and timbre varies dramatically over the range of the instrument.<br />
<br />
= Conclusion & Future Directions =<br />
<br />
This paper presents a Wavelet autoencoder model which is built on top of the WaveNet model and evaluate the model on NSynth dataset. The paper also introduces a new large scale dataset of musical notes: NSynth.<br />
<br />
One significant area which the authors claim great improvement is needed is the large memory constraints required by there algorithm. Due to the large memory requirement the current WaveNet must rely on down sampling thus being unable to fully capture the global context. This is an area where model compression techniques could be beneficial. That is, quantization and pruning could be effective: with 4-bit quantization during the entire process (weights, activations, gradients, error as in the work of Wu et al., 2016<sup>[[#References|[7]]]</sup>), memory requirement could be reduced by at least 8 times. The authors also claim that research using different input representations (instead of mu-law) to minimize distortion is ongoing.<br />
<br />
= Critique = <br />
* Authors have never conducted a human study determining sound similarity between the original, baseline, and WaveNet.<br />
* Architecture is not very novel.<br />
* In order to have a comparison, they set out to create a straight-forward baseline for the neural audio synthesis experiments.<br />
<br />
= Open Source Code =<br />
<br />
Google has released all code related to this paper at the following open source repository: https://github.com/tensorflow/magenta/tree/master/magenta/models/nsynth<br />
<br />
= References =<br />
<br />
# Engel, J., Resnick, C., Roberts, A., Dieleman, S., Norouzi, M., Eck, D. & Simonyan, K.. (2017). Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders. Proceedings of the 34th International Conference on Machine Learning, in PMLR 70:1068-1077<br />
# Griffin, Daniel, and Jae Lim. "Signal estimation from modified short-time Fourier transform." IEEE Transactions on Acoustics, Speech, and Signal Processing 32.2 (1984): 236-243.<br />
# NSynth: Neural Audio Synthesis. (2017, April 06). Retrieved March 19, 2018, from https://magenta.tensorflow.org/nsynth <br />
# The NSynth Dataset. (2017, April 05). Retrieved March 19, 2018, from https://magenta.tensorflow.org/datasets/nsynth<br />
# Oord, Aaron van den, Nal Kalchbrenner, and Koray Kavukcuoglu. "Pixel recurrent neural networks." arXiv preprint arXiv:1601.06759 (2016).<br />
# Mehri, Soroush, et al. "SampleRNN: An unconditional end-to-end neural audio generation model." arXiv preprint arXiv:1612.07837 (2016).<br />
# Wu, S., Li, G., Chen, F., & Shi, L. (2018). Training and Inference with Integers in Deep Neural Networks. arXiv preprint arXiv:1802.04680.<br />
# Goto, Masataka, Hashiguchi, Hiroki, Nishimura, Takuichi, and Oka, Ryuichi. Rwc music database: Music genre database and musical instrument sound database. 2003.<br />
# Romani Picas, Oriol, Parra Rodriguez, Hector, Dabiri, Dara, Tokuda, Hiroshi, Hariya, Wataru, Oishi, Koji, and Serra, Xavier. A real-time system for measuring sound goodness in instrumental sounds. In Audio Engineering Society Convention 138. Audio Engineering Society, 2015</div>Tdunnhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=A_Neural_Representation_of_Sketch_Drawings&diff=36286A Neural Representation of Sketch Drawings2018-04-07T17:00:09Z<p>Tdunn: /* Source */</p>
<hr />
<div>= Introduction =<br />
<br />
There have been many recent advances in neural generative models for low resolution pixel-based images. Humans, however, do not see the world in a grid of pixels and more typically communicate drawings of the things we see using a series of pen strokes that represent components of objects. These pen strokes are similar to the way vector-based images store data. This paper proposes a new method for creating conditional and unconditional generative models for creating these kinds of vector sketch drawings based on recurrent neural networks (RNNs). For the conditional generation mode, the authors explore the model's latent space that it uses to express the vector image. The paper also explores many applications of these kinds of models, especially creative applications and makes available their unique dataset of vector images.<br />
<br />
= Related Work =<br />
<br />
Previous work related to sketch drawing generation includes methods that focused primarily on converting input photographs into equivalent vector line drawings. Image generating models using neural networks also exist but focused more on generation of pixel-based imagery. For example, Gatys et al.'s (2015) work focuses on separating style and content from pixel-based artwork and imagery. Some recent work has focused on handwritten character generation using RNNs and Mixture Density Networks to generate continuous data points. This work has been extended somewhat recently to conditionally and unconditionally generate handwritten vectorized Chinese Kanji characters by modeling them as a series of pen strokes. Furthermore, this paper builds on work that employed Sequence-to-Sequence models with Variational Auto-encoders to model English sentences in latent vector space.<br />
<br />
One of the limiting factors for creating models that operate on vector datasets has been the dearth of publicly available data. Previously available datasets include Sketch, a set of 20K vector drawings; Sketchy, a set of 70K vector drawings; and ShadowDraw, a set of 30K raster images with extracted vector drawings.<br />
<br />
= Methodology =<br />
<br />
=== Dataset ===<br />
<br />
The “QuickDraw” dataset used in this research was assembled from 75K user drawings extracted from the game “Quick, Draw!” where users drew objects from one of hundreds of classes in 20 seconds or less. The dataset is split into 70K training samples and 2.5K validation and test samples each and represents each sketch a set of “pen stroke actions”. Each action is provided as a vector in the form <math>(\Delta x, \Delta y, p_{1}, p_{2}, p_{3})</math>. For each vector, <math>\Delta x</math> and <math>\Delta y</math> give the movement of the pen from the previous point, with the initial location being the origin. The last three vector elements are a one-hot representation of pen states; <math>p_{1}</math> indicates that the pen is down and a line should be drawn between the current point and the next point, <math>p_{2}</math> indicates that the pen is up and no line should be drawn between the current point and the next point, and <math>p_{3}</math> indicates that the drawing is finished and subsequent points and the current point should not be drawn.<br />
<br />
=== Sketch-RNN ===<br />
[[File:sketchrnn.PNG]]<br />
<br />
The model is a Sequence-to-Sequence Variational Autoencoder (VAE). The encoder model is a symmetric and parallel set of two RNNs that individually process the sketch drawings (sequence <math>S</math>) in forward and reverse order, respectively. The hidden state produced by each encoder model is then concatenated into a single hidden state <math>h</math>. <br />
<br />
\begin{align}<br />
h_\rightarrow = \text{encode}_\rightarrow(S), h_\leftarrow = \text{encode}_\leftarrow(S_{\text{reverse}}), h=[h_\rightarrow; h_\leftarrow]<br />
\end{align}<br />
<br />
The concatenated hidden state <math>h</math> is then projected into two vectors <math>\mu</math> and <math>\hat{\sigma}</math> each of size <math>N_{z}</math> using a fully connected layer. <math>\hat{\sigma}</math> is then converted into a non-negative standard deviation parameter <math>\sigma</math> using an exponential operator. These two parameters <math>\mu</math> and <math>\sigma</math> are then used along with an IID Gaussian vector distributed as <math>\mathcal{N}(0, I)</math> of size <math>N_{z}</math> to construct a random vector <math>z \in ℝ^{N_{z}}</math>, similar to the method used for VAE:<br />
\begin{align}<br />
\mu = W_{\mu}h + b_{mu}\textrm{, }\hat{\sigma} = W_{\sigma}h + b_{\sigma}\textrm{, }\sigma = exp\bigg{(}\frac{\hat{\sigma}}{2}\bigg{)}\textrm{, }z = \mu + \sigma \odot \mathcal{N}(0,I)<br />
\end{align}<br />
<br />
The decoder model is an autoregressive RNN that samples output sketches from the latent vector <math>z</math>. The initial hidden states of each recurrent neuron are determined using <math>[h_{0}, c_{0}] = tanh(W_{z}z + b_{z})</math>. Each step of the decoder RNN accepts the previous point <math>S_{i-1}</math> and the latent vector <math>z</math> as concatenated input. The initial point given is the origin point with pen state down. The output at each step are the parameters for a probability distribution of the next point <math>S_{i}</math>. Outputs <math>\Delta x</math> and <math>\Delta y</math> are modeled using a Gaussian Mixture Model (GMM) with M normal distributions and output pen states <math>(q_{1}, q_{2}, q_{3})</math> modelled as a categorical distribution with one-hot encoding.<br />
\begin{align}<br />
P(\Delta x, \Delta y) = \sum_{j=1}^{M}\Pi_{j}\mathcal{N}(\Delta x, \Delta y | \mu_{x, j}, \mu_{y, j}, \sigma_{x, j}, \sigma_{y, j}, \rho_{xy, j})\textrm{, where }\sum_{j=1}^{M}\Pi_{j} = 1<br />
\end{align}<br />
<br />
For each of the M distributions in the GMM, parameters <math>\mu</math> and <math>\sigma</math> are output for both the x and y locations signifying the mean location of the next point and the standard deviation, respectively. Also output from each model is parameter <math>\rho_{xy}</math> signifying correlation of each bivariate normal distribution. An additional vector <math>\Pi</math> is an output giving the mixture weights for the GMM. The output <math>S_{i}</math> is determined from each of the mixture models using softmax sampling from these distributions.<br />
<br />
One of the key difficulties in training this model is the highly imbalanced class distribution of pen states. In particular, the state that signifies a drawing is complete will only appear one time per each sketch and is difficult to incorporate into the model. In order to have the model stop drawing, the authors introduce a hyperparameter <math>N_{max}</math> which basically is the length of the longest sketch in the dataset and limits the number of points per drawing to being no more than <math>N_{max}</math>, after which all output states form the model are set to (0, 0, 0, 0, 1) to force the drawing to stop.<br />
<br />
To sample from the model, the parameters required by the GMM and categorical distributions are generated at each time step and the model is sampled until a “stop drawing” state appears or the time state reaches time <math>N_{max}</math>. The authors also introduce a “temperature” parameter <math>\tau</math> that controls the randomness of the drawings by modifying the pen states, model standard deviations, and mixture weights as follows:<br />
<br />
\begin{align}<br />
\hat{q}_{k} \rightarrow \frac{\hat{q}_{k}}{\tau}\textrm{, }\hat{\Pi}_{k} \rightarrow \frac{\hat{\Pi}_{k}}{\tau}\textrm{, }\sigma^{2}_{x} \rightarrow \sigma^{2}_{x}\tau\textrm{, }\sigma^{2}_{y} \rightarrow \sigma^{2}_{y}\tau<br />
\end{align}<br />
<br />
This parameter <math>\tau</math> lies in the range (0, 1]. As the parameter approaches 0, the model becomes more deterministic and always produces the point locations with the maximum likelihood for a given timestep.<br />
<br />
=== Unconditional Generation ===<br />
<br />
[[File:paper15_Unconditional_Generation.png|800px|]]<br />
<br />
The authors also explored unconditional generation of sketch drawings by only training the decoder RNN module. To do this, the initial hidden states of the RNN were set to 0, and only vectors from the drawing input are used as input without any conditional latent variable <math>z</math>. Figure 3 above shows different sketches that are sampled from the network by only varying the temperature parameter <math>\tau</math> between 0.2 and 0.9.<br />
<br />
=== Training ===<br />
The training procedure follows the same approach as training for VAE and uses a loss function that consists of the sum of Reconstruction Loss <math>L_{R}</math> and KL Divergence Loss <math>L_{KL}</math>. The reconstruction loss term is composed of two terms; <math>L_{s}</math>, which tries to maximize the log-likelihood of the generated probability distribution explaining the training data <math>S</math> and <math>L_{p}</math> which is the log loss of the pen state terms.<br />
\begin{align}<br />
L_{s} = -\frac{1}{N_{max}}\sum_{i=1}^{N_{S}}log\bigg{(}\sum_{j=1}^{M}\Pi_{j,i}\mathcal{N}(\Delta x_{i},\Delta y_{i} | \mu_{x,j,i},\mu_{y,j,i},\sigma_{x,j,i},\sigma_{y,j,i},\rho_{xy,j,i})\bigg{)}<br />
\end{align}<br />
\begin{align}<br />
L_{p} = -\frac{1}{N_{max}}\sum_{i=1}^{N_{max}} \sum_{k=1}^{3}p_{k,i}log(q_{k,i})<br />
\end{align}<br />
\begin{align}<br />
L_{R} = L_{s} + L{p}<br />
\end{align}<br />
<br />
The KL divergence loss <math>L_{KL}</math> measures the difference between the latent vector <math>z</math> and an IID Gaussian distribution with 0 mean and unit variance. This term, normalized by the number of dimensions <math>N_{z}</math> is calculated as:<br />
\begin{align}<br />
L_{KL} = -\frac{1}{2N_{z}}\big{(}1 + \hat{\sigma} - \mu^{2} – exp(\hat{\sigma})\big{)}<br />
\end{align}<br />
<br />
The loss for the entire model is thus the weighted sum:<br />
\begin{align}<br />
Loss = L_{R} + w_{KL}L_{KL}<br />
\end{align}<br />
<br />
The value of the weight parameter <math>w_{KL}</math> has the effect that as <math>w_{KL} \rightarrow 0</math>, there is a loss in ability to enforce a prior over the latent space and the model assumes the form of a pure autoencoder. As with VAEs, there is a trade-off between optimizing for the two loss terms (i.e. between how precisely the model can regenerate training data <math>S</math> and how closely the latent vector <math>z</math> follows a standard normal distribution) - smaller values of <math>w_{KL}</math> lead to better <math>L_R</math> and worse <math>L_{KL}</math> compared to bigger values of <math>w_{KL}</math>. Also for unconditional generation, the model is a standalone decoder, so there will be no <math>L_{KL}</math> term as only <math>L_{R}</math> is optimized for. This trade-off is illustrated in Figure 4 showing different settings of <math>w_{KL}</math> and the resulting <math>L_{KL}</math> and <math>L_{R}</math>, as well as just <math>L_{R}</math> in the case of unconditional generation with only a standalone decoder.<br />
<br />
[[File:paper15_fig4.png|600px]]<br />
<br />
In practice however it was found that annealing the KL term improves the training of the network. While the original loss can be used for testing and validation, when training the following variation on the loss is used:<br />
\begin{align}<br />
\eta_{step} = 1 - (1 - \eta_{min})R^{step}<br />
\end{align}<br />
\begin{align}<br />
Loss_{train} = L_{R} + w_{KL} \eta_{step} max(L_{KL},KL_{min})<br />
\end{align}<br />
<br />
As can be seen above, the <math>\eta_{step}</math> term will start at some preset <math>\eta_{min}</math> value. As <math>R</math> is a value slightly smaller than 1, the <math>R^{step}</math> term will converge to 0, and thus <math>\eta_{step}</math> will converge to 1. This will have the affect of focusing the training loss in the early stages on the reconstruction loss <math>L_{R}</math>, but to a more balanced loss in the later steps. Additionally, in practice it was found that when <math>L_{KL}</math> got too low, the network would cease learning. To combat this a floor value for the KL loss was implemented by the <math>max(.)</math> function.<br />
<br />
=== Model Configuration ===<br />
In the given model, the encoder and decoder RNNs consist of 512 and 2048 nodes respectively. Also, M = 20 mixture components are used for the decoder RNN. Layer Normalization is applied to the model, and during training recurrent dropout is applied with a keep probability of 90%. The model is trained with batch sizes of 100 samples, using Adam with a learning rate of 0.0001 and gradient clipping of 1.0. During training, simple data augmentation is performed by multiplying the offset columns by two IID random factors. <br />
<br />
= Experiments =<br />
The authors trained multiple conditional and unconditional models using varying values of <math>w_{KL}</math> and recorded the different <math>L_{R}</math> and <math>L_{KL}</math> values at convergence. The network used LSTM as it’s encoder RNN and HyperLSTM as the decoder network. The HyperLSTM model was used for decoding because it has a history of being useful in sequence generation tasks. (A HyperLSTM consists of two coupled LSTMS: an auxiliary LSTM and a main LSTM. At every time step, the auxiliary LSTM reads the previous hidden state and the current input vector, and computes an intermediate vector <math display="inline"> z </math>. The weights of the main LSTM used in the current time step are then a learned function of this intermediate vector <math display="inline"> z </math>. That is, the weights of the main LSTM are allowed to vary between time steps as a function of the output of the auxiliary LSTM. See Ha et al. (2016) for details)<br />
<br />
=== Conditional Reconstruction ===<br />
[[File:conditional_generation.PNG]]<br />
<br />
The authors qualitatively assessed the reconstructed images <math>S’</math> given input sketch <math>S</math> using different values for the temperature hyperparameter <math>\tau</math>. The figure above shows the results for different values of <math>\tau</math> starting with 0.01 at the far left and increasing to 1.0 on the far right. Interestingly, sketches with extra features like a cat with 3 eyes are reproduced as a sketch of a cat with two eyes and sketches of object of a different class such as a toothbrush are reproduced as a sketch of a cat that maintains several of the input toothbrush sketches features.<br />
<br />
=== Latent Space Interpolation ===<br />
[[File:latent_space_interp.PNG]]<br />
<br />
The latent space vectors <math>z</math> have few “gaps” between encoded latent space vectors due to the enforcement of a Gaussian prior. This allowed the authors to do simple arithmetic on the latent vectors from different sketches and produce logical resulting images in the same style as latent space arithmetic on Word2Vec vectors. A model trained with higher <math>w_{KL}</math> is expected to produce images closer to the data manifold, and the figure above shows reconstructed images from latent vector interpolation between the original images. Results from the model trained with higher <math>w_{KL}</math> seem to produce more coherent images.<br />
<br />
=== Sketch Drawing Analogies ===<br />
Given the latent space arithmetic possible, it was found that features of a sketch could be added after some sketch input was encoded. For example, a drawing of a cat with a body could be produced by providing the network with a drawing of a cat’s head, and then adding a latent vector to the embedding layer that represents “body”. As an example, this “body” vector might be produced by taking a drawing of a pig with a body and subtracting a vector representing the pigs head.<br />
<br />
=== Predicting Different Endings of Incomplete Sketches ===<br />
[[File:predicting_endings.PNG]]<br />
<br />
Using the decoder RNN only, it is possible to finish sketches by conditioning future vector line predictions on the previous points. To do this, the decoder RNN is first used to encode some existing points into the hidden state of the decoder network and then generating the remaining points of the sketch with <math>\tau</math> set to 0.8.<br />
<br />
= Applications and Future Work =<br />
Sketch-RNN may enable the production of several creative applications. These might include suggesting ways an artist could finish a sketch, enabling artists to explore latent space arithmetic to find interesting outputs given different sketch inputs, or allowing the production of multiple different sketches of some object as a purely generative application. The authors suggest that providing some conditional sketch of an object to a model designed to produce output from a different class might be useful for producing sketches that morph the two different object classes into one sketch. For example, the image below was trained on drawing cats, but a chair was used as the input. This results in a chair looking cat.<br />
<br />
[[File:cat-chair.png]]<br />
<br />
Sketch-RNN may also be useful as a teaching tool to help people learn how to draw, especially if it were to be trained on higher quality images. Teaching tools might suggest to students how to proceed to finish a sketch or intake low fidelity sketches to produce a higher quality and “more coherent” output sketch.<br />
<br />
The authors noted that Sketch-RNN is not as effective at generating coherent sketches when trained on a large number of classes simultaneously (experiments mostly used datasets consisting of one or two object classes), and plan to use class information outside the latent space to try to model a greater number of classes.<br />
<br />
Finally, the authors suggest that combining this model with another that produces photorealistic pixel-based images using sketch input, such as Pix2Pix may be an interesting direction for future research. In this case, the output from the Sketch-RNN model would be used as input for Pix2Pix and could produce photorealistic images given some crude sketch from a user.<br />
<br />
= Limitations =<br />
The authors note a major limitation to the model is the training time relative to the number of data points. When sketches surpass 300 data points the model is difficult to train. To counteract this effect the Ramer-Douglas-Peucker algorithm was used to reduce the number of data points per sketch. This algorithm attempts to significantly reduce the number of data points while keeping the sketch as close to the original as possible.<br />
<br />
Another limitation is the effectiveness of generating sketches as the complexity of the class increases. Below are sketches of a few classes which show how the less complex classes such as cats and crabs are more accurately generated. Frogs (more complex) tend to have overly smooth lines drawn which do not seem to be part of realistic frog samples.<br />
<br />
[[File:paper15_classcomplexity.png]]<br />
<br />
A further limitation is the need to train individual neural networks for each class of drawings. While the former is useful in sketch completion with labeled incomplete sketches it may produce low quality results when the starting sketch is very different than any part of the learned representation. Further work can be done to extend the model to account for both prediction of the label and sketch completion.<br />
<br />
= Conclusion =<br />
The authors presented Sketch-RNN, a RNN model for modelling and generating vector-based sketch drawings with a goal to abstract concepts in the images similar to the way humans think. The VAE inspired architecture allows sampling the latent space to generate new drawings and also allows for applications that use latent space arithmetic in the style of Word2Vec to produce new drawings given operations on embedded sketch vectors. The authors also made available a large dataset of sketch drawings in the hope of encouraging more research in the area of vector-based image modelling.<br />
<br />
= Criticisms =<br />
The paper produces an interesting model that can effectively model vector-based images instead of traditional pixel-based images. This is an interesting problem because vector based images require producing a new way to encode the data. While the results from this paper are interesting, most of the techniques used are borrowed ideas from Variational Autoencoders and the main architecture is not terribly groundbreaking. <br />
<br />
One novel part about the architecture presented was the way the authors used GMMs in the decoder network. While this was interesting and seemed to allow the authors to produce different outputs given the same latent vector input <math>z</math> by manipulating the <math>\tau</math> hyperparameter, it was not that clear in the article why GMMs were used instead of a more simple architecture. Much time was spent explaining basics about GMM parameters like <math>\mu</math> and <math>\sigma</math>, but there was comparatively little explanation about how points were actually sampled from these mixture models.<br />
<br />
The authors contribute to a novel dataset but fail to evaluate the quality of the dataset, including generalized metrics for evaluation. They also provide no comparisons of their method on this dataset with other baseline sequence generation approaches.<br />
<br />
Finally, the authors gloss somewhat over how they were able to encode previous sketch points using only the decoder network into the hidden state of the decoder RNN to finish partially finished sketches. I can only assume that some kind of back-propagation was used to encode the expected sketch points into the hidden parameters of the decoder, but no explanation was given in the paper.<br />
<br />
== Major Contributions ==<br />
<br />
The paper provides with intuition of their approach and detailed below are the major contributions of this paper:<br />
* For images composed by sequence of lines, such as hand drawing, this paper proposed a framework to generate such image in vector format, conditionally and unconditionally. <br />
* Provided a unique training procedure that targets vector images, which makes training procedures more robust.<br />
* Composed large dataset of hand drawn vector images which benefits future development.<br />
* Discussed several potential applications of this methodology, such as drawing assist for artists and educational tool for students.<br />
<br />
= Implementation =<br />
Google has released all code related to this paper at the following open source repository: https://github.com/tensorflow/magenta/tree/master/magenta/models/sketch_rnn<br />
<br />
= Source =<br />
<br />
# Ha, D., & Eck, D. A neural representation of sketch drawings. In Proc. International Conference on Learning Representations (2018).<br />
# Tensorflow/magenta. (n.d.). Retrieved March 25, 2018, from https://github.com/tensorflow/magenta/tree/master/magenta/models/sketch_rnn<br />
# Graves et al, 2013, https://arxiv.org/pdf/1308.0850.pdf<br />
# David Ha, Andrew Dai, Quoc V. Le. HyperNetworks. (2016) arXiv:1609.09106</div>Tdunnhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Understanding_Image_Motion_with_Group_Representations&diff=36285Understanding Image Motion with Group Representations2018-04-07T16:39:01Z<p>Tdunn: /* Learning Motion by Group Properties */</p>
<hr />
<div>== Introduction ==<br />
Motion perception is a key component of computer vision. It is critical to problems such as optical flow and visual odometry, where a sequence of images are used to calculate either the pixel level (local) motion or the motion of the entire scene (global). The smooth image transformation caused by camera motion is a subspace of all position image transformations. Here, we are interested in realistic transformations caused by motion, therefore unrealistic motion caused by say, face swapping, is not considered. <br />
<br />
To be useful for understanding and acting on scene motion, a representation should capture the motion of the observer and all relevant scene content. Supervised training of such a representation is challenging: explicit motion labels are difficult to obtain, especially for nonrigid scenes where it can be unclear how the structure and motion of the scene should be decomposed. The proposed learning method does not need labeled data. Instead, the method applies constraints to learning by using the properties of motion space. The paper presents a general model of visual motion, and how the motion space properties of associativity and invertibility can be used to constrain the learning of a deep neural network. The results show evidence that the learned model captures motion in both 2D and 3D settings. This method can be used to extract useful information for vehicle localization, tracking, and odometry.<br />
<br />
[[File:paper13_fig1.png|650px|center|]]<br />
<br />
== Related Work ==<br />
The most common global representations of motion are from structure from motion (SfM) and simultaneous localization and mapping (SLAM), which represent poses in special Euclidean group <math> SE(3) </math> to represent a sequence of motions. However, these cannot be used to represent non-rigid or independent motions. The most used method for local representation is optical flow, which estimates motion by pixel over 2-D image. Furthermore, scene flow is a more generalized method of optical flow which estimates the point trajectories from 3-D motions. The limitation of optical flow is that it only captures motion locally, which makes capturing the overall motion impossible. <br />
<br />
Another approach to representing motion is spatiotemporal features (STFs), which are flexible enough to represent non-rigid motions since there is usually a dimensionality reduction process involved. However these approaches are restricted to fixed windows of representation. <br />
<br />
There are also works using CNN’s to learn optical flow using brightness constancy assumptions, and/or photometric local constraints. Works on stereo depth estimation using learning has also shown results. Regarding image sequences, there are works on shuffling the order of images to learn representations of its contents, as well as learning representations equivariant to the egomotion of the camera.<br />
<br />
== Approach ==<br />
The proposed method is based on the observation that 3D motions, equipped with composition, form a group. By learning the underlying mapping that captures the motion transformations, we are approximating latent motion of the scene. The method is designed to capture group associativity and invertibility.<br />
<br />
Consider a latent structure space <math>S</math>, element of the structure space generates images via projection <math>\pi:S\rightarrow I</math>, latent motion space <math>M</math> which is some closed subgroup of the set of homeomorphism on <math>S</math>. For <math>s \in S</math>, a continuous motion sequence <math> \{m_t \in M | t \geq 0\} </math> generates continous image sequence <math> \{i_t \in I | t \geq 0\} </math> where <math> i_t=\pi(m_t(s)) </math>. Writing this as a hidden Markov model gives <math> i_t=\pi(m_{\Delta t}(s_{t-1}))) </math> where the current state is based on the change from the previous. Since <math> M </math> is a closed group on <math> S </math>, it is associative, has inverse, and contains idenity. <math> SE(3) </math> is an exmaple of this. To be more specific, the latent structure of a scene from rigid image motion could be modelled by a point cloud with a motion space <math>M=SE(3)</math>, where rigid image motion can be produced by a camera translating and rotating through a rigid scene in 3D. When a scene has N rigid bodies, the motion space can be represented as <math>M=[SE(3)]^N</math>.<br />
<br />
=== Learning Motion by Group Properties ===<br />
The goal is to learn a function <math> \Phi : I \times I \rightarrow \overline{M} </math>, <math> \overline{M} </math> indicating mapping of image pairs from <math> M </math> to its representation, as well as the composition operator <math> \diamond : \overline{M} \rightarrow \overline{M} </math> that emulates the composition of these elements in <math> M </math>. For all sequences, it is assumed that for all times <math> t_0 < t_1 < t_2 ... </math>, the sequence representation should have the following properties: <br />
# Associativity: <math> \Phi(I_{t_0}, I_{t_2}) \diamond \Phi(I_{t_2}, I_{t_3}) = (\Phi(I_{t_0}, I_{t_1}) \diamond \Phi(I_{t_1}, I_{t_2})) \diamond \Phi(I_{t_2}, I_{t_3}) = \Phi(I_{t_0}, I_{t_1}) \diamond (\Phi(I_{t_1}, I_{t_2}) \diamond \Phi(I_{t_2}, I_{t_3})) = \Phi(I_{t_0}, I_{t_1}) \diamond \Phi(I_{t_1}, I_{t_3})</math>, which means that the motion of differently composed subsequences of a sequence are equivalent<br />
# Has Identity: <math> \Phi(I_{t_0}, I_{t_1}) \diamond e = \Phi(I_{t_0}, I_{t_1}) = e \diamond \Phi(I_{t_0}, I_{t_1}) </math> and <math> e=\Phi(I_{t}, I_{t}) \forall t </math>, where <math>e</math> is the null image motion and the unique identity in the latent space<br />
# Invertibility: <math> \Phi(I_{t_0}, I_{t_1}) \diamond \Phi(I_{t_1}, I_{t_0}) = e </math>, so the inverse of the motion of an image sequence is the motion of that image sequence reversed<br />
<br />
Also note that a notion of transitivity is assumed, specifically <math>\Phi(I_{t_0}, I_{t_2}) = \Phi(I_{t_0}, I_{t_1}) \diamond \Phi(I_{t_1}, I_{t_2})</math>.<br />
<br />
An embedding loss is used to approximately enforce associativity and invertibility among subsequences sampled from image sequence. Associativity is encouraged by pushing sequences with the same final motion but different transitions to the same representation. Invertibility is encouraged by pushing sequences corresponding to the same motion with but in opposite directions away from each other, as well as pushing all loops to the same representation. Uniqueness of the identity is encouraged by pushing loops away from non-identity representations. Loops from different sequences are also pushed to the same representation (the identity).<br />
<br />
These constraints are true to any type of transformation resulting from image motion. This puts little restriction on the learning problems and allows all features relevant to the motion structure to be captured. On the other hand, optical flow assumes unchanging brightness between frames of the same projected scene, and motion estimates would degrade when that assumption does not hold.<br />
<br />
Also with this method, it is possible multiple representations <math> \overline{M} </math> can be learned from a single <math> M </math>, thus the learned representation is not necessary unique. In addition, the scenes are not expected to have rapid changing content, scene cuts, or long-term occlusions.<br />
<br />
=== Sequence Learning with Neural Networks ===<br />
The functions <math> \Phi </math> and <math> \diamond </math> are approximated by CNN and RNN, respectively. LSTM is used for RNN. The input to the network is a sequence of images <math> I_t = \{I_1,...,I_t\} </math>. The CNN processes pairs of images and generates intermediate representations, and the LSTM operates over the sequence of CNN outputs to produce an embedding sequence <math> R_t = \{R_{1,2},...,R_{t-1,t}\} </math>. Only the embedding at the final time step is used for loss. The network is trained to minimize a hinge loss with respect to embeddings to pairs of sequences as defined below:<br />
<br />
<center><math>L(R^1,R^2) = \begin{cases} d(R^1,R^2), & \text{if positive pair} \\ max(0, m - d(R^1,R^2)), & \text{if negative pair} \end{cases}</math></center><br />
<center><math> d_{cosine}(R^1,R^2)=1-\frac{\langle R^1,R^2 \rangle}{\lVert R^1 \rVert \lVert R^2 \rVert} </math></center><br />
<br />
where <math>d(R^1,R^2)</math> measure the distance between the embeddings of two sequences used for training selected to be cosine distance, <math> m </math> is a fixed scalar margin selected to be 0.5. Positive pairs are training examples where two sequences have the same final motion, negative pairs are training examples where two sequences have the exact opposite final motion. Using L2 distances yields similar results as cosine distances.<br />
<br />
Each training sequence is recomposed into 6 subsequences: two forward, two backward, and two identity. To prevent the network from only looking at static differences, subsequence pairs are sampled such that they have the same start and end frames but different motions in between. Sequences of varying lengths are also used to generalize motion on different temporal scales. Training the network with only one input images per time step was also tried, but consistently yielded work results than image pairs.<br />
<br />
[[File:paper13_fig2.png|650px|center|]]<br />
<br />
Overall, training with image pairs resulted in lower error than training with just single images. This is demonstrated in the below table.<br />
<br />
<br />
[[File:table.png|700px|center|]]<br />
<br />
== Experimentation ==<br />
Trained network using rotated and translated MNIST dataset as well as KITTI dataset. <br />
* Used Torch<br />
* Used Adam for optimization, decay schedule of 30 epochs, learning rate chosen by random serach<br />
* 50-60 batch size for MNIST, 25-30 batch size for KITTI<br />
* Dilated convolution with Relu and batch normalization<br />
* Two LSTM cell per layer 256 hidden units each<br />
* Sequence length of 3-5 images<br />
* MINIST networks with up to 12 images <br />
<br />
=== Rigid Motion in 2D ===<br />
* MNIST data rotated <math>[0, 360)</math> degrees and translated <math>[-10, 10] </math> pixels, i.e. <math>SE(2)</math> transformations<br />
* Visualized the representation using t-SNE<br />
** Clear clustering by translation and rotation but not object classes<br />
** Suggests the representation captures the motion properties in the dataset, but is independent of image contents<br />
* Visualized the image-conditioned saliency maps<br />
**In Figure 3, the red represents the positive gradients of the activation function with respect to the input image, and the negative gradients are represented in blue.<br />
**If we consider a saliency map as a first-order Taylor expansion, then the map could show the relationship between pixel and the representation.<br />
** Take derivative of the network output respect to the map<br />
** The area that has the highest gradient means that part contributes the most to the output<br />
** The resulting salient map strongly resembles spatiotemporal energy filters of classical motion processing<br />
** Suggests the network is learning the right motion structure<br />
<br />
[[File:paper13_fig3.png|700px|center|]]<br />
<br />
=== Real World Motion in 3D ===<br />
* Uses KITTI dataset collected on a car driving through roads in Germany<br />
* On a separate dataset with ground truth camera pose, linearly regress the representation to the ground truth<br />
** The result is compared against self supervised flow algorithm Yu et al.(2016) after the output from the flow algorithm is downsampled, then feed through PCA, then regressed against the camera motion<br />
** The data shows it performs not as well as the supervised algorithm, but consistently better than chance (guessing the mean value)<br />
** Largest improvements are shown in X and Z translation, which also have the most variance in the data<br />
** Shows the method is able to capture dominant motion structure<br />
* Test performance on interpolation task<br />
** Check <math>R([I_1,I_T])</math> against <math>R([I_1, I_m, I_T])</math>, <math>R([I_1, I_{IN}, I_T])</math>, and <math>R([I_1, I_{OUT}, I_T])</math><br />
** Test how sensitive the network is to deviations from unnatural motion<br />
** High errors <math>\gg 1</math> means the network can distinguish between realistic and unrealistic motion<br />
** In order to do this, the distance between the embeddings of the frame sequences of the first and last frame <math>R([I_1,I_T])</math> and of the first, middle, and last frame <math>R([I_1, I_m, I_T])</math> is computed. This distance is compared with the distance when the middle frame of the second embedding is changed to a frame that is visually similar (inside sequence): <math>R([I_1, I_{IN}, I_T])</math> and one that is visually dissimilar (outside sequence): <math>R([I_1, I_{OUT}, I_T])</math>. The results are shown in Table 3. The embedding distance method is compared to the Euclidean distance which is defined as the mean pixel distance between the test frame and <math>{I_1,I_T}</math>, whichever is smaller. It can be seen from the results that the embedding distance of the true frame is significantly lower than other frames. This means that the embedding distance used in the network is more sensitive to any atypical motions of the scenes. <br />
* Visualized saliency maps<br />
** Highlights objects moving in the background, and motion of the car in the foreground<br />
** Suggests the method can be used for tracking as well<br />
<br />
[[File:paper13_tab2.png|700px|center|]]<br />
<br />
[[File:paper13_fig4.png|700px|center|]]<br />
<br />
[[File:paper13_fig5.png|700px|center|]]<br />
<br />
[[File:table3_motion.PNG|700px|center|]]<br />
<br />
* Figure 7 displays graphs comparing the mean squared error of the method presented in this paper to the baseline chance method and the supervised Flow PCA method.<br />
<br />
[[File:paper13_fig6.PNG|700px|center|]]<br />
<br />
== Conclusion ==<br />
The authors presented a new model of motion and a method for learning motion representations. It is shown that by enforcing group properties we can learn motion representations that are able to generalize between scenes with disparate content. The results can be useful for navigation, prediction, and other behavioral tasks relying on motion. Due to the fact that this method does not require labelled data, it can be applied to a large variety of tasks.<br />
<br />
== Criticism ==<br />
Although this method does not require any labelled data, it is still learning by supervision through defined constraints. The idea of training using unlabelled data is interesting and it does have meaningful practical application. Unfortunately, the author did not provide convincing experimental results. Results from motion estimation problems are typically compared against ground truth data for their accuracy. The author performed experiments on transformed MNIST data and KITTI data. The MNIST data is transformed by the author, thus the ground truth is readily available. However the author only claimed the validity of the results through indirect means of using t-SNE and saliency map visualization. For the KITTI dataset, the author regressed the representations against ground truth for some mapping from the network output to some physical motion representation. Again, the results were compared only indirectly against ground truth, also shows poor results when compared with the Flow+PCA baseline, especially for X and Z translations as well as Y rotation, which are the main elements of motion present in the KITTI dataset. Such experimentation made the method hardly convincing and applicable to real-world applications. In addition, the network does not output motion representations with physical meanings, making the proposed method useless for any real world applications.<br />
<br />
One of the motivations the authors use for this approach is that traditional SLAM formulations represent motion as a sequence of poses in <math> SE(3) </math>, and that they are unable to represent non-rigid or independent motions. There exist SLAM formulations that represent motion as [http://ieeexplore.ieee.org/document/7353368/ Gaussian processes], as well as [http://journals.sagepub.com/doi/abs/10.1177/0278364915585860 temporal basis functions], and it is quite [https://openslam.org/robotvision.html common] for inertial, monocular-camera SLAM problems to use a motion representation on <math> SIM(3) </math>, which is the group containing all scale-preserving transformations. A <math> SIM(3) </math> transformation is not, in general, rigid, so it is not true to say that modern SLAM is unable to represent non-rigid motions. Additionally, the saliency images from the KITTI experiment displaying network gradients on independently moving objects in the scene does not necessarily mean that the motion representation is capturing independent motion, it just means that the network representation is dependent on those pixels. As the authors did not provide an error comparison between images containing independent motions and those without, it is possible that these network gradients only contribute to error (in terms of the camera pose) instead of capturing independent motions.<br />
<br />
Another criticism is that the group-properties constraint the authors impose is too weak. Any set consisting of functions, their inverses, and the identity forms a group. While physical motions are one example of such a group, there are many valid groups that do not represent any coherent physical motions. That is, it's unclear whether group representations adequately describe the underlying mechanisms of the paper.<br />
<br />
Since the network has to learn both the group elements, <math>\overline{M}</math>, and the composition function, <math>\diamond</math>, associated with the group it is difficult to tell how each of them are performing. It would not be possible to perform a layer-by-layer ablation study to determine the individual contributions of the functions associated with each group.<br />
<br />
Finally, the method requires domain knowledge of the motion space and feature engineering for encoding it, which reduces the ease with which the method can be generalized to various tasks.<br />
<br />
== References ==<br />
Jaegle, A. (2018). Understanding image motion with group representations . ICLR. Retrieved from https://openreview.net/pdf?id=SJLlmG-AZ.<br />
<br />
Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. In ''International Conference on Learning Representations (ICLR) Workshop'', 2013.<br />
<br />
Jason J Yu, Adam W Harley, and Konstantinos G Derpanis. Back to basics: Unsupervised learning of optical flow via brightness constancy and motion smoothness. ''European Conference on Computer Vision (ECCV) Workshops'', 2016.</div>Tdunnhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Understanding_Image_Motion_with_Group_Representations&diff=36284Understanding Image Motion with Group Representations2018-04-07T16:33:04Z<p>Tdunn: /* Learning Motion by Group Properties */</p>
<hr />
<div>== Introduction ==<br />
Motion perception is a key component of computer vision. It is critical to problems such as optical flow and visual odometry, where a sequence of images are used to calculate either the pixel level (local) motion or the motion of the entire scene (global). The smooth image transformation caused by camera motion is a subspace of all position image transformations. Here, we are interested in realistic transformations caused by motion, therefore unrealistic motion caused by say, face swapping, is not considered. <br />
<br />
To be useful for understanding and acting on scene motion, a representation should capture the motion of the observer and all relevant scene content. Supervised training of such a representation is challenging: explicit motion labels are difficult to obtain, especially for nonrigid scenes where it can be unclear how the structure and motion of the scene should be decomposed. The proposed learning method does not need labeled data. Instead, the method applies constraints to learning by using the properties of motion space. The paper presents a general model of visual motion, and how the motion space properties of associativity and invertibility can be used to constrain the learning of a deep neural network. The results show evidence that the learned model captures motion in both 2D and 3D settings. This method can be used to extract useful information for vehicle localization, tracking, and odometry.<br />
<br />
[[File:paper13_fig1.png|650px|center|]]<br />
<br />
== Related Work ==<br />
The most common global representations of motion are from structure from motion (SfM) and simultaneous localization and mapping (SLAM), which represent poses in special Euclidean group <math> SE(3) </math> to represent a sequence of motions. However, these cannot be used to represent non-rigid or independent motions. The most used method for local representation is optical flow, which estimates motion by pixel over 2-D image. Furthermore, scene flow is a more generalized method of optical flow which estimates the point trajectories from 3-D motions. The limitation of optical flow is that it only captures motion locally, which makes capturing the overall motion impossible. <br />
<br />
Another approach to representing motion is spatiotemporal features (STFs), which are flexible enough to represent non-rigid motions since there is usually a dimensionality reduction process involved. However these approaches are restricted to fixed windows of representation. <br />
<br />
There are also works using CNN’s to learn optical flow using brightness constancy assumptions, and/or photometric local constraints. Works on stereo depth estimation using learning has also shown results. Regarding image sequences, there are works on shuffling the order of images to learn representations of its contents, as well as learning representations equivariant to the egomotion of the camera.<br />
<br />
== Approach ==<br />
The proposed method is based on the observation that 3D motions, equipped with composition, form a group. By learning the underlying mapping that captures the motion transformations, we are approximating latent motion of the scene. The method is designed to capture group associativity and invertibility.<br />
<br />
Consider a latent structure space <math>S</math>, element of the structure space generates images via projection <math>\pi:S\rightarrow I</math>, latent motion space <math>M</math> which is some closed subgroup of the set of homeomorphism on <math>S</math>. For <math>s \in S</math>, a continuous motion sequence <math> \{m_t \in M | t \geq 0\} </math> generates continous image sequence <math> \{i_t \in I | t \geq 0\} </math> where <math> i_t=\pi(m_t(s)) </math>. Writing this as a hidden Markov model gives <math> i_t=\pi(m_{\Delta t}(s_{t-1}))) </math> where the current state is based on the change from the previous. Since <math> M </math> is a closed group on <math> S </math>, it is associative, has inverse, and contains idenity. <math> SE(3) </math> is an exmaple of this. To be more specific, the latent structure of a scene from rigid image motion could be modelled by a point cloud with a motion space <math>M=SE(3)</math>, where rigid image motion can be produced by a camera translating and rotating through a rigid scene in 3D. When a scene has N rigid bodies, the motion space can be represented as <math>M=[SE(3)]^N</math>.<br />
<br />
=== Learning Motion by Group Properties ===<br />
The goal is to learn a function <math> \Phi : I \times I \rightarrow \overline{M} </math>, <math> \overline{M} </math> indicating mapping of image pairs from <math> M </math> to its representation, as well as the composition operator <math> \diamond : \overline{M} \rightarrow \overline{M} </math> that emulates the composition of these elements in <math> M </math>. For all sequences, it is assumed that for all times <math> t_0 < t_1 < t_2 ... </math>, the sequence representation should have the following properties: <br />
# Associativity: <math> \Phi(I_{t_0}, I_{t_2}) \diamond \Phi(I_{t_2}, I_{t_3}) = (\Phi(I_{t_0}, I_{t_1}) \diamond \Phi(I_{t_1}, I_{t_2})) \diamond \Phi(I_{t_2}, I_{t_3}) = \Phi(I_{t_0}, I_{t_1}) \diamond (\Phi(I_{t_1}, I_{t_2}) \diamond \Phi(I_{t_2}, I_{t_3})) = \Phi(I_{t_0}, I_{t_1}) \diamond \Phi(I_{t_1}, I_{t_3})</math>, which means that the motion of differently composed subsequences of a sequence are equivalent<br />
# Has Identity: <math> \Phi(I_{t_0}, I_{t_1}) \diamond e = \Phi(I_{t_0}, I_{t_1}) = e \diamond \Phi(I_{t_0}, I_{t_1}) </math> and <math> e=\Phi(I_{t}, I_{t}) \forall t </math>, where <math>e</math> is the null image motion and the unique identity in the latent space<br />
# Invertibility: <math> \Phi(I_{t_0}, I_{t_1}) \diamond \Phi(I_{t_1}, I_{t_0}) = e </math>, so the inverse of the motion of an image sequence is the motion of that image sequence reversed<br />
An embedding loss is used to approximately enforce associativity and invertibility among subsequences sampled from image sequence. Associativity is encouraged by pushing sequences with the same final motion but different transitions to the same representation. Invertibility is encouraged by pushing sequences corresponding to the same motion with but in opposite directions away from each other, as well as pushing all loops to the same representation. Uniqueness of the identity is encouraged by pushing loops away from non-identity representations. Loops from different sequences are also pushed to the same representation (the identity).<br />
<br />
These constraints are true to any type of transformation resulting from image motion. This puts little restriction on the learning problems and allows all features relevant to the motion structure to be captured. On the other hand, optical flow assumes unchanging brightness between frames of the same projected scene, and motion estimates would degrade when that assumption does not hold.<br />
<br />
Also with this method, it is possible multiple representations <math> \overline{M} </math> can be learned from a single <math> M </math>, thus the learned representation is not necessary unique. In addition, the scenes are not expected to have rapid changing content, scene cuts, or long-term occlusions.<br />
<br />
=== Sequence Learning with Neural Networks ===<br />
The functions <math> \Phi </math> and <math> \diamond </math> are approximated by CNN and RNN, respectively. LSTM is used for RNN. The input to the network is a sequence of images <math> I_t = \{I_1,...,I_t\} </math>. The CNN processes pairs of images and generates intermediate representations, and the LSTM operates over the sequence of CNN outputs to produce an embedding sequence <math> R_t = \{R_{1,2},...,R_{t-1,t}\} </math>. Only the embedding at the final time step is used for loss. The network is trained to minimize a hinge loss with respect to embeddings to pairs of sequences as defined below:<br />
<br />
<center><math>L(R^1,R^2) = \begin{cases} d(R^1,R^2), & \text{if positive pair} \\ max(0, m - d(R^1,R^2)), & \text{if negative pair} \end{cases}</math></center><br />
<center><math> d_{cosine}(R^1,R^2)=1-\frac{\langle R^1,R^2 \rangle}{\lVert R^1 \rVert \lVert R^2 \rVert} </math></center><br />
<br />
where <math>d(R^1,R^2)</math> measure the distance between the embeddings of two sequences used for training selected to be cosine distance, <math> m </math> is a fixed scalar margin selected to be 0.5. Positive pairs are training examples where two sequences have the same final motion, negative pairs are training examples where two sequences have the exact opposite final motion. Using L2 distances yields similar results as cosine distances.<br />
<br />
Each training sequence is recomposed into 6 subsequences: two forward, two backward, and two identity. To prevent the network from only looking at static differences, subsequence pairs are sampled such that they have the same start and end frames but different motions in between. Sequences of varying lengths are also used to generalize motion on different temporal scales. Training the network with only one input images per time step was also tried, but consistently yielded work results than image pairs.<br />
<br />
[[File:paper13_fig2.png|650px|center|]]<br />
<br />
Overall, training with image pairs resulted in lower error than training with just single images. This is demonstrated in the below table.<br />
<br />
<br />
[[File:table.png|700px|center|]]<br />
<br />
== Experimentation ==<br />
Trained network using rotated and translated MNIST dataset as well as KITTI dataset. <br />
* Used Torch<br />
* Used Adam for optimization, decay schedule of 30 epochs, learning rate chosen by random serach<br />
* 50-60 batch size for MNIST, 25-30 batch size for KITTI<br />
* Dilated convolution with Relu and batch normalization<br />
* Two LSTM cell per layer 256 hidden units each<br />
* Sequence length of 3-5 images<br />
* MINIST networks with up to 12 images <br />
<br />
=== Rigid Motion in 2D ===<br />
* MNIST data rotated <math>[0, 360)</math> degrees and translated <math>[-10, 10] </math> pixels, i.e. <math>SE(2)</math> transformations<br />
* Visualized the representation using t-SNE<br />
** Clear clustering by translation and rotation but not object classes<br />
** Suggests the representation captures the motion properties in the dataset, but is independent of image contents<br />
* Visualized the image-conditioned saliency maps<br />
**In Figure 3, the red represents the positive gradients of the activation function with respect to the input image, and the negative gradients are represented in blue.<br />
**If we consider a saliency map as a first-order Taylor expansion, then the map could show the relationship between pixel and the representation.<br />
** Take derivative of the network output respect to the map<br />
** The area that has the highest gradient means that part contributes the most to the output<br />
** The resulting salient map strongly resembles spatiotemporal energy filters of classical motion processing<br />
** Suggests the network is learning the right motion structure<br />
<br />
[[File:paper13_fig3.png|700px|center|]]<br />
<br />
=== Real World Motion in 3D ===<br />
* Uses KITTI dataset collected on a car driving through roads in Germany<br />
* On a separate dataset with ground truth camera pose, linearly regress the representation to the ground truth<br />
** The result is compared against self supervised flow algorithm Yu et al.(2016) after the output from the flow algorithm is downsampled, then feed through PCA, then regressed against the camera motion<br />
** The data shows it performs not as well as the supervised algorithm, but consistently better than chance (guessing the mean value)<br />
** Largest improvements are shown in X and Z translation, which also have the most variance in the data<br />
** Shows the method is able to capture dominant motion structure<br />
* Test performance on interpolation task<br />
** Check <math>R([I_1,I_T])</math> against <math>R([I_1, I_m, I_T])</math>, <math>R([I_1, I_{IN}, I_T])</math>, and <math>R([I_1, I_{OUT}, I_T])</math><br />
** Test how sensitive the network is to deviations from unnatural motion<br />
** High errors <math>\gg 1</math> means the network can distinguish between realistic and unrealistic motion<br />
** In order to do this, the distance between the embeddings of the frame sequences of the first and last frame <math>R([I_1,I_T])</math> and of the first, middle, and last frame <math>R([I_1, I_m, I_T])</math> is computed. This distance is compared with the distance when the middle frame of the second embedding is changed to a frame that is visually similar (inside sequence): <math>R([I_1, I_{IN}, I_T])</math> and one that is visually dissimilar (outside sequence): <math>R([I_1, I_{OUT}, I_T])</math>. The results are shown in Table 3. The embedding distance method is compared to the Euclidean distance which is defined as the mean pixel distance between the test frame and <math>{I_1,I_T}</math>, whichever is smaller. It can be seen from the results that the embedding distance of the true frame is significantly lower than other frames. This means that the embedding distance used in the network is more sensitive to any atypical motions of the scenes. <br />
* Visualized saliency maps<br />
** Highlights objects moving in the background, and motion of the car in the foreground<br />
** Suggests the method can be used for tracking as well<br />
<br />
[[File:paper13_tab2.png|700px|center|]]<br />
<br />
[[File:paper13_fig4.png|700px|center|]]<br />
<br />
[[File:paper13_fig5.png|700px|center|]]<br />
<br />
[[File:table3_motion.PNG|700px|center|]]<br />
<br />
* Figure 7 displays graphs comparing the mean squared error of the method presented in this paper to the baseline chance method and the supervised Flow PCA method.<br />
<br />
[[File:paper13_fig6.PNG|700px|center|]]<br />
<br />
== Conclusion ==<br />
The authors presented a new model of motion and a method for learning motion representations. It is shown that by enforcing group properties we can learn motion representations that are able to generalize between scenes with disparate content. The results can be useful for navigation, prediction, and other behavioral tasks relying on motion. Due to the fact that this method does not require labelled data, it can be applied to a large variety of tasks.<br />
<br />
== Criticism ==<br />
Although this method does not require any labelled data, it is still learning by supervision through defined constraints. The idea of training using unlabelled data is interesting and it does have meaningful practical application. Unfortunately, the author did not provide convincing experimental results. Results from motion estimation problems are typically compared against ground truth data for their accuracy. The author performed experiments on transformed MNIST data and KITTI data. The MNIST data is transformed by the author, thus the ground truth is readily available. However the author only claimed the validity of the results through indirect means of using t-SNE and saliency map visualization. For the KITTI dataset, the author regressed the representations against ground truth for some mapping from the network output to some physical motion representation. Again, the results were compared only indirectly against ground truth, also shows poor results when compared with the Flow+PCA baseline, especially for X and Z translations as well as Y rotation, which are the main elements of motion present in the KITTI dataset. Such experimentation made the method hardly convincing and applicable to real-world applications. In addition, the network does not output motion representations with physical meanings, making the proposed method useless for any real world applications.<br />
<br />
One of the motivations the authors use for this approach is that traditional SLAM formulations represent motion as a sequence of poses in <math> SE(3) </math>, and that they are unable to represent non-rigid or independent motions. There exist SLAM formulations that represent motion as [http://ieeexplore.ieee.org/document/7353368/ Gaussian processes], as well as [http://journals.sagepub.com/doi/abs/10.1177/0278364915585860 temporal basis functions], and it is quite [https://openslam.org/robotvision.html common] for inertial, monocular-camera SLAM problems to use a motion representation on <math> SIM(3) </math>, which is the group containing all scale-preserving transformations. A <math> SIM(3) </math> transformation is not, in general, rigid, so it is not true to say that modern SLAM is unable to represent non-rigid motions. Additionally, the saliency images from the KITTI experiment displaying network gradients on independently moving objects in the scene does not necessarily mean that the motion representation is capturing independent motion, it just means that the network representation is dependent on those pixels. As the authors did not provide an error comparison between images containing independent motions and those without, it is possible that these network gradients only contribute to error (in terms of the camera pose) instead of capturing independent motions.<br />
<br />
Another criticism is that the group-properties constraint the authors impose is too weak. Any set consisting of functions, their inverses, and the identity forms a group. While physical motions are one example of such a group, there are many valid groups that do not represent any coherent physical motions. That is, it's unclear whether group representations adequately describe the underlying mechanisms of the paper.<br />
<br />
Since the network has to learn both the group elements, <math>\overline{M}</math>, and the composition function, <math>\diamond</math>, associated with the group it is difficult to tell how each of them are performing. It would not be possible to perform a layer-by-layer ablation study to determine the individual contributions of the functions associated with each group.<br />
<br />
Finally, the method requires domain knowledge of the motion space and feature engineering for encoding it, which reduces the ease with which the method can be generalized to various tasks.<br />
<br />
== References ==<br />
Jaegle, A. (2018). Understanding image motion with group representations . ICLR. Retrieved from https://openreview.net/pdf?id=SJLlmG-AZ.<br />
<br />
Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. In ''International Conference on Learning Representations (ICLR) Workshop'', 2013.<br />
<br />
Jason J Yu, Adam W Harley, and Konstantinos G Derpanis. Back to basics: Unsupervised learning of optical flow via brightness constancy and motion smoothness. ''European Conference on Computer Vision (ECCV) Workshops'', 2016.</div>Tdunnhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=On_The_Convergence_Of_ADAM_And_Beyond&diff=36283On The Convergence Of ADAM And Beyond2018-04-07T15:40:33Z<p>Tdunn: /* Notation */</p>
<hr />
<div>= Introduction =<br />
Stochastic gradient descent (SGD) is currently the dominant method of training deep networks. Variants of SGD that scale the gradients using information from past gradients have been very successful, since the learning rate is adjusted on a per-feature basis, with ADAGRAD being one example. However, ADAGRAD performance deteriorates when loss functions are nonconvex and gradients are dense. Several variants of ADAGRAD, such as RMSProp, ADAM, ADADELTA, and NADAM have been proposed, which address the issue by using exponential moving averages of squared past gradients, thereby limiting the update to only rely on the past few gradients. The following formula shows the per-parameter update for which is then vectorized:<br />
<math><br />
g_{t, i} = \nabla_\theta J( \theta_{t, i} ).<br />
</math><br />
<br />
After vectorizing the update per-parameter using SGD becomes:<br />
<math><br />
\theta_{t+1, i} = \theta_{t, i} - \eta \cdot g_{t, i}.<br />
</math><br />
<br />
The update for the parameter in the next step is calculated using the matrix vector product:<br />
<math><br />
\theta_{t+1} = \theta_{t} - \dfrac{\eta}{\sqrt{G_{t} + \epsilon}} \odot g_{t}.<br />
</math><br />
<br />
This paper focuses strictly on the pitfalls in convergence of the ADAM optimizer from a theoretical standpoint and proposes a novel improvement to ADAM called AMSGrad. The paper introduces the idea that it is possible for ADAM to get "stuck" in its weighted average history, preventing it from converging to an optimal solution. For example, in an experiment there may be a large spike in the gradient during some mini-batches. But since ADAM weighs the current update by the exponential moving averages of squared past gradients, the effect of the large spike in gradient is lost. To tackle these issues, several variants of ADAGRAD hav been proposed. The authors' analysis suggest that this can be prevented through novel but simple adjustments to the ADAM optimization algorithm, which can improve convergence. This paper is published in ICLR 2018.<br />
<br />
== Notation ==<br />
The paper presents the following framework as a generalization to all training algorithms, allowing us to fully define any specific variant such as AMSGrad or SGD entirely within it:<br />
<br />
[[File:training_algo_framework.png|700px|center]]<br />
<br />
Where we have <math> x_t </math> as our network parameters defined within a vector space <math> \mathcal{F} </math>. <math> \prod_{\mathcal{F}} (y) = </math> the projection of <math> y </math> on to the set <math> \mathcal{F} </math>. It should be noted that the <math>\sqrt{V_t}</math> in the expression <math> \prod_{\mathcal{F}, \sqrt{V_t}}</math> has been included as a typo.<br />
<math> \psi_t </math> and <math> \phi_t </math> correspond to arbitrary functions we will provide later, The former maps from the history of gradients to <math> \mathbb{R}^d </math> and the latter maps from the history of the gradients to positive semi definite matrices. And finally <math> f_t </math> is our loss function at some time <math> t </math>, the rest should be pretty self explanatory. Using this framework and defining different <math> \psi_t </math> , <math> \phi_t </math> will allow us to recover all different kinds of training algorithms under this one roof.<br />
<br />
=== SGD As An Example ===<br />
To recover SGD using this framework we simply select <math> \phi_t (g_1, \dotsc, g_t) = g_t</math>, <math> \psi_t (g_1, \dotsc, g_t) = I </math> and <math>\alpha_t = \alpha / \sqrt{t}</math>. It is easy to see that no transformations are ultimately applied to any of the parameters based on any gradient history other than the most recent from <math> \phi_t </math> and that <math> \psi_t </math> in no way transforms any of the parameters by any specific amount as <math> V_t = I </math> has no impact later on.<br />
<br />
=== ADAGRAD As Another Example ===<br />
<br />
To recover ADAGRAD, we select <math> \phi_t (g_1, \dotsc, g_t) = g_t</math>, <math> \psi_t (g_1, \dotsc, g_t) = \frac{\sum_{i=1}^{t} g_i^2}{t} </math>, and <math>\alpha_t = \alpha / \sqrt{t}</math>. Therefore, compared to SGD, ADAGRAD uses a different step size for each parameter, based on the past gradients for that parameter; the learning rate becomes <math> \alpha_t = \alpha / \sqrt{\sum_i g_{i,j}^2} </math> for each parameter <math> j </math>. The authors note that this scheme is quite efficient when the gradients are sparse.<br />
<br />
=== ADAM As Another Example ===<br />
Once you can convince yourself that the recovery of SGD from the generalized framework is correct, you should understand the framework enough to see why the following setup for ADAM will allow us to recover the behaviour we want. ADAM has the ability to define a "learning rate" for every parameter based on how much that parameter moves over time (a.k.a its momentum) supposedly to help with the learning process.<br />
<br />
In order to do this, we will choose <math> \phi_t (g_1, \dotsc, g_t) = (1 - \beta_1) \sum_{i=0}^{t} {\beta_1}^{t - i} g_t </math>, psi to be <math> \psi_t (g_1, \dotsc, g_t) = (1 - \beta_2)</math>diag<math>( \sum_{i=0}^{t} {\beta_2}^{t - i} {g_t}^2) </math>, and keep <math>\alpha_t = \alpha / \sqrt{t}</math>. This setup is equivalent to choosing a learning rate decay of <math>\alpha / \sqrt{\sum_i g_{i,j}}</math> for <math>j \in [d]</math>.<br />
<br />
From this, we can now see that <math>m_t </math> gets filled up with the exponentially weighted average of the history of our gradients that we have come across so far in the algorithm. And that as we proceed to update we scale each one of our parameters by dividing out <math> V_t </math> (in the case of diagonal it is just one over the diagonal entry) which contains the exponentially weighted average of each parameter's momentum (<math> {g_t}^2 </math>) across our training so far in the algorithm. Thus each parameter has its own unique scaling by its second moment or momentum. Intuitively, from a physical perspective, if each parameter is a ball rolling around in the optimization landscape what we are now doing is instead of having the ball change positions on the landscape at a fixed velocity (i.e. momentum of 0) the ball now has the ability to accelerate and speed up or slow down if it is on a steep hill or flat trough in the landscape (i.e. a momentum that can change with time).<br />
<br />
= <math> \Gamma_t </math>, an Interesting Quantity =<br />
Now that we have an idea of what ADAM looks like in this framework, let us now investigate the following:<br />
<br />
<center><math> \Gamma_{t + 1} = \frac{\sqrt{V_{t+1}}}{\alpha_{t+1}} - \frac{\sqrt{V_t}}{\alpha_t} </math></center><br />
<br />
Which essentially measure the change of the "Inverse of the learning rate" across time (since we are using alpha's as step sizes). A key observation is that for SGD and ADAGRAD, <math>\Gamma_t \succeq 0</math> for all <math>t \in [T]</math>, which simply follows from the update rules of SGD and ADAGRAD. Looking back to our example of SGD it's not hard to see that this quantity is strictly positive semidefinite, which leads to "non-increasing" learning rates, which is a desired property. However, that is not the case with ADAM, and can pose a problem in a theoretical and applied setting. The problem ADAM can face is that <math> \Gamma_t </math> can potentially be indefinite for <math>t \in [T]</math>, which the original proof assumed it could not be. The math for this proof is VERY long so instead we will opt for an example to showcase why this could be an issue.<br />
<br />
Consider the loss function <math> f_t(x) = \begin{cases} <br />
Cx & \text{for } t \text{ mod 3} = 1 \\<br />
-x & \text{otherwise}<br />
\end{cases} </math><br />
<br />
Where we have <math> C > 2 </math> and <math> \mathcal{F} </math> is <math> [-1,1] </math>. Additionally we choose <math> \beta_1 = 0 </math> and <math> \beta_2 = 1/(1+C^2) </math>. We then proceed to plug this into our framework from before. This function is periodic and it's easy to see that it has the gradient of C once and then a gradient of -1 twice every period. It has an optimal solution of <math> x = -1 </math> (from a regret standpoint), but using ADAM we would eventually converge at <math> x = 1 </math>, since <math> \psi_t </math> would scale down the <math> C </math> by a factor of almost <math> C </math> so that it's unable to "overpower" the multiple -1's.<br />
<br />
We formalize this intuition in the results below.<br />
<br />
'''Theorem 1.''' There is an online convex optimization problem where ADAM has non-zero average regret. i.e. <math>R_T/T\nrightarrow 0 </math> as <math>T\rightarrow \infty</math>.<br />
<br />
One might think that adding a small constant in the denominator of the update function can help avoid this issue by modifying the update for ADAM as follow:<br />
\begin{align}<br />
\hat x_{t+1} = x_t - \alpha_t m_t/\sqrt{V_t + \epsilon \mathbb{I}}<br />
\end{align}<br />
<br />
The selection of <math>\epsilon</math> appears to be crucial for the performance of the algorithm in practice. However, this work shows that for any constant <math>\epsilon > 0</math>, there exists an online optimization setting where ADAM has non-zero average regret asymptotically.<br />
<br />
'''Theorem 2.''' For any constant <math>\beta_1,\beta_2 \in [0,1)</math> such that <math>\beta_2 < \sqrt{\beta_2}</math>, there is an online convex optimization problem where ADAM has non-zero average regret i.e. <math>R_T/T\nrightarrow 0 </math> as <math>T\rightarrow \infty</math>.<br />
<br />
The theorem shows that the convergence of the algorithm to the optimal solution will not be improved by momentum or regularization via <math> \varepsilon </math> with constant <math> \beta_1 </math> and <math> \beta_2</math>.<br />
<br />
<br />
'''Theorem 3.''' For any constant <math>\beta_1,\beta_2 \in [0,1)</math> such that <math>\beta_2 < \sqrt{\beta_2}</math>, there is a stochastic convex optimization problem for which ADAM does not converge to the optimal solution. <br />
<br />
Kingama & Ba (2015) mentioned that the analysis of ADAM relies on decreasing <math> \beta_1 </math> over time. As <math> \beta_2 </math> is the critical parameter, the examples could be easily extended to the case where <math> \beta_1 </math> is decreasing over time. The paper only focus on proving non-convergence of ADAM when <math> \beta_1 </math> is constant.<br />
<br />
= AMSGrad as an improvement to ADAM =<br />
There is a very simple intuitive fix to ADAM to handle this problem. We simply scale our historical weighted average by the maximum we have seen so far to avoid the negative sign problem. There is a very simple one-liner adaptation of ADAM to get to AMSGRAD:<br />
[[File:AMSGrad_algo.png|700px|center]]<br />
<br />
Below are some simple plots comparing ADAM and AMSGrad, the first are from the paper and the second are from another individual who attempted to recreate the experiments. The two plots somewhat disagree with one another so take this heuristic improvement with a grain of salt.<br />
<br />
[[File:AMSGrad_vs_adam.png|900px|center]]<br />
<br />
Here is another example of a one-dimensional convex optimization problem where ADAM fails to converge<br />
<br />
[[File:AMSGrad_vs_adam3.png|900px|center]]<br />
<br />
[[File:AMSGrad_vs_adam2.png|700px|center]]<br />
<br />
= Conclusion =<br />
The authors have introduced a framework for which they could view several different training algorithms. From there they used it to recover SGD as well as ADAM. In their recovery of ADAM the authors investigated the change of the inverse of the learning rate over time to discover in certain cases there were convergence issues. They proposed a new heuristic AMSGrad to help deal with this problem and presented some empirical results that show it may have helped ADAM slightly. Thanks for your time.<br />
<br />
== Critique ==<br />
The contrived example which serves as the intuition to illustrate the failure of ADAM is not convincing, since we can construct similar failure examples for SGD as well. <br />
Consider the loss function <br />
<br />
<math> f_t(x) = \begin{cases} <br />
-x & \text{for } t \text{ mod 2} = 1 \\<br />
-\frac{1}{2} x^2 & \text{otherwise}<br />
\end{cases} <br />
</math><br />
<br />
where <math> x \in \mathcal{F} = [-a, 1], a \in [1, \sqrt{2}) </math>. The optimal solution is <math>x=1</math>, but starting from initial point <math>x_{t=0} \le -1</math>, SGD will converge to <math>x = -a</math><br />
<br />
The author also fail to explain why ADAM is popular in experiments, why it works better than other optimizer in certain situations.<br />
<br />
==Implementation == <br />
Keras implementation of AMSGrad : https://gist.github.com/kashif/3eddc3c90e23d84975451f43f6e917da<br />
<br />
= Source =<br />
1. Sashank J. Reddi and Satyen Kale and Sanjiv Kumar. "On the Convergence of Adam and Beyond." International Conference on Learning Representations. 2018</div>Tdunnhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Word_translation_without_parallel_data&diff=36282Word translation without parallel data2018-04-07T15:19:50Z<p>Tdunn: /* Source */</p>
<hr />
<div>[[File:Toy_example.png]]<br />
<br />
= Presented by =<br />
<br />
Xia Fan<br />
<br />
= Introduction =<br />
<br />
Many successful methods for learning relationships between languages stem from the hypothesis that there is a relationship between the context of words and their meanings. This means that if an adequate representation of a language is found in a high dimensional space (this is called an embedding), then words similar to a given word are close to one another in this space (ex. some norm can be minimized to find a word with similar context). Historically, another significant hypothesis is that these embedding spaces show similar structures over different languages. That is to say that given an embedding space for English and one for Spanish, a mapping could be found that aligns the two spaces and such a mapping could be used as a tool for translation. Many papers exploit these hypotheses, but use large parallel datasets for training. Recently, to remove the need for supervised training, methods have been implemented that utilize identical character strings (ex. letters or digits) in order to try to align the embeddings. The downside of this approach is that the two languages need to be similar to begin with as they need to have some shared basic building block. The method proposed in this paper uses an adversarial method to find this mapping between the embedding spaces of two languages without the use of large parallel datasets.<br />
<br />
The contributions of this paper can be listed as follows: <br />
<br />
1. This paper introduces a model that either is on par, or outperforms supervised state-of-the-art methods, without employing any cross-lingual annotated data such as bilingual dictionaries or parallel corpora (large and structured sets of texts). This method uses an idea similar to GANs: it leverages adversarial training to learn a linear mapping from a source to distinguish between the mapped source embeddings and the target embeddings, while the mapping is jointly trained to fool the discriminator. <br />
<br />
2. Second, this paper extracts a synthetic dictionary from the resulting shared embedding space and fine-tunes the mapping with the closed-form Procrustes solution from Schonemann (1966). <br />
<br />
3. Third, this paper also introduces an unsupervised selection metric that is highly correlated with the mapping quality and that the authors use both as a stopping criterion and to select the best hyper-parameters. <br />
<br />
4. Fourth, they introduce a cross-domain similarity adaptation to mitigate the so-called hubness problem (points tending to be nearest neighbors of many points in high-dimensional spaces).<br />
<br />
5. They demonstrate the effectiveness of our method using an example of a low-resource language pair where parallel corpora are not available (English-Esperanto) for which their method is particularly suited.<br />
<br />
This paper is published in ICLR 2018.<br />
<br />
= Related Work =<br />
<br />
'''Bilingual Lexicon Induction'''<br />
<br />
Many papers have addressed this subject by using discrete word representations. Regularly however these methods need to have an initialization of prior knowledge, such as the editing distance between the input and output ground truth. This unfortunately only works for closely related languages.<br />
<br />
= Model =<br />
<br />
<br />
=== Estimation of Word Representations in Vector Space ===<br />
<br />
This model focuses on learning a mapping between the two sets such that translations are close in the shared space. Before talking about the model it used, a model which can exploit the similarities of monolingual embedding spaces should be introduced. Mikolov et al.(2013) use a known dictionary of n=5000 pairs of words <math> \{x_i,y_i\}_{i\in{1,n}} </math>. and learn a linear mapping W between the source and the target space such that <br />
<br />
\begin{align}<br />
W^*=argmin_{W{\in}M_d(R)}||WX-Y||_F \hspace{1cm} (1)<br />
\end{align}<br />
<br />
where d is the dimension of the embeddings, <math> M_d(R) </math> is the space of d*d matrices of real numbers, and X and Y are two aligned matrices of size d*n containing the embeddings of the words in the parallel vocabulary. Here <math>||\cdot||_F</math> is the Frobenius matrix norm which is the square root of the sum of the squared components.<br />
<br />
Xing et al. (2015) showed that these results are improved by enforcing orthogonality constraint on W. In that case, equation (1) boils down to the Procrustes problem, a matrix approximation problem for which the goal is to find an orthogonal matrix that best maps two given matrices on the measure of the Frobenius norm. It advantageously offers a closed form solution obtained from the singular value decomposition (SVD) of <math> YX^T </math> :<br />
<br />
\begin{align}<br />
W^*=argmin_{W{\in}M_d(R)}||WX-Y||_F=UV^T\textrm{, with }U\Sigma V^T=SVD(YX^T).<br />
\end{align}<br />
<br />
<br />
This can be proven as follows. First note that <br />
\begin{align}<br />
&||WX-Y||_F\\<br />
&= \langle WX-Y, WX-Y\rangle_F\\ <br />
&= \langle WX, WX \rangle_F -2 \langle W X, Y \rangle_F + \langle Y, Y \rangle_F \\<br />
&= ||X||_F^2 -2 \langle W X, Y \rangle_F + || Y||_F^2, <br />
\end{align}<br />
<br />
where <math display="inline"> \langle \cdot, \cdot \rangle_F </math> denotes the Frobenius inner-product and we have used the orthogonality of <math display="inline"> W </math>. It follows that we need only maximize the inner-product above. Let <math display="inline"> u_1, \ldots, u_d </math> denote the columns of <math display="inline"> U </math>. Let <math display="inline"> v_1, \ldots , v_d </math> denote the columns of <math display="inline"> V </math>. Let <math display="inline"> \sigma_1, \ldots, \sigma_d </math> denote the diagonal entries of <math display="inline"> \Sigma </math>. We have<br />
\begin{align}<br />
&\langle W X, Y \rangle_F \\<br />
&= \text{Tr} (W^T Y X^T)\\<br />
& =\text{Tr}(W^T \sum_i \sigma_i u_i v_i^T)\\<br />
&=\sum_i \sigma_i \text{Tr}(W^T u_i v_i^T)\\<br />
&=\sum_i \sigma_i ((Wv_i)^T u_i )\text{ invariance of trace under cyclic permutations}\\<br />
&\le \sum_i \sigma_i ||Wv_i|| ||u_i||\text{ Cauchy-Swarz inequality}\\<br />
&= \sum_i \sigma_i<br />
\end{align}<br />
where we have used the invariance of trace under cyclic permutations, Cauchy-Schwarz, and the orthogonality of the columns of U and V. Note that choosing <br />
\begin{align}<br />
W=UV^T<br />
\end{align}<br />
achieves the bound. This completes the proof.<br />
<br />
=== Domain-adversarial setting ===<br />
<br />
This paper shows how to learn this mapping W without cross-lingual supervision. An illustration of the approach is given in Fig. 1. First, this model learns an initial proxy of W by using an adversarial criterion. Then, it uses the words that match the best as anchor points for Procrustes. Finally, it improves performance over less frequent words by changing the metric of the space, which leads to spread more of those points in dense region. <br />
<br />
[[File:Toy_example.png |frame|none|alt=Alt text|Figure 1: Toy illustration of the method. (A) There are two distributions of word embeddings, English words in red denoted by X and Italian words in blue denoted by Y , which we want to align/translate. Each dot represents a word in that space. The size of the dot is proportional to the frequency of the words in the training corpus of that language. (B) Using adversarial learning, we learn a rotation matrix W which roughly aligns the two distributions. The green stars are randomly selected words that are fed to the discriminator to determine whether the two word embeddings come from the same distribution. (C) The mapping W is further refined via Procrustes. This method uses frequent words aligned by the previous step as anchor points, and minimizes an energy function that corresponds to a spring system between anchor points. The refined mapping is then used to map all words in the dictionary. (D) Finally, we translate by using the mapping W and a distance metric, dubbed CSLS, that expands the space where there is high density of points (like the area around the word “cat”), so that “hubs” (like the word “cat”) become less close to other word vectors than they would otherwise (compare to the same region in panel (A)).]]<br />
<br />
Let <math> X={x_1,...,x_n} </math> and <math> Y={y_1,...,y_m} </math> be two sets of n and m word embeddings coming from a source and a target language respectively. A model is trained is trained to discriminate between elements randomly sampled from <math> WX={Wx_1,...,Wx_n} </math> and Y, We call this model the discriminator. W is trained to prevent the discriminator from making accurate predictions. As a result, this is a two-player adversarial game, where the discriminator aims at maximizing its ability to identify the origin of an embedding, and W aims at preventing the discriminator from doing so by making WX and Y as similar as possible. This approach is in line with the work of Ganin et al.(2016), who proposed to learn latent representations invariant to the input domain, where in this case, a domain is represented by a language(source or target).<br />
<br />
1. Discriminator objective<br />
<br />
Refer to the discriminator parameters as <math> \theta_D </math>. Consider the probability <math> P_{\theta_D}(source = 1|z) </math> that a vector z is the mapping of a source embedding (as opposed to a target embedding) according to the discriminator. The discriminator loss can be written as:<br />
<br />
\begin{align}<br />
L_D(\theta_D|W)=-\frac{1}{n} \sum_{i=1}^n log P_{\theta_D}(source=1|Wx_i)-\frac{1}{m} \sum_{i=1}^m log P_{\theta_D}(source=0|y_i)<br />
\end{align}<br />
<br />
2. Mapping objective <br />
<br />
In the unsupervised setting, W is now trained so that the discriminator is unable to accurately predict the embedding origins: <br />
<br />
\begin{align}<br />
L_W(W|\theta_D)=-\frac{1}{n} \sum_{i=1}^n log P_{\theta_D}(source=0|Wx_i)-\frac{1}{m} \sum_{i=1}^m log P_{\theta_D}(source=1|y_i)<br />
\end{align}<br />
<br />
3. Learning algorithm <br />
To train the model, the authors follow the standard training procedure of deep adversarial networks of Goodfellow et al. (2014). For every input sample, the discriminator and the mapping matrix W are trained successively with stochastic gradient updates to respectively minimize <math> L_D </math> and <math> L_W </math><br />
<br />
=== Refinement procedure ===<br />
<br />
The matrix W obtained with adversarial training gives good performance (see Table 1), but the results are still not on par with the supervised approach. In fact, the adversarial approach tries to align all words irrespective of their frequencies. However, rare words have embeddings that are less updated and are more likely to appear in different contexts in each corpus, which makes them harder to align. Under the assumption that the mapping is linear, it is then better to infer the global mapping using only the most frequent words as anchors. Besides, the accuracy on the most frequent word pairs is high after adversarial training.<br />
To refine the mapping, this paper build a synthetic parallel vocabulary using the W just learned with adversarial training. Specifically, this paper consider the most frequent words and retain only mutual nearest neighbors to ensure a high-quality dictionary. Subsequently, this paper apply the Procrustes solution in (2) on this generated dictionary. Considering the improved solution generated with the Procrustes algorithm, it is possible to generate a more accurate dictionary and apply this method iteratively, similarly to Artetxe et al. (2017). However, given that the synthetic dictionary obtained using adversarial training is already strong, this paper only observe small improvements when doing more than one iteration, i.e., the improvements on the word translation task are usually below 1%.<br />
<br />
=== Cross-Domain Similarity Local Scaling (CSLS) ===<br />
<br />
This paper considers a bi-partite neighborhood graph, in which each word of a given dictionary is connected to its K nearest neighbors in the other language. <math> N_T(Wx_s) </math> is used to denote the neighborhood, on this bi-partite graph, associated with a mapped source word embedding <math> Wx_s </math>. All K elements of <math> N_T(Wx_s) </math> are words from the target language. Similarly we denote by <math> N_S(y_t) </math> the neighborhood associated with a word t of the target language. Consider the mean similarity of a source embedding <math> x_s </math> to its target neighborhood as<br />
<br />
\begin{align}<br />
r_T(Wx_s)=\frac{1}{K}\sum_{y\in N_T(Wx_s)}cos(Wx_s,y_t)<br />
\end{align}<br />
<br />
where cos(.,.) is the cosine similarity which is the cosine of the angle between two vectors. Likewise, the mean similarity of a target word <math> y_t </math> to its neighborhood is denoted as <math> r_S(y_t) </math>. This is used to define similarity measure CSLS(.,.) between mapped source words and target words as <br />
<br />
\begin{align}<br />
CSLS(Wx_s,y_t)=2cos(Wx_s,y_t)-r_T(Wx_s)-r_S(y_t)<br />
\end{align}<br />
<br />
This process increases the similarity associated with isolated word vectors, but decreases the similarity of vectors lying in dense areas. <br />
<br />
CSLS represents an improved measure for producing reliable matching words between two languages (i.e. neighbors of a word in one language should ideally correspond to the same words in the second language). The nearest neighbors algorithm is asymmetric, and in high-dimensional spaces, it suffers from the problem of hubness, in which some points are nearest neighbors to exceptionally many points, while others are not nearest neighbors to any points. Existing approaches for combating the effect of hubness on word translation retrieval involve performing similarity updates one language at a time without consideration for the other language in the pair (Dinu et al., 2015, Smith et al., 2017). Consequently, they yielded less accurate results when compared to CSLS in experiments conducted in this paper (Table 1).<br />
<br />
= Training and architectural choices =<br />
=== Architecture ===<br />
<br />
This paper use unsupervised word vectors that were trained using fastText2. These correspond to monolingual embeddings of dimension 300 trained on Wikipedia corpora; therefore, the mapping W has size 300 × 300. Words are lower-cased, and those that appear less than 5 times are discarded for training. As a post-processing step, only the first 200k most frequent words were selected in the experiments.<br />
For the discriminator, it use a multilayer perceptron with two hidden layers of size 2048, and Leaky-ReLU activation functions. The input to the discriminator is corrupted with dropout noise with a rate of 0.1. As suggested by Goodfellow (2016), a smoothing coefficient s = 0.2 is included in the discriminator predictions. This paper use stochastic gradient descent with a batch size of 32, a learning rate of 0.1 and a decay of 0.95 both for the discriminator and W . <br />
<br />
=== Discriminator inputs ===<br />
The embedding quality of rare words is generally not as good as the one of frequent words (Luong et al., 2013), and it is observed that feeding the discriminator with rare words had a small, but not negligible negative impact. As a result, this paper only feed the discriminator with the 50,000 most frequent words. At each training step, the word embeddings given to the discriminator are sampled uniformly. Sampling them according to the word frequency did not have any noticeable impact on the results.<br />
<br />
=== Orthogonality===<br />
In this work, the authors propose to use a simple update step to ensure that the matrix W stays close to an orthogonal matrix during training (Cisse et al. (2017)). Specifically, the following update rule on the matrix W is used:<br />
<br />
\begin{align}<br />
W \leftarrow (1+\beta)W-\beta(WW^T)W<br />
\end{align}<br />
<br />
where β = 0.01 is usually found to perform well. This method ensures that the matrix stays close to the manifold of orthogonal matrices after each update.<br />
<br />
This update rule can be justified as follows. Consider the function <br />
\begin{align}<br />
g: \mathbb{R}^{d\times d} \to \mathbb{R}^{d \times d}<br />
\end{align}<br />
defined by<br />
\begin{align}<br />
g(W)= W^T W -I.<br />
\end{align}<br />
<br />
The derivative of g at W is is the linear map<br />
\begin{align}<br />
Dg[W]: \mathbb{R}^{d \times d} \to \mathbb{R}^{d \times d}<br />
\end{align}<br />
defined by<br />
\begin{align}<br />
Dg[W](H)= H^T W + W^T H.<br />
\end{align}<br />
<br />
The adjoint of this linear map is<br />
<br />
\begin{align}<br />
D^\ast g[W](H)= WH^T +WH.<br />
\end{align}<br />
<br />
Now consider the function f<br />
\begin{align}<br />
f: \mathbb{R}^{d \times d} \to \mathbb{R}<br />
\end{align}<br />
<br />
defined by<br />
<br />
\begin{align}<br />
f(W)=||g(W) ||_F^2=||W^TW -I ||_F^2.<br />
\end{align}<br />
<br />
f has gradient:<br />
\begin{align}<br />
\nabla f (W) = 2D^\ast g[W] (g(W ) ) =2W(W^TW-I) +2W(W^TW-I)=4W W^TW-4W.<br />
\end{align}<br />
or<br />
\begin{align}<br />
\nabla f (W) = \nabla||W^TW-I||_F = \nabla\text{Tr}(W^TW-I)(W^TW-I)=4(\nabla(W^TW-I))(W^TW-I)=4W(W^TW-I)\text{ (check derivative of trace function)}<br />
\end{align}<br />
<br />
Thus the update<br />
\begin{align}<br />
W \leftarrow (1+\beta)W-\beta(WW^T)W<br />
\end{align}<br />
amounts to a step in the direction opposite the gradient of f. That is, a step toward the set of orthogonal matrices.<br />
<br />
=== Dictionary generation ===<br />
The refinement step requires the generation of a new dictionary at each iteration. In order for the Procrustes solution to work well, it is best to apply it on correct word pairs. As a result, the CSLS method is used to select more accurate translation pairs in the dictionary. To further increase the quality of the dictionary, and ensure that W is learned from correct translation pairs, only mutual nearest neighbors were considered, i.e. pairs of words that are mutually nearest neighbors of each other according to CSLS. This significantly decreases the size of the generated dictionary, but improves its accuracy, as well as the overall performance.<br />
<br />
=== Validation criterion for unsupervised model selection ===<br />
<br />
This paper consider the 10k most frequent source words, and use CSLS to generate a translation for each of them, then compute the average cosine similarity between these deemed translations, and use this average as a validation metric. The choice of using the 10 thousand most frequent source words is requires more justification since we would expect those to be the best trained words and may not accurately represent the entire data set. Perhaps a k-fold cross validation approach should be used instead. Figure 2 below shows the correlation between the evaluation score and this unsupervised criterion (without stabilization by learning rate shrinkage)<br />
<br />
<br />
<br />
[[File:fig2_fan.png |frame|none|alt=Alt text|Figure 2: Unsupervised model selection.<br />
Correlation between the unsupervised validation criterion (black line) and actual word translation accuracy (blue line). In this particular experiment, the selected model is at epoch 10. Observe how the criterion is well correlated with translation accuracy.]]<br />
<br />
= Results =<br />
<br />
The results on word translation retrieval using the bilingual dictionaries are presented in Table 1, and a comparison to previous work in shown in Table 2 where the unsupervised model significantly outperforms previous approaches. The results on the sentence translation retrieval task are presented in Table 3, and the cross-lingual word similarity task in Table 4. Finally, the results on word-by-word translation for English-Esperanto are presented in Table 5. The bilingual dictionary used here does not account for words with multiple meanings.<br />
<br />
[[File:table1_fan.png |frame|none|alt=Alt text|Table 1: Word translation retrieval P@1 for the released vocabularies in various language pairs. The authors consider 1,500 source test queries, and 200k target words for each language pair. The authors use fastText embeddings trained on Wikipedia. NN: nearest neighbors. ISF: inverted softmax. (’en’ is English, ’fr’ is French, ’de’ is German, ’ru’ is Russian, ’zh’ is classical Chinese and ’eo’ is Esperanto)]]<br />
<br />
<br />
[[File:table2_fan.png |frame|none|alt=Alt text|English-Italian word translation average precisions (@1, @5, @10) from 1.5k source word queries using 200k target words. Results marked with the symbol † are from Smith et al. (2017). Wiki means the embeddings were trained on Wikipedia using fastText. Note that the method used by Artetxe et al. (2017) does not use the same supervision as other supervised methods, as they only use numbers in their ini- tial parallel dictionary.]]<br />
<br />
[[File:table3_fan.png |frame|none|alt=Alt text|Table 3: English-Italian sentence translation retrieval. The authors report the average P@k from 2,000 source queries using 200,000 target sentences. The authors use the same embeddings as in Smith et al. (2017). Their results are marked with the symbol †.]]<br />
<br />
[[File:table4_fan.png |frame|none|alt=Alt text|Table 4: Cross-lingual wordsim task. NASARI<br />
(Camacho-Collados et al. (2016)) refers to the official SemEval2017 baseline. The authors report Pearson correlation.]]<br />
<br />
[[File:table5_fan.png |frame|none|alt=Alt text|Table 5: BLEU score on English-Esperanto.<br />
Although being a naive approach, word-by- word translation is enough to get a rough idea of the input sentence. The quality of the gener- ated dictionary has a significant impact on the BLEU score.]]<br />
<br />
[[File:paper9_fig3.png |frame|none|alt=Alt text|Figure 3: The paper also investigated the impact of monolingual embeddings. It was found that model from this paper can align embeddings obtained through different methods, but not embeddings obtained from different corpora, which explains the large performance increase in Table 2 due to the corpus change from WaCky to Wiki using CBOW embedding. This is conveyed in this figure which displays English to English world alignment accuracies with regard to word frequency. Perfect alignment is achieved using the same model and corpora (a). Also good alignment using different model and corpora, although CSLS consistently has better results (b). Worse results due to use of different corpora (c). Even worse results when both embedding model and corpora are different.]]<br />
<br />
= Conclusion =<br />
It is clear that one major downfall of this method when it actually comes to translation is the restriction that the two languages must have similar intrinsic structures to allow for the embeddings to align. However, given this assumption, this paper shows for the first time that one can align word embedding spaces without any cross-lingual supervision, i.e., solely based on unaligned datasets of each language, while reaching or outperforming the quality of previous supervised approaches in several cases. Using adversarial training, the model is able to initialize a linear mapping between a source and a target space, which is also used to produce a synthetic parallel dictionary. It is then possible to apply the same techniques proposed for supervised techniques, namely a Procrustean optimization.<br />
<br />
= Open source code =<br />
The source code for the paper is provided at the following Github link: https://github.com/facebookresearch/MUSE. The repository provides the source code as written in PyTorch by the authors of this paper.<br />
<br />
= Source =<br />
Dinu, Georgiana; Lazaridou, Angeliki; Baroni, Marco<br />
| Improving zero-shot learning by mitigating the hubness problem<br />
| arXiv:1412.6568<br />
<br />
Lample, Guillaume; Denoyer, Ludovic; Ranzato, Marc'Aurelio <br />
| Unsupervised Machine Translation Using Monolingual Corpora Only<br />
| arXiv: 1701.04087<br />
<br />
Smith, Samuel L; Turban, David HP; Hamblin, Steven; Hammerla, Nils Y<br />
| Offline bilingual word vectors, orthogonal transformations and the inverted softmax<br />
| arXiv:1702.03859<br />
<br />
Lample, G. (n.d.). Facebookresearch/MUSE. Retrieved March 25, 2018, from https://github.com/facebookresearch/MUSE<br />
<br />
Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean<br />
| Efficient Estimation of Word Representations in Vector Space, 2013<br />
| arXiv:1301.3781<br />
<br />
<br />
Xing, C., Wang, D., Liu, C., & Lin, Y. (2015). Normalized Word Embedding and Orthogonal Transform for Bilingual Word Translation. HLT-NAACL.</div>Tdunnhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Word_translation_without_parallel_data&diff=36281Word translation without parallel data2018-04-07T15:17:30Z<p>Tdunn: /* Source */</p>
<hr />
<div>[[File:Toy_example.png]]<br />
<br />
= Presented by =<br />
<br />
Xia Fan<br />
<br />
= Introduction =<br />
<br />
Many successful methods for learning relationships between languages stem from the hypothesis that there is a relationship between the context of words and their meanings. This means that if an adequate representation of a language is found in a high dimensional space (this is called an embedding), then words similar to a given word are close to one another in this space (ex. some norm can be minimized to find a word with similar context). Historically, another significant hypothesis is that these embedding spaces show similar structures over different languages. That is to say that given an embedding space for English and one for Spanish, a mapping could be found that aligns the two spaces and such a mapping could be used as a tool for translation. Many papers exploit these hypotheses, but use large parallel datasets for training. Recently, to remove the need for supervised training, methods have been implemented that utilize identical character strings (ex. letters or digits) in order to try to align the embeddings. The downside of this approach is that the two languages need to be similar to begin with as they need to have some shared basic building block. The method proposed in this paper uses an adversarial method to find this mapping between the embedding spaces of two languages without the use of large parallel datasets.<br />
<br />
The contributions of this paper can be listed as follows: <br />
<br />
1. This paper introduces a model that either is on par, or outperforms supervised state-of-the-art methods, without employing any cross-lingual annotated data such as bilingual dictionaries or parallel corpora (large and structured sets of texts). This method uses an idea similar to GANs: it leverages adversarial training to learn a linear mapping from a source to distinguish between the mapped source embeddings and the target embeddings, while the mapping is jointly trained to fool the discriminator. <br />
<br />
2. Second, this paper extracts a synthetic dictionary from the resulting shared embedding space and fine-tunes the mapping with the closed-form Procrustes solution from Schonemann (1966). <br />
<br />
3. Third, this paper also introduces an unsupervised selection metric that is highly correlated with the mapping quality and that the authors use both as a stopping criterion and to select the best hyper-parameters. <br />
<br />
4. Fourth, they introduce a cross-domain similarity adaptation to mitigate the so-called hubness problem (points tending to be nearest neighbors of many points in high-dimensional spaces).<br />
<br />
5. They demonstrate the effectiveness of our method using an example of a low-resource language pair where parallel corpora are not available (English-Esperanto) for which their method is particularly suited.<br />
<br />
This paper is published in ICLR 2018.<br />
<br />
= Related Work =<br />
<br />
'''Bilingual Lexicon Induction'''<br />
<br />
Many papers have addressed this subject by using discrete word representations. Regularly however these methods need to have an initialization of prior knowledge, such as the editing distance between the input and output ground truth. This unfortunately only works for closely related languages.<br />
<br />
= Model =<br />
<br />
<br />
=== Estimation of Word Representations in Vector Space ===<br />
<br />
This model focuses on learning a mapping between the two sets such that translations are close in the shared space. Before talking about the model it used, a model which can exploit the similarities of monolingual embedding spaces should be introduced. Mikolov et al.(2013) use a known dictionary of n=5000 pairs of words <math> \{x_i,y_i\}_{i\in{1,n}} </math>. and learn a linear mapping W between the source and the target space such that <br />
<br />
\begin{align}<br />
W^*=argmin_{W{\in}M_d(R)}||WX-Y||_F \hspace{1cm} (1)<br />
\end{align}<br />
<br />
where d is the dimension of the embeddings, <math> M_d(R) </math> is the space of d*d matrices of real numbers, and X and Y are two aligned matrices of size d*n containing the embeddings of the words in the parallel vocabulary. Here <math>||\cdot||_F</math> is the Frobenius matrix norm which is the square root of the sum of the squared components.<br />
<br />
Xing et al. (2015) showed that these results are improved by enforcing orthogonality constraint on W. In that case, equation (1) boils down to the Procrustes problem, a matrix approximation problem for which the goal is to find an orthogonal matrix that best maps two given matrices on the measure of the Frobenius norm. It advantageously offers a closed form solution obtained from the singular value decomposition (SVD) of <math> YX^T </math> :<br />
<br />
\begin{align}<br />
W^*=argmin_{W{\in}M_d(R)}||WX-Y||_F=UV^T\textrm{, with }U\Sigma V^T=SVD(YX^T).<br />
\end{align}<br />
<br />
<br />
This can be proven as follows. First note that <br />
\begin{align}<br />
&||WX-Y||_F\\<br />
&= \langle WX-Y, WX-Y\rangle_F\\ <br />
&= \langle WX, WX \rangle_F -2 \langle W X, Y \rangle_F + \langle Y, Y \rangle_F \\<br />
&= ||X||_F^2 -2 \langle W X, Y \rangle_F + || Y||_F^2, <br />
\end{align}<br />
<br />
where <math display="inline"> \langle \cdot, \cdot \rangle_F </math> denotes the Frobenius inner-product and we have used the orthogonality of <math display="inline"> W </math>. It follows that we need only maximize the inner-product above. Let <math display="inline"> u_1, \ldots, u_d </math> denote the columns of <math display="inline"> U </math>. Let <math display="inline"> v_1, \ldots , v_d </math> denote the columns of <math display="inline"> V </math>. Let <math display="inline"> \sigma_1, \ldots, \sigma_d </math> denote the diagonal entries of <math display="inline"> \Sigma </math>. We have<br />
\begin{align}<br />
&\langle W X, Y \rangle_F \\<br />
&= \text{Tr} (W^T Y X^T)\\<br />
& =\text{Tr}(W^T \sum_i \sigma_i u_i v_i^T)\\<br />
&=\sum_i \sigma_i \text{Tr}(W^T u_i v_i^T)\\<br />
&=\sum_i \sigma_i ((Wv_i)^T u_i )\text{ invariance of trace under cyclic permutations}\\<br />
&\le \sum_i \sigma_i ||Wv_i|| ||u_i||\text{ Cauchy-Swarz inequality}\\<br />
&= \sum_i \sigma_i<br />
\end{align}<br />
where we have used the invariance of trace under cyclic permutations, Cauchy-Schwarz, and the orthogonality of the columns of U and V. Note that choosing <br />
\begin{align}<br />
W=UV^T<br />
\end{align}<br />
achieves the bound. This completes the proof.<br />
<br />
=== Domain-adversarial setting ===<br />
<br />
This paper shows how to learn this mapping W without cross-lingual supervision. An illustration of the approach is given in Fig. 1. First, this model learns an initial proxy of W by using an adversarial criterion. Then, it uses the words that match the best as anchor points for Procrustes. Finally, it improves performance over less frequent words by changing the metric of the space, which leads to spread more of those points in dense region. <br />
<br />
[[File:Toy_example.png |frame|none|alt=Alt text|Figure 1: Toy illustration of the method. (A) There are two distributions of word embeddings, English words in red denoted by X and Italian words in blue denoted by Y , which we want to align/translate. Each dot represents a word in that space. The size of the dot is proportional to the frequency of the words in the training corpus of that language. (B) Using adversarial learning, we learn a rotation matrix W which roughly aligns the two distributions. The green stars are randomly selected words that are fed to the discriminator to determine whether the two word embeddings come from the same distribution. (C) The mapping W is further refined via Procrustes. This method uses frequent words aligned by the previous step as anchor points, and minimizes an energy function that corresponds to a spring system between anchor points. The refined mapping is then used to map all words in the dictionary. (D) Finally, we translate by using the mapping W and a distance metric, dubbed CSLS, that expands the space where there is high density of points (like the area around the word “cat”), so that “hubs” (like the word “cat”) become less close to other word vectors than they would otherwise (compare to the same region in panel (A)).]]<br />
<br />
Let <math> X={x_1,...,x_n} </math> and <math> Y={y_1,...,y_m} </math> be two sets of n and m word embeddings coming from a source and a target language respectively. A model is trained is trained to discriminate between elements randomly sampled from <math> WX={Wx_1,...,Wx_n} </math> and Y, We call this model the discriminator. W is trained to prevent the discriminator from making accurate predictions. As a result, this is a two-player adversarial game, where the discriminator aims at maximizing its ability to identify the origin of an embedding, and W aims at preventing the discriminator from doing so by making WX and Y as similar as possible. This approach is in line with the work of Ganin et al.(2016), who proposed to learn latent representations invariant to the input domain, where in this case, a domain is represented by a language(source or target).<br />
<br />
1. Discriminator objective<br />
<br />
Refer to the discriminator parameters as <math> \theta_D </math>. Consider the probability <math> P_{\theta_D}(source = 1|z) </math> that a vector z is the mapping of a source embedding (as opposed to a target embedding) according to the discriminator. The discriminator loss can be written as:<br />
<br />
\begin{align}<br />
L_D(\theta_D|W)=-\frac{1}{n} \sum_{i=1}^n log P_{\theta_D}(source=1|Wx_i)-\frac{1}{m} \sum_{i=1}^m log P_{\theta_D}(source=0|y_i)<br />
\end{align}<br />
<br />
2. Mapping objective <br />
<br />
In the unsupervised setting, W is now trained so that the discriminator is unable to accurately predict the embedding origins: <br />
<br />
\begin{align}<br />
L_W(W|\theta_D)=-\frac{1}{n} \sum_{i=1}^n log P_{\theta_D}(source=0|Wx_i)-\frac{1}{m} \sum_{i=1}^m log P_{\theta_D}(source=1|y_i)<br />
\end{align}<br />
<br />
3. Learning algorithm <br />
To train the model, the authors follow the standard training procedure of deep adversarial networks of Goodfellow et al. (2014). For every input sample, the discriminator and the mapping matrix W are trained successively with stochastic gradient updates to respectively minimize <math> L_D </math> and <math> L_W </math><br />
<br />
=== Refinement procedure ===<br />
<br />
The matrix W obtained with adversarial training gives good performance (see Table 1), but the results are still not on par with the supervised approach. In fact, the adversarial approach tries to align all words irrespective of their frequencies. However, rare words have embeddings that are less updated and are more likely to appear in different contexts in each corpus, which makes them harder to align. Under the assumption that the mapping is linear, it is then better to infer the global mapping using only the most frequent words as anchors. Besides, the accuracy on the most frequent word pairs is high after adversarial training.<br />
To refine the mapping, this paper build a synthetic parallel vocabulary using the W just learned with adversarial training. Specifically, this paper consider the most frequent words and retain only mutual nearest neighbors to ensure a high-quality dictionary. Subsequently, this paper apply the Procrustes solution in (2) on this generated dictionary. Considering the improved solution generated with the Procrustes algorithm, it is possible to generate a more accurate dictionary and apply this method iteratively, similarly to Artetxe et al. (2017). However, given that the synthetic dictionary obtained using adversarial training is already strong, this paper only observe small improvements when doing more than one iteration, i.e., the improvements on the word translation task are usually below 1%.<br />
<br />
=== Cross-Domain Similarity Local Scaling (CSLS) ===<br />
<br />
This paper considers a bi-partite neighborhood graph, in which each word of a given dictionary is connected to its K nearest neighbors in the other language. <math> N_T(Wx_s) </math> is used to denote the neighborhood, on this bi-partite graph, associated with a mapped source word embedding <math> Wx_s </math>. All K elements of <math> N_T(Wx_s) </math> are words from the target language. Similarly we denote by <math> N_S(y_t) </math> the neighborhood associated with a word t of the target language. Consider the mean similarity of a source embedding <math> x_s </math> to its target neighborhood as<br />
<br />
\begin{align}<br />
r_T(Wx_s)=\frac{1}{K}\sum_{y\in N_T(Wx_s)}cos(Wx_s,y_t)<br />
\end{align}<br />
<br />
where cos(.,.) is the cosine similarity which is the cosine of the angle between two vectors. Likewise, the mean similarity of a target word <math> y_t </math> to its neighborhood is denoted as <math> r_S(y_t) </math>. This is used to define similarity measure CSLS(.,.) between mapped source words and target words as <br />
<br />
\begin{align}<br />
CSLS(Wx_s,y_t)=2cos(Wx_s,y_t)-r_T(Wx_s)-r_S(y_t)<br />
\end{align}<br />
<br />
This process increases the similarity associated with isolated word vectors, but decreases the similarity of vectors lying in dense areas. <br />
<br />
CSLS represents an improved measure for producing reliable matching words between two languages (i.e. neighbors of a word in one language should ideally correspond to the same words in the second language). The nearest neighbors algorithm is asymmetric, and in high-dimensional spaces, it suffers from the problem of hubness, in which some points are nearest neighbors to exceptionally many points, while others are not nearest neighbors to any points. Existing approaches for combating the effect of hubness on word translation retrieval involve performing similarity updates one language at a time without consideration for the other language in the pair (Dinu et al., 2015, Smith et al., 2017). Consequently, they yielded less accurate results when compared to CSLS in experiments conducted in this paper (Table 1).<br />
<br />
= Training and architectural choices =<br />
=== Architecture ===<br />
<br />
This paper use unsupervised word vectors that were trained using fastText2. These correspond to monolingual embeddings of dimension 300 trained on Wikipedia corpora; therefore, the mapping W has size 300 × 300. Words are lower-cased, and those that appear less than 5 times are discarded for training. As a post-processing step, only the first 200k most frequent words were selected in the experiments.<br />
For the discriminator, it use a multilayer perceptron with two hidden layers of size 2048, and Leaky-ReLU activation functions. The input to the discriminator is corrupted with dropout noise with a rate of 0.1. As suggested by Goodfellow (2016), a smoothing coefficient s = 0.2 is included in the discriminator predictions. This paper use stochastic gradient descent with a batch size of 32, a learning rate of 0.1 and a decay of 0.95 both for the discriminator and W . <br />
<br />
=== Discriminator inputs ===<br />
The embedding quality of rare words is generally not as good as the one of frequent words (Luong et al., 2013), and it is observed that feeding the discriminator with rare words had a small, but not negligible negative impact. As a result, this paper only feed the discriminator with the 50,000 most frequent words. At each training step, the word embeddings given to the discriminator are sampled uniformly. Sampling them according to the word frequency did not have any noticeable impact on the results.<br />
<br />
=== Orthogonality===<br />
In this work, the authors propose to use a simple update step to ensure that the matrix W stays close to an orthogonal matrix during training (Cisse et al. (2017)). Specifically, the following update rule on the matrix W is used:<br />
<br />
\begin{align}<br />
W \leftarrow (1+\beta)W-\beta(WW^T)W<br />
\end{align}<br />
<br />
where β = 0.01 is usually found to perform well. This method ensures that the matrix stays close to the manifold of orthogonal matrices after each update.<br />
<br />
This update rule can be justified as follows. Consider the function <br />
\begin{align}<br />
g: \mathbb{R}^{d\times d} \to \mathbb{R}^{d \times d}<br />
\end{align}<br />
defined by<br />
\begin{align}<br />
g(W)= W^T W -I.<br />
\end{align}<br />
<br />
The derivative of g at W is is the linear map<br />
\begin{align}<br />
Dg[W]: \mathbb{R}^{d \times d} \to \mathbb{R}^{d \times d}<br />
\end{align}<br />
defined by<br />
\begin{align}<br />
Dg[W](H)= H^T W + W^T H.<br />
\end{align}<br />
<br />
The adjoint of this linear map is<br />
<br />
\begin{align}<br />
D^\ast g[W](H)= WH^T +WH.<br />
\end{align}<br />
<br />
Now consider the function f<br />
\begin{align}<br />
f: \mathbb{R}^{d \times d} \to \mathbb{R}<br />
\end{align}<br />
<br />
defined by<br />
<br />
\begin{align}<br />
f(W)=||g(W) ||_F^2=||W^TW -I ||_F^2.<br />
\end{align}<br />
<br />
f has gradient:<br />
\begin{align}<br />
\nabla f (W) = 2D^\ast g[W] (g(W ) ) =2W(W^TW-I) +2W(W^TW-I)=4W W^TW-4W.<br />
\end{align}<br />
or<br />
\begin{align}<br />
\nabla f (W) = \nabla||W^TW-I||_F = \nabla\text{Tr}(W^TW-I)(W^TW-I)=4(\nabla(W^TW-I))(W^TW-I)=4W(W^TW-I)\text{ (check derivative of trace function)}<br />
\end{align}<br />
<br />
Thus the update<br />
\begin{align}<br />
W \leftarrow (1+\beta)W-\beta(WW^T)W<br />
\end{align}<br />
amounts to a step in the direction opposite the gradient of f. That is, a step toward the set of orthogonal matrices.<br />
<br />
=== Dictionary generation ===<br />
The refinement step requires the generation of a new dictionary at each iteration. In order for the Procrustes solution to work well, it is best to apply it on correct word pairs. As a result, the CSLS method is used to select more accurate translation pairs in the dictionary. To further increase the quality of the dictionary, and ensure that W is learned from correct translation pairs, only mutual nearest neighbors were considered, i.e. pairs of words that are mutually nearest neighbors of each other according to CSLS. This significantly decreases the size of the generated dictionary, but improves its accuracy, as well as the overall performance.<br />
<br />
=== Validation criterion for unsupervised model selection ===<br />
<br />
This paper consider the 10k most frequent source words, and use CSLS to generate a translation for each of them, then compute the average cosine similarity between these deemed translations, and use this average as a validation metric. The choice of using the 10 thousand most frequent source words is requires more justification since we would expect those to be the best trained words and may not accurately represent the entire data set. Perhaps a k-fold cross validation approach should be used instead. Figure 2 below shows the correlation between the evaluation score and this unsupervised criterion (without stabilization by learning rate shrinkage)<br />
<br />
<br />
<br />
[[File:fig2_fan.png |frame|none|alt=Alt text|Figure 2: Unsupervised model selection.<br />
Correlation between the unsupervised validation criterion (black line) and actual word translation accuracy (blue line). In this particular experiment, the selected model is at epoch 10. Observe how the criterion is well correlated with translation accuracy.]]<br />
<br />
= Results =<br />
<br />
The results on word translation retrieval using the bilingual dictionaries are presented in Table 1, and a comparison to previous work in shown in Table 2 where the unsupervised model significantly outperforms previous approaches. The results on the sentence translation retrieval task are presented in Table 3, and the cross-lingual word similarity task in Table 4. Finally, the results on word-by-word translation for English-Esperanto are presented in Table 5. The bilingual dictionary used here does not account for words with multiple meanings.<br />
<br />
[[File:table1_fan.png |frame|none|alt=Alt text|Table 1: Word translation retrieval P@1 for the released vocabularies in various language pairs. The authors consider 1,500 source test queries, and 200k target words for each language pair. The authors use fastText embeddings trained on Wikipedia. NN: nearest neighbors. ISF: inverted softmax. (’en’ is English, ’fr’ is French, ’de’ is German, ’ru’ is Russian, ’zh’ is classical Chinese and ’eo’ is Esperanto)]]<br />
<br />
<br />
[[File:table2_fan.png |frame|none|alt=Alt text|English-Italian word translation average precisions (@1, @5, @10) from 1.5k source word queries using 200k target words. Results marked with the symbol † are from Smith et al. (2017). Wiki means the embeddings were trained on Wikipedia using fastText. Note that the method used by Artetxe et al. (2017) does not use the same supervision as other supervised methods, as they only use numbers in their ini- tial parallel dictionary.]]<br />
<br />
[[File:table3_fan.png |frame|none|alt=Alt text|Table 3: English-Italian sentence translation retrieval. The authors report the average P@k from 2,000 source queries using 200,000 target sentences. The authors use the same embeddings as in Smith et al. (2017). Their results are marked with the symbol †.]]<br />
<br />
[[File:table4_fan.png |frame|none|alt=Alt text|Table 4: Cross-lingual wordsim task. NASARI<br />
(Camacho-Collados et al. (2016)) refers to the official SemEval2017 baseline. The authors report Pearson correlation.]]<br />
<br />
[[File:table5_fan.png |frame|none|alt=Alt text|Table 5: BLEU score on English-Esperanto.<br />
Although being a naive approach, word-by- word translation is enough to get a rough idea of the input sentence. The quality of the gener- ated dictionary has a significant impact on the BLEU score.]]<br />
<br />
[[File:paper9_fig3.png |frame|none|alt=Alt text|Figure 3: The paper also investigated the impact of monolingual embeddings. It was found that model from this paper can align embeddings obtained through different methods, but not embeddings obtained from different corpora, which explains the large performance increase in Table 2 due to the corpus change from WaCky to Wiki using CBOW embedding. This is conveyed in this figure which displays English to English world alignment accuracies with regard to word frequency. Perfect alignment is achieved using the same model and corpora (a). Also good alignment using different model and corpora, although CSLS consistently has better results (b). Worse results due to use of different corpora (c). Even worse results when both embedding model and corpora are different.]]<br />
<br />
= Conclusion =<br />
It is clear that one major downfall of this method when it actually comes to translation is the restriction that the two languages must have similar intrinsic structures to allow for the embeddings to align. However, given this assumption, this paper shows for the first time that one can align word embedding spaces without any cross-lingual supervision, i.e., solely based on unaligned datasets of each language, while reaching or outperforming the quality of previous supervised approaches in several cases. Using adversarial training, the model is able to initialize a linear mapping between a source and a target space, which is also used to produce a synthetic parallel dictionary. It is then possible to apply the same techniques proposed for supervised techniques, namely a Procrustean optimization.<br />
<br />
= Open source code =<br />
The source code for the paper is provided at the following Github link: https://github.com/facebookresearch/MUSE. The repository provides the source code as written in PyTorch by the authors of this paper.<br />
<br />
= Source =<br />
Dinu, Georgiana; Lazaridou, Angeliki; Baroni, Marco<br />
| Improving zero-shot learning by mitigating the hubness problem<br />
| arXiv:1412.6568<br />
<br />
Lample, Guillaume; Denoyer, Ludovic; Ranzato, Marc'Aurelio <br />
| Unsupervised Machine Translation Using Monolingual Corpora Only<br />
| arXiv: 1701.04087<br />
<br />
Smith, Samuel L; Turban, David HP; Hamblin, Steven; Hammerla, Nils Y<br />
| Offline bilingual word vectors, orthogonal transformations and the inverted softmax<br />
| arXiv:1702.03859<br />
<br />
Lample, G. (n.d.). Facebookresearch/MUSE. Retrieved March 25, 2018, from https://github.com/facebookresearch/MUSE<br />
<br />
Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean<br />
| Efficient Estimation of Word Representations in Vector Space, 2013<br />
| arXiv:1301.3781</div>Tdunnhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Word_translation_without_parallel_data&diff=36280Word translation without parallel data2018-04-07T15:11:22Z<p>Tdunn: /* Cross-Domain Similarity Local Scaling (CSLS) */</p>
<hr />
<div>[[File:Toy_example.png]]<br />
<br />
= Presented by =<br />
<br />
Xia Fan<br />
<br />
= Introduction =<br />
<br />
Many successful methods for learning relationships between languages stem from the hypothesis that there is a relationship between the context of words and their meanings. This means that if an adequate representation of a language is found in a high dimensional space (this is called an embedding), then words similar to a given word are close to one another in this space (ex. some norm can be minimized to find a word with similar context). Historically, another significant hypothesis is that these embedding spaces show similar structures over different languages. That is to say that given an embedding space for English and one for Spanish, a mapping could be found that aligns the two spaces and such a mapping could be used as a tool for translation. Many papers exploit these hypotheses, but use large parallel datasets for training. Recently, to remove the need for supervised training, methods have been implemented that utilize identical character strings (ex. letters or digits) in order to try to align the embeddings. The downside of this approach is that the two languages need to be similar to begin with as they need to have some shared basic building block. The method proposed in this paper uses an adversarial method to find this mapping between the embedding spaces of two languages without the use of large parallel datasets.<br />
<br />
The contributions of this paper can be listed as follows: <br />
<br />
1. This paper introduces a model that either is on par, or outperforms supervised state-of-the-art methods, without employing any cross-lingual annotated data such as bilingual dictionaries or parallel corpora (large and structured sets of texts). This method uses an idea similar to GANs: it leverages adversarial training to learn a linear mapping from a source to distinguish between the mapped source embeddings and the target embeddings, while the mapping is jointly trained to fool the discriminator. <br />
<br />
2. Second, this paper extracts a synthetic dictionary from the resulting shared embedding space and fine-tunes the mapping with the closed-form Procrustes solution from Schonemann (1966). <br />
<br />
3. Third, this paper also introduces an unsupervised selection metric that is highly correlated with the mapping quality and that the authors use both as a stopping criterion and to select the best hyper-parameters. <br />
<br />
4. Fourth, they introduce a cross-domain similarity adaptation to mitigate the so-called hubness problem (points tending to be nearest neighbors of many points in high-dimensional spaces).<br />
<br />
5. They demonstrate the effectiveness of our method using an example of a low-resource language pair where parallel corpora are not available (English-Esperanto) for which their method is particularly suited.<br />
<br />
This paper is published in ICLR 2018.<br />
<br />
= Related Work =<br />
<br />
'''Bilingual Lexicon Induction'''<br />
<br />
Many papers have addressed this subject by using discrete word representations. Regularly however these methods need to have an initialization of prior knowledge, such as the editing distance between the input and output ground truth. This unfortunately only works for closely related languages.<br />
<br />
= Model =<br />
<br />
<br />
=== Estimation of Word Representations in Vector Space ===<br />
<br />
This model focuses on learning a mapping between the two sets such that translations are close in the shared space. Before talking about the model it used, a model which can exploit the similarities of monolingual embedding spaces should be introduced. Mikolov et al.(2013) use a known dictionary of n=5000 pairs of words <math> \{x_i,y_i\}_{i\in{1,n}} </math>. and learn a linear mapping W between the source and the target space such that <br />
<br />
\begin{align}<br />
W^*=argmin_{W{\in}M_d(R)}||WX-Y||_F \hspace{1cm} (1)<br />
\end{align}<br />
<br />
where d is the dimension of the embeddings, <math> M_d(R) </math> is the space of d*d matrices of real numbers, and X and Y are two aligned matrices of size d*n containing the embeddings of the words in the parallel vocabulary. Here <math>||\cdot||_F</math> is the Frobenius matrix norm which is the square root of the sum of the squared components.<br />
<br />
Xing et al. (2015) showed that these results are improved by enforcing orthogonality constraint on W. In that case, equation (1) boils down to the Procrustes problem, a matrix approximation problem for which the goal is to find an orthogonal matrix that best maps two given matrices on the measure of the Frobenius norm. It advantageously offers a closed form solution obtained from the singular value decomposition (SVD) of <math> YX^T </math> :<br />
<br />
\begin{align}<br />
W^*=argmin_{W{\in}M_d(R)}||WX-Y||_F=UV^T\textrm{, with }U\Sigma V^T=SVD(YX^T).<br />
\end{align}<br />
<br />
<br />
This can be proven as follows. First note that <br />
\begin{align}<br />
&||WX-Y||_F\\<br />
&= \langle WX-Y, WX-Y\rangle_F\\ <br />
&= \langle WX, WX \rangle_F -2 \langle W X, Y \rangle_F + \langle Y, Y \rangle_F \\<br />
&= ||X||_F^2 -2 \langle W X, Y \rangle_F + || Y||_F^2, <br />
\end{align}<br />
<br />
where <math display="inline"> \langle \cdot, \cdot \rangle_F </math> denotes the Frobenius inner-product and we have used the orthogonality of <math display="inline"> W </math>. It follows that we need only maximize the inner-product above. Let <math display="inline"> u_1, \ldots, u_d </math> denote the columns of <math display="inline"> U </math>. Let <math display="inline"> v_1, \ldots , v_d </math> denote the columns of <math display="inline"> V </math>. Let <math display="inline"> \sigma_1, \ldots, \sigma_d </math> denote the diagonal entries of <math display="inline"> \Sigma </math>. We have<br />
\begin{align}<br />
&\langle W X, Y \rangle_F \\<br />
&= \text{Tr} (W^T Y X^T)\\<br />
& =\text{Tr}(W^T \sum_i \sigma_i u_i v_i^T)\\<br />
&=\sum_i \sigma_i \text{Tr}(W^T u_i v_i^T)\\<br />
&=\sum_i \sigma_i ((Wv_i)^T u_i )\text{ invariance of trace under cyclic permutations}\\<br />
&\le \sum_i \sigma_i ||Wv_i|| ||u_i||\text{ Cauchy-Swarz inequality}\\<br />
&= \sum_i \sigma_i<br />
\end{align}<br />
where we have used the invariance of trace under cyclic permutations, Cauchy-Schwarz, and the orthogonality of the columns of U and V. Note that choosing <br />
\begin{align}<br />
W=UV^T<br />
\end{align}<br />
achieves the bound. This completes the proof.<br />
<br />
=== Domain-adversarial setting ===<br />
<br />
This paper shows how to learn this mapping W without cross-lingual supervision. An illustration of the approach is given in Fig. 1. First, this model learns an initial proxy of W by using an adversarial criterion. Then, it uses the words that match the best as anchor points for Procrustes. Finally, it improves performance over less frequent words by changing the metric of the space, which leads to spread more of those points in dense region. <br />
<br />
[[File:Toy_example.png |frame|none|alt=Alt text|Figure 1: Toy illustration of the method. (A) There are two distributions of word embeddings, English words in red denoted by X and Italian words in blue denoted by Y , which we want to align/translate. Each dot represents a word in that space. The size of the dot is proportional to the frequency of the words in the training corpus of that language. (B) Using adversarial learning, we learn a rotation matrix W which roughly aligns the two distributions. The green stars are randomly selected words that are fed to the discriminator to determine whether the two word embeddings come from the same distribution. (C) The mapping W is further refined via Procrustes. This method uses frequent words aligned by the previous step as anchor points, and minimizes an energy function that corresponds to a spring system between anchor points. The refined mapping is then used to map all words in the dictionary. (D) Finally, we translate by using the mapping W and a distance metric, dubbed CSLS, that expands the space where there is high density of points (like the area around the word “cat”), so that “hubs” (like the word “cat”) become less close to other word vectors than they would otherwise (compare to the same region in panel (A)).]]<br />
<br />
Let <math> X={x_1,...,x_n} </math> and <math> Y={y_1,...,y_m} </math> be two sets of n and m word embeddings coming from a source and a target language respectively. A model is trained is trained to discriminate between elements randomly sampled from <math> WX={Wx_1,...,Wx_n} </math> and Y, We call this model the discriminator. W is trained to prevent the discriminator from making accurate predictions. As a result, this is a two-player adversarial game, where the discriminator aims at maximizing its ability to identify the origin of an embedding, and W aims at preventing the discriminator from doing so by making WX and Y as similar as possible. This approach is in line with the work of Ganin et al.(2016), who proposed to learn latent representations invariant to the input domain, where in this case, a domain is represented by a language(source or target).<br />
<br />
1. Discriminator objective<br />
<br />
Refer to the discriminator parameters as <math> \theta_D </math>. Consider the probability <math> P_{\theta_D}(source = 1|z) </math> that a vector z is the mapping of a source embedding (as opposed to a target embedding) according to the discriminator. The discriminator loss can be written as:<br />
<br />
\begin{align}<br />
L_D(\theta_D|W)=-\frac{1}{n} \sum_{i=1}^n log P_{\theta_D}(source=1|Wx_i)-\frac{1}{m} \sum_{i=1}^m log P_{\theta_D}(source=0|y_i)<br />
\end{align}<br />
<br />
2. Mapping objective <br />
<br />
In the unsupervised setting, W is now trained so that the discriminator is unable to accurately predict the embedding origins: <br />
<br />
\begin{align}<br />
L_W(W|\theta_D)=-\frac{1}{n} \sum_{i=1}^n log P_{\theta_D}(source=0|Wx_i)-\frac{1}{m} \sum_{i=1}^m log P_{\theta_D}(source=1|y_i)<br />
\end{align}<br />
<br />
3. Learning algorithm <br />
To train the model, the authors follow the standard training procedure of deep adversarial networks of Goodfellow et al. (2014). For every input sample, the discriminator and the mapping matrix W are trained successively with stochastic gradient updates to respectively minimize <math> L_D </math> and <math> L_W </math><br />
<br />
=== Refinement procedure ===<br />
<br />
The matrix W obtained with adversarial training gives good performance (see Table 1), but the results are still not on par with the supervised approach. In fact, the adversarial approach tries to align all words irrespective of their frequencies. However, rare words have embeddings that are less updated and are more likely to appear in different contexts in each corpus, which makes them harder to align. Under the assumption that the mapping is linear, it is then better to infer the global mapping using only the most frequent words as anchors. Besides, the accuracy on the most frequent word pairs is high after adversarial training.<br />
To refine the mapping, this paper build a synthetic parallel vocabulary using the W just learned with adversarial training. Specifically, this paper consider the most frequent words and retain only mutual nearest neighbors to ensure a high-quality dictionary. Subsequently, this paper apply the Procrustes solution in (2) on this generated dictionary. Considering the improved solution generated with the Procrustes algorithm, it is possible to generate a more accurate dictionary and apply this method iteratively, similarly to Artetxe et al. (2017). However, given that the synthetic dictionary obtained using adversarial training is already strong, this paper only observe small improvements when doing more than one iteration, i.e., the improvements on the word translation task are usually below 1%.<br />
<br />
=== Cross-Domain Similarity Local Scaling (CSLS) ===<br />
<br />
This paper considers a bi-partite neighborhood graph, in which each word of a given dictionary is connected to its K nearest neighbors in the other language. <math> N_T(Wx_s) </math> is used to denote the neighborhood, on this bi-partite graph, associated with a mapped source word embedding <math> Wx_s </math>. All K elements of <math> N_T(Wx_s) </math> are words from the target language. Similarly we denote by <math> N_S(y_t) </math> the neighborhood associated with a word t of the target language. Consider the mean similarity of a source embedding <math> x_s </math> to its target neighborhood as<br />
<br />
\begin{align}<br />
r_T(Wx_s)=\frac{1}{K}\sum_{y\in N_T(Wx_s)}cos(Wx_s,y_t)<br />
\end{align}<br />
<br />
where cos(.,.) is the cosine similarity which is the cosine of the angle between two vectors. Likewise, the mean similarity of a target word <math> y_t </math> to its neighborhood is denoted as <math> r_S(y_t) </math>. This is used to define similarity measure CSLS(.,.) between mapped source words and target words as <br />
<br />
\begin{align}<br />
CSLS(Wx_s,y_t)=2cos(Wx_s,y_t)-r_T(Wx_s)-r_S(y_t)<br />
\end{align}<br />
<br />
This process increases the similarity associated with isolated word vectors, but decreases the similarity of vectors lying in dense areas. <br />
<br />
CSLS represents an improved measure for producing reliable matching words between two languages (i.e. neighbors of a word in one language should ideally correspond to the same words in the second language). The nearest neighbors algorithm is asymmetric, and in high-dimensional spaces, it suffers from the problem of hubness, in which some points are nearest neighbors to exceptionally many points, while others are not nearest neighbors to any points. Existing approaches for combating the effect of hubness on word translation retrieval involve performing similarity updates one language at a time without consideration for the other language in the pair (Dinu et al., 2015, Smith et al., 2017). Consequently, they yielded less accurate results when compared to CSLS in experiments conducted in this paper (Table 1).<br />
<br />
= Training and architectural choices =<br />
=== Architecture ===<br />
<br />
This paper use unsupervised word vectors that were trained using fastText2. These correspond to monolingual embeddings of dimension 300 trained on Wikipedia corpora; therefore, the mapping W has size 300 × 300. Words are lower-cased, and those that appear less than 5 times are discarded for training. As a post-processing step, only the first 200k most frequent words were selected in the experiments.<br />
For the discriminator, it use a multilayer perceptron with two hidden layers of size 2048, and Leaky-ReLU activation functions. The input to the discriminator is corrupted with dropout noise with a rate of 0.1. As suggested by Goodfellow (2016), a smoothing coefficient s = 0.2 is included in the discriminator predictions. This paper use stochastic gradient descent with a batch size of 32, a learning rate of 0.1 and a decay of 0.95 both for the discriminator and W . <br />
<br />
=== Discriminator inputs ===<br />
The embedding quality of rare words is generally not as good as the one of frequent words (Luong et al., 2013), and it is observed that feeding the discriminator with rare words had a small, but not negligible negative impact. As a result, this paper only feed the discriminator with the 50,000 most frequent words. At each training step, the word embeddings given to the discriminator are sampled uniformly. Sampling them according to the word frequency did not have any noticeable impact on the results.<br />
<br />
=== Orthogonality===<br />
In this work, the authors propose to use a simple update step to ensure that the matrix W stays close to an orthogonal matrix during training (Cisse et al. (2017)). Specifically, the following update rule on the matrix W is used:<br />
<br />
\begin{align}<br />
W \leftarrow (1+\beta)W-\beta(WW^T)W<br />
\end{align}<br />
<br />
where β = 0.01 is usually found to perform well. This method ensures that the matrix stays close to the manifold of orthogonal matrices after each update.<br />
<br />
This update rule can be justified as follows. Consider the function <br />
\begin{align}<br />
g: \mathbb{R}^{d\times d} \to \mathbb{R}^{d \times d}<br />
\end{align}<br />
defined by<br />
\begin{align}<br />
g(W)= W^T W -I.<br />
\end{align}<br />
<br />
The derivative of g at W is is the linear map<br />
\begin{align}<br />
Dg[W]: \mathbb{R}^{d \times d} \to \mathbb{R}^{d \times d}<br />
\end{align}<br />
defined by<br />
\begin{align}<br />
Dg[W](H)= H^T W + W^T H.<br />
\end{align}<br />
<br />
The adjoint of this linear map is<br />
<br />
\begin{align}<br />
D^\ast g[W](H)= WH^T +WH.<br />
\end{align}<br />
<br />
Now consider the function f<br />
\begin{align}<br />
f: \mathbb{R}^{d \times d} \to \mathbb{R}<br />
\end{align}<br />
<br />
defined by<br />
<br />
\begin{align}<br />
f(W)=||g(W) ||_F^2=||W^TW -I ||_F^2.<br />
\end{align}<br />
<br />
f has gradient:<br />
\begin{align}<br />
\nabla f (W) = 2D^\ast g[W] (g(W ) ) =2W(W^TW-I) +2W(W^TW-I)=4W W^TW-4W.<br />
\end{align}<br />
or<br />
\begin{align}<br />
\nabla f (W) = \nabla||W^TW-I||_F = \nabla\text{Tr}(W^TW-I)(W^TW-I)=4(\nabla(W^TW-I))(W^TW-I)=4W(W^TW-I)\text{ (check derivative of trace function)}<br />
\end{align}<br />
<br />
Thus the update<br />
\begin{align}<br />
W \leftarrow (1+\beta)W-\beta(WW^T)W<br />
\end{align}<br />
amounts to a step in the direction opposite the gradient of f. That is, a step toward the set of orthogonal matrices.<br />
<br />
=== Dictionary generation ===<br />
The refinement step requires the generation of a new dictionary at each iteration. In order for the Procrustes solution to work well, it is best to apply it on correct word pairs. As a result, the CSLS method is used to select more accurate translation pairs in the dictionary. To further increase the quality of the dictionary, and ensure that W is learned from correct translation pairs, only mutual nearest neighbors were considered, i.e. pairs of words that are mutually nearest neighbors of each other according to CSLS. This significantly decreases the size of the generated dictionary, but improves its accuracy, as well as the overall performance.<br />
<br />
=== Validation criterion for unsupervised model selection ===<br />
<br />
This paper consider the 10k most frequent source words, and use CSLS to generate a translation for each of them, then compute the average cosine similarity between these deemed translations, and use this average as a validation metric. The choice of using the 10 thousand most frequent source words is requires more justification since we would expect those to be the best trained words and may not accurately represent the entire data set. Perhaps a k-fold cross validation approach should be used instead. Figure 2 below shows the correlation between the evaluation score and this unsupervised criterion (without stabilization by learning rate shrinkage)<br />
<br />
<br />
<br />
[[File:fig2_fan.png |frame|none|alt=Alt text|Figure 2: Unsupervised model selection.<br />
Correlation between the unsupervised validation criterion (black line) and actual word translation accuracy (blue line). In this particular experiment, the selected model is at epoch 10. Observe how the criterion is well correlated with translation accuracy.]]<br />
<br />
= Results =<br />
<br />
The results on word translation retrieval using the bilingual dictionaries are presented in Table 1, and a comparison to previous work in shown in Table 2 where the unsupervised model significantly outperforms previous approaches. The results on the sentence translation retrieval task are presented in Table 3, and the cross-lingual word similarity task in Table 4. Finally, the results on word-by-word translation for English-Esperanto are presented in Table 5. The bilingual dictionary used here does not account for words with multiple meanings.<br />
<br />
[[File:table1_fan.png |frame|none|alt=Alt text|Table 1: Word translation retrieval P@1 for the released vocabularies in various language pairs. The authors consider 1,500 source test queries, and 200k target words for each language pair. The authors use fastText embeddings trained on Wikipedia. NN: nearest neighbors. ISF: inverted softmax. (’en’ is English, ’fr’ is French, ’de’ is German, ’ru’ is Russian, ’zh’ is classical Chinese and ’eo’ is Esperanto)]]<br />
<br />
<br />
[[File:table2_fan.png |frame|none|alt=Alt text|English-Italian word translation average precisions (@1, @5, @10) from 1.5k source word queries using 200k target words. Results marked with the symbol † are from Smith et al. (2017). Wiki means the embeddings were trained on Wikipedia using fastText. Note that the method used by Artetxe et al. (2017) does not use the same supervision as other supervised methods, as they only use numbers in their ini- tial parallel dictionary.]]<br />
<br />
[[File:table3_fan.png |frame|none|alt=Alt text|Table 3: English-Italian sentence translation retrieval. The authors report the average P@k from 2,000 source queries using 200,000 target sentences. The authors use the same embeddings as in Smith et al. (2017). Their results are marked with the symbol †.]]<br />
<br />
[[File:table4_fan.png |frame|none|alt=Alt text|Table 4: Cross-lingual wordsim task. NASARI<br />
(Camacho-Collados et al. (2016)) refers to the official SemEval2017 baseline. The authors report Pearson correlation.]]<br />
<br />
[[File:table5_fan.png |frame|none|alt=Alt text|Table 5: BLEU score on English-Esperanto.<br />
Although being a naive approach, word-by- word translation is enough to get a rough idea of the input sentence. The quality of the gener- ated dictionary has a significant impact on the BLEU score.]]<br />
<br />
[[File:paper9_fig3.png |frame|none|alt=Alt text|Figure 3: The paper also investigated the impact of monolingual embeddings. It was found that model from this paper can align embeddings obtained through different methods, but not embeddings obtained from different corpora, which explains the large performance increase in Table 2 due to the corpus change from WaCky to Wiki using CBOW embedding. This is conveyed in this figure which displays English to English world alignment accuracies with regard to word frequency. Perfect alignment is achieved using the same model and corpora (a). Also good alignment using different model and corpora, although CSLS consistently has better results (b). Worse results due to use of different corpora (c). Even worse results when both embedding model and corpora are different.]]<br />
<br />
= Conclusion =<br />
It is clear that one major downfall of this method when it actually comes to translation is the restriction that the two languages must have similar intrinsic structures to allow for the embeddings to align. However, given this assumption, this paper shows for the first time that one can align word embedding spaces without any cross-lingual supervision, i.e., solely based on unaligned datasets of each language, while reaching or outperforming the quality of previous supervised approaches in several cases. Using adversarial training, the model is able to initialize a linear mapping between a source and a target space, which is also used to produce a synthetic parallel dictionary. It is then possible to apply the same techniques proposed for supervised techniques, namely a Procrustean optimization.<br />
<br />
= Open source code =<br />
The source code for the paper is provided at the following Github link: https://github.com/facebookresearch/MUSE. The repository provides the source code as written in PyTorch by the authors of this paper.<br />
<br />
= Source =<br />
Dinu, Georgiana; Lazaridou, Angeliki; Baroni, Marco<br />
| Improving zero-shot learning by mitigating the hubness problem<br />
| arXiv:1412.6568<br />
<br />
Lample, Guillaume; Denoyer, Ludovic; Ranzato, Marc'Aurelio <br />
| Unsupervised Machine Translation Using Monolingual Corpora Only<br />
| arXiv: 1701.04087<br />
<br />
Smith, Samuel L; Turban, David HP; Hamblin, Steven; Hammerla, Nils Y<br />
| Offline bilingual word vectors, orthogonal transformations and the inverted softmax<br />
| arXiv:1702.03859<br />
<br />
Lample, G. (n.d.). Facebookresearch/MUSE. Retrieved March 25, 2018, from https://github.com/facebookresearch/MUSE<br />
<br />
Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean<br />
| Efficient Estimation of Word Representations in Vector Space<br />
| arXiv:1301.3781</div>Tdunnhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Word_translation_without_parallel_data&diff=36279Word translation without parallel data2018-04-07T15:04:18Z<p>Tdunn: /* Estimation of Word Representations in Vector Space */</p>
<hr />
<div>[[File:Toy_example.png]]<br />
<br />
= Presented by =<br />
<br />
Xia Fan<br />
<br />
= Introduction =<br />
<br />
Many successful methods for learning relationships between languages stem from the hypothesis that there is a relationship between the context of words and their meanings. This means that if an adequate representation of a language is found in a high dimensional space (this is called an embedding), then words similar to a given word are close to one another in this space (ex. some norm can be minimized to find a word with similar context). Historically, another significant hypothesis is that these embedding spaces show similar structures over different languages. That is to say that given an embedding space for English and one for Spanish, a mapping could be found that aligns the two spaces and such a mapping could be used as a tool for translation. Many papers exploit these hypotheses, but use large parallel datasets for training. Recently, to remove the need for supervised training, methods have been implemented that utilize identical character strings (ex. letters or digits) in order to try to align the embeddings. The downside of this approach is that the two languages need to be similar to begin with as they need to have some shared basic building block. The method proposed in this paper uses an adversarial method to find this mapping between the embedding spaces of two languages without the use of large parallel datasets.<br />
<br />
The contributions of this paper can be listed as follows: <br />
<br />
1. This paper introduces a model that either is on par, or outperforms supervised state-of-the-art methods, without employing any cross-lingual annotated data such as bilingual dictionaries or parallel corpora (large and structured sets of texts). This method uses an idea similar to GANs: it leverages adversarial training to learn a linear mapping from a source to distinguish between the mapped source embeddings and the target embeddings, while the mapping is jointly trained to fool the discriminator. <br />
<br />
2. Second, this paper extracts a synthetic dictionary from the resulting shared embedding space and fine-tunes the mapping with the closed-form Procrustes solution from Schonemann (1966). <br />
<br />
3. Third, this paper also introduces an unsupervised selection metric that is highly correlated with the mapping quality and that the authors use both as a stopping criterion and to select the best hyper-parameters. <br />
<br />
4. Fourth, they introduce a cross-domain similarity adaptation to mitigate the so-called hubness problem (points tending to be nearest neighbors of many points in high-dimensional spaces).<br />
<br />
5. They demonstrate the effectiveness of our method using an example of a low-resource language pair where parallel corpora are not available (English-Esperanto) for which their method is particularly suited.<br />
<br />
This paper is published in ICLR 2018.<br />
<br />
= Related Work =<br />
<br />
'''Bilingual Lexicon Induction'''<br />
<br />
Many papers have addressed this subject by using discrete word representations. Regularly however these methods need to have an initialization of prior knowledge, such as the editing distance between the input and output ground truth. This unfortunately only works for closely related languages.<br />
<br />
= Model =<br />
<br />
<br />
=== Estimation of Word Representations in Vector Space ===<br />
<br />
This model focuses on learning a mapping between the two sets such that translations are close in the shared space. Before talking about the model it used, a model which can exploit the similarities of monolingual embedding spaces should be introduced. Mikolov et al.(2013) use a known dictionary of n=5000 pairs of words <math> \{x_i,y_i\}_{i\in{1,n}} </math>. and learn a linear mapping W between the source and the target space such that <br />
<br />
\begin{align}<br />
W^*=argmin_{W{\in}M_d(R)}||WX-Y||_F \hspace{1cm} (1)<br />
\end{align}<br />
<br />
where d is the dimension of the embeddings, <math> M_d(R) </math> is the space of d*d matrices of real numbers, and X and Y are two aligned matrices of size d*n containing the embeddings of the words in the parallel vocabulary. Here <math>||\cdot||_F</math> is the Frobenius matrix norm which is the square root of the sum of the squared components.<br />
<br />
Xing et al. (2015) showed that these results are improved by enforcing orthogonality constraint on W. In that case, equation (1) boils down to the Procrustes problem, a matrix approximation problem for which the goal is to find an orthogonal matrix that best maps two given matrices on the measure of the Frobenius norm. It advantageously offers a closed form solution obtained from the singular value decomposition (SVD) of <math> YX^T </math> :<br />
<br />
\begin{align}<br />
W^*=argmin_{W{\in}M_d(R)}||WX-Y||_F=UV^T\textrm{, with }U\Sigma V^T=SVD(YX^T).<br />
\end{align}<br />
<br />
<br />
This can be proven as follows. First note that <br />
\begin{align}<br />
&||WX-Y||_F\\<br />
&= \langle WX-Y, WX-Y\rangle_F\\ <br />
&= \langle WX, WX \rangle_F -2 \langle W X, Y \rangle_F + \langle Y, Y \rangle_F \\<br />
&= ||X||_F^2 -2 \langle W X, Y \rangle_F + || Y||_F^2, <br />
\end{align}<br />
<br />
where <math display="inline"> \langle \cdot, \cdot \rangle_F </math> denotes the Frobenius inner-product and we have used the orthogonality of <math display="inline"> W </math>. It follows that we need only maximize the inner-product above. Let <math display="inline"> u_1, \ldots, u_d </math> denote the columns of <math display="inline"> U </math>. Let <math display="inline"> v_1, \ldots , v_d </math> denote the columns of <math display="inline"> V </math>. Let <math display="inline"> \sigma_1, \ldots, \sigma_d </math> denote the diagonal entries of <math display="inline"> \Sigma </math>. We have<br />
\begin{align}<br />
&\langle W X, Y \rangle_F \\<br />
&= \text{Tr} (W^T Y X^T)\\<br />
& =\text{Tr}(W^T \sum_i \sigma_i u_i v_i^T)\\<br />
&=\sum_i \sigma_i \text{Tr}(W^T u_i v_i^T)\\<br />
&=\sum_i \sigma_i ((Wv_i)^T u_i )\text{ invariance of trace under cyclic permutations}\\<br />
&\le \sum_i \sigma_i ||Wv_i|| ||u_i||\text{ Cauchy-Swarz inequality}\\<br />
&= \sum_i \sigma_i<br />
\end{align}<br />
where we have used the invariance of trace under cyclic permutations, Cauchy-Schwarz, and the orthogonality of the columns of U and V. Note that choosing <br />
\begin{align}<br />
W=UV^T<br />
\end{align}<br />
achieves the bound. This completes the proof.<br />
<br />
=== Domain-adversarial setting ===<br />
<br />
This paper shows how to learn this mapping W without cross-lingual supervision. An illustration of the approach is given in Fig. 1. First, this model learns an initial proxy of W by using an adversarial criterion. Then, it uses the words that match the best as anchor points for Procrustes. Finally, it improves performance over less frequent words by changing the metric of the space, which leads to spread more of those points in dense region. <br />
<br />
[[File:Toy_example.png |frame|none|alt=Alt text|Figure 1: Toy illustration of the method. (A) There are two distributions of word embeddings, English words in red denoted by X and Italian words in blue denoted by Y , which we want to align/translate. Each dot represents a word in that space. The size of the dot is proportional to the frequency of the words in the training corpus of that language. (B) Using adversarial learning, we learn a rotation matrix W which roughly aligns the two distributions. The green stars are randomly selected words that are fed to the discriminator to determine whether the two word embeddings come from the same distribution. (C) The mapping W is further refined via Procrustes. This method uses frequent words aligned by the previous step as anchor points, and minimizes an energy function that corresponds to a spring system between anchor points. The refined mapping is then used to map all words in the dictionary. (D) Finally, we translate by using the mapping W and a distance metric, dubbed CSLS, that expands the space where there is high density of points (like the area around the word “cat”), so that “hubs” (like the word “cat”) become less close to other word vectors than they would otherwise (compare to the same region in panel (A)).]]<br />
<br />
Let <math> X={x_1,...,x_n} </math> and <math> Y={y_1,...,y_m} </math> be two sets of n and m word embeddings coming from a source and a target language respectively. A model is trained is trained to discriminate between elements randomly sampled from <math> WX={Wx_1,...,Wx_n} </math> and Y, We call this model the discriminator. W is trained to prevent the discriminator from making accurate predictions. As a result, this is a two-player adversarial game, where the discriminator aims at maximizing its ability to identify the origin of an embedding, and W aims at preventing the discriminator from doing so by making WX and Y as similar as possible. This approach is in line with the work of Ganin et al.(2016), who proposed to learn latent representations invariant to the input domain, where in this case, a domain is represented by a language(source or target).<br />
<br />
1. Discriminator objective<br />
<br />
Refer to the discriminator parameters as <math> \theta_D </math>. Consider the probability <math> P_{\theta_D}(source = 1|z) </math> that a vector z is the mapping of a source embedding (as opposed to a target embedding) according to the discriminator. The discriminator loss can be written as:<br />
<br />
\begin{align}<br />
L_D(\theta_D|W)=-\frac{1}{n} \sum_{i=1}^n log P_{\theta_D}(source=1|Wx_i)-\frac{1}{m} \sum_{i=1}^m log P_{\theta_D}(source=0|y_i)<br />
\end{align}<br />
<br />
2. Mapping objective <br />
<br />
In the unsupervised setting, W is now trained so that the discriminator is unable to accurately predict the embedding origins: <br />
<br />
\begin{align}<br />
L_W(W|\theta_D)=-\frac{1}{n} \sum_{i=1}^n log P_{\theta_D}(source=0|Wx_i)-\frac{1}{m} \sum_{i=1}^m log P_{\theta_D}(source=1|y_i)<br />
\end{align}<br />
<br />
3. Learning algorithm <br />
To train the model, the authors follow the standard training procedure of deep adversarial networks of Goodfellow et al. (2014). For every input sample, the discriminator and the mapping matrix W are trained successively with stochastic gradient updates to respectively minimize <math> L_D </math> and <math> L_W </math><br />
<br />
=== Refinement procedure ===<br />
<br />
The matrix W obtained with adversarial training gives good performance (see Table 1), but the results are still not on par with the supervised approach. In fact, the adversarial approach tries to align all words irrespective of their frequencies. However, rare words have embeddings that are less updated and are more likely to appear in different contexts in each corpus, which makes them harder to align. Under the assumption that the mapping is linear, it is then better to infer the global mapping using only the most frequent words as anchors. Besides, the accuracy on the most frequent word pairs is high after adversarial training.<br />
To refine the mapping, this paper build a synthetic parallel vocabulary using the W just learned with adversarial training. Specifically, this paper consider the most frequent words and retain only mutual nearest neighbors to ensure a high-quality dictionary. Subsequently, this paper apply the Procrustes solution in (2) on this generated dictionary. Considering the improved solution generated with the Procrustes algorithm, it is possible to generate a more accurate dictionary and apply this method iteratively, similarly to Artetxe et al. (2017). However, given that the synthetic dictionary obtained using adversarial training is already strong, this paper only observe small improvements when doing more than one iteration, i.e., the improvements on the word translation task are usually below 1%.<br />
<br />
=== Cross-Domain Similarity Local Scaling (CSLS) ===<br />
<br />
This paper considers a bi-partite neighborhood graph, in which each word of a given dictionary is connected to its K nearest neighbors in the other language. <math> N_T(Wx_s) </math> is used to denote the neighborhood, on this bi-partite graph, associated with a mapped source word embedding <math> Wx_s </math>. All K elements of <math> N_T(Wx_s) </math> are words from the target language. Similarly we denote by <math> N_S(y_t) </math> the neighborhood associated with a word t of the target language. Consider the mean similarity of a source embedding <math> x_s </math> to its target neighborhood as<br />
<br />
\begin{align}<br />
r_T(Wx_s)=\frac{1}{K}\sum_{y\in N_T(Wx_s)}cos(Wx_s,y_t)<br />
\end{align}<br />
<br />
where cos(.,.) is the cosine similarity. Likewise, the mean similarity of a target word <math> y_t </math> to its neighborhood is denoted as <math> r_S(y_t) </math>. This is used to define similarity measure CSLS(.,.) between mapped source words and target words as <br />
<br />
\begin{align}<br />
CSLS(Wx_s,y_t)=2cos(Wx_s,y_t)-r_T(Wx_s)-r_S(y_t)<br />
\end{align}<br />
<br />
This process increases the similarity associated with isolated word vectors, but decreases the similarity of vectors lying in dense areas. <br />
<br />
CSLS represents an improved measure for producing reliable matching words between two languages (i.e. neighbors of a word in one language should ideally correspond to the same words in the second language). The nearest neighbors algorithm is asymmetric, and in high-dimensional spaces, it suffers from the problem of hubness, in which some points are nearest neighbors to exceptionally many points, while others are not nearest neighbors to any points. Existing approaches for combating the effect of hubness on word translation retrieval involve performing similarity updates one language at a time without consideration for the other language in the pair (Dinu et al., 2015, Smith et al., 2017). Consequently, they yielded less accurate results when compared to CSLS in experiments conducted in this paper (Table 1).<br />
<br />
= Training and architectural choices =<br />
=== Architecture ===<br />
<br />
This paper use unsupervised word vectors that were trained using fastText2. These correspond to monolingual embeddings of dimension 300 trained on Wikipedia corpora; therefore, the mapping W has size 300 × 300. Words are lower-cased, and those that appear less than 5 times are discarded for training. As a post-processing step, only the first 200k most frequent words were selected in the experiments.<br />
For the discriminator, it use a multilayer perceptron with two hidden layers of size 2048, and Leaky-ReLU activation functions. The input to the discriminator is corrupted with dropout noise with a rate of 0.1. As suggested by Goodfellow (2016), a smoothing coefficient s = 0.2 is included in the discriminator predictions. This paper use stochastic gradient descent with a batch size of 32, a learning rate of 0.1 and a decay of 0.95 both for the discriminator and W . <br />
<br />
=== Discriminator inputs ===<br />
The embedding quality of rare words is generally not as good as the one of frequent words (Luong et al., 2013), and it is observed that feeding the discriminator with rare words had a small, but not negligible negative impact. As a result, this paper only feed the discriminator with the 50,000 most frequent words. At each training step, the word embeddings given to the discriminator are sampled uniformly. Sampling them according to the word frequency did not have any noticeable impact on the results.<br />
<br />
=== Orthogonality===<br />
In this work, the authors propose to use a simple update step to ensure that the matrix W stays close to an orthogonal matrix during training (Cisse et al. (2017)). Specifically, the following update rule on the matrix W is used:<br />
<br />
\begin{align}<br />
W \leftarrow (1+\beta)W-\beta(WW^T)W<br />
\end{align}<br />
<br />
where β = 0.01 is usually found to perform well. This method ensures that the matrix stays close to the manifold of orthogonal matrices after each update.<br />
<br />
This update rule can be justified as follows. Consider the function <br />
\begin{align}<br />
g: \mathbb{R}^{d\times d} \to \mathbb{R}^{d \times d}<br />
\end{align}<br />
defined by<br />
\begin{align}<br />
g(W)= W^T W -I.<br />
\end{align}<br />
<br />
The derivative of g at W is is the linear map<br />
\begin{align}<br />
Dg[W]: \mathbb{R}^{d \times d} \to \mathbb{R}^{d \times d}<br />
\end{align}<br />
defined by<br />
\begin{align}<br />
Dg[W](H)= H^T W + W^T H.<br />
\end{align}<br />
<br />
The adjoint of this linear map is<br />
<br />
\begin{align}<br />
D^\ast g[W](H)= WH^T +WH.<br />
\end{align}<br />
<br />
Now consider the function f<br />
\begin{align}<br />
f: \mathbb{R}^{d \times d} \to \mathbb{R}<br />
\end{align}<br />
<br />
defined by<br />
<br />
\begin{align}<br />
f(W)=||g(W) ||_F^2=||W^TW -I ||_F^2.<br />
\end{align}<br />
<br />
f has gradient:<br />
\begin{align}<br />
\nabla f (W) = 2D^\ast g[W] (g(W ) ) =2W(W^TW-I) +2W(W^TW-I)=4W W^TW-4W.<br />
\end{align}<br />
or<br />
\begin{align}<br />
\nabla f (W) = \nabla||W^TW-I||_F = \nabla\text{Tr}(W^TW-I)(W^TW-I)=4(\nabla(W^TW-I))(W^TW-I)=4W(W^TW-I)\text{ (check derivative of trace function)}<br />
\end{align}<br />
<br />
Thus the update<br />
\begin{align}<br />
W \leftarrow (1+\beta)W-\beta(WW^T)W<br />
\end{align}<br />
amounts to a step in the direction opposite the gradient of f. That is, a step toward the set of orthogonal matrices.<br />
<br />
=== Dictionary generation ===<br />
The refinement step requires the generation of a new dictionary at each iteration. In order for the Procrustes solution to work well, it is best to apply it on correct word pairs. As a result, the CSLS method is used to select more accurate translation pairs in the dictionary. To further increase the quality of the dictionary, and ensure that W is learned from correct translation pairs, only mutual nearest neighbors were considered, i.e. pairs of words that are mutually nearest neighbors of each other according to CSLS. This significantly decreases the size of the generated dictionary, but improves its accuracy, as well as the overall performance.<br />
<br />
=== Validation criterion for unsupervised model selection ===<br />
<br />
This paper consider the 10k most frequent source words, and use CSLS to generate a translation for each of them, then compute the average cosine similarity between these deemed translations, and use this average as a validation metric. The choice of using the 10 thousand most frequent source words is requires more justification since we would expect those to be the best trained words and may not accurately represent the entire data set. Perhaps a k-fold cross validation approach should be used instead. Figure 2 below shows the correlation between the evaluation score and this unsupervised criterion (without stabilization by learning rate shrinkage)<br />
<br />
<br />
<br />
[[File:fig2_fan.png |frame|none|alt=Alt text|Figure 2: Unsupervised model selection.<br />
Correlation between the unsupervised validation criterion (black line) and actual word translation accuracy (blue line). In this particular experiment, the selected model is at epoch 10. Observe how the criterion is well correlated with translation accuracy.]]<br />
<br />
= Results =<br />
<br />
The results on word translation retrieval using the bilingual dictionaries are presented in Table 1, and a comparison to previous work in shown in Table 2 where the unsupervised model significantly outperforms previous approaches. The results on the sentence translation retrieval task are presented in Table 3, and the cross-lingual word similarity task in Table 4. Finally, the results on word-by-word translation for English-Esperanto are presented in Table 5. The bilingual dictionary used here does not account for words with multiple meanings.<br />
<br />
[[File:table1_fan.png |frame|none|alt=Alt text|Table 1: Word translation retrieval P@1 for the released vocabularies in various language pairs. The authors consider 1,500 source test queries, and 200k target words for each language pair. The authors use fastText embeddings trained on Wikipedia. NN: nearest neighbors. ISF: inverted softmax. (’en’ is English, ’fr’ is French, ’de’ is German, ’ru’ is Russian, ’zh’ is classical Chinese and ’eo’ is Esperanto)]]<br />
<br />
<br />
[[File:table2_fan.png |frame|none|alt=Alt text|English-Italian word translation average precisions (@1, @5, @10) from 1.5k source word queries using 200k target words. Results marked with the symbol † are from Smith et al. (2017). Wiki means the embeddings were trained on Wikipedia using fastText. Note that the method used by Artetxe et al. (2017) does not use the same supervision as other supervised methods, as they only use numbers in their ini- tial parallel dictionary.]]<br />
<br />
[[File:table3_fan.png |frame|none|alt=Alt text|Table 3: English-Italian sentence translation retrieval. The authors report the average P@k from 2,000 source queries using 200,000 target sentences. The authors use the same embeddings as in Smith et al. (2017). Their results are marked with the symbol †.]]<br />
<br />
[[File:table4_fan.png |frame|none|alt=Alt text|Table 4: Cross-lingual wordsim task. NASARI<br />
(Camacho-Collados et al. (2016)) refers to the official SemEval2017 baseline. The authors report Pearson correlation.]]<br />
<br />
[[File:table5_fan.png |frame|none|alt=Alt text|Table 5: BLEU score on English-Esperanto.<br />
Although being a naive approach, word-by- word translation is enough to get a rough idea of the input sentence. The quality of the gener- ated dictionary has a significant impact on the BLEU score.]]<br />
<br />
[[File:paper9_fig3.png |frame|none|alt=Alt text|Figure 3: The paper also investigated the impact of monolingual embeddings. It was found that model from this paper can align embeddings obtained through different methods, but not embeddings obtained from different corpora, which explains the large performance increase in Table 2 due to the corpus change from WaCky to Wiki using CBOW embedding. This is conveyed in this figure which displays English to English world alignment accuracies with regard to word frequency. Perfect alignment is achieved using the same model and corpora (a). Also good alignment using different model and corpora, although CSLS consistently has better results (b). Worse results due to use of different corpora (c). Even worse results when both embedding model and corpora are different.]]<br />
<br />
= Conclusion =<br />
It is clear that one major downfall of this method when it actually comes to translation is the restriction that the two languages must have similar intrinsic structures to allow for the embeddings to align. However, given this assumption, this paper shows for the first time that one can align word embedding spaces without any cross-lingual supervision, i.e., solely based on unaligned datasets of each language, while reaching or outperforming the quality of previous supervised approaches in several cases. Using adversarial training, the model is able to initialize a linear mapping between a source and a target space, which is also used to produce a synthetic parallel dictionary. It is then possible to apply the same techniques proposed for supervised techniques, namely a Procrustean optimization.<br />
<br />
= Open source code =<br />
The source code for the paper is provided at the following Github link: https://github.com/facebookresearch/MUSE. The repository provides the source code as written in PyTorch by the authors of this paper.<br />
<br />
= Source =<br />
Dinu, Georgiana; Lazaridou, Angeliki; Baroni, Marco<br />
| Improving zero-shot learning by mitigating the hubness problem<br />
| arXiv:1412.6568<br />
<br />
Lample, Guillaume; Denoyer, Ludovic; Ranzato, Marc'Aurelio <br />
| Unsupervised Machine Translation Using Monolingual Corpora Only<br />
| arXiv: 1701.04087<br />
<br />
Smith, Samuel L; Turban, David HP; Hamblin, Steven; Hammerla, Nils Y<br />
| Offline bilingual word vectors, orthogonal transformations and the inverted softmax<br />
| arXiv:1702.03859<br />
<br />
Lample, G. (n.d.). Facebookresearch/MUSE. Retrieved March 25, 2018, from https://github.com/facebookresearch/MUSE<br />
<br />
Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean<br />
| Efficient Estimation of Word Representations in Vector Space<br />
| arXiv:1301.3781</div>Tdunnhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Word_translation_without_parallel_data&diff=36278Word translation without parallel data2018-04-07T14:55:09Z<p>Tdunn: /* Estimation of Word Representations in Vector Space */</p>
<hr />
<div>[[File:Toy_example.png]]<br />
<br />
= Presented by =<br />
<br />
Xia Fan<br />
<br />
= Introduction =<br />
<br />
Many successful methods for learning relationships between languages stem from the hypothesis that there is a relationship between the context of words and their meanings. This means that if an adequate representation of a language is found in a high dimensional space (this is called an embedding), then words similar to a given word are close to one another in this space (ex. some norm can be minimized to find a word with similar context). Historically, another significant hypothesis is that these embedding spaces show similar structures over different languages. That is to say that given an embedding space for English and one for Spanish, a mapping could be found that aligns the two spaces and such a mapping could be used as a tool for translation. Many papers exploit these hypotheses, but use large parallel datasets for training. Recently, to remove the need for supervised training, methods have been implemented that utilize identical character strings (ex. letters or digits) in order to try to align the embeddings. The downside of this approach is that the two languages need to be similar to begin with as they need to have some shared basic building block. The method proposed in this paper uses an adversarial method to find this mapping between the embedding spaces of two languages without the use of large parallel datasets.<br />
<br />
The contributions of this paper can be listed as follows: <br />
<br />
1. This paper introduces a model that either is on par, or outperforms supervised state-of-the-art methods, without employing any cross-lingual annotated data such as bilingual dictionaries or parallel corpora (large and structured sets of texts). This method uses an idea similar to GANs: it leverages adversarial training to learn a linear mapping from a source to distinguish between the mapped source embeddings and the target embeddings, while the mapping is jointly trained to fool the discriminator. <br />
<br />
2. Second, this paper extracts a synthetic dictionary from the resulting shared embedding space and fine-tunes the mapping with the closed-form Procrustes solution from Schonemann (1966). <br />
<br />
3. Third, this paper also introduces an unsupervised selection metric that is highly correlated with the mapping quality and that the authors use both as a stopping criterion and to select the best hyper-parameters. <br />
<br />
4. Fourth, they introduce a cross-domain similarity adaptation to mitigate the so-called hubness problem (points tending to be nearest neighbors of many points in high-dimensional spaces).<br />
<br />
5. They demonstrate the effectiveness of our method using an example of a low-resource language pair where parallel corpora are not available (English-Esperanto) for which their method is particularly suited.<br />
<br />
This paper is published in ICLR 2018.<br />
<br />
= Related Work =<br />
<br />
'''Bilingual Lexicon Induction'''<br />
<br />
Many papers have addressed this subject by using discrete word representations. Regularly however these methods need to have an initialization of prior knowledge, such as the editing distance between the input and output ground truth. This unfortunately only works for closely related languages.<br />
<br />
= Model =<br />
<br />
<br />
=== Estimation of Word Representations in Vector Space ===<br />
<br />
This model focuses on learning a mapping between the two sets such that translations are close in the shared space. Before talking about the model it used, a model which can exploit the similarities of monolingual embedding spaces should be introduced. Mikolov et al.(2013) use a known dictionary of n=5000 pairs of words <math> \{x_i,y_i\}_{i\in{1,n}} </math>. and learn a linear mapping W between the source and the target space such that <br />
<br />
\begin{align}<br />
W^*=argmin_{W{\in}M_d(R)}||WX-Y||_F \hspace{1cm} (1)<br />
\end{align}<br />
<br />
where d is the dimension of the embeddings, <math> M_d(R) </math> is the space of d*d matrices of real numbers, and X and Y are two aligned matrices of size d*n containing the embeddings of the words in the parallel vocabulary. Here <math>||\cdot||_F</math> is the Frobenius matrix norm which is simply the square root of the sum of the squared components.<br />
<br />
Xing et al. (2015) showed that these results are improved by enforcing orthogonality constraint on W. In that case, equation (1) boils down to the Procrustes problem, a matrix approximation problem for which the goal is to find an orthogonal matrix that best maps two given matrices on the measure of the Frobenius norm. It advantageously offers a closed form solution obtained from the singular value decomposition (SVD) of <math> YX^T </math> :<br />
<br />
\begin{align}<br />
W^*=argmin_{W{\in}M_d(R)}||WX-Y||_F=UV^T\textrm{, with }U\Sigma V^T=SVD(YX^T).<br />
\end{align}<br />
<br />
<br />
This can be proven as follows. First note that <br />
\begin{align}<br />
&||WX-Y||_F\\<br />
&= \langle WX-Y, WX-Y\rangle_F\\ <br />
&= \langle WX, WX \rangle_F -2 \langle W X, Y \rangle_F + \langle Y, Y \rangle_F \\<br />
&= ||X||_F^2 -2 \langle W X, Y \rangle_F + || Y||_F^2, <br />
\end{align}<br />
<br />
where <math display="inline"> \langle \cdot, \cdot \rangle_F </math> denotes the Frobenius inner-product and we have used the orthogonality of <math display="inline"> W </math>. It follows that we need only maximize the inner-product above. Let <math display="inline"> u_1, \ldots, u_d </math> denote the columns of <math display="inline"> U </math>. Let <math display="inline"> v_1, \ldots , v_d </math> denote the columns of <math display="inline"> V </math>. Let <math display="inline"> \sigma_1, \ldots, \sigma_d </math> denote the diagonal entries of <math display="inline"> \Sigma </math>. We have<br />
\begin{align}<br />
&\langle W X, Y \rangle_F \\<br />
&= \text{Tr} (W^T Y X^T)\\<br />
& =\text{Tr}(W^T \sum_i \sigma_i u_i v_i^T)\\<br />
&=\sum_i \sigma_i \text{Tr}(W^T u_i v_i^T)\\<br />
&=\sum_i \sigma_i ((Wv_i)^T u_i )\text{ invariance of trace under cyclic permutations}\\<br />
&\le \sum_i \sigma_i ||Wv_i|| ||u_i||\text{ Cauchy-Swarz inequality}\\<br />
&= \sum_i \sigma_i<br />
\end{align}<br />
where we have used the invariance of trace under cyclic permutations, Cauchy-Schwarz, and the orthogonality of the columns of U and V. Note that choosing <br />
\begin{align}<br />
W=UV^T<br />
\end{align}<br />
achieves the bound. This completes the proof.<br />
<br />
=== Domain-adversarial setting ===<br />
<br />
This paper shows how to learn this mapping W without cross-lingual supervision. An illustration of the approach is given in Fig. 1. First, this model learns an initial proxy of W by using an adversarial criterion. Then, it uses the words that match the best as anchor points for Procrustes. Finally, it improves performance over less frequent words by changing the metric of the space, which leads to spread more of those points in dense region. <br />
<br />
[[File:Toy_example.png |frame|none|alt=Alt text|Figure 1: Toy illustration of the method. (A) There are two distributions of word embeddings, English words in red denoted by X and Italian words in blue denoted by Y , which we want to align/translate. Each dot represents a word in that space. The size of the dot is proportional to the frequency of the words in the training corpus of that language. (B) Using adversarial learning, we learn a rotation matrix W which roughly aligns the two distributions. The green stars are randomly selected words that are fed to the discriminator to determine whether the two word embeddings come from the same distribution. (C) The mapping W is further refined via Procrustes. This method uses frequent words aligned by the previous step as anchor points, and minimizes an energy function that corresponds to a spring system between anchor points. The refined mapping is then used to map all words in the dictionary. (D) Finally, we translate by using the mapping W and a distance metric, dubbed CSLS, that expands the space where there is high density of points (like the area around the word “cat”), so that “hubs” (like the word “cat”) become less close to other word vectors than they would otherwise (compare to the same region in panel (A)).]]<br />
<br />
Let <math> X={x_1,...,x_n} </math> and <math> Y={y_1,...,y_m} </math> be two sets of n and m word embeddings coming from a source and a target language respectively. A model is trained is trained to discriminate between elements randomly sampled from <math> WX={Wx_1,...,Wx_n} </math> and Y, We call this model the discriminator. W is trained to prevent the discriminator from making accurate predictions. As a result, this is a two-player adversarial game, where the discriminator aims at maximizing its ability to identify the origin of an embedding, and W aims at preventing the discriminator from doing so by making WX and Y as similar as possible. This approach is in line with the work of Ganin et al.(2016), who proposed to learn latent representations invariant to the input domain, where in this case, a domain is represented by a language(source or target).<br />
<br />
1. Discriminator objective<br />
<br />
Refer to the discriminator parameters as <math> \theta_D </math>. Consider the probability <math> P_{\theta_D}(source = 1|z) </math> that a vector z is the mapping of a source embedding (as opposed to a target embedding) according to the discriminator. The discriminator loss can be written as:<br />
<br />
\begin{align}<br />
L_D(\theta_D|W)=-\frac{1}{n} \sum_{i=1}^n log P_{\theta_D}(source=1|Wx_i)-\frac{1}{m} \sum_{i=1}^m log P_{\theta_D}(source=0|y_i)<br />
\end{align}<br />
<br />
2. Mapping objective <br />
<br />
In the unsupervised setting, W is now trained so that the discriminator is unable to accurately predict the embedding origins: <br />
<br />
\begin{align}<br />
L_W(W|\theta_D)=-\frac{1}{n} \sum_{i=1}^n log P_{\theta_D}(source=0|Wx_i)-\frac{1}{m} \sum_{i=1}^m log P_{\theta_D}(source=1|y_i)<br />
\end{align}<br />
<br />
3. Learning algorithm <br />
To train the model, the authors follow the standard training procedure of deep adversarial networks of Goodfellow et al. (2014). For every input sample, the discriminator and the mapping matrix W are trained successively with stochastic gradient updates to respectively minimize <math> L_D </math> and <math> L_W </math><br />
<br />
=== Refinement procedure ===<br />
<br />
The matrix W obtained with adversarial training gives good performance (see Table 1), but the results are still not on par with the supervised approach. In fact, the adversarial approach tries to align all words irrespective of their frequencies. However, rare words have embeddings that are less updated and are more likely to appear in different contexts in each corpus, which makes them harder to align. Under the assumption that the mapping is linear, it is then better to infer the global mapping using only the most frequent words as anchors. Besides, the accuracy on the most frequent word pairs is high after adversarial training.<br />
To refine the mapping, this paper build a synthetic parallel vocabulary using the W just learned with adversarial training. Specifically, this paper consider the most frequent words and retain only mutual nearest neighbors to ensure a high-quality dictionary. Subsequently, this paper apply the Procrustes solution in (2) on this generated dictionary. Considering the improved solution generated with the Procrustes algorithm, it is possible to generate a more accurate dictionary and apply this method iteratively, similarly to Artetxe et al. (2017). However, given that the synthetic dictionary obtained using adversarial training is already strong, this paper only observe small improvements when doing more than one iteration, i.e., the improvements on the word translation task are usually below 1%.<br />
<br />
=== Cross-Domain Similarity Local Scaling (CSLS) ===<br />
<br />
This paper considers a bi-partite neighborhood graph, in which each word of a given dictionary is connected to its K nearest neighbors in the other language. <math> N_T(Wx_s) </math> is used to denote the neighborhood, on this bi-partite graph, associated with a mapped source word embedding <math> Wx_s </math>. All K elements of <math> N_T(Wx_s) </math> are words from the target language. Similarly we denote by <math> N_S(y_t) </math> the neighborhood associated with a word t of the target language. Consider the mean similarity of a source embedding <math> x_s </math> to its target neighborhood as<br />
<br />
\begin{align}<br />
r_T(Wx_s)=\frac{1}{K}\sum_{y\in N_T(Wx_s)}cos(Wx_s,y_t)<br />
\end{align}<br />
<br />
where cos(.,.) is the cosine similarity. Likewise, the mean similarity of a target word <math> y_t </math> to its neighborhood is denoted as <math> r_S(y_t) </math>. This is used to define similarity measure CSLS(.,.) between mapped source words and target words as <br />
<br />
\begin{align}<br />
CSLS(Wx_s,y_t)=2cos(Wx_s,y_t)-r_T(Wx_s)-r_S(y_t)<br />
\end{align}<br />
<br />
This process increases the similarity associated with isolated word vectors, but decreases the similarity of vectors lying in dense areas. <br />
<br />
CSLS represents an improved measure for producing reliable matching words between two languages (i.e. neighbors of a word in one language should ideally correspond to the same words in the second language). The nearest neighbors algorithm is asymmetric, and in high-dimensional spaces, it suffers from the problem of hubness, in which some points are nearest neighbors to exceptionally many points, while others are not nearest neighbors to any points. Existing approaches for combating the effect of hubness on word translation retrieval involve performing similarity updates one language at a time without consideration for the other language in the pair (Dinu et al., 2015, Smith et al., 2017). Consequently, they yielded less accurate results when compared to CSLS in experiments conducted in this paper (Table 1).<br />
<br />
= Training and architectural choices =<br />
=== Architecture ===<br />
<br />
This paper use unsupervised word vectors that were trained using fastText2. These correspond to monolingual embeddings of dimension 300 trained on Wikipedia corpora; therefore, the mapping W has size 300 × 300. Words are lower-cased, and those that appear less than 5 times are discarded for training. As a post-processing step, only the first 200k most frequent words were selected in the experiments.<br />
For the discriminator, it use a multilayer perceptron with two hidden layers of size 2048, and Leaky-ReLU activation functions. The input to the discriminator is corrupted with dropout noise with a rate of 0.1. As suggested by Goodfellow (2016), a smoothing coefficient s = 0.2 is included in the discriminator predictions. This paper use stochastic gradient descent with a batch size of 32, a learning rate of 0.1 and a decay of 0.95 both for the discriminator and W . <br />
<br />
=== Discriminator inputs ===<br />
The embedding quality of rare words is generally not as good as the one of frequent words (Luong et al., 2013), and it is observed that feeding the discriminator with rare words had a small, but not negligible negative impact. As a result, this paper only feed the discriminator with the 50,000 most frequent words. At each training step, the word embeddings given to the discriminator are sampled uniformly. Sampling them according to the word frequency did not have any noticeable impact on the results.<br />
<br />
=== Orthogonality===<br />
In this work, the authors propose to use a simple update step to ensure that the matrix W stays close to an orthogonal matrix during training (Cisse et al. (2017)). Specifically, the following update rule on the matrix W is used:<br />
<br />
\begin{align}<br />
W \leftarrow (1+\beta)W-\beta(WW^T)W<br />
\end{align}<br />
<br />
where β = 0.01 is usually found to perform well. This method ensures that the matrix stays close to the manifold of orthogonal matrices after each update.<br />
<br />
This update rule can be justified as follows. Consider the function <br />
\begin{align}<br />
g: \mathbb{R}^{d\times d} \to \mathbb{R}^{d \times d}<br />
\end{align}<br />
defined by<br />
\begin{align}<br />
g(W)= W^T W -I.<br />
\end{align}<br />
<br />
The derivative of g at W is is the linear map<br />
\begin{align}<br />
Dg[W]: \mathbb{R}^{d \times d} \to \mathbb{R}^{d \times d}<br />
\end{align}<br />
defined by<br />
\begin{align}<br />
Dg[W](H)= H^T W + W^T H.<br />
\end{align}<br />
<br />
The adjoint of this linear map is<br />
<br />
\begin{align}<br />
D^\ast g[W](H)= WH^T +WH.<br />
\end{align}<br />
<br />
Now consider the function f<br />
\begin{align}<br />
f: \mathbb{R}^{d \times d} \to \mathbb{R}<br />
\end{align}<br />
<br />
defined by<br />
<br />
\begin{align}<br />
f(W)=||g(W) ||_F^2=||W^TW -I ||_F^2.<br />
\end{align}<br />
<br />
f has gradient:<br />
\begin{align}<br />
\nabla f (W) = 2D^\ast g[W] (g(W ) ) =2W(W^TW-I) +2W(W^TW-I)=4W W^TW-4W.<br />
\end{align}<br />
or<br />
\begin{align}<br />
\nabla f (W) = \nabla||W^TW-I||_F = \nabla\text{Tr}(W^TW-I)(W^TW-I)=4(\nabla(W^TW-I))(W^TW-I)=4W(W^TW-I)\text{ (check derivative of trace function)}<br />
\end{align}<br />
<br />
Thus the update<br />
\begin{align}<br />
W \leftarrow (1+\beta)W-\beta(WW^T)W<br />
\end{align}<br />
amounts to a step in the direction opposite the gradient of f. That is, a step toward the set of orthogonal matrices.<br />
<br />
=== Dictionary generation ===<br />
The refinement step requires the generation of a new dictionary at each iteration. In order for the Procrustes solution to work well, it is best to apply it on correct word pairs. As a result, the CSLS method is used to select more accurate translation pairs in the dictionary. To further increase the quality of the dictionary, and ensure that W is learned from correct translation pairs, only mutual nearest neighbors were considered, i.e. pairs of words that are mutually nearest neighbors of each other according to CSLS. This significantly decreases the size of the generated dictionary, but improves its accuracy, as well as the overall performance.<br />
<br />
=== Validation criterion for unsupervised model selection ===<br />
<br />
This paper consider the 10k most frequent source words, and use CSLS to generate a translation for each of them, then compute the average cosine similarity between these deemed translations, and use this average as a validation metric. The choice of using the 10 thousand most frequent source words is requires more justification since we would expect those to be the best trained words and may not accurately represent the entire data set. Perhaps a k-fold cross validation approach should be used instead. Figure 2 below shows the correlation between the evaluation score and this unsupervised criterion (without stabilization by learning rate shrinkage)<br />
<br />
<br />
<br />
[[File:fig2_fan.png |frame|none|alt=Alt text|Figure 2: Unsupervised model selection.<br />
Correlation between the unsupervised validation criterion (black line) and actual word translation accuracy (blue line). In this particular experiment, the selected model is at epoch 10. Observe how the criterion is well correlated with translation accuracy.]]<br />
<br />
= Results =<br />
<br />
The results on word translation retrieval using the bilingual dictionaries are presented in Table 1, and a comparison to previous work in shown in Table 2 where the unsupervised model significantly outperforms previous approaches. The results on the sentence translation retrieval task are presented in Table 3, and the cross-lingual word similarity task in Table 4. Finally, the results on word-by-word translation for English-Esperanto are presented in Table 5. The bilingual dictionary used here does not account for words with multiple meanings.<br />
<br />
[[File:table1_fan.png |frame|none|alt=Alt text|Table 1: Word translation retrieval P@1 for the released vocabularies in various language pairs. The authors consider 1,500 source test queries, and 200k target words for each language pair. The authors use fastText embeddings trained on Wikipedia. NN: nearest neighbors. ISF: inverted softmax. (’en’ is English, ’fr’ is French, ’de’ is German, ’ru’ is Russian, ’zh’ is classical Chinese and ’eo’ is Esperanto)]]<br />
<br />
<br />
[[File:table2_fan.png |frame|none|alt=Alt text|English-Italian word translation average precisions (@1, @5, @10) from 1.5k source word queries using 200k target words. Results marked with the symbol † are from Smith et al. (2017). Wiki means the embeddings were trained on Wikipedia using fastText. Note that the method used by Artetxe et al. (2017) does not use the same supervision as other supervised methods, as they only use numbers in their ini- tial parallel dictionary.]]<br />
<br />
[[File:table3_fan.png |frame|none|alt=Alt text|Table 3: English-Italian sentence translation retrieval. The authors report the average P@k from 2,000 source queries using 200,000 target sentences. The authors use the same embeddings as in Smith et al. (2017). Their results are marked with the symbol †.]]<br />
<br />
[[File:table4_fan.png |frame|none|alt=Alt text|Table 4: Cross-lingual wordsim task. NASARI<br />
(Camacho-Collados et al. (2016)) refers to the official SemEval2017 baseline. The authors report Pearson correlation.]]<br />
<br />
[[File:table5_fan.png |frame|none|alt=Alt text|Table 5: BLEU score on English-Esperanto.<br />
Although being a naive approach, word-by- word translation is enough to get a rough idea of the input sentence. The quality of the gener- ated dictionary has a significant impact on the BLEU score.]]<br />
<br />
[[File:paper9_fig3.png |frame|none|alt=Alt text|Figure 3: The paper also investigated the impact of monolingual embeddings. It was found that model from this paper can align embeddings obtained through different methods, but not embeddings obtained from different corpora, which explains the large performance increase in Table 2 due to the corpus change from WaCky to Wiki using CBOW embedding. This is conveyed in this figure which displays English to English world alignment accuracies with regard to word frequency. Perfect alignment is achieved using the same model and corpora (a). Also good alignment using different model and corpora, although CSLS consistently has better results (b). Worse results due to use of different corpora (c). Even worse results when both embedding model and corpora are different.]]<br />
<br />
= Conclusion =<br />
It is clear that one major downfall of this method when it actually comes to translation is the restriction that the two languages must have similar intrinsic structures to allow for the embeddings to align. However, given this assumption, this paper shows for the first time that one can align word embedding spaces without any cross-lingual supervision, i.e., solely based on unaligned datasets of each language, while reaching or outperforming the quality of previous supervised approaches in several cases. Using adversarial training, the model is able to initialize a linear mapping between a source and a target space, which is also used to produce a synthetic parallel dictionary. It is then possible to apply the same techniques proposed for supervised techniques, namely a Procrustean optimization.<br />
<br />
= Open source code =<br />
The source code for the paper is provided at the following Github link: https://github.com/facebookresearch/MUSE. The repository provides the source code as written in PyTorch by the authors of this paper.<br />
<br />
= Source =<br />
Dinu, Georgiana; Lazaridou, Angeliki; Baroni, Marco<br />
| Improving zero-shot learning by mitigating the hubness problem<br />
| arXiv:1412.6568<br />
<br />
Lample, Guillaume; Denoyer, Ludovic; Ranzato, Marc'Aurelio <br />
| Unsupervised Machine Translation Using Monolingual Corpora Only<br />
| arXiv: 1701.04087<br />
<br />
Smith, Samuel L; Turban, David HP; Hamblin, Steven; Hammerla, Nils Y<br />
| Offline bilingual word vectors, orthogonal transformations and the inverted softmax<br />
| arXiv:1702.03859<br />
<br />
Lample, G. (n.d.). Facebookresearch/MUSE. Retrieved March 25, 2018, from https://github.com/facebookresearch/MUSE<br />
<br />
Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean<br />
| Efficient Estimation of Word Representations in Vector Space<br />
| arXiv:1301.3781</div>Tdunnhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Word_translation_without_parallel_data&diff=36277Word translation without parallel data2018-04-07T14:49:46Z<p>Tdunn: /* Source */</p>
<hr />
<div>[[File:Toy_example.png]]<br />
<br />
= Presented by =<br />
<br />
Xia Fan<br />
<br />
= Introduction =<br />
<br />
Many successful methods for learning relationships between languages stem from the hypothesis that there is a relationship between the context of words and their meanings. This means that if an adequate representation of a language is found in a high dimensional space (this is called an embedding), then words similar to a given word are close to one another in this space (ex. some norm can be minimized to find a word with similar context). Historically, another significant hypothesis is that these embedding spaces show similar structures over different languages. That is to say that given an embedding space for English and one for Spanish, a mapping could be found that aligns the two spaces and such a mapping could be used as a tool for translation. Many papers exploit these hypotheses, but use large parallel datasets for training. Recently, to remove the need for supervised training, methods have been implemented that utilize identical character strings (ex. letters or digits) in order to try to align the embeddings. The downside of this approach is that the two languages need to be similar to begin with as they need to have some shared basic building block. The method proposed in this paper uses an adversarial method to find this mapping between the embedding spaces of two languages without the use of large parallel datasets.<br />
<br />
The contributions of this paper can be listed as follows: <br />
<br />
1. This paper introduces a model that either is on par, or outperforms supervised state-of-the-art methods, without employing any cross-lingual annotated data such as bilingual dictionaries or parallel corpora (large and structured sets of texts). This method uses an idea similar to GANs: it leverages adversarial training to learn a linear mapping from a source to distinguish between the mapped source embeddings and the target embeddings, while the mapping is jointly trained to fool the discriminator. <br />
<br />
2. Second, this paper extracts a synthetic dictionary from the resulting shared embedding space and fine-tunes the mapping with the closed-form Procrustes solution from Schonemann (1966). <br />
<br />
3. Third, this paper also introduces an unsupervised selection metric that is highly correlated with the mapping quality and that the authors use both as a stopping criterion and to select the best hyper-parameters. <br />
<br />
4. Fourth, they introduce a cross-domain similarity adaptation to mitigate the so-called hubness problem (points tending to be nearest neighbors of many points in high-dimensional spaces).<br />
<br />
5. They demonstrate the effectiveness of our method using an example of a low-resource language pair where parallel corpora are not available (English-Esperanto) for which their method is particularly suited.<br />
<br />
This paper is published in ICLR 2018.<br />
<br />
= Related Work =<br />
<br />
'''Bilingual Lexicon Induction'''<br />
<br />
Many papers have addressed this subject by using discrete word representations. Regularly however these methods need to have an initialization of prior knowledge, such as the editing distance between the input and output ground truth. This unfortunately only works for closely related languages.<br />
<br />
= Model =<br />
<br />
<br />
=== Estimation of Word Representations in Vector Space ===<br />
<br />
This model focuses on learning a mapping between the two sets such that translations are close in the shared space. Before talking about the model it used, a model which can exploit the similarities of monolingual embedding spaces should be introduced. Mikolov et al.(2013) use a known dictionary of n=5000 pairs of words <math> \{x_i,y_i\}_{i\in{1,n}} </math>. and learn a linear mapping W between the source and the target space such that <br />
<br />
\begin{align}<br />
W^*=argmin_{W{\in}M_d(R)}||WX-Y||_F \hspace{1cm} (1)<br />
\end{align}<br />
<br />
where d is the dimension of the embeddings, <math> M_d(R) </math> is the space of d*d matrices of real numbers, and X and Y are two aligned matrices of size d*n containing the embeddings of the words in the parallel vocabulary. <br />
<br />
Xing et al. (2015) showed that these results are improved by enforcing orthogonality constraint on W. In that case, equation (1) boils down to the Procrustes problem, a matrix approximation problem for which the goal is to find an orthogonal matrix that best maps two given matrices on the measure of the Frobenius norm. It advantageously offers a closed form solution obtained from the singular value decomposition (SVD) of <math> YX^T </math> :<br />
<br />
\begin{align}<br />
W^*=argmin_{W{\in}M_d(R)}||WX-Y||_F=UV^T\textrm{, with }U\Sigma V^T=SVD(YX^T).<br />
\end{align}<br />
<br />
<br />
This can be proven as follows. First note that <br />
\begin{align}<br />
&||WX-Y||_F\\<br />
&= \langle WX-Y, WX-Y\rangle_F\\ <br />
&= \langle WX, WX \rangle_F -2 \langle W X, Y \rangle_F + \langle Y, Y \rangle_F \\<br />
&= ||X||_F^2 -2 \langle W X, Y \rangle_F + || Y||_F^2, <br />
\end{align}<br />
<br />
where <math display="inline"> \langle \cdot, \cdot \rangle_F </math> denotes the Frobenius inner-product and we have used the orthogonality of <math display="inline"> W </math>. It follows that we need only maximize the inner-product above. Let <math display="inline"> u_1, \ldots, u_d </math> denote the columns of <math display="inline"> U </math>. Let <math display="inline"> v_1, \ldots , v_d </math> denote the columns of <math display="inline"> V </math>. Let <math display="inline"> \sigma_1, \ldots, \sigma_d </math> denote the diagonal entries of <math display="inline"> \Sigma </math>. We have<br />
\begin{align}<br />
&\langle W X, Y \rangle_F \\<br />
&= \text{Tr} (W^T Y X^T)\\<br />
& =\text{Tr}(W^T \sum_i \sigma_i u_i v_i^T)\\<br />
&=\sum_i \sigma_i \text{Tr}(W^T u_i v_i^T)\\<br />
&=\sum_i \sigma_i ((Wv_i)^T u_i )\text{ invariance of trace under cyclic permutations}\\<br />
&\le \sum_i \sigma_i ||Wv_i|| ||u_i||\text{ Cauchy-Swarz inequality}\\<br />
&= \sum_i \sigma_i<br />
\end{align}<br />
where we have used the invariance of trace under cyclic permutations, Cauchy-Schwarz, and the orthogonality of the columns of U and V. Note that choosing <br />
\begin{align}<br />
W=UV^T<br />
\end{align}<br />
achieves the bound. This completes the proof.<br />
<br />
=== Domain-adversarial setting ===<br />
<br />
This paper shows how to learn this mapping W without cross-lingual supervision. An illustration of the approach is given in Fig. 1. First, this model learns an initial proxy of W by using an adversarial criterion. Then, it uses the words that match the best as anchor points for Procrustes. Finally, it improves performance over less frequent words by changing the metric of the space, which leads to spread more of those points in dense region. <br />
<br />
[[File:Toy_example.png |frame|none|alt=Alt text|Figure 1: Toy illustration of the method. (A) There are two distributions of word embeddings, English words in red denoted by X and Italian words in blue denoted by Y , which we want to align/translate. Each dot represents a word in that space. The size of the dot is proportional to the frequency of the words in the training corpus of that language. (B) Using adversarial learning, we learn a rotation matrix W which roughly aligns the two distributions. The green stars are randomly selected words that are fed to the discriminator to determine whether the two word embeddings come from the same distribution. (C) The mapping W is further refined via Procrustes. This method uses frequent words aligned by the previous step as anchor points, and minimizes an energy function that corresponds to a spring system between anchor points. The refined mapping is then used to map all words in the dictionary. (D) Finally, we translate by using the mapping W and a distance metric, dubbed CSLS, that expands the space where there is high density of points (like the area around the word “cat”), so that “hubs” (like the word “cat”) become less close to other word vectors than they would otherwise (compare to the same region in panel (A)).]]<br />
<br />
Let <math> X={x_1,...,x_n} </math> and <math> Y={y_1,...,y_m} </math> be two sets of n and m word embeddings coming from a source and a target language respectively. A model is trained is trained to discriminate between elements randomly sampled from <math> WX={Wx_1,...,Wx_n} </math> and Y, We call this model the discriminator. W is trained to prevent the discriminator from making accurate predictions. As a result, this is a two-player adversarial game, where the discriminator aims at maximizing its ability to identify the origin of an embedding, and W aims at preventing the discriminator from doing so by making WX and Y as similar as possible. This approach is in line with the work of Ganin et al.(2016), who proposed to learn latent representations invariant to the input domain, where in this case, a domain is represented by a language(source or target).<br />
<br />
1. Discriminator objective<br />
<br />
Refer to the discriminator parameters as <math> \theta_D </math>. Consider the probability <math> P_{\theta_D}(source = 1|z) </math> that a vector z is the mapping of a source embedding (as opposed to a target embedding) according to the discriminator. The discriminator loss can be written as:<br />
<br />
\begin{align}<br />
L_D(\theta_D|W)=-\frac{1}{n} \sum_{i=1}^n log P_{\theta_D}(source=1|Wx_i)-\frac{1}{m} \sum_{i=1}^m log P_{\theta_D}(source=0|y_i)<br />
\end{align}<br />
<br />
2. Mapping objective <br />
<br />
In the unsupervised setting, W is now trained so that the discriminator is unable to accurately predict the embedding origins: <br />
<br />
\begin{align}<br />
L_W(W|\theta_D)=-\frac{1}{n} \sum_{i=1}^n log P_{\theta_D}(source=0|Wx_i)-\frac{1}{m} \sum_{i=1}^m log P_{\theta_D}(source=1|y_i)<br />
\end{align}<br />
<br />
3. Learning algorithm <br />
To train the model, the authors follow the standard training procedure of deep adversarial networks of Goodfellow et al. (2014). For every input sample, the discriminator and the mapping matrix W are trained successively with stochastic gradient updates to respectively minimize <math> L_D </math> and <math> L_W </math><br />
<br />
=== Refinement procedure ===<br />
<br />
The matrix W obtained with adversarial training gives good performance (see Table 1), but the results are still not on par with the supervised approach. In fact, the adversarial approach tries to align all words irrespective of their frequencies. However, rare words have embeddings that are less updated and are more likely to appear in different contexts in each corpus, which makes them harder to align. Under the assumption that the mapping is linear, it is then better to infer the global mapping using only the most frequent words as anchors. Besides, the accuracy on the most frequent word pairs is high after adversarial training.<br />
To refine the mapping, this paper build a synthetic parallel vocabulary using the W just learned with adversarial training. Specifically, this paper consider the most frequent words and retain only mutual nearest neighbors to ensure a high-quality dictionary. Subsequently, this paper apply the Procrustes solution in (2) on this generated dictionary. Considering the improved solution generated with the Procrustes algorithm, it is possible to generate a more accurate dictionary and apply this method iteratively, similarly to Artetxe et al. (2017). However, given that the synthetic dictionary obtained using adversarial training is already strong, this paper only observe small improvements when doing more than one iteration, i.e., the improvements on the word translation task are usually below 1%.<br />
<br />
=== Cross-Domain Similarity Local Scaling (CSLS) ===<br />
<br />
This paper considers a bi-partite neighborhood graph, in which each word of a given dictionary is connected to its K nearest neighbors in the other language. <math> N_T(Wx_s) </math> is used to denote the neighborhood, on this bi-partite graph, associated with a mapped source word embedding <math> Wx_s </math>. All K elements of <math> N_T(Wx_s) </math> are words from the target language. Similarly we denote by <math> N_S(y_t) </math> the neighborhood associated with a word t of the target language. Consider the mean similarity of a source embedding <math> x_s </math> to its target neighborhood as<br />
<br />
\begin{align}<br />
r_T(Wx_s)=\frac{1}{K}\sum_{y\in N_T(Wx_s)}cos(Wx_s,y_t)<br />
\end{align}<br />
<br />
where cos(.,.) is the cosine similarity. Likewise, the mean similarity of a target word <math> y_t </math> to its neighborhood is denoted as <math> r_S(y_t) </math>. This is used to define similarity measure CSLS(.,.) between mapped source words and target words as <br />
<br />
\begin{align}<br />
CSLS(Wx_s,y_t)=2cos(Wx_s,y_t)-r_T(Wx_s)-r_S(y_t)<br />
\end{align}<br />
<br />
This process increases the similarity associated with isolated word vectors, but decreases the similarity of vectors lying in dense areas. <br />
<br />
CSLS represents an improved measure for producing reliable matching words between two languages (i.e. neighbors of a word in one language should ideally correspond to the same words in the second language). The nearest neighbors algorithm is asymmetric, and in high-dimensional spaces, it suffers from the problem of hubness, in which some points are nearest neighbors to exceptionally many points, while others are not nearest neighbors to any points. Existing approaches for combating the effect of hubness on word translation retrieval involve performing similarity updates one language at a time without consideration for the other language in the pair (Dinu et al., 2015, Smith et al., 2017). Consequently, they yielded less accurate results when compared to CSLS in experiments conducted in this paper (Table 1).<br />
<br />
= Training and architectural choices =<br />
=== Architecture ===<br />
<br />
This paper use unsupervised word vectors that were trained using fastText2. These correspond to monolingual embeddings of dimension 300 trained on Wikipedia corpora; therefore, the mapping W has size 300 × 300. Words are lower-cased, and those that appear less than 5 times are discarded for training. As a post-processing step, only the first 200k most frequent words were selected in the experiments.<br />
For the discriminator, it use a multilayer perceptron with two hidden layers of size 2048, and Leaky-ReLU activation functions. The input to the discriminator is corrupted with dropout noise with a rate of 0.1. As suggested by Goodfellow (2016), a smoothing coefficient s = 0.2 is included in the discriminator predictions. This paper use stochastic gradient descent with a batch size of 32, a learning rate of 0.1 and a decay of 0.95 both for the discriminator and W . <br />
<br />
=== Discriminator inputs ===<br />
The embedding quality of rare words is generally not as good as the one of frequent words (Luong et al., 2013), and it is observed that feeding the discriminator with rare words had a small, but not negligible negative impact. As a result, this paper only feed the discriminator with the 50,000 most frequent words. At each training step, the word embeddings given to the discriminator are sampled uniformly. Sampling them according to the word frequency did not have any noticeable impact on the results.<br />
<br />
=== Orthogonality===<br />
In this work, the authors propose to use a simple update step to ensure that the matrix W stays close to an orthogonal matrix during training (Cisse et al. (2017)). Specifically, the following update rule on the matrix W is used:<br />
<br />
\begin{align}<br />
W \leftarrow (1+\beta)W-\beta(WW^T)W<br />
\end{align}<br />
<br />
where β = 0.01 is usually found to perform well. This method ensures that the matrix stays close to the manifold of orthogonal matrices after each update.<br />
<br />
This update rule can be justified as follows. Consider the function <br />
\begin{align}<br />
g: \mathbb{R}^{d\times d} \to \mathbb{R}^{d \times d}<br />
\end{align}<br />
defined by<br />
\begin{align}<br />
g(W)= W^T W -I.<br />
\end{align}<br />
<br />
The derivative of g at W is is the linear map<br />
\begin{align}<br />
Dg[W]: \mathbb{R}^{d \times d} \to \mathbb{R}^{d \times d}<br />
\end{align}<br />
defined by<br />
\begin{align}<br />
Dg[W](H)= H^T W + W^T H.<br />
\end{align}<br />
<br />
The adjoint of this linear map is<br />
<br />
\begin{align}<br />
D^\ast g[W](H)= WH^T +WH.<br />
\end{align}<br />
<br />
Now consider the function f<br />
\begin{align}<br />
f: \mathbb{R}^{d \times d} \to \mathbb{R}<br />
\end{align}<br />
<br />
defined by<br />
<br />
\begin{align}<br />
f(W)=||g(W) ||_F^2=||W^TW -I ||_F^2.<br />
\end{align}<br />
<br />
f has gradient:<br />
\begin{align}<br />
\nabla f (W) = 2D^\ast g[W] (g(W ) ) =2W(W^TW-I) +2W(W^TW-I)=4W W^TW-4W.<br />
\end{align}<br />
or<br />
\begin{align}<br />
\nabla f (W) = \nabla||W^TW-I||_F = \nabla\text{Tr}(W^TW-I)(W^TW-I)=4(\nabla(W^TW-I))(W^TW-I)=4W(W^TW-I)\text{ (check derivative of trace function)}<br />
\end{align}<br />
<br />
Thus the update<br />
\begin{align}<br />
W \leftarrow (1+\beta)W-\beta(WW^T)W<br />
\end{align}<br />
amounts to a step in the direction opposite the gradient of f. That is, a step toward the set of orthogonal matrices.<br />
<br />
=== Dictionary generation ===<br />
The refinement step requires the generation of a new dictionary at each iteration. In order for the Procrustes solution to work well, it is best to apply it on correct word pairs. As a result, the CSLS method is used to select more accurate translation pairs in the dictionary. To further increase the quality of the dictionary, and ensure that W is learned from correct translation pairs, only mutual nearest neighbors were considered, i.e. pairs of words that are mutually nearest neighbors of each other according to CSLS. This significantly decreases the size of the generated dictionary, but improves its accuracy, as well as the overall performance.<br />
<br />
=== Validation criterion for unsupervised model selection ===<br />
<br />
This paper consider the 10k most frequent source words, and use CSLS to generate a translation for each of them, then compute the average cosine similarity between these deemed translations, and use this average as a validation metric. The choice of using the 10 thousand most frequent source words is requires more justification since we would expect those to be the best trained words and may not accurately represent the entire data set. Perhaps a k-fold cross validation approach should be used instead. Figure 2 below shows the correlation between the evaluation score and this unsupervised criterion (without stabilization by learning rate shrinkage)<br />
<br />
<br />
<br />
[[File:fig2_fan.png |frame|none|alt=Alt text|Figure 2: Unsupervised model selection.<br />
Correlation between the unsupervised validation criterion (black line) and actual word translation accuracy (blue line). In this particular experiment, the selected model is at epoch 10. Observe how the criterion is well correlated with translation accuracy.]]<br />
<br />
= Results =<br />
<br />
The results on word translation retrieval using the bilingual dictionaries are presented in Table 1, and a comparison to previous work in shown in Table 2 where the unsupervised model significantly outperforms previous approaches. The results on the sentence translation retrieval task are presented in Table 3, and the cross-lingual word similarity task in Table 4. Finally, the results on word-by-word translation for English-Esperanto are presented in Table 5. The bilingual dictionary used here does not account for words with multiple meanings.<br />
<br />
[[File:table1_fan.png |frame|none|alt=Alt text|Table 1: Word translation retrieval P@1 for the released vocabularies in various language pairs. The authors consider 1,500 source test queries, and 200k target words for each language pair. The authors use fastText embeddings trained on Wikipedia. NN: nearest neighbors. ISF: inverted softmax. (’en’ is English, ’fr’ is French, ’de’ is German, ’ru’ is Russian, ’zh’ is classical Chinese and ’eo’ is Esperanto)]]<br />
<br />
<br />
[[File:table2_fan.png |frame|none|alt=Alt text|English-Italian word translation average precisions (@1, @5, @10) from 1.5k source word queries using 200k target words. Results marked with the symbol † are from Smith et al. (2017). Wiki means the embeddings were trained on Wikipedia using fastText. Note that the method used by Artetxe et al. (2017) does not use the same supervision as other supervised methods, as they only use numbers in their ini- tial parallel dictionary.]]<br />
<br />
[[File:table3_fan.png |frame|none|alt=Alt text|Table 3: English-Italian sentence translation retrieval. The authors report the average P@k from 2,000 source queries using 200,000 target sentences. The authors use the same embeddings as in Smith et al. (2017). Their results are marked with the symbol †.]]<br />
<br />
[[File:table4_fan.png |frame|none|alt=Alt text|Table 4: Cross-lingual wordsim task. NASARI<br />
(Camacho-Collados et al. (2016)) refers to the official SemEval2017 baseline. The authors report Pearson correlation.]]<br />
<br />
[[File:table5_fan.png |frame|none|alt=Alt text|Table 5: BLEU score on English-Esperanto.<br />
Although being a naive approach, word-by- word translation is enough to get a rough idea of the input sentence. The quality of the gener- ated dictionary has a significant impact on the BLEU score.]]<br />
<br />
[[File:paper9_fig3.png |frame|none|alt=Alt text|Figure 3: The paper also investigated the impact of monolingual embeddings. It was found that model from this paper can align embeddings obtained through different methods, but not embeddings obtained from different corpora, which explains the large performance increase in Table 2 due to the corpus change from WaCky to Wiki using CBOW embedding. This is conveyed in this figure which displays English to English world alignment accuracies with regard to word frequency. Perfect alignment is achieved using the same model and corpora (a). Also good alignment using different model and corpora, although CSLS consistently has better results (b). Worse results due to use of different corpora (c). Even worse results when both embedding model and corpora are different.]]<br />
<br />
= Conclusion =<br />
It is clear that one major downfall of this method when it actually comes to translation is the restriction that the two languages must have similar intrinsic structures to allow for the embeddings to align. However, given this assumption, this paper shows for the first time that one can align word embedding spaces without any cross-lingual supervision, i.e., solely based on unaligned datasets of each language, while reaching or outperforming the quality of previous supervised approaches in several cases. Using adversarial training, the model is able to initialize a linear mapping between a source and a target space, which is also used to produce a synthetic parallel dictionary. It is then possible to apply the same techniques proposed for supervised techniques, namely a Procrustean optimization.<br />
<br />
= Open source code =<br />
The source code for the paper is provided at the following Github link: https://github.com/facebookresearch/MUSE. The repository provides the source code as written in PyTorch by the authors of this paper.<br />
<br />
= Source =<br />
Dinu, Georgiana; Lazaridou, Angeliki; Baroni, Marco<br />
| Improving zero-shot learning by mitigating the hubness problem<br />
| arXiv:1412.6568<br />
<br />
Lample, Guillaume; Denoyer, Ludovic; Ranzato, Marc'Aurelio <br />
| Unsupervised Machine Translation Using Monolingual Corpora Only<br />
| arXiv: 1701.04087<br />
<br />
Smith, Samuel L; Turban, David HP; Hamblin, Steven; Hammerla, Nils Y<br />
| Offline bilingual word vectors, orthogonal transformations and the inverted softmax<br />
| arXiv:1702.03859<br />
<br />
Lample, G. (n.d.). Facebookresearch/MUSE. Retrieved March 25, 2018, from https://github.com/facebookresearch/MUSE<br />
<br />
Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean<br />
| Efficient Estimation of Word Representations in Vector Space<br />
| arXiv:1301.3781</div>Tdunnhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=One-Shot_Imitation_Learning&diff=36185One-Shot Imitation Learning2018-04-05T20:10:32Z<p>Tdunn: /* Problem Formalization */</p>
<hr />
<div>= Introduction =<br />
Robotic systems can be used for many applications, but to truly be useful for complex applications, they need to overcome 2 challenges: having the intent of the task at hand communicated to them, and being able to perform the manipulations necessary to complete this task. It is preferable to use demonstration to teach the robotic systems rather than natural language, as natural language may often fail to convey the details and intricacies required for the task. However, current work on learning from demonstrations is only successful with large amounts of feature engineering or a large number of demonstrations. The proposed model aims to achieve 'one-shot' imitation learning, ie. learning to complete a new task from just a single demonstration of it without any other supervision. As input, the proposed model takes the observation of the current instance of a task, and a demonstration of successfully solving a different instance of the same task. Strong generalization was achieved by using a soft attention mechanism on both the sequence of actions and states that the demonstration consists of, as well as on the vector of element locations within the environment. The success of this proposed model at completing a series of block stacking tasks can be viewed at http://bit.ly/nips2017-oneshot.<br />
<br />
= Related Work =<br />
While one-shot imitation learning is a novel combination of ideas, each of the components has previously been studied.<br />
* Imitation Learning: <br />
** Behavioural learning uses supervised learning to map from observations to actions (e.g. [https://papers.nips.cc/paper/95-alvinn-an-autonomous-land-vehicle-in-a-neural-network.pdf (Pomerleau 1988)], [https://arxiv.org/pdf/1011.0686.pdf (Ross et. al 2011)])<br />
** Inverse reinforcement learning estimates a reward function that considers demonstrations as optimal behavior (e.g. [http://ai.stanford.edu/~ang/papers/icml00-irl.pdf (Ng et. al 2000)])<br />
* One-Shot Learning:<br />
** Typically a form of meta-learning<br />
** Previously used for variety of tasks but all domain-specific<br />
** [https://arxiv.org/abs/1703.03400 (Finn et al. 2017)] proposed a generic solution but excluded imitation learning<br />
* Reinforcement Learning:<br />
** Demonstrated to work on variety of tasks and environments, in particular on games and robotic control<br />
** Requires large amount of trials and a user-specified reward function<br />
* Multi-task/Transfer Learning:<br />
** Shown to be particularly effective at computer vision tasks<br />
** Not meant for one-shot learning<br />
* Attention Modelling:<br />
** The proposed model makes use of the attention model from [https://arxiv.org/abs/1409.0473 (Bahdanau et al. 2016)]<br />
** The attention modelling over demonstration is similar in nature to the seq2seq models from the well known [https://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf (Sutskever et al. 2014)]<br />
<br />
= One-Shot Imitation Learning =<br />
<br />
[[File:oneshot1.jpg|1000px]]<br />
<br />
The figure above shows the differences between traditional and one-shot imitation learning. In a), the traditional method may require training different policies for performing similar tasks that are similar in nature. For example, stacking blocks to a height of 2 and to a height of 3. In b), the one-shot imitation learning allows the same policy to be used for these tasks given a single demonstration, achieving good performance without any additional system interactions. In c), the policy is trained by using a set of different training tasks, with enough examples so that the learned results can be generalized to other similar tasks. Each task has a set of successful demonstrations. Each iteration of training uses two demonstrations from a task, one is used as the input passing into the algorithm and the other is used at the output, the results from the two are then conditioned to produce the correct action.<br />
<br />
== Problem Formalization ==<br />
The problem is briefly formalized with the authors describing a distribution of tasks, an individual task, a distribution of demonstrations for this task, and a single demonstration respectively as \[T, \: t\sim T, \: D(t), \: d\sim D(t)\]<br />
In addition, an action, an observation, parameters, and a policy are respectively defined as \[a, o, \theta, \pi_\theta(a|o,d)\]<br />
In particular, a demonstration is a sequence of observation and action pairs \[d = [(o_1, a_1),(o_2, a_2), . . . ,(o_H , a_H )]\]<br />
Assuming that <math>H </math>, the length or horizon of a demonstration, and some evaluation function $$R_t(d): R^H \rightarrow R$$ are given, and that succesful demonstrations are available for each task, then the objective is to maximize expectation of the policy performance over \[t\sim T, d\sim D(t)\].<br />
<br />
== Block Stacking Tasks ==<br />
The tasks that the authors focus on is block stacking. A user specifies in what final configuration cubic blocks should be stacked, and the goal is to use a 7-DOF Fetch robotic arm to arrange the blocks in this configuration. The number of blocks, and their desired configuration (ie. number of towers, the height of each tower, and order of blocks within each tower) can be varied and encoded as a string. For example, 'abc def' would signify 2 towers of height 3, with block A on block B on block C in one tower, and block D on block E on block F in a second tower. To add complexity, the initial configuration of the blocks can vary and is encoded as a set of 3-dimensional vectors describing the position of each block relative to the robotic arm.<br />
<br />
== Algorithm ==<br />
To avoid needing to specify a reward function, the authors use behavioral cloning and DAGGER, 2 imitation learning methods that require only demonstrations, for training. In each training step, a list of tasks is sampled, and for each, a demonstration with injected noise along with some observation-action pairs are sampled. Given the current observation and demonstration as input, the policy is trained against the sampled actions by minimizing L2 norm for continuous actions, and cross-entropy for discrete ones. Adamax is used as the optimizer with a learning rate of 0.001.<br />
<br />
= Architecture =<br />
The authors propose a novel architecture for imitation learning, consisting of 3 networks.<br />
<br />
While, in principle, a generic neural network could learn the mapping from demonstration and current observation to appropriate action, the authors propose the following architecture which they claim as one of the main contributions of this paper, and believe it would be useful for complex tasks in the future.<br />
The proposed architecture consists of three modules: the demonstration network, the context network, and the manipulation network.<br />
<br />
[[File:oneshot2.jpg|1000px|center]]<br />
<br />
== Demonstration Network ==<br />
This network takes a demonstration as input and produces an embedding with size linearly proportional to the number of blocks and the size of the demonstration.<br />
=== Temporal Dropout ===<br />
Since a demonstration for block stacking can be very long, the authors randomly discard 95% of the time steps, a process they call 'temporal dropout'. The reduced size of the demonstrations allows multiple trajectories to be explored during testing to calculate an ensemble estimate. Dilated temporal convolutions and neighborhood attention are then repeatedly applied to the downsampled demonstrations.<br />
<br />
=== Neighborhood Attention ===<br />
Since demonstration sizes can vary, a mechanism is needed that is not restricted to fixed-length inputs. While soft attention is one such mechanism, the problem with it is that there may be increasingly large amounts of information lost if soft attention is used to map longer demonstrations to the same fixed length as shorter demonstrations. As a solution, the authors propose having the same number of outputs as inputs, but with attention performed on other inputs relative to the current input.<br />
<br />
A query <math>q</math>, a list of context vectors <math>\{c_j\}</math>, and a list of memory vectors <math>\{m_j\}</math> are given as input to soft attention. Each attention weight is given by the product of a learned weight vector and a nonlinearity applied to the sum of the query and corresponding context vector. Softmaxed weights applied to the corresponding memory vector form the output of the soft attention.<br />
<br />
\[Inputs: q, \{c_j\}, \{m_j\}\]<br />
\[Weights: w_i \leftarrow v^Ttanh(q+c_i)\]<br />
\[Output: \sum_i{m_i\frac{\exp(w_i)}{\sum_j{\exp(w_j)}}}\]<br />
<br />
A list of same-length embeddings, coming from a previous neighbourhood attention layer or a projection from the list of block coordinates, is given as input to neighborhood attention. For each block, two separate linear layers produce a query vector and a context vector, while a memory vector is a list of tuples that describe the position of each block joined with the input embedding for that block. Soft attention is then performed on this query, context vector, and memory vector. The authors claim that the intuition behind this process is to allow each block to provide information about itself relative to the other blocks in the environment. Finally, for each block, a linear transformation is performed on the vector composed by concatenating the input embedding, the result of the soft attention for that block, and the robot's state.<br />
<br />
For an environment with B blocks:<br />
\[State: s\]<br />
\[Block_i: b_i \leftarrow (x_i, y_i, z_i)\]<br />
\[Embeddings: h_1^{in}, ..., h_B^{in}\] <br />
\[Query_i: q_i \leftarrow Linear(h_i^{in})\]<br />
\[Context_i: c_i \leftarrow Linear(h_i^{in})\]<br />
\[Memory_i: m_i \leftarrow (b_i, h_i^{in}) \]<br />
\[Result_i: result_i \leftarrow SoftAttn(q_i, \{c_j\}_{j=1}^B, \{m_k\}_{k=1}^B)\]<br />
\[Output_i: output_i \leftarrow Linear(concat(h_i^{in}, result_i, b_i, s))\]<br />
<br />
== Context network ==<br />
This network takes the current state and the embedding produced by the demonstration network as inputs and outputs a fixed-length "context embedding" which captures only the information relevant for the manipulation network at this particular step.<br />
=== Attention over demonstration ===<br />
The current state is used to compute a query vector which is then used for attending over all the steps of the embedding. Since at each time step there are multiple blocks, the weights for each are summed together to produce a scalar for each time step. Neighbourhood attention is then applied several times, using an LSTM with untied weights, since the information at each time steps needs to be propagated to each block's embedding. <br />
<br />
Performing attention over the demonstration yields a vector whose size is independent of the demonstration size; however, it is still dependent on the number of blocks in the environment, so it is natural to now attend over the state in order to get a fixed-length vector.<br />
=== Attention over current state ===<br />
The authors propose that in general, within each subtask, only a limited number of blocks are relevant for performing the subtask. If the subtask is to stack A on B, then intuitively, one would suppose that only block A and B are relevant, and perhaps any blocks that may be blocking access to either A or B. This is not enforced during training, but once soft attention is applied to the current state to produce a fixed-length context embedding, the authors believe that the model does indeed learn in this way.<br />
<br />
== Manipulation network ==<br />
Given the context embedding as input, this simple feedforward network decides on the particular action needed, to complete the subtask of stacking one particular 'source' block on top of another 'target' block.<br />
<br />
= Experiments = <br />
The proposed model was tested on the block stacking tasks. the experiments were designed at answering the following questions:<br />
* How does training with behavioral cloning compare with DAGGER?<br />
* How does conditioning on the entire demonstration compare to conditioning on the final state?<br />
* How does conditioning on the entire demonstration compare to conditioning on a “snapshot” of the trajectory?<br />
* Can the authors' framework generalize to tasks that it has never seen during training?<br />
For the experiments, 140 training tasks and 43 testing tasks were collected, each with between 2 to 10 blocks and a different, desired final layout. Over 1000 demonstrations for each task were collected using a hard-coded policy rather than a human user. The authors compare 4 different architectures in these experiments:<br />
* Behavioural cloning used to train the proposed model<br />
* DAGGER used to train the proposed model<br />
* The proposed model, trained with DAGGER, but conditioned on the desired final state rather than an entire demonstration<br />
* The proposed model, trained with DAGGER, but conditioned on a 'snapshot' of the environment at the end of each subtask (ie. every time a block is stacked on another block)<br />
<br />
== Performance Evaluation ==<br />
[[File:oneshot3.jpg|1000px]]<br />
<br />
The most confident action at each timestep is chosen in 100 different task configurations, and results are averaged over tasks that had the same number of blocks. The results suggest that the performance of each of the architectures is comparable to that of the hard-coded policy which they aim to imitate. Performance degrades similarly across all architectures and the hard-coded policy as the number of blocks increases. On the harder tasks, conditioning on the entire demonstration led to better performance than conditioning on snapshots or on the final state. The authors believe that this may be due to the lack of information when conditioning only on the final state as well as due to regularization caused by temporal dropout which leads to data augmentation when conditioning on the full demonstration but is omitted when conditioning only on the snapshots or final state. Both DAGGER and behavioral cloning performed comparably well. As mentioned above, noise injection was used in training to improve performance; in practice, additional noise can still be injected but some may already come from other sources.<br />
<br />
== Visualization ==<br />
The authors visualize the attention mechanisms underlying the main policy architecture to have a better understanding about how it operates. There are two kinds of attention that the authors are mainly interested in, one where the policy attends to different time steps in the demonstration, and the other where the policy attends to different blocks in the current state. The figures below show some of the policy attention heatmaps over time.<br />
<br />
[[File:paper6_Visualization.png|800px]]<br />
<br />
= Conclusions =<br />
The proposed model successfully learns to complete new instances of a new task from just a single demonstration. The model was demonstrated to work on a series of block stacking tasks. The authors propose several extensions including enabling few-shot learning when one demonstration is insufficient, using image data as the demonstrations, and attempting many other tasks aside from block stacking.<br />
<br />
= Criticisms =<br />
While the paper shows an incredibly impressive result: the ability to learn a new task from just a single demonstration, there are a few points that need clearing up.<br />
Firstly, the authors use a hard-coded policy in their experiments rather than a human. It is clear that the performance of this policy begins to degrade quickly as the complexity of the task increases. It would be useful to know what this hard-coded policy actually was, and if the proposed model could still have comparable performance if a more successful demonstration, perhaps one by a human user, were performed. Give the current popularity of adversarial examples, it would also be interesting to see the performance when conditioned on an "adversarial" demonstration, that achieves the correct final state, but intentionally performs complex or obfuscated steps to get there.<br />
Second, it would be useful to see the model's performance on a more complex family of tasks than block stacking, since although each block stacking task is slightly different, the differences may turn out be insignificant compared to other tasks that this model should work on if it is to be a general imitation learning architecture; intuitively, the space of all possible moves and configurations is not large for the task. Also it is a bit misleading as there seems to be a need for more demonstrations to first get a reasonable policy that can generalize, leading to generic policy and then use just one demonstration on a new task expecting the policy to generalize. So it seems there is some sort of pre training involved here. Regardless, this work is a big step forward for imitation learning, permitting a wider range of tasks for which there is little training data and no reward function available, to still be successfully solved.<br />
<br />
= Illustrative Example: Particle Reaching =<br />
<br />
[[File:f1.png]]<br />
<br />
Figure 1: [Left] Agent, [Middle] Orange square is target, [Right] Green triangle is target [2].<br />
<br />
Another simple yet insightful example of the One-Shot Imitation Learning is the particle reaching problem which provides a relatively simple suite of tasks from which the network needs to solve an arbitrary one. The problem is formulated such that for each task: there is an agent which can move based on a 2D force vector, and n landmarks at varying 2D locations (n varies from task to task) with the goal of moving the agent to the specific landmark reached in the demonstration. This is illustrated in Figure 1. <br />
<br />
[[File:f2.png|450px]]<br />
<br />
Figure 2: Experimental results [2].<br />
<br />
Some insight comes from the use of different network architectures to solve this problem. The three architectures to compare (described below) are plain LSTM, LSTM with attention, and final state with attention. The key insight is that the architectures go from generic to specific, with the best generalization performance achieved with the most specific architecture, final state with attention, as seen in Figure 2. It is important to note that this conclusion does not carry forward to more complicated tasks such as the block stacking task.<br />
*Plain LSTM: 512 hidden units, with the input being the demonstration trajectory (the position of the agent changes over time and approaches one of the targets). Output of the LSTM with the current state (from the task needed to be solved) is the input for a multi-layer perceptron (MLP) for finding the solution.<br />
*LSTM with attention: Output of LSTM is now a set of weights for the different targets during training. These weights and the test state are used in the test task. The, now, 2D output is the input for an MLP as before.<br />
*Final state with attention: Looks only at the final state of the demonstration since it can sufficiently provide the needed detail of which target to reach (trajectory is not required). Similar to previous architecture, produces weights used by MLP.<br />
<br />
= Source =<br />
# Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. "Neural machine translation by jointly learning to align and translate." arXiv preprint arXiv:1409.0473 (2014).<br />
# Duan, Yan, Marcin Andrychowicz, Bradly Stadie, OpenAI Jonathan Ho, Jonas Schneider, Ilya Sutskever, Pieter Abbeel, and Wojciech Zaremba. "One-shot imitation learning." In Advances in neural information processing systems, pp. 1087-1098. 2017.<br />
# Y. Duan, M. Andrychowicz, B. Stadie, J. Ho, J. Schneider, I. Sutskever, P. Abbeel, and W. Zaremba. One-shot imitation learning. arXiv preprint arXiv:1703.07326, 2017. (Newer revision)<br />
# Finn, Chelsea, Pieter Abbeel, and Sergey Levine. "Model-agnostic meta-learning for fast adaptation of deep networks." arXiv preprint arXiv:1703.03400 (2017).<br />
# Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. "Sequence to sequence learning with neural networks." Advances in neural information processing systems. 2014.</div>Tdunnhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=One-Shot_Imitation_Learning&diff=36184One-Shot Imitation Learning2018-04-05T20:07:14Z<p>Tdunn: /* Block Stacking Tasks */</p>
<hr />
<div>= Introduction =<br />
Robotic systems can be used for many applications, but to truly be useful for complex applications, they need to overcome 2 challenges: having the intent of the task at hand communicated to them, and being able to perform the manipulations necessary to complete this task. It is preferable to use demonstration to teach the robotic systems rather than natural language, as natural language may often fail to convey the details and intricacies required for the task. However, current work on learning from demonstrations is only successful with large amounts of feature engineering or a large number of demonstrations. The proposed model aims to achieve 'one-shot' imitation learning, ie. learning to complete a new task from just a single demonstration of it without any other supervision. As input, the proposed model takes the observation of the current instance of a task, and a demonstration of successfully solving a different instance of the same task. Strong generalization was achieved by using a soft attention mechanism on both the sequence of actions and states that the demonstration consists of, as well as on the vector of element locations within the environment. The success of this proposed model at completing a series of block stacking tasks can be viewed at http://bit.ly/nips2017-oneshot.<br />
<br />
= Related Work =<br />
While one-shot imitation learning is a novel combination of ideas, each of the components has previously been studied.<br />
* Imitation Learning: <br />
** Behavioural learning uses supervised learning to map from observations to actions (e.g. [https://papers.nips.cc/paper/95-alvinn-an-autonomous-land-vehicle-in-a-neural-network.pdf (Pomerleau 1988)], [https://arxiv.org/pdf/1011.0686.pdf (Ross et. al 2011)])<br />
** Inverse reinforcement learning estimates a reward function that considers demonstrations as optimal behavior (e.g. [http://ai.stanford.edu/~ang/papers/icml00-irl.pdf (Ng et. al 2000)])<br />
* One-Shot Learning:<br />
** Typically a form of meta-learning<br />
** Previously used for variety of tasks but all domain-specific<br />
** [https://arxiv.org/abs/1703.03400 (Finn et al. 2017)] proposed a generic solution but excluded imitation learning<br />
* Reinforcement Learning:<br />
** Demonstrated to work on variety of tasks and environments, in particular on games and robotic control<br />
** Requires large amount of trials and a user-specified reward function<br />
* Multi-task/Transfer Learning:<br />
** Shown to be particularly effective at computer vision tasks<br />
** Not meant for one-shot learning<br />
* Attention Modelling:<br />
** The proposed model makes use of the attention model from [https://arxiv.org/abs/1409.0473 (Bahdanau et al. 2016)]<br />
** The attention modelling over demonstration is similar in nature to the seq2seq models from the well known [https://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf (Sutskever et al. 2014)]<br />
<br />
= One-Shot Imitation Learning =<br />
<br />
[[File:oneshot1.jpg|1000px]]<br />
<br />
The figure above shows the differences between traditional and one-shot imitation learning. In a), the traditional method may require training different policies for performing similar tasks that are similar in nature. For example, stacking blocks to a height of 2 and to a height of 3. In b), the one-shot imitation learning allows the same policy to be used for these tasks given a single demonstration, achieving good performance without any additional system interactions. In c), the policy is trained by using a set of different training tasks, with enough examples so that the learned results can be generalized to other similar tasks. Each task has a set of successful demonstrations. Each iteration of training uses two demonstrations from a task, one is used as the input passing into the algorithm and the other is used at the output, the results from the two are then conditioned to produce the correct action.<br />
<br />
== Problem Formalization ==<br />
The problem is briefly formalized with the authors describing a distribution of tasks, an individual task, a distribution of demonstrations for this task, and a single demonstration respectively as \[T, \: t\sim T, \: D(t), \: d\sim D(t)\]<br />
In addition, an action, an observation, parameters, and a policy are respectively defined as \[a, o, \theta, \pi_\theta(a|o,d)\]<br />
In particular, a demonstration is a sequence of observation and action pairs \[d = [(o_1, a_1),(o_2, a_2), . . . ,(o_H , a_H )]\]<br />
Assuming that <math>H </math>, the length or horizon of a demonstration, and some evaluation function $$R_t(d): R^T \rightarrow R$$ are given, and that succesful demonstrations are available for each task, then the objective is to maximize expectation of the policy performance over \[t\sim T, d\sim D(t)\].<br />
<br />
== Block Stacking Tasks ==<br />
The tasks that the authors focus on is block stacking. A user specifies in what final configuration cubic blocks should be stacked, and the goal is to use a 7-DOF Fetch robotic arm to arrange the blocks in this configuration. The number of blocks, and their desired configuration (ie. number of towers, the height of each tower, and order of blocks within each tower) can be varied and encoded as a string. For example, 'abc def' would signify 2 towers of height 3, with block A on block B on block C in one tower, and block D on block E on block F in a second tower. To add complexity, the initial configuration of the blocks can vary and is encoded as a set of 3-dimensional vectors describing the position of each block relative to the robotic arm.<br />
<br />
== Algorithm ==<br />
To avoid needing to specify a reward function, the authors use behavioral cloning and DAGGER, 2 imitation learning methods that require only demonstrations, for training. In each training step, a list of tasks is sampled, and for each, a demonstration with injected noise along with some observation-action pairs are sampled. Given the current observation and demonstration as input, the policy is trained against the sampled actions by minimizing L2 norm for continuous actions, and cross-entropy for discrete ones. Adamax is used as the optimizer with a learning rate of 0.001.<br />
<br />
= Architecture =<br />
The authors propose a novel architecture for imitation learning, consisting of 3 networks.<br />
<br />
While, in principle, a generic neural network could learn the mapping from demonstration and current observation to appropriate action, the authors propose the following architecture which they claim as one of the main contributions of this paper, and believe it would be useful for complex tasks in the future.<br />
The proposed architecture consists of three modules: the demonstration network, the context network, and the manipulation network.<br />
<br />
[[File:oneshot2.jpg|1000px|center]]<br />
<br />
== Demonstration Network ==<br />
This network takes a demonstration as input and produces an embedding with size linearly proportional to the number of blocks and the size of the demonstration.<br />
=== Temporal Dropout ===<br />
Since a demonstration for block stacking can be very long, the authors randomly discard 95% of the time steps, a process they call 'temporal dropout'. The reduced size of the demonstrations allows multiple trajectories to be explored during testing to calculate an ensemble estimate. Dilated temporal convolutions and neighborhood attention are then repeatedly applied to the downsampled demonstrations.<br />
<br />
=== Neighborhood Attention ===<br />
Since demonstration sizes can vary, a mechanism is needed that is not restricted to fixed-length inputs. While soft attention is one such mechanism, the problem with it is that there may be increasingly large amounts of information lost if soft attention is used to map longer demonstrations to the same fixed length as shorter demonstrations. As a solution, the authors propose having the same number of outputs as inputs, but with attention performed on other inputs relative to the current input.<br />
<br />
A query <math>q</math>, a list of context vectors <math>\{c_j\}</math>, and a list of memory vectors <math>\{m_j\}</math> are given as input to soft attention. Each attention weight is given by the product of a learned weight vector and a nonlinearity applied to the sum of the query and corresponding context vector. Softmaxed weights applied to the corresponding memory vector form the output of the soft attention.<br />
<br />
\[Inputs: q, \{c_j\}, \{m_j\}\]<br />
\[Weights: w_i \leftarrow v^Ttanh(q+c_i)\]<br />
\[Output: \sum_i{m_i\frac{\exp(w_i)}{\sum_j{\exp(w_j)}}}\]<br />
<br />
A list of same-length embeddings, coming from a previous neighbourhood attention layer or a projection from the list of block coordinates, is given as input to neighborhood attention. For each block, two separate linear layers produce a query vector and a context vector, while a memory vector is a list of tuples that describe the position of each block joined with the input embedding for that block. Soft attention is then performed on this query, context vector, and memory vector. The authors claim that the intuition behind this process is to allow each block to provide information about itself relative to the other blocks in the environment. Finally, for each block, a linear transformation is performed on the vector composed by concatenating the input embedding, the result of the soft attention for that block, and the robot's state.<br />
<br />
For an environment with B blocks:<br />
\[State: s\]<br />
\[Block_i: b_i \leftarrow (x_i, y_i, z_i)\]<br />
\[Embeddings: h_1^{in}, ..., h_B^{in}\] <br />
\[Query_i: q_i \leftarrow Linear(h_i^{in})\]<br />
\[Context_i: c_i \leftarrow Linear(h_i^{in})\]<br />
\[Memory_i: m_i \leftarrow (b_i, h_i^{in}) \]<br />
\[Result_i: result_i \leftarrow SoftAttn(q_i, \{c_j\}_{j=1}^B, \{m_k\}_{k=1}^B)\]<br />
\[Output_i: output_i \leftarrow Linear(concat(h_i^{in}, result_i, b_i, s))\]<br />
<br />
== Context network ==<br />
This network takes the current state and the embedding produced by the demonstration network as inputs and outputs a fixed-length "context embedding" which captures only the information relevant for the manipulation network at this particular step.<br />
=== Attention over demonstration ===<br />
The current state is used to compute a query vector which is then used for attending over all the steps of the embedding. Since at each time step there are multiple blocks, the weights for each are summed together to produce a scalar for each time step. Neighbourhood attention is then applied several times, using an LSTM with untied weights, since the information at each time steps needs to be propagated to each block's embedding. <br />
<br />
Performing attention over the demonstration yields a vector whose size is independent of the demonstration size; however, it is still dependent on the number of blocks in the environment, so it is natural to now attend over the state in order to get a fixed-length vector.<br />
=== Attention over current state ===<br />
The authors propose that in general, within each subtask, only a limited number of blocks are relevant for performing the subtask. If the subtask is to stack A on B, then intuitively, one would suppose that only block A and B are relevant, and perhaps any blocks that may be blocking access to either A or B. This is not enforced during training, but once soft attention is applied to the current state to produce a fixed-length context embedding, the authors believe that the model does indeed learn in this way.<br />
<br />
== Manipulation network ==<br />
Given the context embedding as input, this simple feedforward network decides on the particular action needed, to complete the subtask of stacking one particular 'source' block on top of another 'target' block.<br />
<br />
= Experiments = <br />
The proposed model was tested on the block stacking tasks. the experiments were designed at answering the following questions:<br />
* How does training with behavioral cloning compare with DAGGER?<br />
* How does conditioning on the entire demonstration compare to conditioning on the final state?<br />
* How does conditioning on the entire demonstration compare to conditioning on a “snapshot” of the trajectory?<br />
* Can the authors' framework generalize to tasks that it has never seen during training?<br />
For the experiments, 140 training tasks and 43 testing tasks were collected, each with between 2 to 10 blocks and a different, desired final layout. Over 1000 demonstrations for each task were collected using a hard-coded policy rather than a human user. The authors compare 4 different architectures in these experiments:<br />
* Behavioural cloning used to train the proposed model<br />
* DAGGER used to train the proposed model<br />
* The proposed model, trained with DAGGER, but conditioned on the desired final state rather than an entire demonstration<br />
* The proposed model, trained with DAGGER, but conditioned on a 'snapshot' of the environment at the end of each subtask (ie. every time a block is stacked on another block)<br />
<br />
== Performance Evaluation ==<br />
[[File:oneshot3.jpg|1000px]]<br />
<br />
The most confident action at each timestep is chosen in 100 different task configurations, and results are averaged over tasks that had the same number of blocks. The results suggest that the performance of each of the architectures is comparable to that of the hard-coded policy which they aim to imitate. Performance degrades similarly across all architectures and the hard-coded policy as the number of blocks increases. On the harder tasks, conditioning on the entire demonstration led to better performance than conditioning on snapshots or on the final state. The authors believe that this may be due to the lack of information when conditioning only on the final state as well as due to regularization caused by temporal dropout which leads to data augmentation when conditioning on the full demonstration but is omitted when conditioning only on the snapshots or final state. Both DAGGER and behavioral cloning performed comparably well. As mentioned above, noise injection was used in training to improve performance; in practice, additional noise can still be injected but some may already come from other sources.<br />
<br />
== Visualization ==<br />
The authors visualize the attention mechanisms underlying the main policy architecture to have a better understanding about how it operates. There are two kinds of attention that the authors are mainly interested in, one where the policy attends to different time steps in the demonstration, and the other where the policy attends to different blocks in the current state. The figures below show some of the policy attention heatmaps over time.<br />
<br />
[[File:paper6_Visualization.png|800px]]<br />
<br />
= Conclusions =<br />
The proposed model successfully learns to complete new instances of a new task from just a single demonstration. The model was demonstrated to work on a series of block stacking tasks. The authors propose several extensions including enabling few-shot learning when one demonstration is insufficient, using image data as the demonstrations, and attempting many other tasks aside from block stacking.<br />
<br />
= Criticisms =<br />
While the paper shows an incredibly impressive result: the ability to learn a new task from just a single demonstration, there are a few points that need clearing up.<br />
Firstly, the authors use a hard-coded policy in their experiments rather than a human. It is clear that the performance of this policy begins to degrade quickly as the complexity of the task increases. It would be useful to know what this hard-coded policy actually was, and if the proposed model could still have comparable performance if a more successful demonstration, perhaps one by a human user, were performed. Give the current popularity of adversarial examples, it would also be interesting to see the performance when conditioned on an "adversarial" demonstration, that achieves the correct final state, but intentionally performs complex or obfuscated steps to get there.<br />
Second, it would be useful to see the model's performance on a more complex family of tasks than block stacking, since although each block stacking task is slightly different, the differences may turn out be insignificant compared to other tasks that this model should work on if it is to be a general imitation learning architecture; intuitively, the space of all possible moves and configurations is not large for the task. Also it is a bit misleading as there seems to be a need for more demonstrations to first get a reasonable policy that can generalize, leading to generic policy and then use just one demonstration on a new task expecting the policy to generalize. So it seems there is some sort of pre training involved here. Regardless, this work is a big step forward for imitation learning, permitting a wider range of tasks for which there is little training data and no reward function available, to still be successfully solved.<br />
<br />
= Illustrative Example: Particle Reaching =<br />
<br />
[[File:f1.png]]<br />
<br />
Figure 1: [Left] Agent, [Middle] Orange square is target, [Right] Green triangle is target [2].<br />
<br />
Another simple yet insightful example of the One-Shot Imitation Learning is the particle reaching problem which provides a relatively simple suite of tasks from which the network needs to solve an arbitrary one. The problem is formulated such that for each task: there is an agent which can move based on a 2D force vector, and n landmarks at varying 2D locations (n varies from task to task) with the goal of moving the agent to the specific landmark reached in the demonstration. This is illustrated in Figure 1. <br />
<br />
[[File:f2.png|450px]]<br />
<br />
Figure 2: Experimental results [2].<br />
<br />
Some insight comes from the use of different network architectures to solve this problem. The three architectures to compare (described below) are plain LSTM, LSTM with attention, and final state with attention. The key insight is that the architectures go from generic to specific, with the best generalization performance achieved with the most specific architecture, final state with attention, as seen in Figure 2. It is important to note that this conclusion does not carry forward to more complicated tasks such as the block stacking task.<br />
*Plain LSTM: 512 hidden units, with the input being the demonstration trajectory (the position of the agent changes over time and approaches one of the targets). Output of the LSTM with the current state (from the task needed to be solved) is the input for a multi-layer perceptron (MLP) for finding the solution.<br />
*LSTM with attention: Output of LSTM is now a set of weights for the different targets during training. These weights and the test state are used in the test task. The, now, 2D output is the input for an MLP as before.<br />
*Final state with attention: Looks only at the final state of the demonstration since it can sufficiently provide the needed detail of which target to reach (trajectory is not required). Similar to previous architecture, produces weights used by MLP.<br />
<br />
= Source =<br />
# Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. "Neural machine translation by jointly learning to align and translate." arXiv preprint arXiv:1409.0473 (2014).<br />
# Duan, Yan, Marcin Andrychowicz, Bradly Stadie, OpenAI Jonathan Ho, Jonas Schneider, Ilya Sutskever, Pieter Abbeel, and Wojciech Zaremba. "One-shot imitation learning." In Advances in neural information processing systems, pp. 1087-1098. 2017.<br />
# Y. Duan, M. Andrychowicz, B. Stadie, J. Ho, J. Schneider, I. Sutskever, P. Abbeel, and W. Zaremba. One-shot imitation learning. arXiv preprint arXiv:1703.07326, 2017. (Newer revision)<br />
# Finn, Chelsea, Pieter Abbeel, and Sergey Levine. "Model-agnostic meta-learning for fast adaptation of deep networks." arXiv preprint arXiv:1703.03400 (2017).<br />
# Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. "Sequence to sequence learning with neural networks." Advances in neural information processing systems. 2014.</div>Tdunnhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=One-Shot_Imitation_Learning&diff=36183One-Shot Imitation Learning2018-04-05T20:06:53Z<p>Tdunn: /* Block Stacking Tasks */</p>
<hr />
<div>= Introduction =<br />
Robotic systems can be used for many applications, but to truly be useful for complex applications, they need to overcome 2 challenges: having the intent of the task at hand communicated to them, and being able to perform the manipulations necessary to complete this task. It is preferable to use demonstration to teach the robotic systems rather than natural language, as natural language may often fail to convey the details and intricacies required for the task. However, current work on learning from demonstrations is only successful with large amounts of feature engineering or a large number of demonstrations. The proposed model aims to achieve 'one-shot' imitation learning, ie. learning to complete a new task from just a single demonstration of it without any other supervision. As input, the proposed model takes the observation of the current instance of a task, and a demonstration of successfully solving a different instance of the same task. Strong generalization was achieved by using a soft attention mechanism on both the sequence of actions and states that the demonstration consists of, as well as on the vector of element locations within the environment. The success of this proposed model at completing a series of block stacking tasks can be viewed at http://bit.ly/nips2017-oneshot.<br />
<br />
= Related Work =<br />
While one-shot imitation learning is a novel combination of ideas, each of the components has previously been studied.<br />
* Imitation Learning: <br />
** Behavioural learning uses supervised learning to map from observations to actions (e.g. [https://papers.nips.cc/paper/95-alvinn-an-autonomous-land-vehicle-in-a-neural-network.pdf (Pomerleau 1988)], [https://arxiv.org/pdf/1011.0686.pdf (Ross et. al 2011)])<br />
** Inverse reinforcement learning estimates a reward function that considers demonstrations as optimal behavior (e.g. [http://ai.stanford.edu/~ang/papers/icml00-irl.pdf (Ng et. al 2000)])<br />
* One-Shot Learning:<br />
** Typically a form of meta-learning<br />
** Previously used for variety of tasks but all domain-specific<br />
** [https://arxiv.org/abs/1703.03400 (Finn et al. 2017)] proposed a generic solution but excluded imitation learning<br />
* Reinforcement Learning:<br />
** Demonstrated to work on variety of tasks and environments, in particular on games and robotic control<br />
** Requires large amount of trials and a user-specified reward function<br />
* Multi-task/Transfer Learning:<br />
** Shown to be particularly effective at computer vision tasks<br />
** Not meant for one-shot learning<br />
* Attention Modelling:<br />
** The proposed model makes use of the attention model from [https://arxiv.org/abs/1409.0473 (Bahdanau et al. 2016)]<br />
** The attention modelling over demonstration is similar in nature to the seq2seq models from the well known [https://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf (Sutskever et al. 2014)]<br />
<br />
= One-Shot Imitation Learning =<br />
<br />
[[File:oneshot1.jpg|1000px]]<br />
<br />
The figure above shows the differences between traditional and one-shot imitation learning. In a), the traditional method may require training different policies for performing similar tasks that are similar in nature. For example, stacking blocks to a height of 2 and to a height of 3. In b), the one-shot imitation learning allows the same policy to be used for these tasks given a single demonstration, achieving good performance without any additional system interactions. In c), the policy is trained by using a set of different training tasks, with enough examples so that the learned results can be generalized to other similar tasks. Each task has a set of successful demonstrations. Each iteration of training uses two demonstrations from a task, one is used as the input passing into the algorithm and the other is used at the output, the results from the two are then conditioned to produce the correct action.<br />
<br />
== Problem Formalization ==<br />
The problem is briefly formalized with the authors describing a distribution of tasks, an individual task, a distribution of demonstrations for this task, and a single demonstration respectively as \[T, \: t\sim T, \: D(t), \: d\sim D(t)\]<br />
In addition, an action, an observation, parameters, and a policy are respectively defined as \[a, o, \theta, \pi_\theta(a|o,d)\]<br />
In particular, a demonstration is a sequence of observation and action pairs \[d = [(o_1, a_1),(o_2, a_2), . . . ,(o_H , a_H )]\]<br />
Assuming that <math>H </math>, the length or horizon of a demonstration, and some evaluation function $$R_t(d): R^T \rightarrow R$$ are given, and that succesful demonstrations are available for each task, then the objective is to maximize expectation of the policy performance over \[t\sim T, d\sim D(t)\].<br />
<br />
== Block Stacking Tasks ==<br />
The tasks that the authors focus on is block stacking. A user specifies in what final configuration cubic blocks should be stacked, and the goal is to use a 7-DOF Fetch robotic arm to arrange the blocks in this configuration. The number of blocks, and their desired configuration (ie. number of towers, the height of each tower, and order of blocks within each tower) can be varied and encoded as a string. For example, 'abc def' would signify 2 towers of height 3, with block A on block B on block C in one tower, and block D on block E on block F in a second tower. To add complexity, the initial configuration of the blocks can vary and is encoded as a set of 3-dimensional vectors describing the position of each block relative to the robotic arm. In this case, the reward function takes on binary values that indicate weather or not the task was completed correctly.<br />
<br />
== Algorithm ==<br />
To avoid needing to specify a reward function, the authors use behavioral cloning and DAGGER, 2 imitation learning methods that require only demonstrations, for training. In each training step, a list of tasks is sampled, and for each, a demonstration with injected noise along with some observation-action pairs are sampled. Given the current observation and demonstration as input, the policy is trained against the sampled actions by minimizing L2 norm for continuous actions, and cross-entropy for discrete ones. Adamax is used as the optimizer with a learning rate of 0.001.<br />
<br />
= Architecture =<br />
The authors propose a novel architecture for imitation learning, consisting of 3 networks.<br />
<br />
While, in principle, a generic neural network could learn the mapping from demonstration and current observation to appropriate action, the authors propose the following architecture which they claim as one of the main contributions of this paper, and believe it would be useful for complex tasks in the future.<br />
The proposed architecture consists of three modules: the demonstration network, the context network, and the manipulation network.<br />
<br />
[[File:oneshot2.jpg|1000px|center]]<br />
<br />
== Demonstration Network ==<br />
This network takes a demonstration as input and produces an embedding with size linearly proportional to the number of blocks and the size of the demonstration.<br />
=== Temporal Dropout ===<br />
Since a demonstration for block stacking can be very long, the authors randomly discard 95% of the time steps, a process they call 'temporal dropout'. The reduced size of the demonstrations allows multiple trajectories to be explored during testing to calculate an ensemble estimate. Dilated temporal convolutions and neighborhood attention are then repeatedly applied to the downsampled demonstrations.<br />
<br />
=== Neighborhood Attention ===<br />
Since demonstration sizes can vary, a mechanism is needed that is not restricted to fixed-length inputs. While soft attention is one such mechanism, the problem with it is that there may be increasingly large amounts of information lost if soft attention is used to map longer demonstrations to the same fixed length as shorter demonstrations. As a solution, the authors propose having the same number of outputs as inputs, but with attention performed on other inputs relative to the current input.<br />
<br />
A query <math>q</math>, a list of context vectors <math>\{c_j\}</math>, and a list of memory vectors <math>\{m_j\}</math> are given as input to soft attention. Each attention weight is given by the product of a learned weight vector and a nonlinearity applied to the sum of the query and corresponding context vector. Softmaxed weights applied to the corresponding memory vector form the output of the soft attention.<br />
<br />
\[Inputs: q, \{c_j\}, \{m_j\}\]<br />
\[Weights: w_i \leftarrow v^Ttanh(q+c_i)\]<br />
\[Output: \sum_i{m_i\frac{\exp(w_i)}{\sum_j{\exp(w_j)}}}\]<br />
<br />
A list of same-length embeddings, coming from a previous neighbourhood attention layer or a projection from the list of block coordinates, is given as input to neighborhood attention. For each block, two separate linear layers produce a query vector and a context vector, while a memory vector is a list of tuples that describe the position of each block joined with the input embedding for that block. Soft attention is then performed on this query, context vector, and memory vector. The authors claim that the intuition behind this process is to allow each block to provide information about itself relative to the other blocks in the environment. Finally, for each block, a linear transformation is performed on the vector composed by concatenating the input embedding, the result of the soft attention for that block, and the robot's state.<br />
<br />
For an environment with B blocks:<br />
\[State: s\]<br />
\[Block_i: b_i \leftarrow (x_i, y_i, z_i)\]<br />
\[Embeddings: h_1^{in}, ..., h_B^{in}\] <br />
\[Query_i: q_i \leftarrow Linear(h_i^{in})\]<br />
\[Context_i: c_i \leftarrow Linear(h_i^{in})\]<br />
\[Memory_i: m_i \leftarrow (b_i, h_i^{in}) \]<br />
\[Result_i: result_i \leftarrow SoftAttn(q_i, \{c_j\}_{j=1}^B, \{m_k\}_{k=1}^B)\]<br />
\[Output_i: output_i \leftarrow Linear(concat(h_i^{in}, result_i, b_i, s))\]<br />
<br />
== Context network ==<br />
This network takes the current state and the embedding produced by the demonstration network as inputs and outputs a fixed-length "context embedding" which captures only the information relevant for the manipulation network at this particular step.<br />
=== Attention over demonstration ===<br />
The current state is used to compute a query vector which is then used for attending over all the steps of the embedding. Since at each time step there are multiple blocks, the weights for each are summed together to produce a scalar for each time step. Neighbourhood attention is then applied several times, using an LSTM with untied weights, since the information at each time steps needs to be propagated to each block's embedding. <br />
<br />
Performing attention over the demonstration yields a vector whose size is independent of the demonstration size; however, it is still dependent on the number of blocks in the environment, so it is natural to now attend over the state in order to get a fixed-length vector.<br />
=== Attention over current state ===<br />
The authors propose that in general, within each subtask, only a limited number of blocks are relevant for performing the subtask. If the subtask is to stack A on B, then intuitively, one would suppose that only block A and B are relevant, and perhaps any blocks that may be blocking access to either A or B. This is not enforced during training, but once soft attention is applied to the current state to produce a fixed-length context embedding, the authors believe that the model does indeed learn in this way.<br />
<br />
== Manipulation network ==<br />
Given the context embedding as input, this simple feedforward network decides on the particular action needed, to complete the subtask of stacking one particular 'source' block on top of another 'target' block.<br />
<br />
= Experiments = <br />
The proposed model was tested on the block stacking tasks. the experiments were designed at answering the following questions:<br />
* How does training with behavioral cloning compare with DAGGER?<br />
* How does conditioning on the entire demonstration compare to conditioning on the final state?<br />
* How does conditioning on the entire demonstration compare to conditioning on a “snapshot” of the trajectory?<br />
* Can the authors' framework generalize to tasks that it has never seen during training?<br />
For the experiments, 140 training tasks and 43 testing tasks were collected, each with between 2 to 10 blocks and a different, desired final layout. Over 1000 demonstrations for each task were collected using a hard-coded policy rather than a human user. The authors compare 4 different architectures in these experiments:<br />
* Behavioural cloning used to train the proposed model<br />
* DAGGER used to train the proposed model<br />
* The proposed model, trained with DAGGER, but conditioned on the desired final state rather than an entire demonstration<br />
* The proposed model, trained with DAGGER, but conditioned on a 'snapshot' of the environment at the end of each subtask (ie. every time a block is stacked on another block)<br />
<br />
== Performance Evaluation ==<br />
[[File:oneshot3.jpg|1000px]]<br />
<br />
The most confident action at each timestep is chosen in 100 different task configurations, and results are averaged over tasks that had the same number of blocks. The results suggest that the performance of each of the architectures is comparable to that of the hard-coded policy which they aim to imitate. Performance degrades similarly across all architectures and the hard-coded policy as the number of blocks increases. On the harder tasks, conditioning on the entire demonstration led to better performance than conditioning on snapshots or on the final state. The authors believe that this may be due to the lack of information when conditioning only on the final state as well as due to regularization caused by temporal dropout which leads to data augmentation when conditioning on the full demonstration but is omitted when conditioning only on the snapshots or final state. Both DAGGER and behavioral cloning performed comparably well. As mentioned above, noise injection was used in training to improve performance; in practice, additional noise can still be injected but some may already come from other sources.<br />
<br />
== Visualization ==<br />
The authors visualize the attention mechanisms underlying the main policy architecture to have a better understanding about how it operates. There are two kinds of attention that the authors are mainly interested in, one where the policy attends to different time steps in the demonstration, and the other where the policy attends to different blocks in the current state. The figures below show some of the policy attention heatmaps over time.<br />
<br />
[[File:paper6_Visualization.png|800px]]<br />
<br />
= Conclusions =<br />
The proposed model successfully learns to complete new instances of a new task from just a single demonstration. The model was demonstrated to work on a series of block stacking tasks. The authors propose several extensions including enabling few-shot learning when one demonstration is insufficient, using image data as the demonstrations, and attempting many other tasks aside from block stacking.<br />
<br />
= Criticisms =<br />
While the paper shows an incredibly impressive result: the ability to learn a new task from just a single demonstration, there are a few points that need clearing up.<br />
Firstly, the authors use a hard-coded policy in their experiments rather than a human. It is clear that the performance of this policy begins to degrade quickly as the complexity of the task increases. It would be useful to know what this hard-coded policy actually was, and if the proposed model could still have comparable performance if a more successful demonstration, perhaps one by a human user, were performed. Give the current popularity of adversarial examples, it would also be interesting to see the performance when conditioned on an "adversarial" demonstration, that achieves the correct final state, but intentionally performs complex or obfuscated steps to get there.<br />
Second, it would be useful to see the model's performance on a more complex family of tasks than block stacking, since although each block stacking task is slightly different, the differences may turn out be insignificant compared to other tasks that this model should work on if it is to be a general imitation learning architecture; intuitively, the space of all possible moves and configurations is not large for the task. Also it is a bit misleading as there seems to be a need for more demonstrations to first get a reasonable policy that can generalize, leading to generic policy and then use just one demonstration on a new task expecting the policy to generalize. So it seems there is some sort of pre training involved here. Regardless, this work is a big step forward for imitation learning, permitting a wider range of tasks for which there is little training data and no reward function available, to still be successfully solved.<br />
<br />
= Illustrative Example: Particle Reaching =<br />
<br />
[[File:f1.png]]<br />
<br />
Figure 1: [Left] Agent, [Middle] Orange square is target, [Right] Green triangle is target [2].<br />
<br />
Another simple yet insightful example of the One-Shot Imitation Learning is the particle reaching problem which provides a relatively simple suite of tasks from which the network needs to solve an arbitrary one. The problem is formulated such that for each task: there is an agent which can move based on a 2D force vector, and n landmarks at varying 2D locations (n varies from task to task) with the goal of moving the agent to the specific landmark reached in the demonstration. This is illustrated in Figure 1. <br />
<br />
[[File:f2.png|450px]]<br />
<br />
Figure 2: Experimental results [2].<br />
<br />
Some insight comes from the use of different network architectures to solve this problem. The three architectures to compare (described below) are plain LSTM, LSTM with attention, and final state with attention. The key insight is that the architectures go from generic to specific, with the best generalization performance achieved with the most specific architecture, final state with attention, as seen in Figure 2. It is important to note that this conclusion does not carry forward to more complicated tasks such as the block stacking task.<br />
*Plain LSTM: 512 hidden units, with the input being the demonstration trajectory (the position of the agent changes over time and approaches one of the targets). Output of the LSTM with the current state (from the task needed to be solved) is the input for a multi-layer perceptron (MLP) for finding the solution.<br />
*LSTM with attention: Output of LSTM is now a set of weights for the different targets during training. These weights and the test state are used in the test task. The, now, 2D output is the input for an MLP as before.<br />
*Final state with attention: Looks only at the final state of the demonstration since it can sufficiently provide the needed detail of which target to reach (trajectory is not required). Similar to previous architecture, produces weights used by MLP.<br />
<br />
= Source =<br />
# Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. "Neural machine translation by jointly learning to align and translate." arXiv preprint arXiv:1409.0473 (2014).<br />
# Duan, Yan, Marcin Andrychowicz, Bradly Stadie, OpenAI Jonathan Ho, Jonas Schneider, Ilya Sutskever, Pieter Abbeel, and Wojciech Zaremba. "One-shot imitation learning." In Advances in neural information processing systems, pp. 1087-1098. 2017.<br />
# Y. Duan, M. Andrychowicz, B. Stadie, J. Ho, J. Schneider, I. Sutskever, P. Abbeel, and W. Zaremba. One-shot imitation learning. arXiv preprint arXiv:1703.07326, 2017. (Newer revision)<br />
# Finn, Chelsea, Pieter Abbeel, and Sergey Levine. "Model-agnostic meta-learning for fast adaptation of deep networks." arXiv preprint arXiv:1703.03400 (2017).<br />
# Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. "Sequence to sequence learning with neural networks." Advances in neural information processing systems. 2014.</div>Tdunnhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=One-Shot_Imitation_Learning&diff=36182One-Shot Imitation Learning2018-04-05T20:06:38Z<p>Tdunn: /* Block Stacking Tasks */</p>
<hr />
<div>= Introduction =<br />
Robotic systems can be used for many applications, but to truly be useful for complex applications, they need to overcome 2 challenges: having the intent of the task at hand communicated to them, and being able to perform the manipulations necessary to complete this task. It is preferable to use demonstration to teach the robotic systems rather than natural language, as natural language may often fail to convey the details and intricacies required for the task. However, current work on learning from demonstrations is only successful with large amounts of feature engineering or a large number of demonstrations. The proposed model aims to achieve 'one-shot' imitation learning, ie. learning to complete a new task from just a single demonstration of it without any other supervision. As input, the proposed model takes the observation of the current instance of a task, and a demonstration of successfully solving a different instance of the same task. Strong generalization was achieved by using a soft attention mechanism on both the sequence of actions and states that the demonstration consists of, as well as on the vector of element locations within the environment. The success of this proposed model at completing a series of block stacking tasks can be viewed at http://bit.ly/nips2017-oneshot.<br />
<br />
= Related Work =<br />
While one-shot imitation learning is a novel combination of ideas, each of the components has previously been studied.<br />
* Imitation Learning: <br />
** Behavioural learning uses supervised learning to map from observations to actions (e.g. [https://papers.nips.cc/paper/95-alvinn-an-autonomous-land-vehicle-in-a-neural-network.pdf (Pomerleau 1988)], [https://arxiv.org/pdf/1011.0686.pdf (Ross et. al 2011)])<br />
** Inverse reinforcement learning estimates a reward function that considers demonstrations as optimal behavior (e.g. [http://ai.stanford.edu/~ang/papers/icml00-irl.pdf (Ng et. al 2000)])<br />
* One-Shot Learning:<br />
** Typically a form of meta-learning<br />
** Previously used for variety of tasks but all domain-specific<br />
** [https://arxiv.org/abs/1703.03400 (Finn et al. 2017)] proposed a generic solution but excluded imitation learning<br />
* Reinforcement Learning:<br />
** Demonstrated to work on variety of tasks and environments, in particular on games and robotic control<br />
** Requires large amount of trials and a user-specified reward function<br />
* Multi-task/Transfer Learning:<br />
** Shown to be particularly effective at computer vision tasks<br />
** Not meant for one-shot learning<br />
* Attention Modelling:<br />
** The proposed model makes use of the attention model from [https://arxiv.org/abs/1409.0473 (Bahdanau et al. 2016)]<br />
** The attention modelling over demonstration is similar in nature to the seq2seq models from the well known [https://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf (Sutskever et al. 2014)]<br />
<br />
= One-Shot Imitation Learning =<br />
<br />
[[File:oneshot1.jpg|1000px]]<br />
<br />
The figure above shows the differences between traditional and one-shot imitation learning. In a), the traditional method may require training different policies for performing similar tasks that are similar in nature. For example, stacking blocks to a height of 2 and to a height of 3. In b), the one-shot imitation learning allows the same policy to be used for these tasks given a single demonstration, achieving good performance without any additional system interactions. In c), the policy is trained by using a set of different training tasks, with enough examples so that the learned results can be generalized to other similar tasks. Each task has a set of successful demonstrations. Each iteration of training uses two demonstrations from a task, one is used as the input passing into the algorithm and the other is used at the output, the results from the two are then conditioned to produce the correct action.<br />
<br />
== Problem Formalization ==<br />
The problem is briefly formalized with the authors describing a distribution of tasks, an individual task, a distribution of demonstrations for this task, and a single demonstration respectively as \[T, \: t\sim T, \: D(t), \: d\sim D(t)\]<br />
In addition, an action, an observation, parameters, and a policy are respectively defined as \[a, o, \theta, \pi_\theta(a|o,d)\]<br />
In particular, a demonstration is a sequence of observation and action pairs \[d = [(o_1, a_1),(o_2, a_2), . . . ,(o_H , a_H )]\]<br />
Assuming that <math>H </math>, the length or horizon of a demonstration, and some evaluation function $$R_t(d): R^T \rightarrow R$$ are given, and that succesful demonstrations are available for each task, then the objective is to maximize expectation of the policy performance over \[t\sim T, d\sim D(t)\].<br />
<br />
== Block Stacking Tasks ==<br />
The tasks that the authors focus on is block stacking. A user specifies in what final configuration cubic blocks should be stacked, and the goal is to use a 7-DOF Fetch robotic arm to arrange the blocks in this configuration. The number of blocks, and their desired configuration (ie. number of towers, the height of each tower, and order of blocks within each tower) can be varied and encoded as a string. For example, 'abc def' would signify 2 towers of height 3, with block A on block B on block C in one tower, and block D on block E on block F in a second tower. To add complexity, the initial configuration of the blocks can vary and is encoded as a set of 3-dimensional vectors describing the position of each block relative to the robotic arm. In this case the reward function is takes on binary values that indicate weather or not the task was completed correctly.<br />
<br />
== Algorithm ==<br />
To avoid needing to specify a reward function, the authors use behavioral cloning and DAGGER, 2 imitation learning methods that require only demonstrations, for training. In each training step, a list of tasks is sampled, and for each, a demonstration with injected noise along with some observation-action pairs are sampled. Given the current observation and demonstration as input, the policy is trained against the sampled actions by minimizing L2 norm for continuous actions, and cross-entropy for discrete ones. Adamax is used as the optimizer with a learning rate of 0.001.<br />
<br />
= Architecture =<br />
The authors propose a novel architecture for imitation learning, consisting of 3 networks.<br />
<br />
While, in principle, a generic neural network could learn the mapping from demonstration and current observation to appropriate action, the authors propose the following architecture which they claim as one of the main contributions of this paper, and believe it would be useful for complex tasks in the future.<br />
The proposed architecture consists of three modules: the demonstration network, the context network, and the manipulation network.<br />
<br />
[[File:oneshot2.jpg|1000px|center]]<br />
<br />
== Demonstration Network ==<br />
This network takes a demonstration as input and produces an embedding with size linearly proportional to the number of blocks and the size of the demonstration.<br />
=== Temporal Dropout ===<br />
Since a demonstration for block stacking can be very long, the authors randomly discard 95% of the time steps, a process they call 'temporal dropout'. The reduced size of the demonstrations allows multiple trajectories to be explored during testing to calculate an ensemble estimate. Dilated temporal convolutions and neighborhood attention are then repeatedly applied to the downsampled demonstrations.<br />
<br />
=== Neighborhood Attention ===<br />
Since demonstration sizes can vary, a mechanism is needed that is not restricted to fixed-length inputs. While soft attention is one such mechanism, the problem with it is that there may be increasingly large amounts of information lost if soft attention is used to map longer demonstrations to the same fixed length as shorter demonstrations. As a solution, the authors propose having the same number of outputs as inputs, but with attention performed on other inputs relative to the current input.<br />
<br />
A query <math>q</math>, a list of context vectors <math>\{c_j\}</math>, and a list of memory vectors <math>\{m_j\}</math> are given as input to soft attention. Each attention weight is given by the product of a learned weight vector and a nonlinearity applied to the sum of the query and corresponding context vector. Softmaxed weights applied to the corresponding memory vector form the output of the soft attention.<br />
<br />
\[Inputs: q, \{c_j\}, \{m_j\}\]<br />
\[Weights: w_i \leftarrow v^Ttanh(q+c_i)\]<br />
\[Output: \sum_i{m_i\frac{\exp(w_i)}{\sum_j{\exp(w_j)}}}\]<br />
<br />
A list of same-length embeddings, coming from a previous neighbourhood attention layer or a projection from the list of block coordinates, is given as input to neighborhood attention. For each block, two separate linear layers produce a query vector and a context vector, while a memory vector is a list of tuples that describe the position of each block joined with the input embedding for that block. Soft attention is then performed on this query, context vector, and memory vector. The authors claim that the intuition behind this process is to allow each block to provide information about itself relative to the other blocks in the environment. Finally, for each block, a linear transformation is performed on the vector composed by concatenating the input embedding, the result of the soft attention for that block, and the robot's state.<br />
<br />
For an environment with B blocks:<br />
\[State: s\]<br />
\[Block_i: b_i \leftarrow (x_i, y_i, z_i)\]<br />
\[Embeddings: h_1^{in}, ..., h_B^{in}\] <br />
\[Query_i: q_i \leftarrow Linear(h_i^{in})\]<br />
\[Context_i: c_i \leftarrow Linear(h_i^{in})\]<br />
\[Memory_i: m_i \leftarrow (b_i, h_i^{in}) \]<br />
\[Result_i: result_i \leftarrow SoftAttn(q_i, \{c_j\}_{j=1}^B, \{m_k\}_{k=1}^B)\]<br />
\[Output_i: output_i \leftarrow Linear(concat(h_i^{in}, result_i, b_i, s))\]<br />
<br />
== Context network ==<br />
This network takes the current state and the embedding produced by the demonstration network as inputs and outputs a fixed-length "context embedding" which captures only the information relevant for the manipulation network at this particular step.<br />
=== Attention over demonstration ===<br />
The current state is used to compute a query vector which is then used for attending over all the steps of the embedding. Since at each time step there are multiple blocks, the weights for each are summed together to produce a scalar for each time step. Neighbourhood attention is then applied several times, using an LSTM with untied weights, since the information at each time steps needs to be propagated to each block's embedding. <br />
<br />
Performing attention over the demonstration yields a vector whose size is independent of the demonstration size; however, it is still dependent on the number of blocks in the environment, so it is natural to now attend over the state in order to get a fixed-length vector.<br />
=== Attention over current state ===<br />
The authors propose that in general, within each subtask, only a limited number of blocks are relevant for performing the subtask. If the subtask is to stack A on B, then intuitively, one would suppose that only block A and B are relevant, and perhaps any blocks that may be blocking access to either A or B. This is not enforced during training, but once soft attention is applied to the current state to produce a fixed-length context embedding, the authors believe that the model does indeed learn in this way.<br />
<br />
== Manipulation network ==<br />
Given the context embedding as input, this simple feedforward network decides on the particular action needed, to complete the subtask of stacking one particular 'source' block on top of another 'target' block.<br />
<br />
= Experiments = <br />
The proposed model was tested on the block stacking tasks. the experiments were designed at answering the following questions:<br />
* How does training with behavioral cloning compare with DAGGER?<br />
* How does conditioning on the entire demonstration compare to conditioning on the final state?<br />
* How does conditioning on the entire demonstration compare to conditioning on a “snapshot” of the trajectory?<br />
* Can the authors' framework generalize to tasks that it has never seen during training?<br />
For the experiments, 140 training tasks and 43 testing tasks were collected, each with between 2 to 10 blocks and a different, desired final layout. Over 1000 demonstrations for each task were collected using a hard-coded policy rather than a human user. The authors compare 4 different architectures in these experiments:<br />
* Behavioural cloning used to train the proposed model<br />
* DAGGER used to train the proposed model<br />
* The proposed model, trained with DAGGER, but conditioned on the desired final state rather than an entire demonstration<br />
* The proposed model, trained with DAGGER, but conditioned on a 'snapshot' of the environment at the end of each subtask (ie. every time a block is stacked on another block)<br />
<br />
== Performance Evaluation ==<br />
[[File:oneshot3.jpg|1000px]]<br />
<br />
The most confident action at each timestep is chosen in 100 different task configurations, and results are averaged over tasks that had the same number of blocks. The results suggest that the performance of each of the architectures is comparable to that of the hard-coded policy which they aim to imitate. Performance degrades similarly across all architectures and the hard-coded policy as the number of blocks increases. On the harder tasks, conditioning on the entire demonstration led to better performance than conditioning on snapshots or on the final state. The authors believe that this may be due to the lack of information when conditioning only on the final state as well as due to regularization caused by temporal dropout which leads to data augmentation when conditioning on the full demonstration but is omitted when conditioning only on the snapshots or final state. Both DAGGER and behavioral cloning performed comparably well. As mentioned above, noise injection was used in training to improve performance; in practice, additional noise can still be injected but some may already come from other sources.<br />
<br />
== Visualization ==<br />
The authors visualize the attention mechanisms underlying the main policy architecture to have a better understanding about how it operates. There are two kinds of attention that the authors are mainly interested in, one where the policy attends to different time steps in the demonstration, and the other where the policy attends to different blocks in the current state. The figures below show some of the policy attention heatmaps over time.<br />
<br />
[[File:paper6_Visualization.png|800px]]<br />
<br />
= Conclusions =<br />
The proposed model successfully learns to complete new instances of a new task from just a single demonstration. The model was demonstrated to work on a series of block stacking tasks. The authors propose several extensions including enabling few-shot learning when one demonstration is insufficient, using image data as the demonstrations, and attempting many other tasks aside from block stacking.<br />
<br />
= Criticisms =<br />
While the paper shows an incredibly impressive result: the ability to learn a new task from just a single demonstration, there are a few points that need clearing up.<br />
Firstly, the authors use a hard-coded policy in their experiments rather than a human. It is clear that the performance of this policy begins to degrade quickly as the complexity of the task increases. It would be useful to know what this hard-coded policy actually was, and if the proposed model could still have comparable performance if a more successful demonstration, perhaps one by a human user, were performed. Give the current popularity of adversarial examples, it would also be interesting to see the performance when conditioned on an "adversarial" demonstration, that achieves the correct final state, but intentionally performs complex or obfuscated steps to get there.<br />
Second, it would be useful to see the model's performance on a more complex family of tasks than block stacking, since although each block stacking task is slightly different, the differences may turn out be insignificant compared to other tasks that this model should work on if it is to be a general imitation learning architecture; intuitively, the space of all possible moves and configurations is not large for the task. Also it is a bit misleading as there seems to be a need for more demonstrations to first get a reasonable policy that can generalize, leading to generic policy and then use just one demonstration on a new task expecting the policy to generalize. So it seems there is some sort of pre training involved here. Regardless, this work is a big step forward for imitation learning, permitting a wider range of tasks for which there is little training data and no reward function available, to still be successfully solved.<br />
<br />
= Illustrative Example: Particle Reaching =<br />
<br />
[[File:f1.png]]<br />
<br />
Figure 1: [Left] Agent, [Middle] Orange square is target, [Right] Green triangle is target [2].<br />
<br />
Another simple yet insightful example of the One-Shot Imitation Learning is the particle reaching problem which provides a relatively simple suite of tasks from which the network needs to solve an arbitrary one. The problem is formulated such that for each task: there is an agent which can move based on a 2D force vector, and n landmarks at varying 2D locations (n varies from task to task) with the goal of moving the agent to the specific landmark reached in the demonstration. This is illustrated in Figure 1. <br />
<br />
[[File:f2.png|450px]]<br />
<br />
Figure 2: Experimental results [2].<br />
<br />
Some insight comes from the use of different network architectures to solve this problem. The three architectures to compare (described below) are plain LSTM, LSTM with attention, and final state with attention. The key insight is that the architectures go from generic to specific, with the best generalization performance achieved with the most specific architecture, final state with attention, as seen in Figure 2. It is important to note that this conclusion does not carry forward to more complicated tasks such as the block stacking task.<br />
*Plain LSTM: 512 hidden units, with the input being the demonstration trajectory (the position of the agent changes over time and approaches one of the targets). Output of the LSTM with the current state (from the task needed to be solved) is the input for a multi-layer perceptron (MLP) for finding the solution.<br />
*LSTM with attention: Output of LSTM is now a set of weights for the different targets during training. These weights and the test state are used in the test task. The, now, 2D output is the input for an MLP as before.<br />
*Final state with attention: Looks only at the final state of the demonstration since it can sufficiently provide the needed detail of which target to reach (trajectory is not required). Similar to previous architecture, produces weights used by MLP.<br />
<br />
= Source =<br />
# Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. "Neural machine translation by jointly learning to align and translate." arXiv preprint arXiv:1409.0473 (2014).<br />
# Duan, Yan, Marcin Andrychowicz, Bradly Stadie, OpenAI Jonathan Ho, Jonas Schneider, Ilya Sutskever, Pieter Abbeel, and Wojciech Zaremba. "One-shot imitation learning." In Advances in neural information processing systems, pp. 1087-1098. 2017.<br />
# Y. Duan, M. Andrychowicz, B. Stadie, J. Ho, J. Schneider, I. Sutskever, P. Abbeel, and W. Zaremba. One-shot imitation learning. arXiv preprint arXiv:1703.07326, 2017. (Newer revision)<br />
# Finn, Chelsea, Pieter Abbeel, and Sergey Levine. "Model-agnostic meta-learning for fast adaptation of deep networks." arXiv preprint arXiv:1703.03400 (2017).<br />
# Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. "Sequence to sequence learning with neural networks." Advances in neural information processing systems. 2014.</div>Tdunnhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=One-Shot_Imitation_Learning&diff=36181One-Shot Imitation Learning2018-04-05T20:04:20Z<p>Tdunn: /* Problem Formalization */</p>
<hr />
<div>= Introduction =<br />
Robotic systems can be used for many applications, but to truly be useful for complex applications, they need to overcome 2 challenges: having the intent of the task at hand communicated to them, and being able to perform the manipulations necessary to complete this task. It is preferable to use demonstration to teach the robotic systems rather than natural language, as natural language may often fail to convey the details and intricacies required for the task. However, current work on learning from demonstrations is only successful with large amounts of feature engineering or a large number of demonstrations. The proposed model aims to achieve 'one-shot' imitation learning, ie. learning to complete a new task from just a single demonstration of it without any other supervision. As input, the proposed model takes the observation of the current instance of a task, and a demonstration of successfully solving a different instance of the same task. Strong generalization was achieved by using a soft attention mechanism on both the sequence of actions and states that the demonstration consists of, as well as on the vector of element locations within the environment. The success of this proposed model at completing a series of block stacking tasks can be viewed at http://bit.ly/nips2017-oneshot.<br />
<br />
= Related Work =<br />
While one-shot imitation learning is a novel combination of ideas, each of the components has previously been studied.<br />
* Imitation Learning: <br />
** Behavioural learning uses supervised learning to map from observations to actions (e.g. [https://papers.nips.cc/paper/95-alvinn-an-autonomous-land-vehicle-in-a-neural-network.pdf (Pomerleau 1988)], [https://arxiv.org/pdf/1011.0686.pdf (Ross et. al 2011)])<br />
** Inverse reinforcement learning estimates a reward function that considers demonstrations as optimal behavior (e.g. [http://ai.stanford.edu/~ang/papers/icml00-irl.pdf (Ng et. al 2000)])<br />
* One-Shot Learning:<br />
** Typically a form of meta-learning<br />
** Previously used for variety of tasks but all domain-specific<br />
** [https://arxiv.org/abs/1703.03400 (Finn et al. 2017)] proposed a generic solution but excluded imitation learning<br />
* Reinforcement Learning:<br />
** Demonstrated to work on variety of tasks and environments, in particular on games and robotic control<br />
** Requires large amount of trials and a user-specified reward function<br />
* Multi-task/Transfer Learning:<br />
** Shown to be particularly effective at computer vision tasks<br />
** Not meant for one-shot learning<br />
* Attention Modelling:<br />
** The proposed model makes use of the attention model from [https://arxiv.org/abs/1409.0473 (Bahdanau et al. 2016)]<br />
** The attention modelling over demonstration is similar in nature to the seq2seq models from the well known [https://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf (Sutskever et al. 2014)]<br />
<br />
= One-Shot Imitation Learning =<br />
<br />
[[File:oneshot1.jpg|1000px]]<br />
<br />
The figure above shows the differences between traditional and one-shot imitation learning. In a), the traditional method may require training different policies for performing similar tasks that are similar in nature. For example, stacking blocks to a height of 2 and to a height of 3. In b), the one-shot imitation learning allows the same policy to be used for these tasks given a single demonstration, achieving good performance without any additional system interactions. In c), the policy is trained by using a set of different training tasks, with enough examples so that the learned results can be generalized to other similar tasks. Each task has a set of successful demonstrations. Each iteration of training uses two demonstrations from a task, one is used as the input passing into the algorithm and the other is used at the output, the results from the two are then conditioned to produce the correct action.<br />
<br />
== Problem Formalization ==<br />
The problem is briefly formalized with the authors describing a distribution of tasks, an individual task, a distribution of demonstrations for this task, and a single demonstration respectively as \[T, \: t\sim T, \: D(t), \: d\sim D(t)\]<br />
In addition, an action, an observation, parameters, and a policy are respectively defined as \[a, o, \theta, \pi_\theta(a|o,d)\]<br />
In particular, a demonstration is a sequence of observation and action pairs \[d = [(o_1, a_1),(o_2, a_2), . . . ,(o_H , a_H )]\]<br />
Assuming that <math>H </math>, the length or horizon of a demonstration, and some evaluation function $$R_t(d): R^T \rightarrow R$$ are given, and that succesful demonstrations are available for each task, then the objective is to maximize expectation of the policy performance over \[t\sim T, d\sim D(t)\].<br />
<br />
== Block Stacking Tasks ==<br />
The tasks that the authors focus on is block stacking. A user specifies in what final configuration cubic blocks should be stacked, and the goal is to use a 7-DOF Fetch robotic arm to arrange the blocks in this configuration. The number of blocks, and their desired configuration (ie. number of towers, the height of each tower, and order of blocks within each tower) can be varied and encoded as a string. For example, 'abc def' would signify 2 towers of height 3, with block A on block B on block C in one tower, and block D on block E on block F in a second tower. To add complexity, the initial configuration of the blocks can vary and is encoded as a set of 3-dimensional vectors describing the position of each block relative to the robotic arm.<br />
<br />
== Algorithm ==<br />
To avoid needing to specify a reward function, the authors use behavioral cloning and DAGGER, 2 imitation learning methods that require only demonstrations, for training. In each training step, a list of tasks is sampled, and for each, a demonstration with injected noise along with some observation-action pairs are sampled. Given the current observation and demonstration as input, the policy is trained against the sampled actions by minimizing L2 norm for continuous actions, and cross-entropy for discrete ones. Adamax is used as the optimizer with a learning rate of 0.001.<br />
<br />
= Architecture =<br />
The authors propose a novel architecture for imitation learning, consisting of 3 networks.<br />
<br />
While, in principle, a generic neural network could learn the mapping from demonstration and current observation to appropriate action, the authors propose the following architecture which they claim as one of the main contributions of this paper, and believe it would be useful for complex tasks in the future.<br />
The proposed architecture consists of three modules: the demonstration network, the context network, and the manipulation network.<br />
<br />
[[File:oneshot2.jpg|1000px|center]]<br />
<br />
== Demonstration Network ==<br />
This network takes a demonstration as input and produces an embedding with size linearly proportional to the number of blocks and the size of the demonstration.<br />
=== Temporal Dropout ===<br />
Since a demonstration for block stacking can be very long, the authors randomly discard 95% of the time steps, a process they call 'temporal dropout'. The reduced size of the demonstrations allows multiple trajectories to be explored during testing to calculate an ensemble estimate. Dilated temporal convolutions and neighborhood attention are then repeatedly applied to the downsampled demonstrations.<br />
<br />
=== Neighborhood Attention ===<br />
Since demonstration sizes can vary, a mechanism is needed that is not restricted to fixed-length inputs. While soft attention is one such mechanism, the problem with it is that there may be increasingly large amounts of information lost if soft attention is used to map longer demonstrations to the same fixed length as shorter demonstrations. As a solution, the authors propose having the same number of outputs as inputs, but with attention performed on other inputs relative to the current input.<br />
<br />
A query <math>q</math>, a list of context vectors <math>\{c_j\}</math>, and a list of memory vectors <math>\{m_j\}</math> are given as input to soft attention. Each attention weight is given by the product of a learned weight vector and a nonlinearity applied to the sum of the query and corresponding context vector. Softmaxed weights applied to the corresponding memory vector form the output of the soft attention.<br />
<br />
\[Inputs: q, \{c_j\}, \{m_j\}\]<br />
\[Weights: w_i \leftarrow v^Ttanh(q+c_i)\]<br />
\[Output: \sum_i{m_i\frac{\exp(w_i)}{\sum_j{\exp(w_j)}}}\]<br />
<br />
A list of same-length embeddings, coming from a previous neighbourhood attention layer or a projection from the list of block coordinates, is given as input to neighborhood attention. For each block, two separate linear layers produce a query vector and a context vector, while a memory vector is a list of tuples that describe the position of each block joined with the input embedding for that block. Soft attention is then performed on this query, context vector, and memory vector. The authors claim that the intuition behind this process is to allow each block to provide information about itself relative to the other blocks in the environment. Finally, for each block, a linear transformation is performed on the vector composed by concatenating the input embedding, the result of the soft attention for that block, and the robot's state.<br />
<br />
For an environment with B blocks:<br />
\[State: s\]<br />
\[Block_i: b_i \leftarrow (x_i, y_i, z_i)\]<br />
\[Embeddings: h_1^{in}, ..., h_B^{in}\] <br />
\[Query_i: q_i \leftarrow Linear(h_i^{in})\]<br />
\[Context_i: c_i \leftarrow Linear(h_i^{in})\]<br />
\[Memory_i: m_i \leftarrow (b_i, h_i^{in}) \]<br />
\[Result_i: result_i \leftarrow SoftAttn(q_i, \{c_j\}_{j=1}^B, \{m_k\}_{k=1}^B)\]<br />
\[Output_i: output_i \leftarrow Linear(concat(h_i^{in}, result_i, b_i, s))\]<br />
<br />
== Context network ==<br />
This network takes the current state and the embedding produced by the demonstration network as inputs and outputs a fixed-length "context embedding" which captures only the information relevant for the manipulation network at this particular step.<br />
=== Attention over demonstration ===<br />
The current state is used to compute a query vector which is then used for attending over all the steps of the embedding. Since at each time step there are multiple blocks, the weights for each are summed together to produce a scalar for each time step. Neighbourhood attention is then applied several times, using an LSTM with untied weights, since the information at each time steps needs to be propagated to each block's embedding. <br />
<br />
Performing attention over the demonstration yields a vector whose size is independent of the demonstration size; however, it is still dependent on the number of blocks in the environment, so it is natural to now attend over the state in order to get a fixed-length vector.<br />
=== Attention over current state ===<br />
The authors propose that in general, within each subtask, only a limited number of blocks are relevant for performing the subtask. If the subtask is to stack A on B, then intuitively, one would suppose that only block A and B are relevant, and perhaps any blocks that may be blocking access to either A or B. This is not enforced during training, but once soft attention is applied to the current state to produce a fixed-length context embedding, the authors believe that the model does indeed learn in this way.<br />
<br />
== Manipulation network ==<br />
Given the context embedding as input, this simple feedforward network decides on the particular action needed, to complete the subtask of stacking one particular 'source' block on top of another 'target' block.<br />
<br />
= Experiments = <br />
The proposed model was tested on the block stacking tasks. the experiments were designed at answering the following questions:<br />
* How does training with behavioral cloning compare with DAGGER?<br />
* How does conditioning on the entire demonstration compare to conditioning on the final state?<br />
* How does conditioning on the entire demonstration compare to conditioning on a “snapshot” of the trajectory?<br />
* Can the authors' framework generalize to tasks that it has never seen during training?<br />
For the experiments, 140 training tasks and 43 testing tasks were collected, each with between 2 to 10 blocks and a different, desired final layout. Over 1000 demonstrations for each task were collected using a hard-coded policy rather than a human user. The authors compare 4 different architectures in these experiments:<br />
* Behavioural cloning used to train the proposed model<br />
* DAGGER used to train the proposed model<br />
* The proposed model, trained with DAGGER, but conditioned on the desired final state rather than an entire demonstration<br />
* The proposed model, trained with DAGGER, but conditioned on a 'snapshot' of the environment at the end of each subtask (ie. every time a block is stacked on another block)<br />
<br />
== Performance Evaluation ==<br />
[[File:oneshot3.jpg|1000px]]<br />
<br />
The most confident action at each timestep is chosen in 100 different task configurations, and results are averaged over tasks that had the same number of blocks. The results suggest that the performance of each of the architectures is comparable to that of the hard-coded policy which they aim to imitate. Performance degrades similarly across all architectures and the hard-coded policy as the number of blocks increases. On the harder tasks, conditioning on the entire demonstration led to better performance than conditioning on snapshots or on the final state. The authors believe that this may be due to the lack of information when conditioning only on the final state as well as due to regularization caused by temporal dropout which leads to data augmentation when conditioning on the full demonstration but is omitted when conditioning only on the snapshots or final state. Both DAGGER and behavioral cloning performed comparably well. As mentioned above, noise injection was used in training to improve performance; in practice, additional noise can still be injected but some may already come from other sources.<br />
<br />
== Visualization ==<br />
The authors visualize the attention mechanisms underlying the main policy architecture to have a better understanding about how it operates. There are two kinds of attention that the authors are mainly interested in, one where the policy attends to different time steps in the demonstration, and the other where the policy attends to different blocks in the current state. The figures below show some of the policy attention heatmaps over time.<br />
<br />
[[File:paper6_Visualization.png|800px]]<br />
<br />
= Conclusions =<br />
The proposed model successfully learns to complete new instances of a new task from just a single demonstration. The model was demonstrated to work on a series of block stacking tasks. The authors propose several extensions including enabling few-shot learning when one demonstration is insufficient, using image data as the demonstrations, and attempting many other tasks aside from block stacking.<br />
<br />
= Criticisms =<br />
While the paper shows an incredibly impressive result: the ability to learn a new task from just a single demonstration, there are a few points that need clearing up.<br />
Firstly, the authors use a hard-coded policy in their experiments rather than a human. It is clear that the performance of this policy begins to degrade quickly as the complexity of the task increases. It would be useful to know what this hard-coded policy actually was, and if the proposed model could still have comparable performance if a more successful demonstration, perhaps one by a human user, were performed. Give the current popularity of adversarial examples, it would also be interesting to see the performance when conditioned on an "adversarial" demonstration, that achieves the correct final state, but intentionally performs complex or obfuscated steps to get there.<br />
Second, it would be useful to see the model's performance on a more complex family of tasks than block stacking, since although each block stacking task is slightly different, the differences may turn out be insignificant compared to other tasks that this model should work on if it is to be a general imitation learning architecture; intuitively, the space of all possible moves and configurations is not large for the task. Also it is a bit misleading as there seems to be a need for more demonstrations to first get a reasonable policy that can generalize, leading to generic policy and then use just one demonstration on a new task expecting the policy to generalize. So it seems there is some sort of pre training involved here. Regardless, this work is a big step forward for imitation learning, permitting a wider range of tasks for which there is little training data and no reward function available, to still be successfully solved.<br />
<br />
= Illustrative Example: Particle Reaching =<br />
<br />
[[File:f1.png]]<br />
<br />
Figure 1: [Left] Agent, [Middle] Orange square is target, [Right] Green triangle is target [2].<br />
<br />
Another simple yet insightful example of the One-Shot Imitation Learning is the particle reaching problem which provides a relatively simple suite of tasks from which the network needs to solve an arbitrary one. The problem is formulated such that for each task: there is an agent which can move based on a 2D force vector, and n landmarks at varying 2D locations (n varies from task to task) with the goal of moving the agent to the specific landmark reached in the demonstration. This is illustrated in Figure 1. <br />
<br />
[[File:f2.png|450px]]<br />
<br />
Figure 2: Experimental results [2].<br />
<br />
Some insight comes from the use of different network architectures to solve this problem. The three architectures to compare (described below) are plain LSTM, LSTM with attention, and final state with attention. The key insight is that the architectures go from generic to specific, with the best generalization performance achieved with the most specific architecture, final state with attention, as seen in Figure 2. It is important to note that this conclusion does not carry forward to more complicated tasks such as the block stacking task.<br />
*Plain LSTM: 512 hidden units, with the input being the demonstration trajectory (the position of the agent changes over time and approaches one of the targets). Output of the LSTM with the current state (from the task needed to be solved) is the input for a multi-layer perceptron (MLP) for finding the solution.<br />
*LSTM with attention: Output of LSTM is now a set of weights for the different targets during training. These weights and the test state are used in the test task. The, now, 2D output is the input for an MLP as before.<br />
*Final state with attention: Looks only at the final state of the demonstration since it can sufficiently provide the needed detail of which target to reach (trajectory is not required). Similar to previous architecture, produces weights used by MLP.<br />
<br />
= Source =<br />
# Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. "Neural machine translation by jointly learning to align and translate." arXiv preprint arXiv:1409.0473 (2014).<br />
# Duan, Yan, Marcin Andrychowicz, Bradly Stadie, OpenAI Jonathan Ho, Jonas Schneider, Ilya Sutskever, Pieter Abbeel, and Wojciech Zaremba. "One-shot imitation learning." In Advances in neural information processing systems, pp. 1087-1098. 2017.<br />
# Y. Duan, M. Andrychowicz, B. Stadie, J. Ho, J. Schneider, I. Sutskever, P. Abbeel, and W. Zaremba. One-shot imitation learning. arXiv preprint arXiv:1703.07326, 2017. (Newer revision)<br />
# Finn, Chelsea, Pieter Abbeel, and Sergey Levine. "Model-agnostic meta-learning for fast adaptation of deep networks." arXiv preprint arXiv:1703.03400 (2017).<br />
# Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. "Sequence to sequence learning with neural networks." Advances in neural information processing systems. 2014.</div>Tdunnhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=One-Shot_Imitation_Learning&diff=36180One-Shot Imitation Learning2018-04-05T20:03:38Z<p>Tdunn: /* Problem Formalization */</p>
<hr />
<div>= Introduction =<br />
Robotic systems can be used for many applications, but to truly be useful for complex applications, they need to overcome 2 challenges: having the intent of the task at hand communicated to them, and being able to perform the manipulations necessary to complete this task. It is preferable to use demonstration to teach the robotic systems rather than natural language, as natural language may often fail to convey the details and intricacies required for the task. However, current work on learning from demonstrations is only successful with large amounts of feature engineering or a large number of demonstrations. The proposed model aims to achieve 'one-shot' imitation learning, ie. learning to complete a new task from just a single demonstration of it without any other supervision. As input, the proposed model takes the observation of the current instance of a task, and a demonstration of successfully solving a different instance of the same task. Strong generalization was achieved by using a soft attention mechanism on both the sequence of actions and states that the demonstration consists of, as well as on the vector of element locations within the environment. The success of this proposed model at completing a series of block stacking tasks can be viewed at http://bit.ly/nips2017-oneshot.<br />
<br />
= Related Work =<br />
While one-shot imitation learning is a novel combination of ideas, each of the components has previously been studied.<br />
* Imitation Learning: <br />
** Behavioural learning uses supervised learning to map from observations to actions (e.g. [https://papers.nips.cc/paper/95-alvinn-an-autonomous-land-vehicle-in-a-neural-network.pdf (Pomerleau 1988)], [https://arxiv.org/pdf/1011.0686.pdf (Ross et. al 2011)])<br />
** Inverse reinforcement learning estimates a reward function that considers demonstrations as optimal behavior (e.g. [http://ai.stanford.edu/~ang/papers/icml00-irl.pdf (Ng et. al 2000)])<br />
* One-Shot Learning:<br />
** Typically a form of meta-learning<br />
** Previously used for variety of tasks but all domain-specific<br />
** [https://arxiv.org/abs/1703.03400 (Finn et al. 2017)] proposed a generic solution but excluded imitation learning<br />
* Reinforcement Learning:<br />
** Demonstrated to work on variety of tasks and environments, in particular on games and robotic control<br />
** Requires large amount of trials and a user-specified reward function<br />
* Multi-task/Transfer Learning:<br />
** Shown to be particularly effective at computer vision tasks<br />
** Not meant for one-shot learning<br />
* Attention Modelling:<br />
** The proposed model makes use of the attention model from [https://arxiv.org/abs/1409.0473 (Bahdanau et al. 2016)]<br />
** The attention modelling over demonstration is similar in nature to the seq2seq models from the well known [https://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf (Sutskever et al. 2014)]<br />
<br />
= One-Shot Imitation Learning =<br />
<br />
[[File:oneshot1.jpg|1000px]]<br />
<br />
The figure above shows the differences between traditional and one-shot imitation learning. In a), the traditional method may require training different policies for performing similar tasks that are similar in nature. For example, stacking blocks to a height of 2 and to a height of 3. In b), the one-shot imitation learning allows the same policy to be used for these tasks given a single demonstration, achieving good performance without any additional system interactions. In c), the policy is trained by using a set of different training tasks, with enough examples so that the learned results can be generalized to other similar tasks. Each task has a set of successful demonstrations. Each iteration of training uses two demonstrations from a task, one is used as the input passing into the algorithm and the other is used at the output, the results from the two are then conditioned to produce the correct action.<br />
<br />
== Problem Formalization ==<br />
The problem is briefly formalized with the authors describing a distribution of tasks, an individual task, a distribution of demonstrations for this task, and a single demonstration respectively as \[T, \: t\sim T, \: D(t), \: d\sim D(t)\]<br />
In addition, an action, an observation, parameters, and a policy are respectively defined as \[a, o, \theta, \pi_\theta(a|o,d)\]<br />
In particular, a demonstration is a sequence of observation and action pairs \[d = [(o_1, a_1),(o_2, a_2), . . . ,(o_H , a_H )]\]<br />
Assuming that $H$, the length or horizon of a demonstration, and some evaluation function $$R_t(d): R^T \rightarrow R$$ are given, and that succesful demonstrations are available for each task, then the objective is to maximize expectation of the policy performance over \[t\sim T, d\sim D(t)\].<br />
<br />
== Block Stacking Tasks ==<br />
The tasks that the authors focus on is block stacking. A user specifies in what final configuration cubic blocks should be stacked, and the goal is to use a 7-DOF Fetch robotic arm to arrange the blocks in this configuration. The number of blocks, and their desired configuration (ie. number of towers, the height of each tower, and order of blocks within each tower) can be varied and encoded as a string. For example, 'abc def' would signify 2 towers of height 3, with block A on block B on block C in one tower, and block D on block E on block F in a second tower. To add complexity, the initial configuration of the blocks can vary and is encoded as a set of 3-dimensional vectors describing the position of each block relative to the robotic arm.<br />
<br />
== Algorithm ==<br />
To avoid needing to specify a reward function, the authors use behavioral cloning and DAGGER, 2 imitation learning methods that require only demonstrations, for training. In each training step, a list of tasks is sampled, and for each, a demonstration with injected noise along with some observation-action pairs are sampled. Given the current observation and demonstration as input, the policy is trained against the sampled actions by minimizing L2 norm for continuous actions, and cross-entropy for discrete ones. Adamax is used as the optimizer with a learning rate of 0.001.<br />
<br />
= Architecture =<br />
The authors propose a novel architecture for imitation learning, consisting of 3 networks.<br />
<br />
While, in principle, a generic neural network could learn the mapping from demonstration and current observation to appropriate action, the authors propose the following architecture which they claim as one of the main contributions of this paper, and believe it would be useful for complex tasks in the future.<br />
The proposed architecture consists of three modules: the demonstration network, the context network, and the manipulation network.<br />
<br />
[[File:oneshot2.jpg|1000px|center]]<br />
<br />
== Demonstration Network ==<br />
This network takes a demonstration as input and produces an embedding with size linearly proportional to the number of blocks and the size of the demonstration.<br />
=== Temporal Dropout ===<br />
Since a demonstration for block stacking can be very long, the authors randomly discard 95% of the time steps, a process they call 'temporal dropout'. The reduced size of the demonstrations allows multiple trajectories to be explored during testing to calculate an ensemble estimate. Dilated temporal convolutions and neighborhood attention are then repeatedly applied to the downsampled demonstrations.<br />
<br />
=== Neighborhood Attention ===<br />
Since demonstration sizes can vary, a mechanism is needed that is not restricted to fixed-length inputs. While soft attention is one such mechanism, the problem with it is that there may be increasingly large amounts of information lost if soft attention is used to map longer demonstrations to the same fixed length as shorter demonstrations. As a solution, the authors propose having the same number of outputs as inputs, but with attention performed on other inputs relative to the current input.<br />
<br />
A query <math>q</math>, a list of context vectors <math>\{c_j\}</math>, and a list of memory vectors <math>\{m_j\}</math> are given as input to soft attention. Each attention weight is given by the product of a learned weight vector and a nonlinearity applied to the sum of the query and corresponding context vector. Softmaxed weights applied to the corresponding memory vector form the output of the soft attention.<br />
<br />
\[Inputs: q, \{c_j\}, \{m_j\}\]<br />
\[Weights: w_i \leftarrow v^Ttanh(q+c_i)\]<br />
\[Output: \sum_i{m_i\frac{\exp(w_i)}{\sum_j{\exp(w_j)}}}\]<br />
<br />
A list of same-length embeddings, coming from a previous neighbourhood attention layer or a projection from the list of block coordinates, is given as input to neighborhood attention. For each block, two separate linear layers produce a query vector and a context vector, while a memory vector is a list of tuples that describe the position of each block joined with the input embedding for that block. Soft attention is then performed on this query, context vector, and memory vector. The authors claim that the intuition behind this process is to allow each block to provide information about itself relative to the other blocks in the environment. Finally, for each block, a linear transformation is performed on the vector composed by concatenating the input embedding, the result of the soft attention for that block, and the robot's state.<br />
<br />
For an environment with B blocks:<br />
\[State: s\]<br />
\[Block_i: b_i \leftarrow (x_i, y_i, z_i)\]<br />
\[Embeddings: h_1^{in}, ..., h_B^{in}\] <br />
\[Query_i: q_i \leftarrow Linear(h_i^{in})\]<br />
\[Context_i: c_i \leftarrow Linear(h_i^{in})\]<br />
\[Memory_i: m_i \leftarrow (b_i, h_i^{in}) \]<br />
\[Result_i: result_i \leftarrow SoftAttn(q_i, \{c_j\}_{j=1}^B, \{m_k\}_{k=1}^B)\]<br />
\[Output_i: output_i \leftarrow Linear(concat(h_i^{in}, result_i, b_i, s))\]<br />
<br />
== Context network ==<br />
This network takes the current state and the embedding produced by the demonstration network as inputs and outputs a fixed-length "context embedding" which captures only the information relevant for the manipulation network at this particular step.<br />
=== Attention over demonstration ===<br />
The current state is used to compute a query vector which is then used for attending over all the steps of the embedding. Since at each time step there are multiple blocks, the weights for each are summed together to produce a scalar for each time step. Neighbourhood attention is then applied several times, using an LSTM with untied weights, since the information at each time steps needs to be propagated to each block's embedding. <br />
<br />
Performing attention over the demonstration yields a vector whose size is independent of the demonstration size; however, it is still dependent on the number of blocks in the environment, so it is natural to now attend over the state in order to get a fixed-length vector.<br />
=== Attention over current state ===<br />
The authors propose that in general, within each subtask, only a limited number of blocks are relevant for performing the subtask. If the subtask is to stack A on B, then intuitively, one would suppose that only block A and B are relevant, and perhaps any blocks that may be blocking access to either A or B. This is not enforced during training, but once soft attention is applied to the current state to produce a fixed-length context embedding, the authors believe that the model does indeed learn in this way.<br />
<br />
== Manipulation network ==<br />
Given the context embedding as input, this simple feedforward network decides on the particular action needed, to complete the subtask of stacking one particular 'source' block on top of another 'target' block.<br />
<br />
= Experiments = <br />
The proposed model was tested on the block stacking tasks. the experiments were designed at answering the following questions:<br />
* How does training with behavioral cloning compare with DAGGER?<br />
* How does conditioning on the entire demonstration compare to conditioning on the final state?<br />
* How does conditioning on the entire demonstration compare to conditioning on a “snapshot” of the trajectory?<br />
* Can the authors' framework generalize to tasks that it has never seen during training?<br />
For the experiments, 140 training tasks and 43 testing tasks were collected, each with between 2 to 10 blocks and a different, desired final layout. Over 1000 demonstrations for each task were collected using a hard-coded policy rather than a human user. The authors compare 4 different architectures in these experiments:<br />
* Behavioural cloning used to train the proposed model<br />
* DAGGER used to train the proposed model<br />
* The proposed model, trained with DAGGER, but conditioned on the desired final state rather than an entire demonstration<br />
* The proposed model, trained with DAGGER, but conditioned on a 'snapshot' of the environment at the end of each subtask (ie. every time a block is stacked on another block)<br />
<br />
== Performance Evaluation ==<br />
[[File:oneshot3.jpg|1000px]]<br />
<br />
The most confident action at each timestep is chosen in 100 different task configurations, and results are averaged over tasks that had the same number of blocks. The results suggest that the performance of each of the architectures is comparable to that of the hard-coded policy which they aim to imitate. Performance degrades similarly across all architectures and the hard-coded policy as the number of blocks increases. On the harder tasks, conditioning on the entire demonstration led to better performance than conditioning on snapshots or on the final state. The authors believe that this may be due to the lack of information when conditioning only on the final state as well as due to regularization caused by temporal dropout which leads to data augmentation when conditioning on the full demonstration but is omitted when conditioning only on the snapshots or final state. Both DAGGER and behavioral cloning performed comparably well. As mentioned above, noise injection was used in training to improve performance; in practice, additional noise can still be injected but some may already come from other sources.<br />
<br />
== Visualization ==<br />
The authors visualize the attention mechanisms underlying the main policy architecture to have a better understanding about how it operates. There are two kinds of attention that the authors are mainly interested in, one where the policy attends to different time steps in the demonstration, and the other where the policy attends to different blocks in the current state. The figures below show some of the policy attention heatmaps over time.<br />
<br />
[[File:paper6_Visualization.png|800px]]<br />
<br />
= Conclusions =<br />
The proposed model successfully learns to complete new instances of a new task from just a single demonstration. The model was demonstrated to work on a series of block stacking tasks. The authors propose several extensions including enabling few-shot learning when one demonstration is insufficient, using image data as the demonstrations, and attempting many other tasks aside from block stacking.<br />
<br />
= Criticisms =<br />
While the paper shows an incredibly impressive result: the ability to learn a new task from just a single demonstration, there are a few points that need clearing up.<br />
Firstly, the authors use a hard-coded policy in their experiments rather than a human. It is clear that the performance of this policy begins to degrade quickly as the complexity of the task increases. It would be useful to know what this hard-coded policy actually was, and if the proposed model could still have comparable performance if a more successful demonstration, perhaps one by a human user, were performed. Give the current popularity of adversarial examples, it would also be interesting to see the performance when conditioned on an "adversarial" demonstration, that achieves the correct final state, but intentionally performs complex or obfuscated steps to get there.<br />
Second, it would be useful to see the model's performance on a more complex family of tasks than block stacking, since although each block stacking task is slightly different, the differences may turn out be insignificant compared to other tasks that this model should work on if it is to be a general imitation learning architecture; intuitively, the space of all possible moves and configurations is not large for the task. Also it is a bit misleading as there seems to be a need for more demonstrations to first get a reasonable policy that can generalize, leading to generic policy and then use just one demonstration on a new task expecting the policy to generalize. So it seems there is some sort of pre training involved here. Regardless, this work is a big step forward for imitation learning, permitting a wider range of tasks for which there is little training data and no reward function available, to still be successfully solved.<br />
<br />
= Illustrative Example: Particle Reaching =<br />
<br />
[[File:f1.png]]<br />
<br />
Figure 1: [Left] Agent, [Middle] Orange square is target, [Right] Green triangle is target [2].<br />
<br />
Another simple yet insightful example of the One-Shot Imitation Learning is the particle reaching problem which provides a relatively simple suite of tasks from which the network needs to solve an arbitrary one. The problem is formulated such that for each task: there is an agent which can move based on a 2D force vector, and n landmarks at varying 2D locations (n varies from task to task) with the goal of moving the agent to the specific landmark reached in the demonstration. This is illustrated in Figure 1. <br />
<br />
[[File:f2.png|450px]]<br />
<br />
Figure 2: Experimental results [2].<br />
<br />
Some insight comes from the use of different network architectures to solve this problem. The three architectures to compare (described below) are plain LSTM, LSTM with attention, and final state with attention. The key insight is that the architectures go from generic to specific, with the best generalization performance achieved with the most specific architecture, final state with attention, as seen in Figure 2. It is important to note that this conclusion does not carry forward to more complicated tasks such as the block stacking task.<br />
*Plain LSTM: 512 hidden units, with the input being the demonstration trajectory (the position of the agent changes over time and approaches one of the targets). Output of the LSTM with the current state (from the task needed to be solved) is the input for a multi-layer perceptron (MLP) for finding the solution.<br />
*LSTM with attention: Output of LSTM is now a set of weights for the different targets during training. These weights and the test state are used in the test task. The, now, 2D output is the input for an MLP as before.<br />
*Final state with attention: Looks only at the final state of the demonstration since it can sufficiently provide the needed detail of which target to reach (trajectory is not required). Similar to previous architecture, produces weights used by MLP.<br />
<br />
= Source =<br />
# Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. "Neural machine translation by jointly learning to align and translate." arXiv preprint arXiv:1409.0473 (2014).<br />
# Duan, Yan, Marcin Andrychowicz, Bradly Stadie, OpenAI Jonathan Ho, Jonas Schneider, Ilya Sutskever, Pieter Abbeel, and Wojciech Zaremba. "One-shot imitation learning." In Advances in neural information processing systems, pp. 1087-1098. 2017.<br />
# Y. Duan, M. Andrychowicz, B. Stadie, J. Ho, J. Schneider, I. Sutskever, P. Abbeel, and W. Zaremba. One-shot imitation learning. arXiv preprint arXiv:1703.07326, 2017. (Newer revision)<br />
# Finn, Chelsea, Pieter Abbeel, and Sergey Levine. "Model-agnostic meta-learning for fast adaptation of deep networks." arXiv preprint arXiv:1703.03400 (2017).<br />
# Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. "Sequence to sequence learning with neural networks." Advances in neural information processing systems. 2014.</div>Tdunnhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Multi-scale_Dense_Networks_for_Resource_Efficient_Image_Classification&diff=35903Multi-scale Dense Networks for Resource Efficient Image Classification2018-03-29T20:53:54Z<p>Tdunn: /* Coarse Level Features Needed For Classification */</p>
<hr />
<div>= Introduction = <br />
<br />
Multi-Scale Dense Networks, MSDNets, are designed to address the growing demand for efficient object recognition. The issue with existing recognition networks is that they are either:<br />
efficient networks, but don't do well on hard examples, or large networks that do well on all examples but require a large amount of resources.<br />
<br />
In order to be efficient on all difficulties MSDNets propose a structure that can accurately output classifications for varying levels of computational requirements. The two cases that are used to evaluate the network are:<br />
Anytime Prediction: What is the best prediction the network can provide when suddenly prompted.<br />
Budget Batch Predictions: Given a maximum amount of computational resources how well does the network do on the batch.<br />
<br />
= Related Networks =<br />
<br />
== Computationally Efficient Networks ==<br />
<br />
Much of the existing work on convolution networks that are computationally efficient at test time focus on reducing model size after training. Many existing methods for refining an accurate network to be more efficient include weight pruning [3,4,5], quantization of weights [6,7] (during or after training), and knowledge distillation [8,9], which trains smaller student networks to reproduce the output of a much larger teacher network. The proposed work differs from these approaches as it trains a single model which trades computation efficiency for accuracy at test time without re-training or finetuning.<br />
<br />
== Resource Efficient Networks == <br />
<br />
Unlike the above, resource efficient concepts consider limited resources as a part of the structure/loss.<br />
Examples of work in this area include: <br />
* Efficient variants to existing state of the art networks<br />
* Gradient boosted decision trees, which incorporate computational limitations into the training<br />
* Fractal nets<br />
* Adaptive computation time method<br />
<br />
== Related architectures ==<br />
<br />
MSDNets pull on concepts from a number of existing networks:<br />
* Neural fabrics and others, are used to quickly establish a low resolution feature map, which is integral for classification.<br />
* Deeply supervised nets, introduced the incorporation of multiple classifiers throughout the network<br />
* The feature concatenation method from DenseNets allows the later classifiers to not be disrupted by the weight updates from earlier classifiers.<br />
<br />
= Problem Setup =<br />
The authors consider two settings that impose computational constraints at prediction time.<br />
<br />
== Anytime Prediction ==<br />
In the anytime prediction setting (Grubb & Bagnell, 2012), there is a finite computational budget <math>B > 0</math> available for each test example <math>x</math>. Once the budget is exhausted, the prediction for the class is output using early exit. The budget is nondeterministic and varies per test instance.<br />
<br />
== Budgeted Batch Classification ==<br />
In the budgeted batch classification setting, the model needs to classify a set of examples <math>D_test = {x_1, . . . , x_M}</math> within a finite computational budget <math>B > 0</math> that is known in advance.<br />
<br />
= Multi-Scale Dense Networks =<br />
<br />
== Integral Contributions ==<br />
<br />
The way MSDNets aims to provide efficient classification with varying computational costs is to create one network that outputs results at depths. While this may seem trivial, as intermediate classifiers can be inserted into any existing network, two major problems arise.<br />
<br />
=== Coarse Level Features Needed For Classification ===<br />
<br />
[[File:paper29 fig3.png | 700px|thumb|center]]<br />
<br />
The term coarse level feature refers to a set of filters in a CNN with low resolution. There are several ways to create such features. These methods are typically refereed to as down sampling. Some example of layers that perform this function are: max pooling, average pooling and convolution with strides. In this architecture, convolution with strides will be used to create coarse features. <br />
<br />
Coarse level features are needed to gain context of scene. In typical CNN based networks, the features propagate from fine to coarse. Classifiers added to the early, fine featured, layers do not output accurate predictions due to the lack of context.<br />
<br />
Figure 3 depicts relative accuracies of the intermediate classifiers and shows that the accuracy of a classifier is highly correlated with its position in the network. It is easy to see, specifically with the case of ResNet, that the classifiers improve in a staircase pattern. All of the experiments were performed on Cifar-100 dataset and it can be seen that the intermediate classifiers perform worst than the final classifiers, thus highlighting the problem with the lack of coarse level features early on.<br />
<br />
To address this issue, MSDNets proposes an architecture in which uses multi scaled feature maps. The feature maps at a particular layer and scale are computed by concatenating results from up to two convolutions: a standard convolution is first applied to same-scale features from the previous layer to pass on high-resolution information that subsequent layers can use to construct better coarse features, and if possible, a strided convolution is also applied on the finer-scale feature map from the previous layer to produce coarser features amenable to classification. The network is quickly formed to contain a set number of scales ranging from fine to coarse. These scales are propagated throughout, so that for the length of the network there are always coarse level features for classification and fine features for learning more difficult representations.<br />
<br />
=== Training of Early Classifiers Interferes with Later Classifiers ===<br />
<br />
When training a network containing intermediate classifiers, the training of early classifiers will cause the early layers to focus on features for that classifier. These learned features may not be as useful to the later classifiers and degrade their accuracy.<br />
<br />
MSDNets use dense connectivity to avoid this issue. By concatenating all prior layers to learn future layers, the gradient propagation is spread throughout the available features. This allows later layers to not be reliant on any single prior, providing opportunities to learn new features that priors have ignored.<br />
<br />
== Architecture ==<br />
<br />
[[File:MSDNet_arch.png | 700px|thumb|center|Left: the MSDNet architecture. Right: example calculations for each output given 3 scales and 4 layers.]]<br />
<br />
The architecture of MSDNet is a structure of convolutions with a set number of layers and a set number of scales. Layers allow the network to build on the previous information to generate more accurate predictions, while the scales allow the network to maintain coarse level features throughout.<br />
<br />
The first layer is a special, mini-CNN-network, that quickly fills all required scales with features. The following layers are generated through the convolutions of the previous layers and scales.<br />
<br />
Each output at a given s scale is given by the convolution of all prior outputs of the same scale, and the strided-convolution of all prior outputs from the previous scale. <br />
<br />
The classifiers are run on the concatenation of all of the coarsest outputs from the preceding layers.<br />
<br />
=== Loss Function ===<br />
<br />
The loss is calculated as a weighted sum of each classifier's logistic loss. The weighted loss is taken as an average over a set of training samples. The weights can be determined from a budget of computational power, but results also show that setting all to 1 is also acceptable.<br />
<br />
=== Computational Limit Inclusion ===<br />
<br />
When running in a budgeted batch scenario, the network attempts to provide the best overall accuracy. To do this with a set limit on computational resources, it works to use less of the budget on easy detections in order to allow more time to be spent on hard ones. <br />
In order to facilitate this, the classifiers are designed to exit when the confidence of the classification exceeds a preset threshold. To determine the threshold for each classifier, <math>|D_{test}|\sum_{k}(q_k C_k) \leq B </math> must be true. Where <math>|D_{test}|</math> is the total number of test samples, <math>C_k</math> is the computational requirement to get an output from the <math>k</math>th classifier, and <math>q_k </math> is the probability that a sample exits at the <math>k</math>th classifier. Assuming that all classifiers have the same base probability, <math>q</math>, then <math>q_k</math> can be used to find the threshold.<br />
<br />
=== Network Reduction and Lazy Evaluation ===<br />
There are two ways to reduce the computational needs of MSDNets:<br />
<br />
# Reduce the size of the network by splitting it into <math>S</math> blocks along the depth dimension and keeping the <math>(S-i+1)</math> scales in the <math>i^{\text{th}}</math> block.<br />
# Remove unnecessary computations: Group the computation in "diagonal blocks"; this propagates the example along paths that are required for the evaluation of the next classifier.<br />
<br />
= Experiments = <br />
<br />
When evaluating on CIFAR-10 and CIFAR-100 ensembles and multi-classifier versions of ResNets and DenseNets, as well as FractalNet are used to compare with MSDNet. <br />
<br />
When evaluating on ImageNet ensembles and individual versions of ResNets and DenseNets are compared with MSDNets.<br />
<br />
== Anytime Prediction ==<br />
<br />
In anytime prediction MSDNets are shown to have highly accurate with very little budget, and continue to remain above the alternate methods as the budget increases. The authors attributed this to the fact that MSDNets are able to produce low-resolution feature maps well-suited for classification after just a few layers, in contrast to the high-resolution feature maps in early layers of ResNets or DenseNets. Ensemble networks need to repeat computations of similar low-level features repeatedly when new models need to be evaluated, so their accuracy results do not increase as fast when computational budget increases. <br />
<br />
[[File:MSDNet_anytime.png | 700px|thumb|center|Accuracy of the anytime classification models.]]<br />
<br />
== Budget Batch ==<br />
<br />
For budget batch 3 MSDNets are designed with classifiers set-up for varying ranges of budget constraints. On both dataset options the MSDNets exceed all alternate methods with a fraction of the budget required.<br />
<br />
[[File:MSDNet_budgetbatch.png | 700px|thumb|center|Accuracy of the budget batch classification models.]]<br />
<br />
The following figure shows examples of what was deemed "easy" and "hard" examples by the network. The top row contains images of either red wine or volcanos that were easily classified, thus exiting the network early and reducing required computations. The bottom row contains examples of "hard" images that were incorrectly classified by the first classifier but were correctly classified by the last layer.<br />
<br />
[[File:MSDNet_visualizingearlyclassifying.png | 700px|thumb|center|Examples of "hard"/"easy" classification]]<br />
<br />
= Ablation study =<br />
Additional experiments were performed to shed light on multi-scale feature maps, dense connectivity, and intermediate classifiers. This experiment started with an MSDNet with six intermediate classifiers and each of these components were removed, one at a time. To make our comparisons fair, the computational costs of the full networks were kept similar by adapting the network width. After removing all the three components, a VGG-like convolutional network is obtained. The classification accuracy of all classifiers is shown in the image below.<br />
<br />
[[File:Screenshot_from_2018-03-29_14-58-03.png]]<br />
<br />
= Critique = <br />
<br />
The problem formulation and scenario evaluation were very well formulated, and according to independent reviews, the results were reproducible. Where the paper could improve is on explaining how to implement the threshold; it isn't very well explained how the use of the validation set can be used to set the threshold value.<br />
<br />
= Implementation =<br />
The following repository provides the source code for the paper, written by the authors: https://github.com/gaohuang/MSDNet<br />
<br />
= Sources =<br />
# Huang, G., Chen, D., Li, T., Wu, F., Maaten, L., & Weinberger, K. Q. (n.d.). Multi-Scale Dense Networks for Resource Efficient Image Classification. ICLR 2018. doi:1703.09844 <br />
# Huang, G. (n.d.). Gaohuang/MSDNet. Retrieved March 25, 2018, from https://github.com/gaohuang/MSDNet<br />
# LeCun, Yann, John S. Denker, and Sara A. Solla. "Optimal brain damage." Advances in neural information processing systems. 1990.<br />
# Hassibi, Babak, David G. Stork, and Gregory J. Wolff. "Optimal brain surgeon and general network pruning." Neural Networks, 1993., IEEE International Conference on. IEEE, 1993.<br />
# Li, Hao, et al. "Pruning filters for efficient convnets." arXiv preprint arXiv:1608.08710 (2016).<br />
# Hubara, Itay, et al. "Binarized neural networks." Advances in neural information processing systems. 2016.<br />
# Rastegari, Mohammad, et al. "Xnor-net: Imagenet classification using binary convolutional neural networks." European Conference on Computer Vision. Springer, Cham, 2016.<br />
# Cristian Bucilua, Rich Caruana, and Alexandru Niculescu-Mizil. Model compression. In ACM SIGKDD, pp. 535–541. ACM, 2006.<br />
# Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. In NIPS Deep Learning Workshop, 2014.</div>Tdunnhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=PointNet%2B%2B:_Deep_Hierarchical_Feature_Learning_on_Point_Sets_in_a_Metric_Space&diff=35891PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space2018-03-29T12:54:07Z<p>Tdunn: /* Point Set Classification in Euclidean Metric Space */</p>
<hr />
<div>= Introduction =<br />
This paper builds off of ideas from PointNet (Qi et al., 2017). The name PointNet is derived from the network's input - a point cloud. A point cloud is a set of three dimensional points that each have coordinates <math> (x,y,z) </math>. These coordinates usually represent the surface of an object. For example, a point cloud describing the shape of a torus is shown below.<br />
<br />
[[File:Point cloud torus.gif|thumb|center|Point cloud torus]]<br />
<br />
<br />
Processing point clouds is important in applications such as autonomous driving where point clouds are collected from an onboard LiDAR sensor. These point clouds can then be used for object detection. However, point clouds are challenging to process because:<br />
<br />
# They are unordered. If <math> N </math> is the number of points in a point cloud, then there are <math> N! </math> permutations that the point cloud can be represented.<br />
# The spatial arrangement of the points contains useful information, thus it needs to be encoded.<br />
# The function processing the point cloud needs to be invariant to transformations such as rotation and translations of all points. <br />
<br />
Previously, typical point cloud processing methods handled the challenges of point clouds by transforming the data with a 3D voxel grid or by representing the point cloud with multiple 2D images. When PointNet was introduced, it was novel because it directly took points as its input. PointNet++ improves on PointNet by using a hierarchical method to better capture local structures of the point cloud. <br />
<br />
[[File:point_cloud.png | 400px|thumb|center|Examples of point clouds and their associated task. Classification (left), part segmentation (centre), scene segmentation (right) ]]<br />
<br />
= Review of PointNet =<br />
<br />
The PointNet architecture is shown below. The input of the network is <math> n </math> points, which each have <math> (x,y,z) </math> coordinates. Each point is processed individually through a multi-layer perceptron (MLP). This network creates an encoding for each point; in the diagram, each point is represented by a 1024 dimension vector. Then, using a max pool layer a vector is created that represents the "global signature" of a point cloud. If classification is the task, this global signature is processed by another MLP to compute the classification scores. If segmentation is the task, this global signature is appended to to each point from the "nx64" layer, and these points are processed by a MLP to compute a semantic category score for each point.<br />
<br />
The core idea of the network is to learn a symmetric function on transformed points. Through the T-Nets and the MLP network, a transformation is learned with the hopes of making points invariant to point cloud transformations. Learning a symmetric function solves the challenge imposed by having unordered points; a symmetric function will produce the same value no matter the order of the input. This symmetric function is represented by the max pool layer.<br />
<br />
[[File:pointnet_arch.png | 700px|thumb|center|PointNet architecture. The blue highlighted region is when it is used for classification, and the beige highlighted region is when it is used for segmentation.]]<br />
<br />
= PointNet++ =<br />
<br />
The motivation for PointNet++ is that PointNet does not capture local, fine-grained details. Since PointNet performs a max pool layer over all of its points, information such as the local interaction between points is lost.<br />
<br />
== Problem Statement ==<br />
<br />
There is a metric space <math> X = (M,d) </math> where <math>d</math> is the metric from a Euclidean space <math>\pmb{\mathbb{R}}^n</math> and <math> M \subseteq \pmb{\mathbb{R}}^n </math> is the set of points. The goal is to learn functions <math>f</math> that take <math>X</math> as the input and produce information of semantic interest about it. In practice, <math>f</math> can often be a classification function that outputs a class label or a segmentation function that outputs a per point label for each member of <math>M</math>.<br />
<br />
== Method ==<br />
<br />
=== High Level Overview ===<br />
[[File:point_net++.png | 700px|thumb|right|PointNet++ architecture]]<br />
<br />
The PointNet++ architecture is shown on the right. The core idea is that a hierarchical architecture is used and at each level of the hierarchy a set of points is processed and abstracted to a new set with less points, i.e.,<br />
<br />
\begin{aligned}<br />
\text{Input at each level: } N \times (d + c) \text{ matrix}<br />
\end{aligned}<br />
<br />
where <math>N</math> is the number of points, <math>d</math> is the coordinate points <math>(x,y,z)</math> and <math>c</math> is the feature representation of each point, and<br />
<br />
\begin{aligned}<br />
\text{Output at each level: } N' \times (d + c') \text{ matrix}<br />
\end{aligned}<br />
<br />
where <math>N'</math> is the new number (smaller) of points and <math>c'</math> is the new feature vector.<br />
<br />
<br />
Each level has three layers: Sampling, Grouping, and PointNet. The Sampling layer selects points that will act as centroids of local regions within the point cloud. The Grouping layer then finds points near these centroids. Lastly, the PointNet layer performs PointNet on each group to encode local information.<br />
<br />
=== Sampling Layer ===<br />
<br />
The input of this layer is a set of points <math>{\{x_1,x_2,...,x_n}\}</math>. The goal of this layer is to select a subset of these points <math>{\{\hat{x}_1, \hat{x}_2,...,\hat{x}_m\}} </math> that will define the centroid of local regions.<br />
<br />
To select these points farthest point sampling is used. This is where <math>\hat{x}_j</math> is the most distant point with regards to <math>{\{\hat{x}_1, \hat{x}_2,...,\hat{x}_{j-1}\}}</math>. This ensures coverage of the entire point cloud opposed to random sampling.<br />
<br />
=== Grouping Layer ===<br />
<br />
The objective of the grouping layer is to form local regions around each centroid by grouping points near the selected centroids. The input is a point set of size <math>N \times (d + c)</math> and the coordinates of the centroids <math>N' \times d</math>. The output is the groups of points within each region <math>N' \times k \times (d+c)</math> where <math>k</math> is the number of points in each region.<br />
<br />
Note that <math>k</math> can vary per group. Later, the PointNet layer creates a feature vector that is the same size for all regions at a hierarchical level.<br />
<br />
To determine which points belong to a group a ball query is used; all points within a radius of the centroid are grouped. This is advantageous over nearest neighbour because it guarantees a fixed region space, which is important when learning local structure.<br />
<br />
=== PointNet Layer ===<br />
<br />
After grouping, PointNet is applied to the points. However, first the coordinates of points in a local region are converted to a local coordinate frame by <math> x_i = x_i - \bar{x}</math> where <math>\bar{x}</math> is the coordinates of the centroid.<br />
<br />
=== Robust Feature Learning under Non-Uniform Sampling Density ===<br />
<br />
The previous description of grouping uses a single scale. This is not optimal because the density varies per section of the point cloud. At each level, it would be better if the PointNet layer was applied to adaptively sized groups depending on the point cloud density.<br />
<br />
The two grouping methods the authors propose are shown in the diagram below. Multi-scale grouping (MSG) applies PointNet at various scales per group. The features from the various scales are concatenated to form a multi-scale feature. To train the network to learn an optimal strategy for combining the multi-scale features, the authors proposed random input dropout, which involves randomly dropping input points with a random probability for each training point set. Each input point has a dropout probability <math>\theta</math>. The authors used a <math>\theta</math> value of 0.95. As shown in the experiments section below, dropout provides robustness to input point density variations. During testing stage all points are used. MSG, however, is computationally expensive because for each region it always applies PointNet at large scale neighborhoods to all points. <br />
<br />
On the other hand, multi-resolution grouping (MRG) is less computationally expensive but still adaptively collects features. As shown in the diagram, features of a region from a certain level is a concatenation of two vectors. The left vector is obtained by applying PointNet to three points, and these three points obtained information from three groups. This vector is then concatenated by a vector that is created by using PointNet on all the points in the level below. The second vector can be weighed more heavily if the first vector contains a sparse amount of points, since the first vector is based on subregions that would be even more sparse and suffer from sampling deficiency. On the other hand, when the density of a local region is high, the first vector can be weighted more heavily as it allows for inspecting at higher resolutions in the lower levels to obtain finer details. <br />
<br />
[[File:grouping.png | 300px|thumb|center|Example of the two ways to perform grouping]]<br />
<br />
== Point Cloud Segmentation ==<br />
<br />
If the task is segmentation, the architecture is slightly modified since we want a semantic score for each point. To achieve this, distance-based interpolation and skip-connections are used.<br />
<br />
=== Distance-based Interpolation ===<br />
<br />
Here, point features from <math>N_l \times (d + C)</math> points are propagated to <math>N_{l-1} \times (d + C)</math> points where <math>N_{l-1}</math> is greater than <math>N_l</math>.<br />
<br />
To propagate features an inverse distance weighted average based on <math>k</math> nearest neighbors is used. The <math>p=2</math> and <math>k=3</math>.<br />
<br />
[[File:prop_feature.png | 500px|thumb|center|Feature interpolation during segmentation]]<br />
<br />
=== Skip-connections ===<br />
<br />
In addition, skip connections are used (see the PointNet++ architecture diagram). The features from the the skip layers are concatenated with the interpolated features. Next, a "unit-wise" PointNet is applied, which the authors describe as similar to a one-by-one convolution.<br />
<br />
== Experiments ==<br />
To validate the effectiveness of PointNet++, experiments in three areas were performed - classification in Euclidean metric space, semantic scene labelling, and classification in non-Euclidean space.<br />
<br />
=== Point Set Classification in Euclidean Metric Space ===<br />
<br />
The digit dataset, MNIST, was converted to a 2D point cloud. Pixel intensities were normalized in the range of <math>[0, 1]</math>, and only pixels with intensities larger than 0.5 were considered. The coordinate system was set at the centre of the image. PointNet++ achieved a classification error of 0.51%. The original PointNet had 0.78% classification error. The table below compares these results to the state-of-the-art.<br />
<br />
[[File:mnist_results.png | 300px|thumb|center|MNIST classification results.]]<br />
<br />
In addition, the ModelNet40 dataset was used. This dataset consists of CAD models. Three dimensional point clouds were sampled from mesh surfaces of the ModelNet40 shapes. The classification results from this dataset are shown below. The last row in the table below, "Ours (with normal)" used face normals (normal is the same for the entire face, regardless of the point picked on that face) as additional point features as well as additional points <math>(N = 5000)</math> to boost performance. All these points are normalized to have zero mean and be within one unit ball. The network contains three hierarchical levels with three fully connected layers.<br />
<br />
[[File:modelnet40.png | 300px|thumb|center|ModelNet40 classification results.]]<br />
<br />
An experiment was performed to show how the accuracy was affected by the number of points used. With PointNet++ using multi-scale grouping and dropout, the performance decreased by less than 1% when 1024 test points were reduced to 256. On the other hand, PointNet's performance was impacted by the decrease in points. This is not surprising because the dropout feature of PointNet++ ensures that the model is trained specifically to be robust to loss of points. <br />
<br />
[[File:paper28_fig4_chair.png | 300px|thumb|center|An example showing the reduction of points visually. At 256 points, the points making up the object is very spare, however the accuracy is only reduced by 1%]][[File:num_points_acc.png | 300px|thumb|center|Relationship between accuracy and the number of points used for classification.]]<br />
<br />
=== Semantic Scene Labelling ===<br />
<br />
The ScanNet dataset was used for experiments in semantic scene labelling. This dataset consists of laser scans of indoor scenes where the goal is to predict a semantic label for each point. Example results are shown below.<br />
<br />
[[File:scannet.png | 300px|thumb|center|Example ScanNet semantic segmentation results.]]<br />
<br />
To compare to other methods, the authors convert their point labels to a voxel format, and accuracy is determined on a per voxel basis. The accuracy compared to other methods is shown below.<br />
<br />
[[File:scannet_acc.png | 500px|thumb|center|ScanNet semantic segmentation classification comparison to other methods.]]<br />
<br />
To test how the trained model performed on scans with non-uniform sampling density, virtual scans of Scannet scenes were synthesized and the network was evaluated on this data. It can be seen from the above figures that SSG performance greatly falls due to the sampling density shift. MRG network, on the other hand, is more robust to the sampling density shift since it is able to automatically switch to features depicting coarser granularity when the sampling is sparse. This proves the effectiveness of the proposed density adaptive layer design.<br />
<br />
=== Classification in Non-Euclidean Metric Space ===<br />
<br />
[[File:shrec.png | 300px|thumb|right|Example of shapes from the SHREC15 dataset.]]<br />
<br />
Lastly, experiments were performed on the SHREC15 dataset. This dataset contains shapes that have different poses. This experiment shows that PointNet++ is able to generalize to non-Euclidean spaces. Results from this dataset are provided below.<br />
<br />
[[File:shrec15_results.png | 500px|thumb|center|Results from the SHREC15 dataset.]]<br />
<br />
=== Feature Visualization ===<br />
The figure below visualizes what is learned by just the first layer kernels of the network. The model is trained on a dataset the mostly consisted of furniture which explains the lines, corners, and planes visible in the visualization. Visualization is performed by creating a voxel grid in space and only aggregating point sets that activate specific neurons the most.<br />
<br />
[[File:26_8.PNG | 800px|thumb|center|Pointclouds learned from first layer kernels (red is near, blue is far)]]<br />
<br />
=== Time and Space Complexity ===<br />
To evaluate the time and space complexity between PointNet++ and PointNet the authors recorded the model size and inference time for several point set deep learning approaches. Inference time is measured with a GTX 1080 and batch size 8. The PointNet inference times are significantly better however the model sizes for PointNet++ are comparable. The table below outlines the specifics for each method.<br />
<br />
[[File:pointnet_complexity.PNG | 700px|thumb|center|Comparison of model size and inference time between PointNet and PointNet++]]<br />
<br />
== Critique ==<br />
<br />
It seems clear that PointNet is lacking capturing local context between points. PointNet++ seems to be an important extension, but the improvements in the experimental results seem small. Some computational efficiency experiments would have been nice. For example, the processing speed of the network, and the computational efficiency of MRG over MRG.<br />
<br />
== Code ==<br />
<br />
Code for PointNet++ can be found at: https://github.com/charlesq34/pointnet2 <br />
<br />
<br />
=Sources=<br />
1. Charles R. Qi, Li Yi, Hao Su, Leonidas J. Guibas. PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space, 2017<br />
<br />
2. Charles R. Qi, Hao Su, Kaichun Mo, Leonidas J. Guibas. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation, 2017</div>Tdunnhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Robust_Imitation_of_Diverse_Behaviors&diff=35890Robust Imitation of Diverse Behaviors2018-03-29T12:14:03Z<p>Tdunn: /* Motivation */</p>
<hr />
<div>=Introduction=<br />
One of the longest standing challenges in AI is building versatile embodied agents, both in the form of real robots and animated avatars, capable of a wide and diverse set of behaviors. State-of-the-art robots cannot compete with the effortless variety and adaptive flexibility of motor behaviors produced by toddlers. Towards addressing this challenge, the authors combine several deep generative approaches to imitation learning in a way that accentuates their individual strengths and addresses their limitations. The end product is a robust neural network policy that can imitate a large and diverse set of behaviors using few training demonstrations.<br />
<br />
=Motivation=<br />
Deep generative models have recently shown great promise in imitation learning. The authors primarily talk about two approaches viz. supervised approaches that condition on demonstrations and Generative Adversarial Imitation Learning (GAIL). The authors also talk about limitations of the two approaches and try to combine the two approaches in order to address these limitations. Some of these limitations are as follows:<br />
<br />
* Supervised approaches that condition on demonstrations (VAE):<br />
** They require large training datasets in order to work for non-trivial tasks<br />
** They tend to be brittle and fail when the agent diverges too much from the demonstration trajectories (As proof of this brittleness, the authors cite Ross et al. (2010), who provide a theorem showing that the cost incurred by this kind of model when it deviates from a demonstration trajectory with a small probability can be amplified in a manner quadratic in the number of time steps. )<br />
* Generative Adversarial Imitation Learning (GAIL)<br />
** Adversarial training leads to mode-collapse (the tendency of adversarial generative models to cover only a subset of modes of a probability distribution, resulting in a failure to produce adequately diverse samples)<br />
** More difficult and slow to train as they do not immediately provide a latent representation of the data<br />
<br />
Thus, the former approach can model diverse behaviors without dropping modes but does not learn robust policies, while the latter approach gives robust policies but insufficiently diverse behaviors. Thus, the authors combine the favorable aspects of these two approaches. The base of their model is a new type of variational autoencoder on demonstration trajectories that learns semantic policy embeddings. Leveraging these policy representations, they develop a new version of GAIL that <br />
# is much more robust than the purely-supervised controller, especially with few demonstrations, and <br />
# avoids mode collapse, capturing many diverse behaviors when GAIL on its own does not.<br />
<br />
=Model=<br />
The authors first introduce a variational autoencoder (VAE) for supervised imitation, consisting of a bi-directional LSTM encoder mapping demonstration sequences to embedding vectors, and two decoders. The first decoder is a multi-layer perceptron (MLP) policy mapping a trajectory embedding and the current state to a continuous action vector. The second is a dynamics model mapping the embedding and previous state to the present state while modeling correlations among states with a WaveNet.<br />
<br />
[[File: Model_Architecture.png|700px|center|]]<br />
<br />
==Behavioral cloning with VAE suited for control==<br />
<br />
In this section, the authors follow a similar approach to Duan et al. (2017), but opt for stochastic VAEs as having a distribution <math display="inline">q_\phi(z|x_{1:T})</math> to better regularize the latent space. In their VAE, an encoder stochastically maps a demonstration sequence to an embedding vector <math display="inline">z</math>. Given <math display="inline">z</math>, they decode both the state and action trajectories as shown in the figure above. To train the model, the following loss is minimized:<br />
<br />
\begin{align}<br />
L\left( \alpha, w, \phi; \tau_i \right) = - \pmb{\mathbb{E}}_{q_{\phi}(z|x_{1:T_i}^i)} \left[ \sum_{t=1}^{T_i} log \pi_\alpha \left( a_t^i|x_t^i, z \right) + log p_w \left( x_{t+1}^i|x_t^i, z\right) \right] +D_{KL}\left( q_\phi(z|x_{1:T_i}^i)||p(z) \right)<br />
\end{align}<br />
<br />
Where <math> \alpha </math> parameterizes the action decoder, <math> w </math> parameterizes the state decoder, <math> \phi </math> parameterizes the state encoder, and <math> T_i \in \tau_i </math> is the set of demonstration trajectories.<br />
<br />
The encoder <math display="inline">q</math> uses a bi-directional LSTM. To produce the final embedding, it calculates the average of all the outputs of the second layer of this LSTM before applying a final linear transformation to<br />
generate the mean and standard deviation of a Gaussian. Then, one sample from this Gaussian is taken as the demonstration encoding.<br />
<br />
The action decoder is an MLP that maps the concatenation of the state and the embedding of the parameters of a Gaussian policy. The state decoder is similar to a conditional WaveNet model. In particular, it conditions on the embedding <math display="inline">z</math> and previous state <math display="inline">x_{t-1}</math> to generate the vector <math display="inline">x_t</math> autoregressively. That is, the autoregression is over the components of the vector <math display="inline">x_t</math>. Finally, instead of a Softmax, the model uses a mixture of Gaussians as the output of the WaveNet.<br />
<br />
==Diverse generative adversarial imitiation learning==<br />
To enable GAIL to produce diverse solutions, the authors condition the discriminator on the embeddings generated by the VAE encoder and integrate out the GAIL objective with respect to the variational posterior <math display="inline">q_\phi(z|x_{1:T})</math>. Specifically, the authors train the discriminator by optimizing the following objective:<br />
<br />
\begin{align}<br />
{max}_{\psi} \pmb{\mathbb{E}}_{\tau_i \sim \pi_E} \left( \pmb{\mathbb{E}}_{q(z|x_{1:T_i}^i)} \left[\frac{1}{T_i} \sum_{t=1}^{T_i} logD_{\psi} \left( x_t^i, a_t^i | z \right) + \pmb{\mathbb{E}}_{\pi_\theta} \left[ log(1 - D_\psi(x, a | z)) \right] \right] \right)<br />
\end{align}<br />
<br />
The authors condition on unlabeled trajectories, which have been passed through a powerful encoder, and hence this approach is capable of one-shot imitation learning. Moreover, the VAE encoder enables to obtain a continuous latent embedding space where interpolation is possible.<br />
<br />
To better motivate the objective, the authors propose on temporarily leaving the context of imitation learning and considering an alternative objective for training GANs<br />
<br />
\begin{align}<br />
{min}_{G}{max}_{D} V (G, D) = \int_{y} p(y) \int_{z} q(z|y) \left[ log D(y | z) + \int_{\hat{y}} G(\hat{y} | z) log (1 - D(\hat{y} | z)) d\hat{y} \right] dy dz<br />
\end{align}<br />
<br />
This function is a simplification of the previous objective function. Furthermore, it satisfies the following property.<br />
<br />
===Lemma 1===<br />
Assuming that <math display="inline">q</math> computes the true posterior distribution that is <math display="inline">q(z|y) = \frac{p(y|z)p(z)}{p(y)}</math> then<br />
<br />
\begin{align}<br />
V (G, D) = \int_{z} p(z) \left[ \int_{y} p(y|z) log D(y|z) dy + \int_{\hat{y}} G(\hat{y} | z) log (1 - D(\hat{y} | z)) d\hat{y} \right] dz<br />
\end{align}<br />
<br />
If an optimal discriminator is further assumed, the cost optimized by the generator then becomes<br />
<br />
\begin{align}<br />
C(G) = 2 \int_ p p(z) JSD[p(\cdot|z) || G(\cdot|z)] dz - log4<br />
\end{align}<br />
<br />
where <math display="inline">JSD</math> stands for the Jensen-Shannon divergence. In the context of the WaveNet described in the earlier section, <math>p(x)</math> is the distribution of a mixture of Gaussians, and <math>p(z)</math> is the distribution over the mixture components, so the conditional distribution over the latent <math>z</math>, <math>p(x | z)</math> is uni-modal, and optimizing the divergence will not lead to mode collapse.<br />
<br />
==Policy Optimization Strategy: TRPO==<br />
<br />
[[file:robust_behaviour_alg.png | 800px]]<br />
<br />
In Algorithm 1, it states that TRPO is used for policy parameter updates. TRPO is short for Trust Region Policy Optimization, which an iterative procedure for policy optimization, developed by John Schulman, Sergey Levine, Philip Moritz, Micheal Jordan and Pieter Abbeel. This optimization methods achieves monotonic improve in fields related to robotic motions, such as walking and swimming. For more details on TRPO, please refer to the [https://arxiv.org/pdf/1502.05477.pdf original paper].<br />
<br />
=Experiments=<br />
<br />
The primary focus of the paper's experimental evaluation is to demonstrate that the architecture allows learning of robust controllers capable of producing the full spectrum of demonstration behaviors for a diverse range of challenging control problems. The authors consider three bodies: a 9 DoF robotic arm, a 9 DoF planar walker, and a 62 DoF complex humanoid (56-actuated joint angles, and a freely translating and rotating 3d root joint). While for the reaching task BC is sufficient to obtain a working controller, for the other two problems the full learning procedure is critical.<br />
<br />
The authors analyze the resulting embedding spaces and demonstrate that they exhibit a rich and sensible structure that can be exploited for control. Finally, the authors show that the encoder can be used to capture the gist of novel demonstration trajectories which can then be reproduced by the controller.<br />
<br />
==Robotic arm reaching==<br />
In this experiment, the authors demonstrate the effectiveness of their VAE architecture and investigate the nature of the learned embedding space on a reaching task with a simulated Jaco arm.<br />
<br />
To obtain demonstrations, the authors trained 60 independent policies to reach to random target locations in the workspace starting from the same initial configuration. 30 trajectories from each of<br />
the first 50 policies were generated. These served as training data for the VAE model (1500 training trajectories in total). The remaining 10 policies were used to generate test data.<br />
<br />
Here are the trajectories produced by the VAE model.<br />
<br />
[[File: Robotic_arm_reaching_VAE.png|300px|center|]]<br />
<br />
The reaching task is relatively simple, so with this amount of data the VAE policy is fairly robust. After training, the VAE encodes and reproduces the demonstrations as shown in the figure below.<br />
<br />
[[File: Robotic_arm_reaching.png|650px|center|]]<br />
<br />
==2D Walker==<br />
<br />
As a more challenging test compared to the reaching task, the authors consider bipedal locomotion. Here, the authors train 60 neural network policies for a 2d walker to serve as demonstrations. These policies are each trained to move at different speeds both forward and backward depending on a label provided as additional input to the policy. Target speeds for training were chosen from a set of four different speeds (m/s): -1, 0, 1, 3.<br />
<br />
[[File: 2D_Walker.png|650px|center|]]<br />
<br />
The authors trained their model with 20 episodes per policy (1200 demonstration trajectories in total, each with a length of 400 steps or 10s of simulated time). In this experiment a full approach is required: training the VAE with BC alone can imitate some of the trajectories, but it performs poorly in general, presumably because the relatively small training set does not cover the space of trajectories sufficiently densely. On this generated dataset, they also train policies with GAIL using the same architecture and hyper-parameters. Due to the lack of conditioning, GAIL does not reproduce coherently trajectories. Instead, it simply meshes different behaviors together. In addition, the policies trained with GAIL also, exhibit dramatically less diversity; see [https://www.youtube.com/watch?v=kIguLQ4OwuM video].<br />
<br />
[[File: 2D_Walker_Optimized.gif|frame|center|In the left panel, the planar walker demonstrates a particular walking style. In the right panel, the model's agent imitates this walking style using a single policy network.]]<br />
<br />
==Complex humanoid==<br />
For this experiment, the authors consider a humanoid body of high dimensionality that poses a hard control problem. They generate training trajectories with the existing controllers, which can produce instances of one of six different movement styles. Examples of such trajectories are shown in Fig. 5.<br />
<br />
[[File: Complex_humanoid.png|650px|center|]]<br />
<br />
The training set consists of 250 random trajectories from 6 different neural network controllers that were trained to match 6 different movement styles from the CMU motion capture database. Each trajectory is 334 steps or 10s long. The authors use a second set of 5 controllers from which they generate trajectories for evaluation (3 of these policies were trained on the same movement styles as the policies used for generating training data).<br />
<br />
Surprisingly, despite the complexity of the body, supervised learning is quite effective at producing sensible controllers: The VAE policy is reasonably good at imitating the demonstration trajectories, although it lacks the robustness to be practically useful. Adversarial training dramatically improves the stability of the controller. The authors analyze the improvement quantitatively by computing the percentage of the humanoid falling down before the end of an episode while imitating either training or test policies. The results are summarized in Figure 5 right. The figure further shows sequences of frames of representative demonstration and associated imitation trajectories. Videos of demonstration and imitation behaviors can be found in the [https://www.youtube.com/watch?v=NaohsyUxpxw video].<br />
<br />
[[File: Complex_humanoid_optimized.gif|frame|center|In the left and middle panels we show two demonstrated behaviors. In the right panel, the model's agent produces an unseen transition between those behaviors.]]<br />
<br />
Also, for practical purposes, it is desirable to allow the controller to transition from one behavior to another. The authors test this possibility in an experiment similar to the one for the Jaco arm: They determine the<br />
embedding vectors of pairs of demonstration trajectories, start the trajectory by conditioning on the first embedding vector, and then transition from one behavior to the other half-way through the episode by linearly interpolating the embeddings of the two demonstration trajectories over a window of 20 control steps. Although not always successful the learned controller often transitions robustly, despite not having been trained to do so. An example of this can be seen in the gif above. More examples can be seen in this [https://www.youtube.com/watch?v=VBrIll0B24o video].<br />
<br />
=Conclusions=<br />
The authors have proposed an approach for imitation learning that combines the favorable properties of techniques for density modeling with latent variables (VAEs) with those of GAIL. The result is a model that from a moderate number of demonstration tragectories, can learn:<br />
# a semantically well structured embedding of behaviors, <br />
# a corresponding multi-task controller that allows to robustly execute diverse behaviors from this embedding space, as well as <br />
# an encoder that can map new trajectories into the embedding space and hence allows for one-shot imitation. <br />
The experimental results demonstrate that this approach can work on a variety of control problems and that it scales even to very challenging ones such as the control of a simulated humanoid with a large number of degrees of freedoms.<br />
<br />
=Critique=<br />
The paper proposes a deep-learning-based approach to imitation learning which is sample-efficient and is able to imitate many diverse behaviors. The architecture can be seen as conditional generative adversarial imitation learning (GAIL). The conditioning vector is an embedding of a demonstrated trajectory, provided by a variational autoencoder. This results in one-shot imitation learning: at test time, a new demonstration can be embedded and provided as a conditioning vector to the imitation policy. The authors evaluate the method on several simulated motor control tasks.<br />
<br />
Pros:<br />
* Addresses a challenging problem of learning complex dynamics controllers / control policies<br />
* Well-written introduction / motivation<br />
* The proposed approach is able to learn complex and diverse behaviors and outperforms both the VAE alone (quantitatively) and GAIL alone (qualitatively).<br />
* Appealing qualitative results on the three evaluation problems. Interesting experiments with motion transitioning. <br />
<br />
Cons:<br />
* Comparisons to baselines could be more detailed.<br />
* Many key details are omitted (either on purpose, placed in the appendix, or simply absent, like the lack of definitions of terms in the modeling section, details of the planner model, simulation process, or the details of experimental settings)<br />
* Experimental evaluation is largely subjective (videos of robotic arm/biped/3D human motion)<br />
* A discussion of sample efficiency compared to GAIL and VAE would be interesting.<br />
* The presentation is not always clear, in particular, I had a hard time figuring out the notation in Section 3.<br />
* There has been some work on hybrids of VAEs and GANs, which seem worth mentioning when generative models are discussed, like:<br />
# Autoencoding beyond pixels using a learned similarity metric, Larsen et al., ICML 2016<br />
# Generating Images with Perceptual Similarity Metrics based on Deep Networks, Dosovitskiy&Brox. NIPS 2016<br />
These works share the intuition that good coverage of VAEs can be combined with sharp results generated by GANs.<br />
* Some more extensive analysis of the approach would be interesting. How sensitive is it to hyperparameters? How important is it to use VAE, not usual AE or supervised learning? How difficult will it be for others to apply it to new tasks?<br />
<br />
=References=<br />
# Duan, Y., Andrychowicz, M., Stadie, B., Ho, J., Schneider, J., Sutskever, I., Abbeel, P., & Zaremba, W. (2017). One-shot imitation learning. Preprint arXiv:1703.07326.<br />
# Ross, Stéphane, and Drew Bagnell. "Efficient reductions for imitation learning." Proceedings of the thirteenth international conference on artificial intelligence and statistics. 2010.<br />
# Wang, Z., Merel, J. S., Reed, S. E., de Freitas, N., Wayne, G., & Heess, N. (2017). Robust imitation of diverse behaviors. In Advances in Neural Information Processing Systems (pp. 5326-5335).<br />
# Producing flexible behaviours in simulated environments. (n.d.). Retrieved March 25, 2018, from https://deepmind.com/blog/producing-flexible-behaviours-simulated-environments/<br />
# Cmu humanoid. (2017, May 19). Retrieved March 25, 2018, from https://www.youtube.com/watch?v=NaohsyUxpxw<br />
# Cmu transitions. (2017, May 19). Retrieved March 25, 2018, from https://www.youtube.com/watch?v=VBrIll0B24o</div>Tdunnhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Tensorized_LSTMs&diff=35889stat946w18/Tensorized LSTMs2018-03-29T12:06:56Z<p>Tdunn: /* Introduction */</p>
<hr />
<div>= Presented by =<br />
<br />
Chen, Weishi(Edward)<br />
<br />
= Introduction =<br />
<br />
Long Short-Term Memory (LSTM) is a popular approach to boosting the ability of Recurrent Neural Networks to store longer term temporal information. The capacity of an LSTM network can be increased by widening and adding layers (illustrations will be provided later). <br />
<br />
<br />
However, usually the LSTM model introduces additional parameters, while LSTM with additional layers and wider layers increases the time required for model training and evaluation. As an alternative, the paper <Wider and Deeper, Cheaper and Faster: Tensorized LSTMs for Sequence Learning> has proposed a model based on LSTM called the Tensorized LSTM in which the hidden states are represented by '''tensors''' and updated via a '''cross-layer convolution'''. <br />
<br />
* By increasing the tensor size, the network can be widened efficiently without additional parameters since the parameters are shared across different locations in the tensor<br />
* By delaying the output, the network can be deepened implicitly with little additional run-time since deep computations for each time step are merged into temporal computations of the sequence. <br />
<br />
<br />
Also, the paper presents experiments that were conducted on five challenging sequence learning tasks to show the potential of the proposed model.<br />
<br />
= A Quick Introduction to RNN and LSTM =<br />
<br />
We consider the time-series prediction task of producing a desired output <math>y_t</math> at each time-step t∈ {1, ..., T} given an observed input sequence <math>x1: t = {x_1,x_2, ···, x_t}</math>, where <math>x_t∈R^R</math> and <math>y_t∈R^S</math> are vectors. RNN learns how to use a hidden state vector <math>h_t ∈ R^M</math> to encapsulate the relevant features of the entire input history x1:t (indicates all inputs from to initial time-step to final step before predication - illustration given below) up to time-step t.<br />
<br />
\begin{align}<br />
h_{t-1}^{cat} = [x_t, h_{t-1}] \hspace{2cm} (1)<br />
\end{align}<br />
<br />
Where <math>h_{t-1}^{cat} ∈R^{R+M}</math> is the concatenation of the current input <math>x_t</math> and the previous hidden state <math>h_{t−1}</math>, which expands the dimensionality of intermediate information.<br />
<br />
The update of the hidden state ht is defined as:<br />
<br />
\begin{align}<br />
a_{t} =h_{t-1}^{cat} W^h + b^h \hspace{2cm} (2)<br />
\end{align}<br />
<br />
and<br />
<br />
\begin{align}<br />
h_t = \Phi(a_t) \hspace{2cm} (3)<br />
\end{align}<br />
<br />
<math>W^h∈R^(R+M)xM </math> guarantees each hidden status provided by the previous step is of dimension M. <math> a_t ∈R^M </math> the hidden activation, and φ(·) the element-wise "tanh" function. Finally, the output <math> y_t </math> at time-step t is generated by:<br />
<br />
\begin{align}<br />
y_t = \varphi(h_{t}^{cat} W^y + b^y) \hspace{2cm} (4)<br />
\end{align}<br />
<br />
where <math>W^y∈R^{M×S}</math> and <math>b^y∈R^S</math>, and <math>\varphi(·)</math> can be any differentiable function, notes that the "Phi" is the element-wise function which produces some non-linearity and further generates another '''hidden status''', while the "Curly Phi" is applied to generates the '''output'''<br />
<br />
[[File:StdRNN.png|650px|center||Figure 1: Recurrent Neural Network]]<br />
<br />
However, one shortfall of RNN is the vanishing/exploding gradients. This shortfall is more significant especially when constructing long-range dependencies models. One alternative is to apply LSTM (Long Short-Term Memories), LSTMs alleviate these problems by employing memory cells to preserve information for longer, and adopting gating mechanisms to modulate the information flow. Since LSTM is successfully in sequence models, it is natural to consider how to increase the complexity of the model to accommodate more complex analytical needs.<br />
<br />
[[File:LSTM_Gated.png|650px|center||Figure 2: LSTM]]<br />
<br />
= Structural Measurement of Sequential Model =<br />
<br />
We can consider the capacity of a network consists of two components: the '''width''' (the amount of information handled in parallel) and the depth (the number of computation steps). <br />
<br />
A way to '''widen''' the LSTM is to increase the number of units in a hidden layer; however, the parameter number scales quadratically with the number of units. To deepen the LSTM, the popular Stacked LSTM (sLSTM) stacks multiple LSTM layers. The drawback of sLSTM, however, is that runtime is proportional to the number of layers and information from the input is potentially lost (due to gradient vanishing/explosion) as it propagates vertically through the layers. This paper introduced a way to both widen and deepen the LSTM whilst keeping the parameter number and runtime largely unchanged. In summary, we make the following contributions:<br />
<br />
'''(a)''' Tensorize RNN hidden state vectors into higher-dimensional tensors, to enable more flexible parameter sharing and can be widened more efficiently without additional parameters.<br />
<br />
'''(b)''' Based on (a), merge RNN deep computations into its temporal computations so that the network can be deepened with little additional runtime, resulting in a Tensorized RNN (tRNN).<br />
<br />
'''(c)''' We extend the tRNN to an LSTM, namely the Tensorized LSTM (tLSTM), which integrates a novel memory cell convolution to help to prevent the vanishing/exploding gradients.<br />
<br />
= Method =<br />
<br />
Go through the methodology.<br />
<br />
== Part 1: Tensorize RNN hidden State vectors ==<br />
<br />
'''Definition:''' Tensorization is defined as the transformation or mapping of lower-order data to higher-order data. For example, the low-order data can be a vector, and the tensorized result is a matrix, a third-order tensor or a higher-order tensor. The ‘low-order’ data can also be a matrix or a third-order tensor, for example. In the latter case, tensorization can take place along one or multiple modes.<br />
<br />
[[File:VecTsor.png|320px|center||Figure 3: Vector Third-order tensorization of a vector]]<br />
<br />
'''Optimization Methodology Part 1:''' It can be seen that in an RNN, the parameter number scales quadratically with the size of the hidden state. A popular way to limit the parameter number when widening the network is to organize parameters as higher-dimensional tensors which can be factorized into lower-rank sub-tensors that contain significantly fewer elements, which is is known as tensor factorization. <br />
<br />
'''Optimization Methodology Part 2:''' Another common way to reduce the parameter number is to share a small set of parameters across different locations in the hidden state, similar to Convolutional Neural Networks (CNNs).<br />
<br />
'''Effects:''' This '''widens''' the network since the hidden state vectors are in fact broadcast to interact with the tensorized parameters. <br />
<br />
<br />
<br />
We adopt parameter sharing to cutdown the parameter number for RNNs, since compared with factorization, it has the following advantages: <br />
<br />
(i) '''Scalability,''' the number of shared parameters can be set independent of the hidden state size<br />
<br />
(ii) '''Separability,''' the information flow can be carefully managed by controlling the receptive field, allowing one to shift RNN deep computations to the temporal domain<br />
<br />
<br />
<br />
We also explicitly tensorize the RNN hidden state vectors, since compared with vectors, tensors have a better: <br />
<br />
(i) '''Flexibility,''' one can specify which dimensions to share parameters and then can just increase the size of those dimensions without introducing additional parameters<br />
<br />
(ii) '''Efficiency,''' with higher-dimensional tensors, the network can be widened faster w.r.t. its depth when fixing the parameter number (explained later). <br />
<br />
<br />
'''Illustration:''' For ease of exposition, we first consider 2D tensors (matrices): we tensorize the hidden state <math>h_t∈R^{M}</math> to become <math>Ht∈R^{P×M}</math>, '''where P is the tensor size,''' and '''M the channel size'''. We locally-connect the first dimension of <math>H_t</math> (which is P - the tensor size) in order to share parameters, and fully-connect the second dimension of <math>H_t</math> (which is M - the channel size) to allow global interactions. This is analogous to the CNN which fully-connects one dimension (e.g., the RGB channel for input images) to globally fuse different feature planes. Also, if one compares <math>H_t</math> to the hidden state of a Stacked RNN (sRNN) (see Figure Blow). <br />
<br />
[[File:Screen_Shot_2018-03-26_at_11.28.37_AM.png|160px|center||Figure 4: Stacked RNN]]<br />
<br />
[[File:ind.png|60px|center||Figure 4: Stacked RNN]]<br />
<br />
Then P is akin to the number of stacked hidden layers (vertical length in the graph), and M the size of each hidden layer (each white node in the graph). We start to describe our model based on 2D tensors, and finally show how to strengthen the model with higher-dimensional tensors.<br />
<br />
== Part 2: Merging Deep Computations ==<br />
<br />
Since an RNN is already deep in its temporal direction, we can deepen an input-to-output computation by associating the input <math>x_t</math> with a (delayed) future output. In doing this, we need to ensure that the output <math>y_t</math> is separable, i.e., not influenced by any future input <math>x_{t^{'}}</math> <math>(t^{'}>t)</math>. Thus, we concatenate the projection of <math>x_t</math> to the top of the previous hidden state <math>H_{t−1}</math>, then gradually shift the input information down when the temporal computation proceeds, and finally generate <math>y_t</math> from the bottom of <math>H_{t+L−1}</math>, where L−1 is the number of delayed time-steps for computations of depth L. <br />
<br />
An example with L= 3 is shown in Figure.<br />
<br />
[[File:tRNN.png|160px|center||Figure 5: skewed sRNN]]<br />
<br />
[[File:ind.png|60px|center||Figure 5: skewed sRNN]]<br />
<br />
<br />
This is in fact a skewed sRNN (or tRNN without feedback). However, the method does not need to change the network structure and also allows different kinds of interactions as long as the output is separable; for example, one can increase the local connections and '''use feedback''' (shown in figure below), which can be beneficial for sRNNs (or tRNN). <br />
<br />
[[File:tRNN_wF.png|160px|center||Figure 5: skewed sRNN with F]]<br />
<br />
[[File:ind.png|60px|center||Figure 5: skewed sRNN with F]]<br />
<br />
'''In order to share parameters, we update <math>H_t</math> using a convolution with a learnable kernel.''' In this manner we increase the complexity of the input-to-output mapping (by delaying outputs) and limit parameter growth (by sharing transition parameters using convolutions).<br />
<br />
To examine the resulting model mathematically, let <math>H^{cat}_{t−1}∈R^{(P+1)×M}</math> be the concatenated hidden state, and <math>p∈Z_+</math> the location at a tensor. The channel vector <math>h^{cat}_{t−1, p }∈R^M</math> at location p of <math>H^{cat}_{t−1}</math> (the p-th channel of H) is defined as:<br />
<br />
\begin{align}<br />
h^{cat}_{t-1, p} = x_t W^x + b^x \hspace{1cm} if p = 1 \hspace{1cm} (5)<br />
\end{align}<br />
<br />
\begin{align}<br />
h^{cat}_{t-1, p} = h_{t-1, p-1} \hspace{1cm} if p > 1 \hspace{1cm} (6)<br />
\end{align}<br />
<br />
where <math>W^x ∈ R^{R×M}</math> and <math>b^x ∈ R^M</math> (recall the dimension of input x is R). Then, the update of tensor <math>H_t</math> is implemented via a convolution:<br />
<br />
\begin{align}<br />
A_t = H^{cat}_{t-1} \circledast \{W^h, b^h \} \hspace{2cm} (7)<br />
\end{align}<br />
<br />
\begin{align}<br />
H_t = \Phi{A_t} \hspace{2cm} (8)<br />
\end{align}<br />
<br />
where <math>W^h∈R^{K×M^i×M^o}</math> is the kernel weight of size K, with <math>M^i =M</math> input channels and <math>M^o =M</math> output channels, <math>b^h ∈ R^{M^o}</math> is the kernel bias, <math>A_t ∈ R^{P×M^o}</math> is the hidden activation, and <math>\circledast</math> is the convolution operator. Since the kernel convolves across different hidden layers, we call it the cross-layer convolution. The kernel enables interaction, both bottom-up and top-down across layers. Finally, we generate <math>y_t</math> from the channel vector <math>h_{t+L−1,P}∈R^M</math> which is located at the bottom of <math>H_{t+L−1}</math>:<br />
<br />
\begin{align}<br />
y_t = \varphi(h_{t+L−1}, _PW^y + b^y) \hspace{2cm} (9)<br />
\end{align}<br />
<br />
Where <math>W^y ∈R^{M×S}</math> and <math>b^y ∈R^S</math>. To guarantee that the receptive field of <math>y_t</math> only covers the current and previous inputs x1:t. (Check the Skewed sRNN again below):<br />
<br />
[[File:tRNN_wF.png|160px|center||Figure 5: skewed sRNN with F]]<br />
<br />
[[File:ind.png|60px|center||Figure 5: skewed sRNN with F]]<br />
<br />
=== Quick Summary of Set of Parameters ===<br />
<br />
'''1. <math> W^x</math> and <math>b_x</math>''' connect input to the first hidden node<br />
<br />
'''2. <math> W^h</math> and <math>b_h</math>''' convolute between layers<br />
<br />
'''3. <math> W^y</math> and <math>b_y</math>''' produce output of each stages<br />
<br />
<br />
== Part 3: Extending to LSTMs==<br />
<br />
Similar to standard RNN, to allow the tRNN (skewed sRNN) to capture long-range temporal dependencies, one can straightforwardly extend it<br />
to a tLSTM by replacing the tRNN tensors:<br />
<br />
\begin{align}<br />
[A^g_t, A^i_t, A^f_t, A^o_t] = H^{cat}_{t-1} \circledast \{W^h, b^h \} \hspace{2cm} (10)<br />
\end{align}<br />
<br />
\begin{align}<br />
[G_t, I_t, F_t, O_t]= [\Phi{(A^g_t)}, σ(A^i_t), σ(A^f_t), σ(A^o_t)] \hspace{2cm} (11)<br />
\end{align}<br />
<br />
Which are pretty similar to tRNN case, the main differences can be observes for memory cells of tLSTM (Ct):<br />
<br />
\begin{align}<br />
C_t= G_t \odot I_t + C_{t-1} \odot F_t \hspace{2cm} (12)<br />
\end{align}<br />
<br />
\begin{align}<br />
H_t= \Phi{(C_t )} \odot O_t \hspace{2cm} (13)<br />
\end{align}<br />
<br />
Note that since the previous memory cell <math>C_{t-1}</math> is only gated along the temporal direction, increasing the tensor size ''P'' might result in the loss of long-range dependencies from the input to the output.<br />
<br />
Summary of the terms: <br />
<br />
1. '''<math>\{W^h, b^h \}</math>:''' Kernel of size K <br />
<br />
2. '''<math>A^g_t, A^i_t, A^f_t, A^o_t \in \mathbb{R}^{P\times M}</math>:''' Activations for the new content <math>G_t</math><br />
<br />
3. '''<math>I_t</math>:''' Input gate<br />
<br />
4. '''<math>F_t</math>:''' Forget gate<br />
<br />
5. '''<math>O_t</math>:''' Output gate<br />
<br />
6. '''<math>C_t \in \mathbb{R}^{P\times M}</math>:''' Memory cell<br />
<br />
Then, see graph below for illustration:<br />
<br />
[[File:tLSTM_wo_MC.png |160px|center||Figure 5: tLSTM wo MC]]<br />
<br />
[[File:ind.png|60px|center||Figure 5: tLSTM wo MC]]<br />
<br />
To further evolve tLSTM, we invoke the '''Memory Cell Convolution''' to capture long-range dependencies from multiple directions, we additionally introduce a novel memory cell convolution, by which the memory cells can have a larger receptive field (figure provided below). <br />
<br />
[[File:tLSTM_w_MC.png |160px|center||Figure 5: tLSTM w MC]]<br />
<br />
[[File:ind.png|60px|center||Figure 5: tLSTM w MC]]<br />
<br />
One can also dynamically generate this convolution kernel so that it is both time - and location-dependent, allowing for flexible control over long-range dependencies from different directions. Mathematically, it can be represented in with the following formulas:<br />
<br />
\begin{align}<br />
[A^g_t, A^i_t, A^f_t, A^o_t, A^q_t] = H^{cat}_{t-1} \circledast \{W^h, b^h \} \hspace{2cm} (14)<br />
\end{align}<br />
<br />
\begin{align}<br />
[G_t, I_t, F_t, O_t, Q_t]= [\Phi{(A^g_t)}, σ(A^i_t), σ(A^f_t), σ(A^o_t), ς(A^q_t)] \hspace{2cm} (15)<br />
\end{align}<br />
<br />
\begin{align}<br />
W_t^c(p) = reshape(q_{t,p}, [K, 1, 1]) \hspace{2cm} (16)<br />
\end{align}<br />
<br />
\begin{align}<br />
C_{t-1}^{conv}= C_{t-1} \circledast W_t^c(p) \hspace{2cm} (17)<br />
\end{align}<br />
<br />
\begin{align}<br />
C_t= G_t \odot I_t + C_{t-1}^{conv} \odot F_t \hspace{2cm} (18)<br />
\end{align}<br />
<br />
\begin{align}<br />
H_t= \Phi{(C_t )} \odot O_t \hspace{2cm} (19)<br />
\end{align}<br />
<br />
where the kernel <math>{W^h, b^h}</math> has additional <K> output channels to generate the activation <math>A^q_t ∈ R^{P×<K>}</math> for the dynamic kernel bank <math>Q_t∈R^{P × <K>}</math>, <math>q_{t,p}∈R^{<K>}</math> is the vectorized adaptive kernel at the location p of <math>Q_t</math>, and <math>W^c_t(p) ∈ R^{K×1×1}</math> is the dynamic kernel of size K with a single input/output channel, which is reshaped from <math>q_{t,p}</math>. Each channel of the previous memory cell <math>C_{t-1}</math> is convolved with <math>W_t^c(p)</math> whose values vary with <math>p</math>, to form a memory cell convolution, which produces a convolved memory cell <math>C_{t-1}^{conv} \in \mathbb{R}^{P\times M}</math>. Note the paper also employed a softmax function ς(·) to normalize the channel dimension of <math>Q_t</math>. which can also stabilize the value of memory cells and help to prevent the vanishing/exploding gradients. An illustration is provided below to better illustrate the process:<br />
<br />
[[File:MCC.png |240px|center||Figure 5: MCC]]<br />
<br />
To improve training, the authors introduced a new normalization technique for ''t''LSTM termed channel normalization (adapted from layer normalization), in which the channel vector are normalized at different locations with their own statistics. Note that layer normalization does not work well with ''t''LSTM, because lower level information is near the input and higher level information is near the output. Channel normalization (CN) is defined as: <br />
<br />
\begin{align}<br />
\mathrm{CN}(\mathbf{Z}; \mathbf{\Gamma}, \mathbf{B}) = \mathbf{\hat{Z}} \odot \mathbf{\Gamma} + \mathbf{B} \hspace{2cm} (20)<br />
\end{align}<br />
<br />
where <math>\mathbf{Z}</math>, <math>\mathbf{\hat{Z}}</math>, <math>\mathbf{\Gamma}</math>, <math>\mathbf{B} \in \mathbb{R}^{P \times M^z}</math> are the original tensor, normalized tensor, gain parameter and bias parameter. The <math>m^z</math>-th channel of <math>\mathbf{Z}</math> is normalized element-wisely: <br />
<br />
\begin{align}<br />
\hat{z_{m^z}} = (z_{m^z} - z^\mu)/z^{\sigma} \hspace{2cm} (21)<br />
\end{align}<br />
<br />
where <math>z^{\mu}</math>, <math>z^{\sigma} \in \mathbb{R}^P</math> are the mean and standard deviation along the channel dimension of <math>\mathbf{Z}</math>, and <math>\hat{z_{m^z}} \in \mathbb{R}^P</math> is the <math>m^z</math>-th channel <math>\mathbf{\hat{Z}}</math>. Channel normalization introduces very few additional parameters compared to the number of other parameters in the model.<br />
<br />
= Results and Evaluation =<br />
<br />
Summary of list of models tLSTM family (may be useful later):<br />
<br />
(a) sLSTM (baseline): the implementation of sLSTM with parameters shared across all layers.<br />
<br />
(b) 2D tLSTM: the standard 2D tLSTM.<br />
<br />
(c) 2D tLSTM–M: removing memory (M) cell convolutions from (b).<br />
<br />
(d) 2D tLSTM–F: removing (–) feedback (F) connections from (b).<br />
<br />
(e) 3D tLSTM: tensorizing (b) into 3D tLSTM.<br />
<br />
(f) 3D tLSTM+LN: applying (+) Layer Normalization.<br />
<br />
(g) 3D tLSTM+CN: applying (+) Channel Normalization.<br />
<br />
=== Efficiency Analysis ===<br />
<br />
'''Fundaments:''' For each configuration, fix the parameter number and increase the tensor size to see if the performance of tLSTM can be boosted without increasing the parameter number. Can also investigate how the runtime is affected by the depth, where the runtime is measured by the average GPU milliseconds spent by a forward and backward pass over one timestep of a single example. <br />
<br />
'''Dataset:''' The Hutter Prize Wikipedia dataset consists of 100 million characters taken from 205 different characters including alphabets, XML markups and special symbols. We model the dataset at the character-level, and try to predict the next character of the input sequence.<br />
<br />
All configurations are evaluated with depths L = 1, 2, 3, 4. Bits-per-character(BPC) is used to measure the model performance and the results are shown in the figure below.<br />
[[File:wiki.png |280px|center||Figure 5: WifiPerf]]<br />
[[File:Wiki_Performance.png |480px|center||Figure 5: WifiPerf]]<br />
<br />
=== Accuracy Analysis ===<br />
<br />
The MNIST dataset [35] consists of 50000/10000/10000 handwritten digit images of size 28×28 for training/validation/test. We have two tasks on this dataset:<br />
<br />
(a) '''Sequential MNIST:''' The goal is to classify the digit after sequentially reading the pixels in a scan-line order. It is therefore a 784 time-step sequence learning task where a single output is produced at the last time-step; the task requires very long range dependencies in the sequence.<br />
<br />
(b) '''Sequential Permuted MNIST:''' We permute the original image pixels in a fixed random order, resulting in a permuted MNIST (pMNIST) problem that has even longer range dependencies across pixels and is harder.<br />
<br />
In both tasks, all configurations are evaluated with M = 100 and L= 1, 3, 5. The model performance is measured by the classification accuracy and results are shown in the figure below.<br />
<br />
[[File:MNISTperf.png |480px|center]]<br />
<br />
<br />
<br />
[[File:Acc_res.png |480px|center||Figure 5: MNIST]]<br />
<br />
[[File:33_mnist.PNG|center|thumb|800px| This figure displays a visualization of the means of the diagonal channels of the tLSTM memory cells per task. The columns indicate the time steps and the rows indicate the diagonal locations. The values are normalized between 0 and 1.]]<br />
<br />
= Conclusions =<br />
<br />
The paper introduced the Tensorized LSTM, which employs tensors to share parameters and utilizes the temporal computation to perform the deep computation for sequential tasks. Then validated the model<br />
on a variety of tasks, showing its potential over other popular approaches.<br />
<br />
= Critique(to be edited) =<br />
<br />
= References =<br />
#Zhen He, Shaobing Gao, Liang Xiao, Daxue Liu, Hangen He, and David Barber. <Wider and Deeper, Cheaper and Faster: Tensorized LSTMs for Sequence Learning> (2017)<br />
#Ali Ghodsi, <Deep Learning: STAT 946 - Winter 2018></div>Tdunnhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Wavelet_Pooling_CNN&diff=35850Wavelet Pooling CNN2018-03-28T13:28:14Z<p>Tdunn: /* Discrete Wavelet Transform General */</p>
<hr />
<div>== Introduction ==<br />
Convolutional neural networks (CNN) have been proven to be powerful in image classification. Over the past few years, researchers have put efforts in improving fundamental components of CNNs such as the pooling operation. Various pooling methods exist; deterministic methods include max pooling and average pooling and probabilistic methods include mixed pooling and stochastic pooling. All these methods employ a neighborhood approach to the sub-sampling which, albeit fast and simple, can produce artifacts such as blurring, aliasing, and edge halos (Parker et al., 1983).<br />
<br />
This paper introduces a novel pooling method based on the discrete wavelet transform. Specifically, it uses a second-level wavelet decomposition for the sub-sampling. This method, instead of nearest neighbor interpolation uses a sub-band method that the authors' claim produces fewer artifacts and represents the underlying features more accurately. Therefore, if pooling is viewed as a lossy process, the reason for employing a wavelet approach is to try to minimize this loss.<br />
<br />
== Pooling Background ==<br />
Pooling essentially means sub-sampling. After the pooling layer, the spatial dimensions of the data is reduced to some degree, with the goal being to compress the data rather than discard some of it. Typical approaches to pooling reduce the dimensionality by using some method to combine a region of values into one value. Max pooling and Mean/Average pooling are the 2 most commonly used pooling methods. For max pooling, this can be represented by the equation <math>a_{kij} = max_{(p,q) \epsilon R_{ij}} (a_{kpq})</math> where <math>a_{kij}</math> is the output activation of the <math>k^th</math> feature map at <math>(i,j)</math>, <math>a_{kpq}</math> is input activation at <math>(p,q)</math> within <math>R_{ij}</math>, and <math>|R_{ij}|</math> is the size of the pooling region. Mean pooling can be represented by the equation <math>a_{kij} = \frac{1}{|R_{ij}|} \sum_{(p,q) \epsilon R_{ij}} (a_{kpq})</math> with everything defined as before. Figure 1 provides a numerical example that can be followed.<br />
<br />
[[File:WT_Fig1.PNG|650px|center|]]<br />
<br />
The paper mentions that these pooling methods, although simple and effective, have shortcomings. Max pooling can omit details from an image if the important features have less intensity than the insignificant ones, and also commonly overfits. On the other hand, average pooling can dilute important features if the data is averaged with values of significantly lower intensities. Figure 2 displays an image of this.<br />
<br />
[[File:WT_Fig2.PNG|650px|center|]]<br />
<br />
To account for the above-mentioned issues, probabilistic pooling methods were introduced, namely mixed pooling and; stochastic pooling. Mixed pooling is a simple method which just combines the max and the average pooling by randomly selecting one method over the other during training. Mixed pooling can be applied for all features, mixed between features, or mixed between regions for different features. Stochastic pooling on the other hand randomly samples within a receptive field with the activation values as the probabilities. These are calculated by taking each activation value and dividing it by the sum of all activation values in the grid so that the probabilities sum to 1.<br />
<br />
Figure 3 shows an example of how stochastic pooling works. On the left is a 3x3 grid filled with activations. The middle grid is the corresponding probability for each activation. The activation in the middle was randomly selected (it had a 13% chance of getting selected). Because the stochastic pooling is based on the probability of the pixels, it is able to avoid the shortcomings of max and mean pooling mentioned above.<br />
<br />
[[File:paper21-stochasticpooling.png|650px|center|]]<br />
<br />
== Wavelet Background ==<br />
Data or signals tend to be composed of slowly changing trends (low frequency) as well as fast-changing transients (high frequency). Similarly, images have smooth regions of intensity which are perturbed by edges or abrupt changes. We know that these abrupt changes can represent features that are of great importance to us when we perform deep learning. Wavelets are a class of functions that are well localized in time and frequency. Compare this to the Fourier transform which represents signals as the sum of sine waves which oscillate forever (not localized in time and space). The ability of wavelets to be localized in time and space is what makes it suitable for detecting the abrupt changes in an image well. <br />
<br />
Essentially, a wavelet is a fast decaying oscillating signal with zero mean that only exists for a fixed duration and can be scaled and shifted in time. There are some well-defined types of wavelets as shown in Figure 3. The key characteristic of wavelets for us is that they have a band-pass characteristic, and the band can be adjusted based on the scaling and shifting. <br />
<br />
[[File:WT_Fig3.jpg|650px|center|]]<br />
<br />
The paper uses discrete wavelet transform and more specifically a faster variation called Fast Wavelet Transform (FWT) using the Haar wavelet. There also exists a continuous wavelet transform. The main difference in these is how the scale and shift parameters are selected.<br />
<br />
== Discrete Wavelet Transform General==<br />
The discrete wavelet transforms for images is essentially applying a low pass and high pass filter to your image where the transfer functions of the filters are related and defined by the type of wavelet used (Haar in this paper). This is shown in the figures below, which also show the recursive nature of the transform. For an image, the per-row transform is taken first. This results in a new image where the first half is a low-frequency sub-band and the second half is the high-frequency sub-band. Then this new image is transformed again per column, resulting in four sub-bands. Generally, the low-frequency content approximates the image and the high-frequency content represents abrupt changes. Therefore, one can simply take the LL band and perform the transformation again to sub-sample even more.<br />
<br />
[[File:WT_Fig8.png|650px|center|]]<br />
<br />
[[File:WT_Fig9.png|650px|center|]]<br />
<br />
In left half of the above image we see a grid containing four different transformations of the same initial image. Each transform has been done by applying a row wise convolution with a wavelet of either high or low frequency, then a column wise convolution with another wavelet of either high or low frequency. The four choices of frequency (LL, LH, HL, HH) result in four different images. The top left image is the result of applying a low frequency wavelet convolution to the original image both row wise and column wise. The bottom left image is the result of first applying a high frequency wavelet convolution row wise and then applying a low frequency wavelet convolution column wise. Since the LL (top right) transformation preserves the original image best, it is then used in this process again to generate the grid of smaller images that can be seen in the top center-right of the above image. The images in this smaller grid are called second order coefficients.<br />
<br />
== DWT example using Haar Wavelet ==<br />
Suppose we have an image represented by the following pixels:<br />
<br />
\begin{align}<br />
\begin{bmatrix} <br />
100 & 50 & 60 & 150 \\<br />
20 & 60 & 40 & 30 \\<br />
50 & 90 & 70 & 82 \\<br />
74 & 66 & 90 & 58 \\<br />
\end{bmatrix}<br />
\end{align}<br />
<br />
For each level of the DWT using the Haar wavelet, we will perform the transform on the rows first and then the columns. For the row pass, we transform each row as follows:<br />
* For each row i = <math>[i_{1}, i_{2}, i_{3}, i_{4}]</math> of the input image, transform the row to <math>i_{t}</math> via<br />
<br />
\begin{align}<br />
i_{t} = [(i_{1} + i_{2}) / 2, (i_{3} + i_{4}) / 2, (i_{1}, - i_{2}) / 2, (i_{3} - i_{4}) / 2]<br />
\end{align}<br />
<br />
After the row transforms, the images looks as follows:<br />
\begin{align}<br />
\begin{bmatrix} <br />
75 & 105 & 25 & -45 \\<br />
40 & 35 & -20 & 5 \\<br />
70 & 76 & -20 & -6 \\<br />
70 & 74 & 4 & 16 \\<br />
\end{bmatrix}<br />
\end{align}<br />
<br />
Now we apply the same method to the columns in the exact same way.<br />
<br />
== Proposed Method ==<br />
The proposed method uses subbands from the second level FWT and discards the first level subbands. The authors postulate that this method is more 'organic' in capturing the data compression and will create less artifacts that may affect the image classification.<br />
=== Forward Propagation ===<br />
FWT can be expressed by <math>W_\varphi[j + 1, k] = h_\varphi[-n]*W_\varphi[j,n]|_{n = 2k, k <= 0}</math> and <math>W_\psi[j + 1, k] = h_\psi[-n]*W_\psi[j,n]|_{n = 2k, k <= 0}</math> where <math>\varphi</math> is the approximation function, <math>\psi</math> is the detail function, <math>W_\varphi</math>, <math>W_\psi</math>, are approximation and detail coefficients, <math>h_\varphi[-n]</math> and <math>h_\psi[-n]</math> are time reversed scaling and wavelet vectors, <math>(n)</math> represents the sample in the vector, and <math>j</math> denotes the resolution level. To apply to images, FWT is first applied on the rows and then the columns. If a low (L) and high(H) sub-band is extracted from the rows and similarly for the columns than at each level there is 4 sub-bands (LH, HL, HH, and LL) where LL will further be decomposed into the level 2 decomposition. <br />
<br />
Using the level 2 decomposition sub-bands, the Inverse Fast Wavelet Transform (IFWT) is used to obtain the resulting sub-sampled image, which is sub-sampled by a factor of two. The Equation for IFWT is <math>W_\varphi[j, k] = h_\varphi[-n]*W_\varphi[j + 1,n] + h_\psi[-n]*W_\psi[j + 1,n]|_{n = \frac{k}{2}, k <= 0}</math> where the parameters are the same as previously explained. Figure 4 displays the algorithm for the forward propagation.<br />
<br />
[[File:WT_Fig6.PNG|650px|center|]]<br />
<br />
=== Back Propagation ===<br />
This is simply the reverse of the forward propagation. The FWT of the image is upsampled to be used as the level 2 decomposition. Then IFWT is performed to obtain the original image which is upsampled by a factor of two using wavelet methods. Figure 5 displays the algorithm.<br />
<br />
[[File:WT_Fig7.PNG|650px|center|]]<br />
<br />
== Results ==<br />
The authors tested on MNIST, CIFAR-10, SHVN, and KDEF and the paper provides comprehensive results for each. Stochastic gradient descent was used and the Haar wavelet is used due to its even, square subbands. The network for all datasets except MNIST is loosely based on (Zeiler & Fergus, 2013). The authors keep the network consistent but change the pooling method for each dataset. They also experiment with dropout and Batch Normalization to examine the effects of regularization on their method. All pooling methods compared use a 2x2 window, and a consistent pooling method was used for all pooling layers of a network. The overall results teach us that the pooling method should be chosen specifically for the type of data we have. In some cases, wavelet pooling may perform the best, and in other cases, other methods may perform better, if the data is more suited for those types of pooling.<br />
<br />
=== MNIST ===<br />
Figure 7 shows the network and Table 1 shows the accuracy. It can be seen that wavelet pooling achieves the best accuracy from all pooling methods compared. Figure 8 shows the energy of each method per epoch.<br />
<br />
[[File:WT_Fig4.PNG|650px|center|]]<br />
<br />
[[File:paper21_fig8.png|800px|center]]<br />
<br />
[[File:WT_Tab1.PNG|650px|center|]]<br />
<br />
=== CIFAR-10 ===<br />
In order to investigate the performance of different pooling methods, two types of networks are trained based on CIFAR-10. The first one is the regular CNN and the second one is the network with dropout and batch normalization. Figure 9 shows the network and Tables 2 and 3 shows the accuracy without and with dropout. Average pooling achieves the best accuracy but wavelet pooling is still competitive, while max pooling overfitted on the validation data fairly quickly as shown by the right energy curve in Figure 10 (although the accuracy performance is not significantly worse when dropout and batch normalization are applied).<br />
<br />
[[File:WT_Fig5.PNG|650px|center|]]<br />
<br />
[[File:paper21_fig10.png|800px|center]]<br />
<br />
[[File:WT_Tab2.PNG|650px|center|]]<br />
<br />
[[File:WT_Tab3.PNG|650px|center|]]<br />
<br />
===SHVN===<br />
Figure 11 shows the network and Tables 4 and 5 shows the accuracy without and with dropout. The proposed method does not perform well in this experiment. <br />
<br />
[[File: a.png|650px|center|]]<br />
<br />
[[File:paper21_fig12.png|800px|center]]<br />
<br />
[[File: b.png|650px|center|]]<br />
<br />
===KDEF===<br />
The authors experimented with pooling methods + dropout on the KDEF dataset (which consists of 4,900 images of 35 people portraying varying emotions through facial expressions under different poses, 3,900 of which were randomly assigned to be used for training). The data was treated for errors (e.g. corrupt images) and resized to 128x128 for memory and time constraints. <br />
<br />
Figure 13 below shows the network structure. Figure 14 shows the energy curve of the competing models on training and validation sets as the number of epochs increases, and Table 6 shows the accuracy performance. Average pooling demonstrated the highest accuracy, with wavelet pooling coming in second and max-pooling a close third. However, stochastic and wavelet pooling exhibited more stable learning progression compared to the other methods, and max-pooling eventually overfitted. <br />
<br />
[[File:kdef_struc.PNG|700px|center|]]<br />
[[File:kdef_curve.PNG|750px|center|]]<br />
[[File:kdef_accu.PNG|550px|center|]]<br />
<br />
== Computational Complexity ==<br />
The authors explain that their paper is a proof of concept and is not meant to implement wavelet pooling in the most efficient way. The table below displays a comparison of the number of mathematical operations for each method according to the dataset. It can be seen that wavelet pooling is significantly worse. The authors explain that through good implementation and coding practices, the method can prove to be viable.<br />
<br />
[[File:WT_Tab4.PNG|650px|center|]]<br />
<br />
== Criticism ==<br />
=== Positive ===<br />
* Wavelet Pooling achieves competitive performance with standard go-to pooling methods<br />
* Leads to a comparison of discrete transformation techniques for pooling (DCT, DFT)<br />
=== Negative ===<br />
* Only 2x2 pooling window used for comparison<br />
* Highly computationally extensive<br />
* Not as simple as other pooling methods<br />
* Only one wavelet used (HAAR wavelet)<br />
<br />
== References ==<br />
* Travis Williams and Robert Li. Wavelet Pooling for Convolutional Neural Networks. ICLR 2018.<br />
* J. Anthony Parker, Robert V. Kenyon, and Donald E. Troxel. Comparison of interpolating methods for image resampling. IEEE Transactions on Medical Imaging, 2(1):31–39, 1983.<br />
* Matthew Zeiler and Robert Fergus. Stochastic pooling for regularization of deep convolutional neural networks. In Proceedings of the International Conference on Learning Representation (ICLR), 2013.</div>Tdunnhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Continuous_Adaptation_via_Meta-Learning_in_Nonstationary_and_Competitive_Environments&diff=34762Continuous Adaptation via Meta-Learning in Nonstationary and Competitive Environments2018-03-20T13:51:30Z<p>Tdunn: /* Probabilistic Framework for Meta-Learning */</p>
<hr />
<div>= Introduction =<br />
<br />
Typically, the basic goal of machine learning is to train a model to perform a task. In Meta-learning, the goal is to train a model to perform the task of training a model to perform a task. Hence in this case the term "Meta-Learning" has the exact meaning you would expect; the word "Meta" has the precise function of introducing a layer of abstraction.<br />
<br />
The meta-learning task can be made more concrete by a simple example. Consider the CIFAR-100 classification task that we used for our data competition. We can alter this task from being a 100-class classification problem to a collection of 100 binary classification problems. The goal of Meta-Learning here is to design and train a single binary classifier for each class that will perform well on a randomly sampled task given a limited amount of training data for that specific task. In other words, we would like to train a model to perform the following procedure:<br />
<br />
# A task is sampled. The task is "Is X a dog?".<br />
# A small set of labeled training data is provided to the model. The labels represent whether or not the image is a picture of a dog.<br />
# The model uses the training data to adjust itself to the specific task of checking whether or not an image is a picture of a dog.<br />
<br />
This example also highlights the intuition that the skill of sight is distinct and separable from the skill of knowing what a dog looks like.<br />
<br />
In this paper, a probabilistic framework for meta-learning is derived, then applied to tasks involving simulated robotic spiders. This framework generalizes the typical machine learning set up using Markov Decision Processes. This paper focuses on a multi-agent non-stationary environment which requires reinforcement learning (RL) agents to do continuous adaptation in such an environment. Non-stationarity breaks the standard assumptions and requires agents to continuously adapt, both at training and execution time, in order to earn more rewards, hence the approach is to break this into a sequence of stationary tasks and present it as a multi-task learning problem.<br />
<br />
[[File:paper19_fig1.png|600px|frame|none|alt=Alt text| '''Figure 1'''. a) Illustrates a probabilistic model for Model Agnostic Meta-Learning (MAML) in a multi-task RL setting, where the tasks <math>T</math>, policies <math>\pi</math>, and trajectories <math>\tau</math> are all random variables with dependencies encoded in the edges of a given graph. b) The proposed extension to MAML by the authors suitable for continuous adaptation to a task changing dynamically due to non-stationarity of the environment. The distribution of tasks is represented by a Markov chain, whereby policies from a previous step are used to construct a new policy for the current step. c) The computation graph for the meta-update from <math>\phi_i</math> to <math>\phi_{i+1}</math>. Boxes represent replicas of the policy graphs with the specified parameters. The model is optimized using truncated backpropagation through time starting from <math>L_{T_{i+1}}</math>.]]<br />
<br />
= Model Agnostic Meta-Learning =<br />
<br />
An initial framework for meta-learning is given in "Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks" (Finn et al, 2017):<br />
<br />
"In our approach, the parameters of<br />
the model are explicitly trained such that a small<br />
number of gradient steps with a small amount<br />
of training data from a new task will produce<br />
good generalization performance on that task" (Finn et al, 2017).<br />
<br />
[[File:MAML.png | 500px]]<br />
<br />
In this training algorithm, the parameter vector <math>\theta</math> belonging to the model <math>f_{\theta}</math> is trained such that the meta-objective function <math>\mathcal{L} (\theta) = \sum_{\tau_i \sim P(\tau)} \mathcal{L}_{\tau_i} (f_{\theta_i' }) </math> is minimized. The sum in the objective function is over a sampled batch of training tasks. <math>\mathcal{L}_{\tau_i} (f_{\theta_i'})</math> is the training loss function corresponding to the <math>i^{th}</math> task in the batch evaluated at the model <math>f_{\theta_i'}</math>. The parameter vector <math>\theta_i'</math> is obtained by updating the general parameter <math>\theta</math> using the loss function <math>\mathcal{L}_{\tau_i}</math> and set of K training examples specific to the <math>i^{th}</math> task. Note that in alternate versions of this algorithm, additional testing sets are sampled from <math>\tau_i</math> and used to update <math>\theta</math> using testing loss functions instead of training loss functions.<br />
<br />
One of the important difference between this algorithm and more typical fine-tuning methods is that <math>\theta</math> is explicitly trained to be easily adjusted to perform well on different tasks rather than perform well on any specific tasks then fine tuned as the environment changes. (Sutton et al., 2007)<br />
<br />
= Probabilistic Framework for Meta-Learning =<br />
<br />
This paper puts the meta-learning problem into a Markov Decision Process (MDP) framework common to RL, see Figure 1a. Instead of training examples <math>\{(x, y)\}</math>, we have trajectories <math>\tau = (x_0, a_1, x_1, R_1, x_2, ... a_H, x_H, R_H)</math>. A trajectory is sequence of states/observations <math>x_t</math>, actions <math>a_t</math> and rewards <math>R_t</math> that is sampled from a task <math> T </math> according to a policy <math>\pi_{\theta}</math>. Included with said task is a method for assigning loss values to trajectories <math>L_T(\tau)</math> which is typically the negative cumulative reward. A policy is a deterministic function that takes in a state and returns an action. Our goal here is to train a policy <math>\pi_{\theta}</math> with parameter vector <math>\theta</math>. This is analougous to training a function <math>f_{\theta}</math> that assigns labels <math>y</math> to feature vectors <math>x</math>. More precisely we have the following definitions:<br />
<br />
* <math>\tau = (x_0, a_1, x_1, R_1, x_2, ... a_H, x_H, R_H)</math> trajectories.<br />
* <math>T :=(L_T, P_T(x), P_T(x_t | x_{t-1}, a_{t-1}), H )</math> (A Task)<br />
* <math>D(T)</math> : A distribution over tasks.<br />
* <math>L_T</math>: A loss function for the task T that assigns numeric loss values to trajectories.<br />
* <math>P_T(x), P_T(x_t | x_{t-1}, a_{t-1})</math>: Probability measures specifying the markovian dynamics of the observations <math>x_t</math><br />
* <math>H</math>: The horizon of the MDP. This is a fixed natural number specifying the lengths of the tasks trajectories.<br />
<br />
The papers goes further to define a Markov dynamic for sequences of tasks as shown in Figure 1b. Thus the policy that we would like to meta learn <math>\pi_{\theta}</math>, after being exposed to a sample of K trajectories <math>\tau_\theta^{1:K}</math> from the task <math>T_i</math>, should produce a new policy <math>\pi_{\phi}</math> that will perform well on the next task <math>T_{i+1}</math>. Thus we seek to minimize the following expectation:<br />
<br />
<math>\mathrm{E}_{P(T_0), P(T_{i+1} | T_i)}\bigg(\sum_{i=1}^{l} \mathcal{L}_{T_i, T_{i+1}}(\theta)\bigg)</math>, <br />
<br />
where <math>\mathcal{L}_{T_i, T_{i + 1}}(\theta) := \mathrm{E}_{\tau_{i, \theta}^{1:K} } \bigg( \mathrm{E}_{\tau_{i+1, \phi}}\Big( L_{T_{i+1}}(\tau_{i+1, \phi} | \tau_{i, \theta}^{1:K}, \theta) \Big) \bigg) </math> and <math>l</math> is the number of tasks.<br />
<br />
The meta-policy <math>\pi_{\theta}</math> is trained and then adapted at test time using the following procedures. The computational graph is given in Figure 1c.<br />
<br />
[[File:MAML2.png | 800px]]<br />
<br />
The mathematics of calculating loss gradients is omitted.<br />
<br />
= Training Spiders to Run with Dynamic Handicaps (Robotic Locomotion in Non-Stationary Environments) =<br />
<br />
The authors used the MuJoCo physics simulator to create a simulated environment where robotic spiders with 6 legs are faced with the task of running due east as quickly as possible. The robotic spider observes the location and velocity of its body, and the angles and velocities of its legs. It interacts with the environment by exerting torque on the joints of its legs. Each leg has two joints, the joint closer to the body rotates horizontally while the joint farther from the body rotates vertically. The environment is made non-stationary by gradually paralyzing two legs of the spider across training and testing episodes.<br />
Putting this example into the above probabilistic framework yields:<br />
<br />
* <math>T_i</math>: The task of walking east with the torques of two legs scaled by <math> (i-1)/6 </math><br />
* <math>\{T_i\}_{i=1}^{7}</math>: A sequence of tasks with the same two legs handicapped in each task. Note there are 15 different ways to choose such legs resulting in 15 sequences of tasks. 12 are used for training and 3 for testing.<br />
* A Markov Descision process composed of<br />
** Observations <math> x_t </math> containing information about the state of the spider.<br />
** Actions <math> a_t </math> containing information about the torques to apply to the spiders legs.<br />
** Rewards <math> R_t </math> corresponding to the speed at which the spider is moving east.<br />
<br />
Three differently structured policy neural networks are trained in this set up using both meta-learning and three different previously developed adaption methods.<br />
<br />
At testing time, the spiders following meta learned policies initially perform worse than the spiders using non-adaptive policies. However, by the third episode (<math> i=3 </math>), the meta-learners perform on par. And by the sixth episode, when the selected legs are mostly immobile, the meta-learners significantly out perform. These results can be seen in the graphs below.<br />
<br />
[[File:locomotion_results.png | 800px]]<br />
<br />
= Training Spiders to Fight Each Other (Adversarial Meta-Learning) =<br />
<br />
The authors created an adversarial environment called RoboSumo where pairs of agents with 4 (named Ants), 6 (named Bugs),or 8 legs (named spiders) sumo wrestle. The agents observe the location and velocity of their bodies and the bodies of their opponent, the angles and velocities of their legs, and the forces being exerted on them by their opponent (equivalent of tactile sense). The game is organized into episodes and rounds. Episodes are single wrestling matches with 500 time steps and win/lose/draw outcomes. Agents win by pushing their opponent out of the ring or making their opponent's body touch the ground. Rounds are batches of episodes. An episode results in a draw when neither of these things happen after 500 time steps. Rounds have possible outcomes win, lose, and draw that are decided based on majority of episodes won. K rounds will be fought. Both agents may update their policies between rounds. The agent that wins the majority of rounds is deemed the winner of the game.<br />
<br />
== Setup ==<br />
Similar to the Robotic locomotion example, this game can be phrased in terms of the RL MDP framework.<br />
<br />
* <math>T_i</math>: The task of fighting a round.<br />
* <math>\{T_i\}_{i=1}^{K}</math>: A sequence of rounds against the same opponent. Note that the opponent may update their policy between rounds but the anatomy of both wrestlers will be constant across rounds.<br />
* A Markov Descision process composed of<br />
** A horizon <math>H = 500*n</math> where <math>n</math> is the number of episodes per round.<br />
** Observations <math> x_t </math> containing information about the state of the agent and its opponent.<br />
** Actions <math> a_t </math> containing information about the torques to apply to the agents legs.<br />
** Rewards <math> R_t </math> rewards given to the agent based on its wrestling performance. <math>R_{500*n} = </math> +2000 if win episode, -2000 if lose, and -1000 if draw.<br />
<br />
Note that the above reward set up is quite sparse, therefore in order to encourage fast training, rewards are introduced at every time step for the following.<br />
* For staying close to the center of the ring.<br />
* For exerting force on the opponents body.<br />
* For moving towards the opponent.<br />
* For the distance of the opponent to the center of the ring.<br />
<br />
This makes sense intuitively as these are reasonable goals for agents to explore when they are learning to wrestle.<br />
<br />
== Training ==<br />
The same combinations of policy networks and adaptation methods that were used in the locomotion example are trained and tested here. A family of non-adaptive policies are first trained via self-play and saved at all stages. Self-play simply means the two agents in the training environment use the same policy. All policy versions are saved so that agents of various skill levels can be sampled when training meta-learners. The weights of the different insects were calibrated such that the test win rate between two insects of differing anatomy, who have been trained for the same number of epochs via self-play, is close to 50%.<br />
<br />
[[File:weight_cal.png | 800px]]<br />
<br />
We can see in the above figure that the weight of the spider had to be increased by almost four times in order for the agents to be evenly matched.<br />
<br />
[[File:robosumo_results.png | 800px]]<br />
<br />
The above figure shows testing results for various adaptation strategies. The agent and opponent both start with the self-trained policies. The opponent uses all of its testing experience to continue training. The agent uses only the last 75 episodes to adapt its policy network. This shows that metal learners need only a limited amount of experience in order to hold their own against a constantly improving opponent.<br />
<br />
= Future Work =<br />
The authors noted that the meta-learning adaptation rule they proposed is similar to backpropagation through time with a unit time lag, so a potential area for future research would be to introduce fully-recurrent meta-updates based on the full interaction history with the environment. Secondly, the algorithm proposed involves computing second-order derivatives at training time (see Figure 1c), which resulted in much slower training processes compared to baseline models during experiments, so they suggested finding a method to utilize information from the loss function without explicit backpropagation to speed up computations. The authors also mention that their approach likely will not work well with sparse rewards. This is because the meta-updates, which use policy gradients, are very dependent on the reward signal. They mention that this is an issue they would like to address in the future. A potential solution they have outlined for this is to introduce auxiliary dense rewards which could enable meta-learning.<br />
<br />
= Sources =<br />
# Chelsea Finn, Pieter Abbeel, Sergey Levine. "Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks." arXiv preprint arXiv:1703.03400v3 (2017).<br />
# Richard S Sutton, Anna Koop, and David Silver. On the role of tracking in stationary environments. In Proceedings of the 24th international conference on Machine learning, pp. 871–878. ACM, 2007.</div>Tdunnhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Label-Free_Supervision_of_Neural_Networks_with_Physics_and_Domain_Knowledge&diff=34301Label-Free Supervision of Neural Networks with Physics and Domain Knowledge2018-03-15T14:54:50Z<p>Tdunn: /* Conclusion and Critique */</p>
<hr />
<div>== Introduction ==<br />
Applications of machine learning are often encumbered by the need for large amounts of labeled training data. Neural networks have made large amounts of labeled data even more crucial to success (LeCun, Bengio, and Hinton 2015[1]). Nonetheless, Humans are often able to learn without direct examples, opting instead for high level instructions for how a task should be performed, or what it will look like when completed. This work explores whether a similar principle can be applied to teaching machines: can we supervise networks without individual examples by instead describing only the structure of desired outputs.<br />
<br />
[[File:c433li-1.png]]<br />
<br />
Unsupervised learning methods such as autoencoders, also aim to uncover hidden structure in the data without having access to any label. Such systems succeed in producing highly compressed, yet informative representations of the inputs (Kingma and Welling 2013; Le 2013). However, these representations differ from ours as they are not explicitly constrained to have a particular meaning or semantics. This paper attempts to explicitly provide the semantics of the hidden variables we hope to discover, but still train without labels by learning from constraints that are known to hold according to prior domain knowledge. By training without direct examples of the values our hidden (output) variables take, several advantages are gained over traditional supervised learning, including:<br />
* a reduction in the amount of work spent labeling, <br />
* an increase in generality, as a single set of constraints can be applied to multiple data sets without relabeling.<br />
<br />
== Problem Setup ==<br />
In a traditional supervised learning setting, we are given a training set <math>D=\{(x_1, y_1), \cdots, (x_n, y_n)\}</math> of <math>n</math> training examples. Each example is a pair <math>(x_i,y_i)</math> formed by an instance <math>x_i \in X</math> and the corresponding output (label) <math>y_i \in Y</math>. The goal is to learn a function <math>f: X \rightarrow Y</math> mapping inputs to outputs. To quantify performance, a loss function <math>\ell:Y \times Y \rightarrow \mathbb{R}</math> is provided, and a mapping is found via <br />
<br />
::<math> f^* = \text{argmin}_{f \in \mathcal{F}} \sum_{i=1}^n \ell(f(x_i),y_i) </math><br />
<br />
where the optimization is over a pre-defined class of functions <math>\mathcal{F}</math> (hypothesis class). In our case, <math>\mathcal{F}</math> will be (convolutional) neural networks parameterized by their weights. The loss could be for example <math>\ell(f(x_i),y_i) = 1[f(x_i) \neq y_i]</math>. By restricting the space of possible functions specifying the hypothesis class <math>\mathcal{F}</math>, we are leveraging prior knowledge about the specific problem we are trying to solve. Informally, the so-called No Free Lunch Theorems state that every machine learning algorithm must make such assumptions in order to work. Another common way in which a modeler incorporates prior knowledge is by specifying an a-priori preference for certain functions in <math>\mathcal{F}</math>, incorporating a regularization term <math>R:\mathcal{F} \rightarrow \mathbb{R}</math>, and solving for <math> f^* = argmin_{f \in \mathcal{F}} \sum_{i=1}^n \ell(f(x_i),y_i) + R(f)</math>. Typically, the regularization term <math>R:\mathcal{F} \rightarrow \mathbb{R}</math> specifies a preference for "simpler' functions (Occam's razor).<br />
<br />
In this paper, prior knowledge on the structure of the outputs is modelled by providing a weighted constraint function <math>g:X \times Y \rightarrow \mathbb{R}</math>, used to penalize “structures” that are not consistent with our prior knowledge. And whether this weak form of supervision is sufficient to learn interesting functions is explored. While one clearly needs labels <math>y</math> to evaluate <math>f^*</math>, labels may not be necessary to discover <math>f^*</math>. If prior knowledge informs us that outputs of <math>f^*</math> have other unique properties among functions in <math>\mathcal{F}</math>, we may use these properties for training rather than direct examples <math>y</math>. <br />
<br />
Specifically, an unsupervised approach where the labels <math>y_i</math> are not provided to us is considered, where a necessary property of the output <math>g</math> is optimized instead.<br />
::<math>\hat{f}^* = \text{argmin}_{f \in \mathcal{F}} \sum_{i=1}^n g(x_i,f(x_i))+ R(f) </math><br />
<br />
If the optimizing the above equation is sufficient to find <math>\hat{f}^*</math>, we can use it in replace of labels. If it's not sufficient, additional regularization terms are added. The idea is illustrated with three examples, as described in the next section.<br />
<br />
== Experiments ==<br />
=== Tracking an object in free fall ===<br />
In the first experiment, they record videos of an object being thrown across the field of view, and aim to learn the object's height in each frame. The goal is to obtain a regression network mapping from <math>{R^{\text{height} \times \text{width} \times 3}} \rightarrow \mathbb{R}</math>, where <math>\text{height}</math> and <math>\text{width}</math> are the number of vertical and horizontal pixels per frame, and each pixel has 3 color channels. This network is trained as a structured prediction problem operating on a sequence of <math>N</math> images to produce a sequence of <math>N</math> heights, <math>\left(R^{\text{height} \times \text{width} \times 3} \right)^N \rightarrow \mathbb{R}^N</math>, and each piece of data <math>x_i</math> will be a vector of images, <math>\mathbf{x}</math>.<br />
Rather than supervising the network with direct labels, <math>\mathbf{y} \in \mathbb{R}^N</math>, the network is instead supervised to find an object obeying the elementary physics of free falling objects. An object acting under gravity will have a fixed acceleration of <math>a = -9.8 m / s^2</math>, and the plot of the object's height over time will form a parabola:<br />
::<math>\mathbf{y}_i = y_0 + v_0(i\Delta t) + \frac{1}{2} a(i\Delta t)^2</math><br />
<br />
The idea is, given any trajectory of <math>N</math> height predictions, <math>f(\mathbf{x})</math>, we fit a parabola with fixed curvature to those predictions, and minimize the resulting residual. Formally, if we specify <math>\mathbf{a} = [\frac{1}{2} a\Delta t^2, \frac{1}{2} a(2 \Delta t)^2, \ldots, \frac{1}{2} a(N \Delta t)^2]</math>, the prediction produced by the fitted parabola is:<br />
::<math> \mathbf{\hat{y}} = \mathbf{a} + \mathbf{A} (\mathbf{A}^T\mathbf{A})^{-1} \mathbf{A}^T (f(\mathbf{x}) - \mathbf{a}) </math><br />
<br />
where<br />
::<math><br />
\mathbf{A} = <br />
\left[ {\begin{array}{*{20}c}<br />
\Delta t & 1 \\<br />
2\Delta t & 1 \\<br />
3\Delta t & 1 \\<br />
\vdots & \vdots \\<br />
N\Delta t & 1 \\<br />
\end{array} } \right]<br />
</math><br />
<br />
The constraint loss is then defined as<br />
::<math>g(\mathbf{x},f(\mathbf{x})) = g(f(\mathbf{x})) = \sum_{i=1}^{N} |\mathbf{\hat{y}}_i - f(\mathbf{x})_i|</math><br />
<br />
Note that <math>\hat{y}</math> is not the ground truth labels. Because <math>g</math> is differentiable almost everywhere, it can be optimized with SGD. They find that when combined with existing regularization methods for neural networks, this optimization is sufficient to recover <math>f^*</math> up to an additive constant <math>C</math> (specifying what object height corresponds to 0).<br />
<br />
[[File:c433li-2.png]]<br />
<br />
The data set is collected on a laptop webcam running at 10 frames per second (<math>\Delta t = 0.1s</math>). The camera position is fixed and 65 diverse trajectories of the object in flight, totalling 602 images are recorded. For each trajectory, the network is trained on randomly selected intervals of <math>N=5</math> contiguous frames. Images are resized to <math>56 \times 56</math> pixels before going into a small, randomly initialized neural network with no pretraining. The network consists of 3 Conv/ReLU/MaxPool blocks followed by 2 Fully Connected/ReLU layers with probability 0.5 dropout and a single regression output.<br />
<br />
Since scaling the <math>y_0</math> and <math>v_0</math> results in the same constraint loss <math>g</math>, the authors evaluate the result by the correlation of predicted heights with ground truth pixel measurements. This method was used since the distance from the object to the camera could not be accurately recorded, and this distance is required to calculate the height in meters. This is not a bullet proof evaluation, and is discussed in further detail in the critique section. The results are compared to a supervised network trained with the labels to directly predict the height of the object in pixels. The supervised learning task is viewed as a substantially easier task. From this knowledge we can see from the table below that, under their evaluation criteria, the result is pretty satisfying.<br />
{| class="wikitable"<br />
|+ style="text-align: left;" | Evaluation <br />
|-<br />
! scope="col" | Method !! scope="col" | Random Uniform Output !! scope="col" | Supervised with Labels !! scope="col" | Approach in this Paper<br />
|-<br />
! scope="row" | Correlation <br />
| 12.1% || 94.5% || 90.1%<br />
|}<br />
<br />
=== Tracking the position of a walking man ===<br />
In the second experiment, they aim to detect the horizontal position of a person walking across a frame without providing direct labels <math>y \in \mathbb{R}</math> by exploiting the assumption that the person will be walking at a constant velocity over short periods of time. This is formulated as a structured prediction problem <math>f: \left(R^{\text{height} \times \text{width} \times 3} \right)^N \rightarrow \mathbb{R}^N</math>, and each training instances <math>x_i</math> are a vector of images, <math>\mathbf{x}</math>, being mapped to a sequence of predictions, <math>\mathbf{y}</math>. Given the similarities to the first experiment with free falling objects, we might hope to simply remove the gravity term from equation and retrain. However, in this case, that is not possible, as the constraint provides a necessary, but not sufficient, condition for convergence.<br />
<br />
Given any sequence of correct outputs, <math>(\mathbf{y}_1, \ldots, \mathbf{y}_N)</math>, the modified sequence, <math>(\lambda * \mathbf{y}_1 + C, \ldots, \lambda * \mathbf{y}_N + C)</math> (<math>\lambda, C \in \mathbb{R}</math>) will also satisfy the constant velocity constraint. In the worst case, when <math>\lambda = 0</math>, <math>f \equiv C</math>, and the network can satisfy the constraint while having no dependence on the image. The trivial output is avoided by adding two two additional loss terms.<br />
<br />
::<math>h_1(\mathbf{x}) = -\text{std}(f(\mathbf{x}))</math><br />
which seeks to maximize the standard deviation of the output, and<br />
<br />
::<math>\begin{split}<br />
h_2(\mathbf{x}) = \hphantom{'} & \text{max}(\text{ReLU}(f(\mathbf{x}) - 10)) \hphantom{\text{ }}+ \\<br />
& \text{max}(\text{ReLU}(0 - f(\mathbf{x})))<br />
\end{split}<br />
</math><br />
which limit the output to a fixed ranged <math>[0, 10]</math>, the final loss is thus:<br />
<br />
::<math><br />
\begin{split}<br />
g(\mathbf{x}) = \hphantom{'} & ||(\mathbf{A} (\mathbf{A}^T\mathbf{A})^{-1} \mathbf{A}^T - \mathbf{I}) * f(\mathbf{x})||_1 \hphantom{\text{ }}+ \\<br />
& \gamma_1 * h_1(\mathbf{x}) <br />
\hphantom{\text{ }}+ \\<br />
& \gamma_2 * h_2(\mathbf{x})<br />
% h_2(y) & = \text{max}(\text{ReLU}(y - 10)) + \\<br />
% & \hphantom{=}\hphantom{a} \text{max}(\text{ReLU}(0 - y))<br />
\end{split}<br />
</math><br />
<br />
[[File:c433li-3.png]]<br />
<br />
The data set contains 11 trajectories across 6 distinct scenes, totalling 507 images resized to <math>56 \times 56</math>. The network is trained to output linearly consistent positions on 5 strided frames from the first half of each trajectory, and is evaluated on the second half. The boundary violation penalty is set to <math>\gamma_2 = 0.8</math> and the standard deviation bonus is set to <math>\gamma_1 = 0.6</math>.<br />
<br />
As in the previous experiment, the result is evaluated by the correlation with the ground truth. The result is as follows:<br />
{| class="wikitable"<br />
|+ style="text-align: left;" | Evaluation <br />
|-<br />
! scope="col" | Method !! scope="col" | Random Uniform Output !! scope="col" | Supervised with Labels !! scope="col" | Approach in this Paper<br />
|-<br />
! scope="row" | Correlation <br />
| 45.9% || 80.5% || 95.4%<br />
|}<br />
Surprisingly, the approach in this paper beats the same network trained with direct labeled supervision on the test set, which can be attributed to overfitting on the small amount of training data available (as correlation on training data reached 99.8%).<br />
<br />
=== Detecting objects with causal relationships ===<br />
In the previous experiments, the authors explored options for incorporating constraints pertaining to dynamics equations in real-world phenomena, i.e., prior knowledge derived from elementary physics. In this experiment, the authors explore the possibilities of learning from logical constraints imposed on single images. More specifically, they ask whether it is possible to learn from causal phenomena.<br />
<br />
[[File:paper18_Experiment_3.png|400px]]<br />
<br />
Here, the authors provide images containing a stochastic collection of up to four characters: Peach, Mario, Yoshi, and Bowser, with each character having small appearance changes across frames due to rotation and reflection. Example images can be seen in Fig. (4). While the existence of objects in each frame is non-deterministic, the generating distribution encodes the underlying phenomenon that Mario will always appear whenever Peach appears. The aim is to create a pair of neural networks <math>f_1, f_2</math> for identifying Peach and Mario, respectively. The networks, <math>f_k : R^{height×width×3} → \{0, 1\}</math>, map the image to the discrete boolean variables, <math>y_1</math> and <math>y_2</math>. Rather than supervising with direct labels, the authors train the networks by constraining their outputs to have the logical relationship <math>y_1 ⇒ y_2</math>. This problem is challenging because the networks must simultaneously learn to recognize the characters and select them according to logical relationships. To avoid the trivial solution <math>y_1 \equiv 1, y_2 \equiv 1</math> on every image, three additional loss terms need to be added:<br />
<br />
::<math> h_1(\mathbf{x}, k) = \frac{1}{M}\sum_i^M |Pr[f_k(\mathbf{x}) = 1] - Pr[f_k(\rho(\mathbf{x})) = 1]|, </math><br />
<br />
which forces rotational independence of the outputs in order to encourage the network to learn the existence, rather than location of objects, <br />
<br />
::<math> h_2(\mathbf{x}, k) = -\text{std}_{i \in [1 \dots M]}(Pr[f_k(\mathbf{x}_i) = 1]), </math><br />
<br />
which seeks high variance outputs, and<br />
<br />
::<math> h_3(\mathbf{x}, v) = \frac{1}{M}\sum_i^{M} (Pr[f(\mathbf{x}_i) = v] - \frac{1}{3} + (\frac{1}{3} - \mu_v))^2 \\<br />
\mu_{v} = \frac{1}{M}\sum_i^{M} \mathbb{1}\{v = \text{argmax}_{v' \in \{0, 1\}^2} Pr[f(\mathbf{x}) = v']\}. </math><br />
<br />
which seeks high entropy outputs. The final loss function then becomes: <br />
<br />
::<math> \begin{split}<br />
g(\mathbf{x}) & = \mathbb{1}\{f_1(\mathbf{x}) \nRightarrow f_2(\mathbf{x})\} \hphantom{\text{ }} + \\<br />
& \sum_{k \in \{1, 2\}} \gamma_1 h_1(\mathbf{x}, k) + \gamma_2 h_2(\mathbf{x}, k) + <br />
\hspace{-0.7em} \sum_{v \neq \{1,0\}} \hspace{-0.7em} \gamma_3 * h_3(\mathbf{x}, v)<br />
\end{split}<br />
</math><br />
<br />
'''Evaluation'''<br />
<br />
The input images, shown in Fig. (4), are 56 × 56 pixels. The authors used <math>\gamma_1 = 0.65, \gamma_2 = 0.65, \gamma_3 = 0.95</math>, and trained for 4,000 iterations. This experiment demonstrates that networks can learn from constraints that operate over discrete sets with potentially complex logical rules. Removing constraints will cause learning to fail. Thus, the experiment also shows that sophisticated sufficiency conditions can be key to success when learning from constraints.<br />
<br />
== Conclusion and Critique ==<br />
This paper has introduced a method for using physics and other domain constraints to supervise neural networks. However, the approach described in this paper is not entirely new. Similar ideas are already widely used in Q learning, where the Q value are not available, and the network is supervised by the constraint, as in Deep Q learning (Mnih, Riedmiller et al. 2013[2]).<br />
::<math>Q(s,a) = R(r,s) + \gamma \sum_{s' ~ P_{sa}}{\text{max}_{a'}Q(s',a')}</math><br />
<br />
<br />
Also, the paper has a mistake where they quote the free fall equation as<br />
::<math>\mathbf{y}_i = y_0 + v_0(i\Delta t) + a(i\Delta t)^2</math><br />
which should be<br />
::<math>\mathbf{y}_i = y_0 + v_0(i\Delta t) + \frac{1}{2} a(i\Delta t)^2</math><br />
Although in this case it doesn't affect the result.<br />
<br />
<br />
For the evaluation of the experiments, they used correlation with ground truth as the metric to avoid the fact that the output can be scaled without affecting the constraint loss. This is fine if the network gives output of the same scale. However, there's no such guarantee, and the network may give output of varying scale, in which case, we can't say that the network has learnt the correct thing, although it may have a high correlation with ground truth. In fact, to solve the scaling issue, an obvious way is to combine the constraints introduced in this paper with some labeled training data. It's not clear why the author didn't experiment with a combination of these two loss.<br />
<br />
These methods essentially boil down to generating approximate labels for training data using some knowledge of the dynamic that the labels should follow.<br />
<br />
Finally, this paper only picks examples where the constraints are easy to design, while in some more common tasks such as image classification, what kind of constraints are needed is not straightforward at all.<br />
<br />
== References ==<br />
[1] LeCun, Y.; Bengio, Y.; and Hinton, G. 2015. Deep learning. Nature 521(7553):436–444.<br />
<br />
[2] Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; and Riedmiller, M. 2013. Playing Atari with Deep Reinforcement Learning. arxiv 1312.5602.</div>Tdunnhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Label-Free_Supervision_of_Neural_Networks_with_Physics_and_Domain_Knowledge&diff=34300Label-Free Supervision of Neural Networks with Physics and Domain Knowledge2018-03-15T14:54:36Z<p>Tdunn: /* Conclusion and Critique */</p>
<hr />
<div>== Introduction ==<br />
Applications of machine learning are often encumbered by the need for large amounts of labeled training data. Neural networks have made large amounts of labeled data even more crucial to success (LeCun, Bengio, and Hinton 2015[1]). Nonetheless, Humans are often able to learn without direct examples, opting instead for high level instructions for how a task should be performed, or what it will look like when completed. This work explores whether a similar principle can be applied to teaching machines: can we supervise networks without individual examples by instead describing only the structure of desired outputs.<br />
<br />
[[File:c433li-1.png]]<br />
<br />
Unsupervised learning methods such as autoencoders, also aim to uncover hidden structure in the data without having access to any label. Such systems succeed in producing highly compressed, yet informative representations of the inputs (Kingma and Welling 2013; Le 2013). However, these representations differ from ours as they are not explicitly constrained to have a particular meaning or semantics. This paper attempts to explicitly provide the semantics of the hidden variables we hope to discover, but still train without labels by learning from constraints that are known to hold according to prior domain knowledge. By training without direct examples of the values our hidden (output) variables take, several advantages are gained over traditional supervised learning, including:<br />
* a reduction in the amount of work spent labeling, <br />
* an increase in generality, as a single set of constraints can be applied to multiple data sets without relabeling.<br />
<br />
== Problem Setup ==<br />
In a traditional supervised learning setting, we are given a training set <math>D=\{(x_1, y_1), \cdots, (x_n, y_n)\}</math> of <math>n</math> training examples. Each example is a pair <math>(x_i,y_i)</math> formed by an instance <math>x_i \in X</math> and the corresponding output (label) <math>y_i \in Y</math>. The goal is to learn a function <math>f: X \rightarrow Y</math> mapping inputs to outputs. To quantify performance, a loss function <math>\ell:Y \times Y \rightarrow \mathbb{R}</math> is provided, and a mapping is found via <br />
<br />
::<math> f^* = \text{argmin}_{f \in \mathcal{F}} \sum_{i=1}^n \ell(f(x_i),y_i) </math><br />
<br />
where the optimization is over a pre-defined class of functions <math>\mathcal{F}</math> (hypothesis class). In our case, <math>\mathcal{F}</math> will be (convolutional) neural networks parameterized by their weights. The loss could be for example <math>\ell(f(x_i),y_i) = 1[f(x_i) \neq y_i]</math>. By restricting the space of possible functions specifying the hypothesis class <math>\mathcal{F}</math>, we are leveraging prior knowledge about the specific problem we are trying to solve. Informally, the so-called No Free Lunch Theorems state that every machine learning algorithm must make such assumptions in order to work. Another common way in which a modeler incorporates prior knowledge is by specifying an a-priori preference for certain functions in <math>\mathcal{F}</math>, incorporating a regularization term <math>R:\mathcal{F} \rightarrow \mathbb{R}</math>, and solving for <math> f^* = argmin_{f \in \mathcal{F}} \sum_{i=1}^n \ell(f(x_i),y_i) + R(f)</math>. Typically, the regularization term <math>R:\mathcal{F} \rightarrow \mathbb{R}</math> specifies a preference for "simpler' functions (Occam's razor).<br />
<br />
In this paper, prior knowledge on the structure of the outputs is modelled by providing a weighted constraint function <math>g:X \times Y \rightarrow \mathbb{R}</math>, used to penalize “structures” that are not consistent with our prior knowledge. And whether this weak form of supervision is sufficient to learn interesting functions is explored. While one clearly needs labels <math>y</math> to evaluate <math>f^*</math>, labels may not be necessary to discover <math>f^*</math>. If prior knowledge informs us that outputs of <math>f^*</math> have other unique properties among functions in <math>\mathcal{F}</math>, we may use these properties for training rather than direct examples <math>y</math>. <br />
<br />
Specifically, an unsupervised approach where the labels <math>y_i</math> are not provided to us is considered, where a necessary property of the output <math>g</math> is optimized instead.<br />
::<math>\hat{f}^* = \text{argmin}_{f \in \mathcal{F}} \sum_{i=1}^n g(x_i,f(x_i))+ R(f) </math><br />
<br />
If the optimizing the above equation is sufficient to find <math>\hat{f}^*</math>, we can use it in replace of labels. If it's not sufficient, additional regularization terms are added. The idea is illustrated with three examples, as described in the next section.<br />
<br />
== Experiments ==<br />
=== Tracking an object in free fall ===<br />
In the first experiment, they record videos of an object being thrown across the field of view, and aim to learn the object's height in each frame. The goal is to obtain a regression network mapping from <math>{R^{\text{height} \times \text{width} \times 3}} \rightarrow \mathbb{R}</math>, where <math>\text{height}</math> and <math>\text{width}</math> are the number of vertical and horizontal pixels per frame, and each pixel has 3 color channels. This network is trained as a structured prediction problem operating on a sequence of <math>N</math> images to produce a sequence of <math>N</math> heights, <math>\left(R^{\text{height} \times \text{width} \times 3} \right)^N \rightarrow \mathbb{R}^N</math>, and each piece of data <math>x_i</math> will be a vector of images, <math>\mathbf{x}</math>.<br />
Rather than supervising the network with direct labels, <math>\mathbf{y} \in \mathbb{R}^N</math>, the network is instead supervised to find an object obeying the elementary physics of free falling objects. An object acting under gravity will have a fixed acceleration of <math>a = -9.8 m / s^2</math>, and the plot of the object's height over time will form a parabola:<br />
::<math>\mathbf{y}_i = y_0 + v_0(i\Delta t) + \frac{1}{2} a(i\Delta t)^2</math><br />
<br />
The idea is, given any trajectory of <math>N</math> height predictions, <math>f(\mathbf{x})</math>, we fit a parabola with fixed curvature to those predictions, and minimize the resulting residual. Formally, if we specify <math>\mathbf{a} = [\frac{1}{2} a\Delta t^2, \frac{1}{2} a(2 \Delta t)^2, \ldots, \frac{1}{2} a(N \Delta t)^2]</math>, the prediction produced by the fitted parabola is:<br />
::<math> \mathbf{\hat{y}} = \mathbf{a} + \mathbf{A} (\mathbf{A}^T\mathbf{A})^{-1} \mathbf{A}^T (f(\mathbf{x}) - \mathbf{a}) </math><br />
<br />
where<br />
::<math><br />
\mathbf{A} = <br />
\left[ {\begin{array}{*{20}c}<br />
\Delta t & 1 \\<br />
2\Delta t & 1 \\<br />
3\Delta t & 1 \\<br />
\vdots & \vdots \\<br />
N\Delta t & 1 \\<br />
\end{array} } \right]<br />
</math><br />
<br />
The constraint loss is then defined as<br />
::<math>g(\mathbf{x},f(\mathbf{x})) = g(f(\mathbf{x})) = \sum_{i=1}^{N} |\mathbf{\hat{y}}_i - f(\mathbf{x})_i|</math><br />
<br />
Note that <math>\hat{y}</math> is not the ground truth labels. Because <math>g</math> is differentiable almost everywhere, it can be optimized with SGD. They find that when combined with existing regularization methods for neural networks, this optimization is sufficient to recover <math>f^*</math> up to an additive constant <math>C</math> (specifying what object height corresponds to 0).<br />
<br />
[[File:c433li-2.png]]<br />
<br />
The data set is collected on a laptop webcam running at 10 frames per second (<math>\Delta t = 0.1s</math>). The camera position is fixed and 65 diverse trajectories of the object in flight, totalling 602 images are recorded. For each trajectory, the network is trained on randomly selected intervals of <math>N=5</math> contiguous frames. Images are resized to <math>56 \times 56</math> pixels before going into a small, randomly initialized neural network with no pretraining. The network consists of 3 Conv/ReLU/MaxPool blocks followed by 2 Fully Connected/ReLU layers with probability 0.5 dropout and a single regression output.<br />
<br />
Since scaling the <math>y_0</math> and <math>v_0</math> results in the same constraint loss <math>g</math>, the authors evaluate the result by the correlation of predicted heights with ground truth pixel measurements. This method was used since the distance from the object to the camera could not be accurately recorded, and this distance is required to calculate the height in meters. This is not a bullet proof evaluation, and is discussed in further detail in the critique section. The results are compared to a supervised network trained with the labels to directly predict the height of the object in pixels. The supervised learning task is viewed as a substantially easier task. From this knowledge we can see from the table below that, under their evaluation criteria, the result is pretty satisfying.<br />
{| class="wikitable"<br />
|+ style="text-align: left;" | Evaluation <br />
|-<br />
! scope="col" | Method !! scope="col" | Random Uniform Output !! scope="col" | Supervised with Labels !! scope="col" | Approach in this Paper<br />
|-<br />
! scope="row" | Correlation <br />
| 12.1% || 94.5% || 90.1%<br />
|}<br />
<br />
=== Tracking the position of a walking man ===<br />
In the second experiment, they aim to detect the horizontal position of a person walking across a frame without providing direct labels <math>y \in \mathbb{R}</math> by exploiting the assumption that the person will be walking at a constant velocity over short periods of time. This is formulated as a structured prediction problem <math>f: \left(R^{\text{height} \times \text{width} \times 3} \right)^N \rightarrow \mathbb{R}^N</math>, and each training instances <math>x_i</math> are a vector of images, <math>\mathbf{x}</math>, being mapped to a sequence of predictions, <math>\mathbf{y}</math>. Given the similarities to the first experiment with free falling objects, we might hope to simply remove the gravity term from equation and retrain. However, in this case, that is not possible, as the constraint provides a necessary, but not sufficient, condition for convergence.<br />
<br />
Given any sequence of correct outputs, <math>(\mathbf{y}_1, \ldots, \mathbf{y}_N)</math>, the modified sequence, <math>(\lambda * \mathbf{y}_1 + C, \ldots, \lambda * \mathbf{y}_N + C)</math> (<math>\lambda, C \in \mathbb{R}</math>) will also satisfy the constant velocity constraint. In the worst case, when <math>\lambda = 0</math>, <math>f \equiv C</math>, and the network can satisfy the constraint while having no dependence on the image. The trivial output is avoided by adding two two additional loss terms.<br />
<br />
::<math>h_1(\mathbf{x}) = -\text{std}(f(\mathbf{x}))</math><br />
which seeks to maximize the standard deviation of the output, and<br />
<br />
::<math>\begin{split}<br />
h_2(\mathbf{x}) = \hphantom{'} & \text{max}(\text{ReLU}(f(\mathbf{x}) - 10)) \hphantom{\text{ }}+ \\<br />
& \text{max}(\text{ReLU}(0 - f(\mathbf{x})))<br />
\end{split}<br />
</math><br />
which limit the output to a fixed ranged <math>[0, 10]</math>, the final loss is thus:<br />
<br />
::<math><br />
\begin{split}<br />
g(\mathbf{x}) = \hphantom{'} & ||(\mathbf{A} (\mathbf{A}^T\mathbf{A})^{-1} \mathbf{A}^T - \mathbf{I}) * f(\mathbf{x})||_1 \hphantom{\text{ }}+ \\<br />
& \gamma_1 * h_1(\mathbf{x}) <br />
\hphantom{\text{ }}+ \\<br />
& \gamma_2 * h_2(\mathbf{x})<br />
% h_2(y) & = \text{max}(\text{ReLU}(y - 10)) + \\<br />
% & \hphantom{=}\hphantom{a} \text{max}(\text{ReLU}(0 - y))<br />
\end{split}<br />
</math><br />
<br />
[[File:c433li-3.png]]<br />
<br />
The data set contains 11 trajectories across 6 distinct scenes, totalling 507 images resized to <math>56 \times 56</math>. The network is trained to output linearly consistent positions on 5 strided frames from the first half of each trajectory, and is evaluated on the second half. The boundary violation penalty is set to <math>\gamma_2 = 0.8</math> and the standard deviation bonus is set to <math>\gamma_1 = 0.6</math>.<br />
<br />
As in the previous experiment, the result is evaluated by the correlation with the ground truth. The result is as follows:<br />
{| class="wikitable"<br />
|+ style="text-align: left;" | Evaluation <br />
|-<br />
! scope="col" | Method !! scope="col" | Random Uniform Output !! scope="col" | Supervised with Labels !! scope="col" | Approach in this Paper<br />
|-<br />
! scope="row" | Correlation <br />
| 45.9% || 80.5% || 95.4%<br />
|}<br />
Surprisingly, the approach in this paper beats the same network trained with direct labeled supervision on the test set, which can be attributed to overfitting on the small amount of training data available (as correlation on training data reached 99.8%).<br />
<br />
=== Detecting objects with causal relationships ===<br />
In the previous experiments, the authors explored options for incorporating constraints pertaining to dynamics equations in real-world phenomena, i.e., prior knowledge derived from elementary physics. In this experiment, the authors explore the possibilities of learning from logical constraints imposed on single images. More specifically, they ask whether it is possible to learn from causal phenomena.<br />
<br />
[[File:paper18_Experiment_3.png|400px]]<br />
<br />
Here, the authors provide images containing a stochastic collection of up to four characters: Peach, Mario, Yoshi, and Bowser, with each character having small appearance changes across frames due to rotation and reflection. Example images can be seen in Fig. (4). While the existence of objects in each frame is non-deterministic, the generating distribution encodes the underlying phenomenon that Mario will always appear whenever Peach appears. The aim is to create a pair of neural networks <math>f_1, f_2</math> for identifying Peach and Mario, respectively. The networks, <math>f_k : R^{height×width×3} → \{0, 1\}</math>, map the image to the discrete boolean variables, <math>y_1</math> and <math>y_2</math>. Rather than supervising with direct labels, the authors train the networks by constraining their outputs to have the logical relationship <math>y_1 ⇒ y_2</math>. This problem is challenging because the networks must simultaneously learn to recognize the characters and select them according to logical relationships. To avoid the trivial solution <math>y_1 \equiv 1, y_2 \equiv 1</math> on every image, three additional loss terms need to be added:<br />
<br />
::<math> h_1(\mathbf{x}, k) = \frac{1}{M}\sum_i^M |Pr[f_k(\mathbf{x}) = 1] - Pr[f_k(\rho(\mathbf{x})) = 1]|, </math><br />
<br />
which forces rotational independence of the outputs in order to encourage the network to learn the existence, rather than location of objects, <br />
<br />
::<math> h_2(\mathbf{x}, k) = -\text{std}_{i \in [1 \dots M]}(Pr[f_k(\mathbf{x}_i) = 1]), </math><br />
<br />
which seeks high variance outputs, and<br />
<br />
::<math> h_3(\mathbf{x}, v) = \frac{1}{M}\sum_i^{M} (Pr[f(\mathbf{x}_i) = v] - \frac{1}{3} + (\frac{1}{3} - \mu_v))^2 \\<br />
\mu_{v} = \frac{1}{M}\sum_i^{M} \mathbb{1}\{v = \text{argmax}_{v' \in \{0, 1\}^2} Pr[f(\mathbf{x}) = v']\}. </math><br />
<br />
which seeks high entropy outputs. The final loss function then becomes: <br />
<br />
::<math> \begin{split}<br />
g(\mathbf{x}) & = \mathbb{1}\{f_1(\mathbf{x}) \nRightarrow f_2(\mathbf{x})\} \hphantom{\text{ }} + \\<br />
& \sum_{k \in \{1, 2\}} \gamma_1 h_1(\mathbf{x}, k) + \gamma_2 h_2(\mathbf{x}, k) + <br />
\hspace{-0.7em} \sum_{v \neq \{1,0\}} \hspace{-0.7em} \gamma_3 * h_3(\mathbf{x}, v)<br />
\end{split}<br />
</math><br />
<br />
'''Evaluation'''<br />
<br />
The input images, shown in Fig. (4), are 56 × 56 pixels. The authors used <math>\gamma_1 = 0.65, \gamma_2 = 0.65, \gamma_3 = 0.95</math>, and trained for 4,000 iterations. This experiment demonstrates that networks can learn from constraints that operate over discrete sets with potentially complex logical rules. Removing constraints will cause learning to fail. Thus, the experiment also shows that sophisticated sufficiency conditions can be key to success when learning from constraints.<br />
<br />
== Conclusion and Critique ==<br />
This paper has introduced a method for using physics and other domain constraints to supervise neural networks. However, the approach described in this paper is not entirely new. Similar ideas are already widely used in Q learning, where the Q value are not available, and the network is supervised by the constraint, as in Deep Q learning (Mnih, Riedmiller et al. 2013[2]).<br />
::<math>Q(s,a) = R(r,s) + \gamma \sum_{s' ~ P_{sa}}{\text{max}_{a'}Q(s',a')}</math><br />
<br />
<br />
Also, the paper has a mistake where they quote the free fall equation as<br />
::<math>\mathbf{y}_i = y_0 + v_0(i\Delta t) + a(i\Delta t)^2</math><br />
which should be<br />
::<math>\mathbf{y}_i = y_0 + v_0(i\Delta t) + \frac{1}{2} a(i\Delta t)^2</math><br />
Although in this case it doesn't affect the result.<br />
<br />
<br />
For the evaluation of the experiments, they used correlation with ground truth as the metric to avoid the fact that the output can be scaled without affecting the constraint loss. This is fine if the network gives output of the same scale. However, there's no such guarantee, and the network may give output of varying scale, in which case, we can't say that the network has learnt the correct thing, although it may have a high correlation with ground truth. In fact, to solve the scaling issue, an obvious way is to combine the constraints introduced in this paper with some labeled training data. It's not clear why the author didn't experiment with a combination of these two loss.<br />
<br />
These methods essentially boils down to generating approximate labels for training data using some knowledge of the dynamic that the labels should follow.<br />
<br />
Finally, this paper only picks examples where the constraints are easy to design, while in some more common tasks such as image classification, what kind of constraints are needed is not straightforward at all.<br />
<br />
== References ==<br />
[1] LeCun, Y.; Bengio, Y.; and Hinton, G. 2015. Deep learning. Nature 521(7553):436–444.<br />
<br />
[2] Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; and Riedmiller, M. 2013. Playing Atari with Deep Reinforcement Learning. arxiv 1312.5602.</div>Tdunnhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Wasserstein_Auto-Encoders&diff=34291Wasserstein Auto-Encoders2018-03-15T14:23:04Z<p>Tdunn: /* Theory of Optimal Transport and Wasserstein Distance */</p>
<hr />
<div><br />
= Introduction =<br />
Recent years have seen a convergence of two previously distinct approaches: representation learning from high dimensional data, and unsupervised generative modeling. In the field that formed at their intersection, Variational Auto-Encoders (VAEs) and Generative Adversarial Networks (GANs) have emerged to become well-established. VAEs are theoretically elegant but with the drawback that they tend to generate blurry samples when applied to natural images. GANs on the other hand produce better visual quality of sampled images, but come without an encoder, are harder to train and suffer from the mode-collapse problem when the trained model is unable to capture all the variability in the true data distribution. Thus there has been a push to come up with the best way to combine them together, but a principled unifying framework is yet to be discovered.<br />
<br />
This work proposes a new family of regularized auto-encoders called the Wasserstein Auto-Encoder (WAE). The proposed method provides a novel theoretical insight into setting up an objective function for auto-encoders from the point of view of of optimal transport (OT). This theoretical formulation leads the authors to examine adversarial and maximum mean discrepancy based regularizers for matching a prior and the distribution of encoded data points in the latent space. An empirical evaluation is performed on MNIST and CelebA datasets, where WAE is found to generate samples of better quality than VAE while preserving training stability, encoder-decoder structure and nice latent manifold structure.<br />
<br />
The main contribution of the proposed algorithm is to provide theoretical foundations for using optimal transport cost as the auto-encoder objective function, while blending auto-encoders and GANs in a principled way. It also theoretically and experimentally explores the interesting relationships between WAEs, VAEs and adversarial auto-encoders.<br />
<br />
= Proposed Approach =<br />
==Theory of Optimal Transport and Wasserstein Distance==<br />
Wasserstein Distance is a measure of the distance between two probability distributions. It is also called Earth Mover’s distance, short for EM distance, because informally it can be interpreted as moving piles of dirt that follow one probability distribution at a minimum cost to follow the other distribution. The cost is quantified by the amount of dirt moved times the moving distance. <br />
A simple case where the probability domain is discrete is presented below.<br />
<br />
<br />
[[File:em_distance.PNG|thumb|upright=1.4|center|Step-by-step plan of moving dirt between piles in ''P'' and ''Q'' to make them match (''W'' = 5).]]<br />
<br />
<br />
When dealing with the continuous probability domain, the EM distance or the minimum one among the costs of all dirt moving solutions becomes:<br />
\begin{align}<br />
\small W(p_r, p_g) = \underset{\gamma\sim\Pi(p_r, p_g)} {\inf}\pmb{\mathbb{E}}_{(x,y)\sim\gamma}[\parallel x-y\parallel]<br />
\end{align}<br />
<br />
Where <math>\Pi(p_r, p_g)</math> is the set of all joint probability distributions with marginals <math>p_r</math> and <math>p_g</math>. Here the distribution <math>\gamma</math> is called a transport plan because it's marginal structure gives some intuition that it represents the amount of probability mass to be moved from x to y. This intuition can be explained by looking at the following equation.<br />
<br />
\begin{align}<br />
\int\gamma(x, y)dx = p_g(y) \text{ The mass at y is distributed over x.}<br />
\end{align}<br />
<br />
The Wasserstein distance or the cost of Optimal Transport (OT) provides a much weaker topology, which informally means that it makes it easier for a sequence of distribution to converge as compared to other ''f''-divergences. This is particularly important in applications where data is supported on low dimensional manifolds in the input space. As a result, stronger notions of distances such as KL-divergence, often max out, providing no useful gradients for training. In contrast, OT has a much nicer linear behaviour even upon saturation. It can be shown that the Wasserstein distance has guarantees of continuity and differentiability (Arjovsky et al., 2017). Moreover, Arjovsky et al. show there is a nice relationship between the magnitude of the Wasserstein distance and the distance between distributions; a smaller distance nicely corresponds to a smaller distance between the two distributions, and vice versa.<br />
<br />
==Problem Formulation and Notation==<br />
In this paper, calligraphic letters, i.e. <math>\small {\mathcal{X}}</math>, are used for sets, capital letters, i.e. <math>\small X</math>, are used for random variables and lower case letters, i.e. <math>\small x</math>, for their values. Probability distributions are denoted with capital letters, i.e. <math>\small P(X)</math>, and corresponding densities with lower case letters, i.e. <math>\small p(x)</math>.<br />
<br />
This work aims to minimize OT <math>\small W_c(P_X, P_G)</math> between the true (but unknown) data distribution <math>\small P_X</math> and a latent variable model <math>\small P_G</math> specified by the prior distribution <math>\small P_Z</math> of latent codes <math>\small Z \in \pmb{\mathbb{Z}}</math> and the generative model <math>\small P_G(X|Z)</math> of the data points <math>\small X \in \pmb{\mathbb{X}}</math> given <math>\small Z</math>. <br />
<br />
Kantorovich's formulation of the OT problem is given by:<br />
\begin{align}<br />
\small W_c(P_X, P_G) := \underset{\Gamma\sim {\mathcal{P}}(X \sim P_X, Y \sim P_G)}{\inf} {\pmb{\mathbb{E}}_{(X,Y)\sim\Gamma}[c(X,Y)]}<br />
\end{align}<br />
where <math>\small c(x,y)</math> is any measurable cost function and <math>\small {\mathcal{P}(X \sim P_X,Y \sim P_G)}</math> is a set of all joint distributions of <math>\small (X,Y)</math> with marginals <math>\small P_X</math> and <math>\small P_G</math>. When <math>\small c(x,y)=d(x,y)</math>, the following Kantorovich-Rubinstein duality holds for the <math>\small 1^{st}</math> root of <math>\small W_c</math>:<br />
\begin{align}<br />
\small W_1(P_X, P_G) := \underset{f \in {\mathcal{F_L}}} {\sup} {\pmb{\mathbb{E}}_{X \sim P_X}[f(X)]} -{\pmb{\mathbb{E}}_{Y \sim P_G}[f(Y)]}<br />
\end{align}<br />
where <math>\small {\mathcal{F_L}}</math> is the class of all bounded Lipschitz continuous functions.<br />
<br />
==Wasserstein Auto-Encoders==<br />
The proposed method focuses on latent variables <math>\small P_G </math> defined by a two step procedure, where first a code <math>\small Z</math> is sampled from a fixed prior distribution <math>\small P_Z</math> on a latent space <math>\small {\mathcal{Z}}</math> and then <math>\small Z</math> is mapped to the image <math>\small X \in {\mathcal{X}}</math> with a transformation. This results in a density of the form<br />
\begin{align}<br />
\small p_G(x) := \int_{{\mathcal{Z}}} p_G(x|z)p_z(z)dz, \forall x\in{\mathcal{X}}<br />
\end{align}<br />
assuming all the densities are properly defined. It turns out that if the focus is only on generative models deterministically mapping <math>\small Z </math> to <math>\small X = G(Z) </math>, then the OT cost takes a much simpler form as stated below by Theorem 1.<br />
<br />
'''Theorem 1''' For any function <math>\small G:{\mathcal{Z}} \rightarrow {\mathcal{X}}</math>, where <math>\small Q(Z) </math> is the marginal distribution of <math>\small Z </math> when <math>\small X \in P_X </math> and <math>\small Z \in Q(Z|X) </math>,<br />
\begin{align}<br />
\small \underset{\Gamma\sim {\mathcal{P}}(X \sim P_X, Y \sim P_G)}{\inf} {\pmb{\mathbb{E}}_{(X,Y)\sim\Gamma}[c(X,Y)]} = \underset{Q : Q_z=P_z}{\inf} {{\pmb{\mathbb{E}}_{P_X}}{\pmb{\mathbb{E}}_{Q(Z|X)}}[c(X,G(Z))]}<br />
\end{align}<br />
This essentially means that instead of finding a coupling <math>\small \Gamma </math> between two random variables living in the <math>\small {\mathcal{X}} </math> space, one distributed according to <math>\small P_X </math> and the other one according to <math>\small P_G </math>, it is sufficient to find a conditional distribution <math>\small Q(Z|X) </math> such that its <math>\small Z </math> marginal <math>\small Q_Z(Z) := {\pmb{\mathbb{E}}_{X \sim P_X}[Q(Z|X)]} </math> is identical to the prior distribution <math>\small P_Z </math>. In order to implement a numerical solution to Theorem 1, the constraints on <math>\small Q(Z|X) </math> and <math>\small P_Z </math> are relaxed and a penalty function is added to the objective leading to the WAE objective function given by:<br />
<br />
\begin{align}<br />
\small D_{WAE}(P_X, P_G):= \underset{Q(Z|X) \in Q}{\inf} {{\pmb{\mathbb{E}}_{P_X}}{\pmb{\mathbb{E}}_{Q(Z|X)}}[c(X,G(Z))]} + {\lambda} {{\mathcal{D}}_Z(Q_Z,P_Z)}<br />
\end{align}<br />
where <math>\small Q </math> is any non-parametric set of probabilistic encoders, <math>\small {\mathcal{D}}_Z </math> is an arbitrary divergence between <br />
<math>\small Q_Z </math> and <math>\small P_Z </math>, and <math>\small \lambda > 0 </math> is a hyperparameter. The authors propose two different penalties <math>\small {\mathcal{D}}_Z(Q_Z,P_Z) </math> based on adversarial training (GANs) and maximum mean discrepancy (MMD).<br />
<br />
===WAE-GAN: GAN-based===<br />
The first option is to choose <math>\small {\mathcal{D}}_Z(Q_Z,P_Z) = D_{JS}(Q_Z,P_Z)</math>, where <math>\small D_{JS} </math> is the Jensen-Shannon divergence metric, and use adversarial training to estimate it. Specifically a discriminator is introduced in the latent space <math>\small {\mathcal{Z}} </math> trying to separate true points sampled from <math>\small P_Z </math> from fake ones sampled from <math>\small Q_Z </math>. This results in Algorithm 1. It is interesting that the min-max problem is moved from the input pixel space to the latent space.<br />
<br />
<br />
[[File:wae-gan.PNG|270px|center]]<br />
<br />
===WAE-MMD: MMD-based===<br />
For a positive definite kernel <math>\small k: {\mathcal{Z}} \times {\mathcal{Z}} \rightarrow {\mathcal{R}}</math>, the following expression is called the maximum mean discrepancy:<br />
\begin{align}<br />
\small {MMD}_k(P_Z,Q_Z) = \parallel \int_{{\mathcal{Z}}} k(z,\cdot)dP_z(z) - \int_{{\mathcal{Z}}} k(z,\cdot)dQ_z(z) \parallel_{\mathcal{H}_k},<br />
\end{align}<br />
<br />
where <math>\mathcal{H}_k</math> is the reproducing kernel Hilbert space of real-valued functions mapping <math>\mathcal{Z}</math> to <math>\mathcal{R}</math>. This can be used as a divergence measure and the authors propose to use <math>\small {\mathcal{D}}_Z(Q_Z,P_Z) = MMD_k(P_Z,Q_Z) </math>, which leads to Algorithm 2.<br />
<br />
<br />
[[File:wae-mmd.PNG|270px|center]]<br />
<br />
= Comparison with Related Work =<br />
==Auto-Encoders, VAEs and WAEs==<br />
Classical unregularized encoders only minimized the reconstruction cost, and resulted in training points being chaotically scattered across the latent space with holes in between, where the decoder had never been trained. They were hard to sample from and did not provide a useful representation. VAEs circumvented this problem by maximizing a variational lower-bound term comprising of a reconstruction cost and a KL-divergence measure which captures how distinct each training example is from the prior <math>\small P_Z</math>. This however does not guarantee that the overall encoded distribution <math>\small {{\pmb{\mathbb{E}}_{P_X}}}[Q(Z|X)]</math> matches <math>\small P_Z</math>. This is ensured by WAE however, is a direct consequence of our objective function derived from Theorem 1, and is visually represented in the figure below. It is also interesting to note that this also allows WAE to have deterministic encoder-decoder pairs.<br />
<br />
<br />
[[File:vae-wae.PNG|500px|thumb|center|WAE and VAE regularization]]<br />
<br />
<br />
It is also shown that if <math>\small c(x,y)={\parallel x-y \parallel}_2^2</math>, WAE-GAN is equivalent to adversarial autoencoders (AAE). Thus the theory suggests that AAE minimize the 2-Wasserstein distance between <math>\small P_X</math> and <math>\small P_G</math>.<br />
<br />
==OT, W-GAN and WAE==<br />
The Wasserstein GAN (W-GAN) minimizes the 1-Wasserstein distance <math>\small W_1(P_X,P_G)</math> for generative modeling. The W-GAN formulation is approached from the dual form and thus it cannot be applied to another other cost <math>\small W_c</math> as the neat form of the Kantorovich-Rubinstein duality holds only for <math>\small W_1</math>. WAE approaches the same problem from the primal form, can be applied to any cost function <math>\small c</math> and comes naturally with an encoder. The constraint on OT in Theorem 1, is relaxed in line with theory on unbalanced optimal transport by adding a penalty or additional divergences to the objective.<br />
<br />
==GANs and WAEs==<br />
Many of the GAN variations including f-GAN and W-GAN come without an encoder. Often it may be desirable to reconstruct the latent codes and use the learned manifold in which case they won't be applicable. For works which try to blend adversarial auto-encoder structures, encoders and decoders do not have incentive to be reciprocal. WAE does not necessarily lead to a min-max game and has a clear theoretical foundation for using penalties for regularization.<br />
<br />
=Experimental Results=<br />
The authors empirically evaluate the proposed WAE generative model by specifically testing if data points are accurately reconstructed, if the latent manifold has reasonable geometry, and if random samples of good visual quality are generated. <br />
<br />
'''Experimental setup:'''<br />
Gaussian prior distribution <math> \small P_Z</math> and squared cost function <math> \small c(x,y)</math> are used for data-points. The encoder-decoder pairs were deterministic. The convolutional deep neural network for encoder mapping and decoder mapping are similar to DC-GAN with batch normalization. Real world datasets, MNIST with 70k images and CelebA with 203k images were used for training and testing.<br />
<br />
'''WAE-GAN and WAE-MMD:'''<br />
In WAE-GAN, the discriminator <math> \small D </math> composed of several fully connected layers with ReLu. For WAE-MMD, the RBF kernel failed to penalize outliers and thus the authors resorted to using inverse multiquadratics kernel <math> \small k(x,y)=C/(C+\parallel{x-y}_2^2\parallel) </math>. Trained models are presented in the figure below.<br />
As far as random sampled results are concerned, WAE-GAN seems to be highly unstable but do lead to better matching scores among WAE-GAN, WAE-MMD and VAE. WAE-MMD on the other hand has much more stable training and fairly good quality of sampled results.<br />
<br />
'''Qualitative assessment:'''<br />
In order to quantitatively assess the quality of the generated images, they use the Fréchet Inception Distance (which measures the similarity between two sets of images, by comparing the Fréchet distance of multivariate Gaussian distributions fitted to their feature representations) and report the results on CelebA. These results confirm that the sampled images from WAE are of better quality than from VAE (score: 82), and WAE-GAN gets a slightly better score (score:42) than WAE-MMD (score:55), which correlates with visual inspection of the images.<br />
<br />
[[File:results.png|800px|thumb|center|Results on MNIST and Celeb-A dataset. In "test reconstructions" (middle row of images), odd rows correspond to the real test points.]]<br />
<br />
<br />
<br />
The authors also heuristically evaluate the sharpness of generated samples using the Laplace filter. The numbers, summarized in Table1, show that WAE-MMD has samples of slightly better quality than VAE, while WAE-GAN achieves the best results overall.<br />
[[File: paper17_Table.png|300px|thumb|right|Qualitative Assessment of Images]]<br />
<br />
= Commentary and Conclusion =<br />
This paper presents an interesting theoretical justification for a new family of auto-encoders called Wasserstein Auto-Encoders (WAE). The objective function minimizes the optimal transport cost in the form of the Wasserstein distance, but relaxes theoretical constraints to separate it into a reconstruction cost and a regularization penalty. The regularization penalizes divergences between a prior and the distribution of encoded latent space training data, and is estimated by means of adversarial training (WAE-GAN), or kernel-based techniques (WAE-MMD). They show that they achieve samples of better visual quality than VAEs, while achieving stable training at the same time. They also theoretically show that WAEs are a generalization of adversarial auto-encoders (AAEs).<br />
<br />
Although the paper mentions that encoder-decoder pairs can be deterministic, they do not show the geometry of the latent space that is obtained. It is necessary to study the effect of randomness of encoders on the quality of obtained samples. While this method is evaluated on MNIST and CelebA datasets, it is also important to see their performance on other real world data distributions. The authors do not provide a comprehensive evaluation of WAE-GAN regularization, thus making it hard to comment on whether moving an adversarial problem to the latent space results in less instability. Reasons for better sample quality of WAE-GAN over WAE-MMD also need to be inspected. In the future it would be interesting to investigate different ways to compute the divergences between the encoded distribution and the prior distribution.<br />
<br />
=Open Source Code=<br />
1. https://github.com/tolstikhin/wae <br />
<br />
2. https://github.com/maitek/waae-pytorch<br />
<br />
=Sources=<br />
1. Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, Bernhard Scholkopf. Wasserstein Auto-Encoders, 2017<br />
<br />
2. M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein GAN, 2017<br />
<br />
3. https://lilianweng.github.io/lil-log/2017/08/20/from-GAN-to-WGAN.html</div>Tdunnhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Continuous_Adaptation_via_Meta-Learning_in_Nonstationary_and_Competitive_Environments&diff=34145Continuous Adaptation via Meta-Learning in Nonstationary and Competitive Environments2018-03-15T01:26:23Z<p>Tdunn: /* Training */</p>
<hr />
<div>= Introduction =<br />
<br />
Typically, the basic goal of machine learning is to train a model to perform a task. In Meta-learning, the goal is to train a model to perform the task of training a model to perform a task. Hence in this case the term "Meta-Learning" has the exact meaning you would expect; the word "Meta" has the precise function of introducing a layer of abstraction.<br />
<br />
The meta-learning task can be made more concrete by a simple example. Consider the CIFAR-100 classification task that we used for our data competition. We can alter this task from being a 100-class classification problem to a collection of 100 binary classification problems. The goal of Meta-Learning here is to design and train and single binary classifier that will perform well on a randomly sampled task given a limited amount of training data for that specific task. In other words, we would like to train a model to perform the following procedure:<br />
<br />
# A task is sampled. The task is "Is X a dog?"<br />
# A small set of labeled training data is provided to the model. The labels represent whether or not the image is a picture of a dog.<br />
# The model uses the training data to adjust itself to the specific task of checking whether or not an image is a picture of a dog.<br />
<br />
This example also highlights the intuition that the skill of sight is distinct and separable from the skill of knowing what a dog looks like.<br />
<br />
In this paper, a probabilistic framework for meta learning is derived, then applied to tasks involving simulated robotic spiders. This framework generalizes the typical machine learning set up using Markov Decision Processes.<br />
<br />
= Model Agnostic Meta-Learning =<br />
<br />
An initial framework for meta-learning is given in "Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks" (Finn et al, 2017)<br />
<br />
"In our approach, the parameters of<br />
the model are explicitly trained such that a small<br />
number of gradient steps with a small amount<br />
of training data from a new task will produce<br />
good generalization performance on that task" (Finn et al, 2017)<br />
<br />
[[File:MAML.png | 500px]]<br />
<br />
In this training algorithm, the parameter vector <math>\theta</math> belonging to the model <math>f_{\theta}</math> is trained such that the meta-objective function <math>\mathcal{L} (\theta) = \sum_{\tau_i \sim P(\tau)} \mathcal{L}_{\tau_i} (f_{\theta_i' }) </math> is minimized. The sum in the objective function is over a sampled batch of training tasks. <math>\mathcal{L}_{\tau_i} (f_{\theta_i'})</math> is the training loss function corresponding to the <math>i^{th}</math> task in the batch evaluated at the model <math>f_{\theta_i'}</math>. The parameter vector <math>\theta_i'</math> is obtained by updating the general parameter <math>\theta</math> using the loss function <math>\mathcal{L}_{\tau_i}</math> and set of K training examples specific to the <math>i^{th}</math> task. Note that in alternate versions of this algorithm, additional testing sets are sampled from <math>\tau_i</math> and used to update <math>\theta</math> using testing loss functions instead of training loss functions.<br />
<br />
One of the important difference between this algorithm and more typical fine-tuning methods is that <math>\theta</math> is explicitly trained to be easily adjusted to perform well on different tasks rather than perform well on any specific tasks then fine tuned as the environment changes. (Sutton et al., 2007)<br />
<br />
= Probabilistic Framework for Meta-Learning =<br />
<br />
This paper puts the meta-learning problem into a Markov Decision Process (MDP) framework common to RL. Instead of training examples <math>\{(x, y)\}</math>, we have trajectories <math>\tau = (x_0, a_1, x_1, R_1, x_2, ... a_H, x_H, R_H)</math>. A trajectory is sequence of states/observations <math>x_t</math>, actions <math>a_t</math> and rewards <math>R_t</math> that is sampled from a task <math> T </math> according to a policy <math>\pi_{\theta}</math>. Included with said task is a method for assigning loss values to trajectories <math>L_T(\tau)</math> which is typically the negative cumulative reward. A policy is a deterministic function that takes in a state and returns an action. Our goal here is to train a policy <math>\pi_{\theta}</math> with parameter vector <math>\theta</math>. This is analougous to training a function <math>f_{\theta}</math> that assigns labels <math>y</math> to feature vectors <math>x</math>. More precisely we have the following definitions:<br />
<br />
* <math>T :=(L_T, P_T(x_0), P_T(x_t | x_{t-1}, a_t), H )</math> (A Task)<br />
* <math>D(T)</math> : A distribution over tasks.<br />
* <math>L_T</math>: A loss function for the task T that assigns numeric loss values to trajectories.<br />
* <math>P_T(x_0), P_T(x_t | x_{t-1}, a_t)</math>: Probability measures specifying the markovian dynamics of the observations <math>x_t</math><br />
* <math>H</math>: The horizon of the MDP. This is a fixed natural number specifying the lengths of the tasks trajectories.<br />
<br />
The papers goes further to define a Markov dynamic for sequences of tasks. Thus the policy that we would like to meta learn <math>\pi_{\theta}</math>, after being exposed to a sample of K trajectories <math>\tau_T^{1:k}</math> from the task <math>T_i</math>, should produce a new policy <math>\pi_{\phi}</math> that will perform well on the next task <math>T_{i+1}</math>. Thus we seek to minimize the following expectation.<br />
<br />
<math>\mathrm{E}_{P(T_0), P(T_{i+1} | T_i)}\bigg(\sum_{i=1}^{l} \mathcal{L}_{T_i, T_{i+1}}(\theta)\bigg)</math><br />
<br />
Where <math>\mathcal{L}_{T_i}(\theta) = \mathrm{E}_{\tau_i^{1:k} } \bigg( \mathrm{E}_{\tau_{i+1, \phi}}\Big( L_{T_{i+1}}(\tau_{i+1, \phi}) \Big) \bigg) </math> and <math>l</math> is the number of tasks.<br />
<br />
The meta-policy <math>\pi_{\theta}</math> is trained and then adapted at test time using the following procedures.<br />
<br />
[[File:MAML2.png | 800px]]<br />
<br />
The mathematics of calculating loss gradients is omitted.<br />
<br />
= Training Spiders to Run with Dynamic Handicaps (Robotic Locomotion in Non-Stationary Environments) =<br />
<br />
The authors used the MuJoCo physics simulator to create a simulated environment where robotic spiders with 6 legs are faced with the task of running due east as quickly as possible. The robotic spider observes the location and velocity of it's body, and the angles and velocities of its legs. It interacts with the environment by exerting torque on the joints of its legs. Each leg has two joints, the joint closer to the body rotates horizontally while the joint farther from the body rotates vertically. The environment is made non-stationary by gradually paralyzing two legs of the spider across training and testing episodes.<br />
Putting this example into the above probabilistic framework yields:<br />
<br />
* <math>T_i</math>: The task of walking east with the torques of two legs scaled by <math> (i-1)/6 </math><br />
* <math>\{T_i\}_{i=1}^{7}</math>: A sequence of tasks with the same two legs handicapped in each task. Note there are 15 different ways to choose such legs resulting in 15 sequences of tasks. 12 are used for training and 3 for testing.<br />
* A Markov Descision process composed of<br />
** Observations <math> x_t </math> containing information about the state of the spider.<br />
** Actions <math> a_t </math> containing information about the torques to apply to the spiders legs.<br />
** Rewards <math> R_t </math> corresponding to the speed at which the spider is moving east.<br />
<br />
Three differently structured policy neural networks are trained in this set up using both meta-learning and three different previously developed adaption methods.<br />
<br />
At testing time, the spiders following meta learned policies initially perform worse than the spiders using non-adaptive policies. However, by the third episode <math> i=3 </math> the meta learners perform on par. And by the sixth episode, when the selected legs are mostly immobile, the meta learners significantly out perform. These results can be seen in the graphs below.<br />
<br />
[[File:locomotion_results.png | 800px]]<br />
<br />
= Training Spiders to Fight Each Other (Adversarial Meta-Learning) =<br />
<br />
The authors created an adversarial environment called RoboSumo where pairs of agents with 4 (named Ants), 6 (named Bugs),or 8 legs (named spiders) sumo wrestle. The agents observe the location and velocity of their bodies and the bodies of their opponent, the angles and velocities of their legs, and the forces being exerted on them by their opponent (equivalent of tactile sense). The game is organized into episodes and rounds. Episodes are single wrestling matches with 500 time steps and win/lose/draw outcomes. Agents win by pushing their opponent out of the ring or making their opponent's body touch the ground. Rounds are batches of episodes. An episode results in a draw when neither of these things happen after 500 time steps. Rounds have possible outcomes win, lose, and draw that are decided based on majority of episodes won. K rounds will be fought. Both agents may update their policies between rounds. The agent that wins the majority of rounds is deemed the winner of the game.<br />
<br />
== Setup ==<br />
Similar to the Robotic locomotion example, this game can be phrased in terms of the RL MDP framework.<br />
<br />
* <math>T_i</math>: The task of fighting a round.<br />
* <math>\{T_i\}_{i=1}^{K}</math>: A sequence of rounds against the same opponent. Note that the opponent may update their policy between rounds but the anatomy of both wrestlers will be constant across rounds.<br />
* A Markov Descision process composed of<br />
** A horizon <math>H = 500*n</math> where <math>n</math> is the number of episodes per round.<br />
** Observations <math> x_t </math> containing information about the state of the agent and its opponent.<br />
** Actions <math> a_t </math> containing information about the torques to apply to the agents legs.<br />
** Rewards <math> R_t </math> rewards given to the agent based on its wrestling performance. <math>R_{500*n} = </math> +2000 if win episode, -2000 if lose, and -1000 if draw.<br />
<br />
Note that the above reward set up is quite sparse, therefore in order to encourage fast training, rewards are introduced at every time step for the following.<br />
* For staying close to the center of the ring.<br />
* For exerting force on the opponents body.<br />
* For moving towards the opponent.<br />
* For the distance of the opponent to the center of the ring.<br />
<br />
This makes sense intuitively as these are reasonable goals for agents to explore when they are learning to wrestle.<br />
<br />
== Training ==<br />
The same combinations of policy networks and adaptation methods that were used in the locomotion example are trained and tested here. A family of non-adaptive policies are first trained via self play and saved at all stages. Self-play simply means the two agents in the training environment use the same policy. All policy versions are saved so that agents of various skill levels can be sampled when training meta-learners. The weights of the different insects were calibrated such that the test win rate between two insects of differing anatomy, who have been trained for the same number of epochs via self-play, is close to %50.<br />
<br />
[[File:weight_cal.png | 800px]]<br />
<br />
We can see in the above figure that the weight of the spider had to be increased by almost four times in order for the agents to evenly matched.<br />
<br />
[[File:robosumo_results.png | 800px]]<br />
<br />
The above figure shows testing results for various adaptation strategies. The agent and opponent both start with the self trained policies. The opponent uses all of its testing experience to continue training. The agent uses only the last 75 episodes to adapt its policy network. This shows that metal learners need a only a limited amount of experience in order to hold their own against a constantly improving opponent.<br />
<br />
= Sources =<br />
# Chelsea Finn, Pieter Abbeel, Sergey Levine. "Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks." arXiv preprint arXiv:1703.03400v3 (2017).<br />
# Richard S Sutton, Anna Koop, and David Silver. On the role of tracking in stationary environments. In Proceedings of the 24th international conference on Machine learning, pp. 871–878. ACM, 2007.</div>Tdunnhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Continuous_Adaptation_via_Meta-Learning_in_Nonstationary_and_Competitive_Environments&diff=34144Continuous Adaptation via Meta-Learning in Nonstationary and Competitive Environments2018-03-15T01:23:27Z<p>Tdunn: /* Training */</p>
<hr />
<div>= Introduction =<br />
<br />
Typically, the basic goal of machine learning is to train a model to perform a task. In Meta-learning, the goal is to train a model to perform the task of training a model to perform a task. Hence in this case the term "Meta-Learning" has the exact meaning you would expect; the word "Meta" has the precise function of introducing a layer of abstraction.<br />
<br />
The meta-learning task can be made more concrete by a simple example. Consider the CIFAR-100 classification task that we used for our data competition. We can alter this task from being a 100-class classification problem to a collection of 100 binary classification problems. The goal of Meta-Learning here is to design and train and single binary classifier that will perform well on a randomly sampled task given a limited amount of training data for that specific task. In other words, we would like to train a model to perform the following procedure:<br />
<br />
# A task is sampled. The task is "Is X a dog?"<br />
# A small set of labeled training data is provided to the model. The labels represent whether or not the image is a picture of a dog.<br />
# The model uses the training data to adjust itself to the specific task of checking whether or not an image is a picture of a dog.<br />
<br />
This example also highlights the intuition that the skill of sight is distinct and separable from the skill of knowing what a dog looks like.<br />
<br />
In this paper, a probabilistic framework for meta learning is derived, then applied to tasks involving simulated robotic spiders. This framework generalizes the typical machine learning set up using Markov Decision Processes.<br />
<br />
= Model Agnostic Meta-Learning =<br />
<br />
An initial framework for meta-learning is given in "Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks" (Finn et al, 2017)<br />
<br />
"In our approach, the parameters of<br />
the model are explicitly trained such that a small<br />
number of gradient steps with a small amount<br />
of training data from a new task will produce<br />
good generalization performance on that task" (Finn et al, 2017)<br />
<br />
[[File:MAML.png | 500px]]<br />
<br />
In this training algorithm, the parameter vector <math>\theta</math> belonging to the model <math>f_{\theta}</math> is trained such that the meta-objective function <math>\mathcal{L} (\theta) = \sum_{\tau_i \sim P(\tau)} \mathcal{L}_{\tau_i} (f_{\theta_i' }) </math> is minimized. The sum in the objective function is over a sampled batch of training tasks. <math>\mathcal{L}_{\tau_i} (f_{\theta_i'})</math> is the training loss function corresponding to the <math>i^{th}</math> task in the batch evaluated at the model <math>f_{\theta_i'}</math>. The parameter vector <math>\theta_i'</math> is obtained by updating the general parameter <math>\theta</math> using the loss function <math>\mathcal{L}_{\tau_i}</math> and set of K training examples specific to the <math>i^{th}</math> task. Note that in alternate versions of this algorithm, additional testing sets are sampled from <math>\tau_i</math> and used to update <math>\theta</math> using testing loss functions instead of training loss functions.<br />
<br />
One of the important difference between this algorithm and more typical fine-tuning methods is that <math>\theta</math> is explicitly trained to be easily adjusted to perform well on different tasks rather than perform well on any specific tasks then fine tuned as the environment changes. (Sutton et al., 2007)<br />
<br />
= Probabilistic Framework for Meta-Learning =<br />
<br />
This paper puts the meta-learning problem into a Markov Decision Process (MDP) framework common to RL. Instead of training examples <math>\{(x, y)\}</math>, we have trajectories <math>\tau = (x_0, a_1, x_1, R_1, x_2, ... a_H, x_H, R_H)</math>. A trajectory is sequence of states/observations <math>x_t</math>, actions <math>a_t</math> and rewards <math>R_t</math> that is sampled from a task <math> T </math> according to a policy <math>\pi_{\theta}</math>. Included with said task is a method for assigning loss values to trajectories <math>L_T(\tau)</math> which is typically the negative cumulative reward. A policy is a deterministic function that takes in a state and returns an action. Our goal here is to train a policy <math>\pi_{\theta}</math> with parameter vector <math>\theta</math>. This is analougous to training a function <math>f_{\theta}</math> that assigns labels <math>y</math> to feature vectors <math>x</math>. More precisely we have the following definitions:<br />
<br />
* <math>T :=(L_T, P_T(x_0), P_T(x_t | x_{t-1}, a_t), H )</math> (A Task)<br />
* <math>D(T)</math> : A distribution over tasks.<br />
* <math>L_T</math>: A loss function for the task T that assigns numeric loss values to trajectories.<br />
* <math>P_T(x_0), P_T(x_t | x_{t-1}, a_t)</math>: Probability measures specifying the markovian dynamics of the observations <math>x_t</math><br />
* <math>H</math>: The horizon of the MDP. This is a fixed natural number specifying the lengths of the tasks trajectories.<br />
<br />
The papers goes further to define a Markov dynamic for sequences of tasks. Thus the policy that we would like to meta learn <math>\pi_{\theta}</math>, after being exposed to a sample of K trajectories <math>\tau_T^{1:k}</math> from the task <math>T_i</math>, should produce a new policy <math>\pi_{\phi}</math> that will perform well on the next task <math>T_{i+1}</math>. Thus we seek to minimize the following expectation.<br />
<br />
<math>\mathrm{E}_{P(T_0), P(T_{i+1} | T_i)}\bigg(\sum_{i=1}^{l} \mathcal{L}_{T_i, T_{i+1}}(\theta)\bigg)</math><br />
<br />
Where <math>\mathcal{L}_{T_i}(\theta) = \mathrm{E}_{\tau_i^{1:k} } \bigg( \mathrm{E}_{\tau_{i+1, \phi}}\Big( L_{T_{i+1}}(\tau_{i+1, \phi}) \Big) \bigg) </math> and <math>l</math> is the number of tasks.<br />
<br />
The meta-policy <math>\pi_{\theta}</math> is trained and then adapted at test time using the following procedures.<br />
<br />
[[File:MAML2.png | 800px]]<br />
<br />
The mathematics of calculating loss gradients is omitted.<br />
<br />
= Training Spiders to Run with Dynamic Handicaps (Robotic Locomotion in Non-Stationary Environments) =<br />
<br />
The authors used the MuJoCo physics simulator to create a simulated environment where robotic spiders with 6 legs are faced with the task of running due east as quickly as possible. The robotic spider observes the location and velocity of it's body, and the angles and velocities of its legs. It interacts with the environment by exerting torque on the joints of its legs. Each leg has two joints, the joint closer to the body rotates horizontally while the joint farther from the body rotates vertically. The environment is made non-stationary by gradually paralyzing two legs of the spider across training and testing episodes.<br />
Putting this example into the above probabilistic framework yields:<br />
<br />
* <math>T_i</math>: The task of walking east with the torques of two legs scaled by <math> (i-1)/6 </math><br />
* <math>\{T_i\}_{i=1}^{7}</math>: A sequence of tasks with the same two legs handicapped in each task. Note there are 15 different ways to choose such legs resulting in 15 sequences of tasks. 12 are used for training and 3 for testing.<br />
* A Markov Descision process composed of<br />
** Observations <math> x_t </math> containing information about the state of the spider.<br />
** Actions <math> a_t </math> containing information about the torques to apply to the spiders legs.<br />
** Rewards <math> R_t </math> corresponding to the speed at which the spider is moving east.<br />
<br />
Three differently structured policy neural networks are trained in this set up using both meta-learning and three different previously developed adaption methods.<br />
<br />
At testing time, the spiders following meta learned policies initially perform worse than the spiders using non-adaptive policies. However, by the third episode <math> i=3 </math> the meta learners perform on par. And by the sixth episode, when the selected legs are mostly immobile, the meta learners significantly out perform. These results can be seen in the graphs below.<br />
<br />
[[File:locomotion_results.png | 800px]]<br />
<br />
= Training Spiders to Fight Each Other (Adversarial Meta-Learning) =<br />
<br />
The authors created an adversarial environment called RoboSumo where pairs of agents with 4 (named Ants), 6 (named Bugs),or 8 legs (named spiders) sumo wrestle. The agents observe the location and velocity of their bodies and the bodies of their opponent, the angles and velocities of their legs, and the forces being exerted on them by their opponent (equivalent of tactile sense). The game is organized into episodes and rounds. Episodes are single wrestling matches with 500 time steps and win/lose/draw outcomes. Agents win by pushing their opponent out of the ring or making their opponent's body touch the ground. Rounds are batches of episodes. An episode results in a draw when neither of these things happen after 500 time steps. Rounds have possible outcomes win, lose, and draw that are decided based on majority of episodes won. K rounds will be fought. Both agents may update their policies between rounds. The agent that wins the majority of rounds is deemed the winner of the game.<br />
<br />
== Setup ==<br />
Similar to the Robotic locomotion example, this game can be phrased in terms of the RL MDP framework.<br />
<br />
* <math>T_i</math>: The task of fighting a round.<br />
* <math>\{T_i\}_{i=1}^{K}</math>: A sequence of rounds against the same opponent. Note that the opponent may update their policy between rounds but the anatomy of both wrestlers will be constant across rounds.<br />
* A Markov Descision process composed of<br />
** A horizon <math>H = 500*n</math> where <math>n</math> is the number of episodes per round.<br />
** Observations <math> x_t </math> containing information about the state of the agent and its opponent.<br />
** Actions <math> a_t </math> containing information about the torques to apply to the agents legs.<br />
** Rewards <math> R_t </math> rewards given to the agent based on its wrestling performance. <math>R_{500*n} = </math> +2000 if win episode, -2000 if lose, and -1000 if draw.<br />
<br />
Note that the above reward set up is quite sparse, therefore in order to encourage fast training, rewards are introduced at every time step for the following.<br />
* For staying close to the center of the ring.<br />
* For exerting force on the opponents body.<br />
* For moving towards the opponent.<br />
* For the distance of the opponent to the center of the ring.<br />
<br />
This makes sense intuitively as these are reasonable goals for agents to explore when they are learning to wrestle.<br />
<br />
== Training ==<br />
The same combinations of policy networks and adaptation methods that were used in the locomotion example are trained and tested here. A family of non-adaptive policies are first trained via self play and saved at all stages. Self-play simply means the two agents in the training environment use the same policy. All policy versions are saved so that agents of various skill levels can be sampled when training meta-learners. The weights of the different insects were calibrated such that the test win rate between two insects of differing anatomy, who have been trained for the same number of epochs via self-play, is close to %50.<br />
<br />
[[File:weight_cal.png | 800px]]<br />
<br />
We can see in the above figure that the weight of the spider had to be increased by almost four times in order for the agents to evenly matched.<br />
<br />
[[File:robosumo_results.png | 800px]]<br />
<br />
The above figure shows testing results for various adaptation strategies. The agent and opponent both start with the self trained policies. The opponent uses all of its testing experience to continue training. The agent uses only the last 75 episodes to adapt its policy network. This shows that metal learners need a only a limited amount of experience in order to hold their own against a constantly improving opponent. The figure below shows how the different adaptation strategies respond to having access to different amounts of information. Note the meta learning update strategy maintains the same performance regardless of the number of episodes it is allowed to use to perform its adaptation update.<br />
<br />
= Sources =<br />
# Chelsea Finn, Pieter Abbeel, Sergey Levine. "Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks." arXiv preprint arXiv:1703.03400v3 (2017).<br />
# Richard S Sutton, Anna Koop, and David Silver. On the role of tracking in stationary environments. In Proceedings of the 24th international conference on Machine learning, pp. 871–878. ACM, 2007.</div>Tdunnhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Continuous_Adaptation_via_Meta-Learning_in_Nonstationary_and_Competitive_Environments&diff=34140Continuous Adaptation via Meta-Learning in Nonstationary and Competitive Environments2018-03-15T01:14:47Z<p>Tdunn: /* Training */</p>
<hr />
<div>= Introduction =<br />
<br />
Typically, the basic goal of machine learning is to train a model to perform a task. In Meta-learning, the goal is to train a model to perform the task of training a model to perform a task. Hence in this case the term "Meta-Learning" has the exact meaning you would expect; the word "Meta" has the precise function of introducing a layer of abstraction.<br />
<br />
The meta-learning task can be made more concrete by a simple example. Consider the CIFAR-100 classification task that we used for our data competition. We can alter this task from being a 100-class classification problem to a collection of 100 binary classification problems. The goal of Meta-Learning here is to design and train and single binary classifier that will perform well on a randomly sampled task given a limited amount of training data for that specific task. In other words, we would like to train a model to perform the following procedure:<br />
<br />
# A task is sampled. The task is "Is X a dog?"<br />
# A small set of labeled training data is provided to the model. The labels represent whether or not the image is a picture of a dog.<br />
# The model uses the training data to adjust itself to the specific task of checking whether or not an image is a picture of a dog.<br />
<br />
This example also highlights the intuition that the skill of sight is distinct and separable from the skill of knowing what a dog looks like.<br />
<br />
In this paper, a probabilistic framework for meta learning is derived, then applied to tasks involving simulated robotic spiders. This framework generalizes the typical machine learning set up using Markov Decision Processes.<br />
<br />
= Model Agnostic Meta-Learning =<br />
<br />
An initial framework for meta-learning is given in "Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks" (Finn et al, 2017)<br />
<br />
"In our approach, the parameters of<br />
the model are explicitly trained such that a small<br />
number of gradient steps with a small amount<br />
of training data from a new task will produce<br />
good generalization performance on that task" (Finn et al, 2017)<br />
<br />
[[File:MAML.png | 500px]]<br />
<br />
In this training algorithm, the parameter vector <math>\theta</math> belonging to the model <math>f_{\theta}</math> is trained such that the meta-objective function <math>\mathcal{L} (\theta) = \sum_{\tau_i \sim P(\tau)} \mathcal{L}_{\tau_i} (f_{\theta_i' }) </math> is minimized. The sum in the objective function is over a sampled batch of training tasks. <math>\mathcal{L}_{\tau_i} (f_{\theta_i'})</math> is the training loss function corresponding to the <math>i^{th}</math> task in the batch evaluated at the model <math>f_{\theta_i'}</math>. The parameter vector <math>\theta_i'</math> is obtained by updating the general parameter <math>\theta</math> using the loss function <math>\mathcal{L}_{\tau_i}</math> and set of K training examples specific to the <math>i^{th}</math> task. Note that in alternate versions of this algorithm, additional testing sets are sampled from <math>\tau_i</math> and used to update <math>\theta</math> using testing loss functions instead of training loss functions.<br />
<br />
One of the important difference between this algorithm and more typical fine-tuning methods is that <math>\theta</math> is explicitly trained to be easily adjusted to perform well on different tasks rather than perform well on any specific tasks then fine tuned as the environment changes. (Sutton et al., 2007)<br />
<br />
= Probabilistic Framework for Meta-Learning =<br />
<br />
This paper puts the meta-learning problem into a Markov Decision Process (MDP) framework common to RL. Instead of training examples <math>\{(x, y)\}</math>, we have trajectories <math>\tau = (x_0, a_1, x_1, R_1, x_2, ... a_H, x_H, R_H)</math>. A trajectory is sequence of states/observations <math>x_t</math>, actions <math>a_t</math> and rewards <math>R_t</math> that is sampled from a task <math> T </math> according to a policy <math>\pi_{\theta}</math>. Included with said task is a method for assigning loss values to trajectories <math>L_T(\tau)</math> which is typically the negative cumulative reward. A policy is a deterministic function that takes in a state and returns an action. Our goal here is to train a policy <math>\pi_{\theta}</math> with parameter vector <math>\theta</math>. This is analougous to training a function <math>f_{\theta}</math> that assigns labels <math>y</math> to feature vectors <math>x</math>. More precisely we have the following definitions:<br />
<br />
* <math>T :=(L_T, P_T(x_0), P_T(x_t | x_{t-1}, a_t), H )</math> (A Task)<br />
* <math>D(T)</math> : A distribution over tasks.<br />
* <math>L_T</math>: A loss function for the task T that assigns numeric loss values to trajectories.<br />
* <math>P_T(x_0), P_T(x_t | x_{t-1}, a_t)</math>: Probability measures specifying the markovian dynamics of the observations <math>x_t</math><br />
* <math>H</math>: The horizon of the MDP. This is a fixed natural number specifying the lengths of the tasks trajectories.<br />
<br />
The papers goes further to define a Markov dynamic for sequences of tasks. Thus the policy that we would like to meta learn <math>\pi_{\theta}</math>, after being exposed to a sample of K trajectories <math>\tau_T^{1:k}</math> from the task <math>T_i</math>, should produce a new policy <math>\pi_{\phi}</math> that will perform well on the next task <math>T_{i+1}</math>. Thus we seek to minimize the following expectation.<br />
<br />
<math>\mathrm{E}_{P(T_0), P(T_{i+1} | T_i)}\bigg(\sum_{i=1}^{l} \mathcal{L}_{T_i, T_{i+1}}(\theta)\bigg)</math><br />
<br />
Where <math>\mathcal{L}_{T_i}(\theta) = \mathrm{E}_{\tau_i^{1:k} } \bigg( \mathrm{E}_{\tau_{i+1, \phi}}\Big( L_{T_{i+1}}(\tau_{i+1, \phi}) \Big) \bigg) </math> and <math>l</math> is the number of tasks.<br />
<br />
The meta-policy <math>\pi_{\theta}</math> is trained and then adapted at test time using the following procedures.<br />
<br />
[[File:MAML2.png | 800px]]<br />
<br />
The mathematics of calculating loss gradients is omitted.<br />
<br />
= Training Spiders to Run with Dynamic Handicaps (Robotic Locomotion in Non-Stationary Environments) =<br />
<br />
The authors used the MuJoCo physics simulator to create a simulated environment where robotic spiders with 6 legs are faced with the task of running due east as quickly as possible. The robotic spider observes the location and velocity of it's body, and the angles and velocities of its legs. It interacts with the environment by exerting torque on the joints of its legs. Each leg has two joints, the joint closer to the body rotates horizontally while the joint farther from the body rotates vertically. The environment is made non-stationary by gradually paralyzing two legs of the spider across training and testing episodes.<br />
Putting this example into the above probabilistic framework yields:<br />
<br />
* <math>T_i</math>: The task of walking east with the torques of two legs scaled by <math> (i-1)/6 </math><br />
* <math>\{T_i\}_{i=1}^{7}</math>: A sequence of tasks with the same two legs handicapped in each task. Note there are 15 different ways to choose such legs resulting in 15 sequences of tasks. 12 are used for training and 3 for testing.<br />
* A Markov Descision process composed of<br />
** Observations <math> x_t </math> containing information about the state of the spider.<br />
** Actions <math> a_t </math> containing information about the torques to apply to the spiders legs.<br />
** Rewards <math> R_t </math> corresponding to the speed at which the spider is moving east.<br />
<br />
Three differently structured policy neural networks are trained in this set up using both meta-learning and three different previously developed adaption methods.<br />
<br />
At testing time, the spiders following meta learned policies initially perform worse than the spiders using non-adaptive policies. However, by the third episode <math> i=3 </math> the meta learners perform on par. And by the sixth episode, when the selected legs are mostly immobile, the meta learners significantly out perform. These results can be seen in the graphs below.<br />
<br />
[[File:locomotion_results.png | 800px]]<br />
<br />
= Training Spiders to Fight Each Other (Adversarial Meta-Learning) =<br />
<br />
The authors created an adversarial environment called RoboSumo where pairs of agents with 4 (named Ants), 6 (named Bugs),or 8 legs (named spiders) sumo wrestle. The agents observe the location and velocity of their bodies and the bodies of their opponent, the angles and velocities of their legs, and the forces being exerted on them by their opponent (equivalent of tactile sense). The game is organized into episodes and rounds. Episodes are single wrestling matches with 500 time steps and win/lose/draw outcomes. Agents win by pushing their opponent out of the ring or making their opponent's body touch the ground. Rounds are batches of episodes. An episode results in a draw when neither of these things happen after 500 time steps. Rounds have possible outcomes win, lose, and draw that are decided based on majority of episodes won. K rounds will be fought. Both agents may update their policies between rounds. The agent that wins the majority of rounds is deemed the winner of the game.<br />
<br />
== Setup ==<br />
Similar to the Robotic locomotion example, this game can be phrased in terms of the RL MDP framework.<br />
<br />
* <math>T_i</math>: The task of fighting a round.<br />
* <math>\{T_i\}_{i=1}^{K}</math>: A sequence of rounds against the same opponent. Note that the opponent may update their policy between rounds but the anatomy of both wrestlers will be constant across rounds.<br />
* A Markov Descision process composed of<br />
** A horizon <math>H = 500*n</math> where <math>n</math> is the number of episodes per round.<br />
** Observations <math> x_t </math> containing information about the state of the agent and its opponent.<br />
** Actions <math> a_t </math> containing information about the torques to apply to the agents legs.<br />
** Rewards <math> R_t </math> rewards given to the agent based on its wrestling performance. <math>R_{500*n} = </math> +2000 if win episode, -2000 if lose, and -1000 if draw.<br />
<br />
Note that the above reward set up is quite sparse, therefore in order to encourage fast training, rewards are introduced at every time step for the following.<br />
* For staying close to the center of the ring.<br />
* For exerting force on the opponents body.<br />
* For moving towards the opponent.<br />
* For the distance of the opponent to the center of the ring.<br />
<br />
This makes sense intuitively as these are reasonable goals for agents to explore when they are learning to wrestle.<br />
<br />
== Training ==<br />
The same combinations of policy networks and adaptation methods that were used in the locomotion example are trained and tested here. A family of non-adaptive policies are first trained via self play and saved at all stages. Self-play simply means the two agents in the training environment use the same policy. All policy versions are saved so that agents of various skill levels can be sampled when training meta-learners. The weights of the different insects were calibrated such that the test win rate between two insects of differing anatomy, who have been trained for the same number of epochs via self-play, is close to %50.<br />
<br />
[[File:weight_cal.png | 800px]]<br />
<br />
We can see in the above figure that the weight of the spider had to be increased by almost four times in order for the agents to evenly matched.<br />
<br />
[[File:robosumo_results.png | 800px]]<br />
<br />
= Sources =<br />
# Chelsea Finn, Pieter Abbeel, Sergey Levine. "Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks." arXiv preprint arXiv:1703.03400v3 (2017).<br />
# Richard S Sutton, Anna Koop, and David Silver. On the role of tracking in stationary environments. In Proceedings of the 24th international conference on Machine learning, pp. 871–878. ACM, 2007.</div>Tdunnhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:robosumo_results.png&diff=34139File:robosumo results.png2018-03-15T01:13:52Z<p>Tdunn: </p>
<hr />
<div></div>Tdunnhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Continuous_Adaptation_via_Meta-Learning_in_Nonstationary_and_Competitive_Environments&diff=34138Continuous Adaptation via Meta-Learning in Nonstationary and Competitive Environments2018-03-15T01:13:37Z<p>Tdunn: /* Training */</p>
<hr />
<div>= Introduction =<br />
<br />
Typically, the basic goal of machine learning is to train a model to perform a task. In Meta-learning, the goal is to train a model to perform the task of training a model to perform a task. Hence in this case the term "Meta-Learning" has the exact meaning you would expect; the word "Meta" has the precise function of introducing a layer of abstraction.<br />
<br />
The meta-learning task can be made more concrete by a simple example. Consider the CIFAR-100 classification task that we used for our data competition. We can alter this task from being a 100-class classification problem to a collection of 100 binary classification problems. The goal of Meta-Learning here is to design and train and single binary classifier that will perform well on a randomly sampled task given a limited amount of training data for that specific task. In other words, we would like to train a model to perform the following procedure:<br />
<br />
# A task is sampled. The task is "Is X a dog?"<br />
# A small set of labeled training data is provided to the model. The labels represent whether or not the image is a picture of a dog.<br />
# The model uses the training data to adjust itself to the specific task of checking whether or not an image is a picture of a dog.<br />
<br />
This example also highlights the intuition that the skill of sight is distinct and separable from the skill of knowing what a dog looks like.<br />
<br />
In this paper, a probabilistic framework for meta learning is derived, then applied to tasks involving simulated robotic spiders. This framework generalizes the typical machine learning set up using Markov Decision Processes.<br />
<br />
= Model Agnostic Meta-Learning =<br />
<br />
An initial framework for meta-learning is given in "Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks" (Finn et al, 2017)<br />
<br />
"In our approach, the parameters of<br />
the model are explicitly trained such that a small<br />
number of gradient steps with a small amount<br />
of training data from a new task will produce<br />
good generalization performance on that task" (Finn et al, 2017)<br />
<br />
[[File:MAML.png | 500px]]<br />
<br />
In this training algorithm, the parameter vector <math>\theta</math> belonging to the model <math>f_{\theta}</math> is trained such that the meta-objective function <math>\mathcal{L} (\theta) = \sum_{\tau_i \sim P(\tau)} \mathcal{L}_{\tau_i} (f_{\theta_i' }) </math> is minimized. The sum in the objective function is over a sampled batch of training tasks. <math>\mathcal{L}_{\tau_i} (f_{\theta_i'})</math> is the training loss function corresponding to the <math>i^{th}</math> task in the batch evaluated at the model <math>f_{\theta_i'}</math>. The parameter vector <math>\theta_i'</math> is obtained by updating the general parameter <math>\theta</math> using the loss function <math>\mathcal{L}_{\tau_i}</math> and set of K training examples specific to the <math>i^{th}</math> task. Note that in alternate versions of this algorithm, additional testing sets are sampled from <math>\tau_i</math> and used to update <math>\theta</math> using testing loss functions instead of training loss functions.<br />
<br />
One of the important difference between this algorithm and more typical fine-tuning methods is that <math>\theta</math> is explicitly trained to be easily adjusted to perform well on different tasks rather than perform well on any specific tasks then fine tuned as the environment changes. (Sutton et al., 2007)<br />
<br />
= Probabilistic Framework for Meta-Learning =<br />
<br />
This paper puts the meta-learning problem into a Markov Decision Process (MDP) framework common to RL. Instead of training examples <math>\{(x, y)\}</math>, we have trajectories <math>\tau = (x_0, a_1, x_1, R_1, x_2, ... a_H, x_H, R_H)</math>. A trajectory is sequence of states/observations <math>x_t</math>, actions <math>a_t</math> and rewards <math>R_t</math> that is sampled from a task <math> T </math> according to a policy <math>\pi_{\theta}</math>. Included with said task is a method for assigning loss values to trajectories <math>L_T(\tau)</math> which is typically the negative cumulative reward. A policy is a deterministic function that takes in a state and returns an action. Our goal here is to train a policy <math>\pi_{\theta}</math> with parameter vector <math>\theta</math>. This is analougous to training a function <math>f_{\theta}</math> that assigns labels <math>y</math> to feature vectors <math>x</math>. More precisely we have the following definitions:<br />
<br />
* <math>T :=(L_T, P_T(x_0), P_T(x_t | x_{t-1}, a_t), H )</math> (A Task)<br />
* <math>D(T)</math> : A distribution over tasks.<br />
* <math>L_T</math>: A loss function for the task T that assigns numeric loss values to trajectories.<br />
* <math>P_T(x_0), P_T(x_t | x_{t-1}, a_t)</math>: Probability measures specifying the markovian dynamics of the observations <math>x_t</math><br />
* <math>H</math>: The horizon of the MDP. This is a fixed natural number specifying the lengths of the tasks trajectories.<br />
<br />
The papers goes further to define a Markov dynamic for sequences of tasks. Thus the policy that we would like to meta learn <math>\pi_{\theta}</math>, after being exposed to a sample of K trajectories <math>\tau_T^{1:k}</math> from the task <math>T_i</math>, should produce a new policy <math>\pi_{\phi}</math> that will perform well on the next task <math>T_{i+1}</math>. Thus we seek to minimize the following expectation.<br />
<br />
<math>\mathrm{E}_{P(T_0), P(T_{i+1} | T_i)}\bigg(\sum_{i=1}^{l} \mathcal{L}_{T_i, T_{i+1}}(\theta)\bigg)</math><br />
<br />
Where <math>\mathcal{L}_{T_i}(\theta) = \mathrm{E}_{\tau_i^{1:k} } \bigg( \mathrm{E}_{\tau_{i+1, \phi}}\Big( L_{T_{i+1}}(\tau_{i+1, \phi}) \Big) \bigg) </math> and <math>l</math> is the number of tasks.<br />
<br />
The meta-policy <math>\pi_{\theta}</math> is trained and then adapted at test time using the following procedures.<br />
<br />
[[File:MAML2.png | 800px]]<br />
<br />
The mathematics of calculating loss gradients is omitted.<br />
<br />
= Training Spiders to Run with Dynamic Handicaps (Robotic Locomotion in Non-Stationary Environments) =<br />
<br />
The authors used the MuJoCo physics simulator to create a simulated environment where robotic spiders with 6 legs are faced with the task of running due east as quickly as possible. The robotic spider observes the location and velocity of it's body, and the angles and velocities of its legs. It interacts with the environment by exerting torque on the joints of its legs. Each leg has two joints, the joint closer to the body rotates horizontally while the joint farther from the body rotates vertically. The environment is made non-stationary by gradually paralyzing two legs of the spider across training and testing episodes.<br />
Putting this example into the above probabilistic framework yields:<br />
<br />
* <math>T_i</math>: The task of walking east with the torques of two legs scaled by <math> (i-1)/6 </math><br />
* <math>\{T_i\}_{i=1}^{7}</math>: A sequence of tasks with the same two legs handicapped in each task. Note there are 15 different ways to choose such legs resulting in 15 sequences of tasks. 12 are used for training and 3 for testing.<br />
* A Markov Descision process composed of<br />
** Observations <math> x_t </math> containing information about the state of the spider.<br />
** Actions <math> a_t </math> containing information about the torques to apply to the spiders legs.<br />
** Rewards <math> R_t </math> corresponding to the speed at which the spider is moving east.<br />
<br />
Three differently structured policy neural networks are trained in this set up using both meta-learning and three different previously developed adaption methods.<br />
<br />
At testing time, the spiders following meta learned policies initially perform worse than the spiders using non-adaptive policies. However, by the third episode <math> i=3 </math> the meta learners perform on par. And by the sixth episode, when the selected legs are mostly immobile, the meta learners significantly out perform. These results can be seen in the graphs below.<br />
<br />
[[File:locomotion_results.png | 800px]]<br />
<br />
= Training Spiders to Fight Each Other (Adversarial Meta-Learning) =<br />
<br />
The authors created an adversarial environment called RoboSumo where pairs of agents with 4 (named Ants), 6 (named Bugs),or 8 legs (named spiders) sumo wrestle. The agents observe the location and velocity of their bodies and the bodies of their opponent, the angles and velocities of their legs, and the forces being exerted on them by their opponent (equivalent of tactile sense). The game is organized into episodes and rounds. Episodes are single wrestling matches with 500 time steps and win/lose/draw outcomes. Agents win by pushing their opponent out of the ring or making their opponent's body touch the ground. Rounds are batches of episodes. An episode results in a draw when neither of these things happen after 500 time steps. Rounds have possible outcomes win, lose, and draw that are decided based on majority of episodes won. K rounds will be fought. Both agents may update their policies between rounds. The agent that wins the majority of rounds is deemed the winner of the game.<br />
<br />
== Setup ==<br />
Similar to the Robotic locomotion example, this game can be phrased in terms of the RL MDP framework.<br />
<br />
* <math>T_i</math>: The task of fighting a round.<br />
* <math>\{T_i\}_{i=1}^{K}</math>: A sequence of rounds against the same opponent. Note that the opponent may update their policy between rounds but the anatomy of both wrestlers will be constant across rounds.<br />
* A Markov Descision process composed of<br />
** A horizon <math>H = 500*n</math> where <math>n</math> is the number of episodes per round.<br />
** Observations <math> x_t </math> containing information about the state of the agent and its opponent.<br />
** Actions <math> a_t </math> containing information about the torques to apply to the agents legs.<br />
** Rewards <math> R_t </math> rewards given to the agent based on its wrestling performance. <math>R_{500*n} = </math> +2000 if win episode, -2000 if lose, and -1000 if draw.<br />
<br />
Note that the above reward set up is quite sparse, therefore in order to encourage fast training, rewards are introduced at every time step for the following.<br />
* For staying close to the center of the ring.<br />
* For exerting force on the opponents body.<br />
* For moving towards the opponent.<br />
* For the distance of the opponent to the center of the ring.<br />
<br />
This makes sense intuitively as these are reasonable goals for agents to explore when they are learning to wrestle.<br />
<br />
== Training ==<br />
The same combinations of policy networks and adaptation methods that were used in the locomotion example are trained and tested here. A family of non-adaptive policies are first trained via self play and saved at all stages. Self-play simply means the two agents in the training environment use the same policy. All policy versions are saved so that agents of various skill levels can be sampled when training meta-learners. The weights of the different insects were calibrated such that the test win rate between two insects of differing anatomy, who have been trained for the same number of epochs via self-play, is close to %50.<br />
<br />
[[File:weight_cal.png | 800px]]<br />
<br />
We can see in the above figure that the weight of the spider had to be increased by almost four times in order for the agents to evenly matched.<br />
<br />
[[File:robosumo_results.png | 1000px]]<br />
<br />
= Sources =<br />
# Chelsea Finn, Pieter Abbeel, Sergey Levine. "Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks." arXiv preprint arXiv:1703.03400v3 (2017).<br />
# Richard S Sutton, Anna Koop, and David Silver. On the role of tracking in stationary environments. In Proceedings of the 24th international conference on Machine learning, pp. 871–878. ACM, 2007.</div>Tdunnhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Continuous_Adaptation_via_Meta-Learning_in_Nonstationary_and_Competitive_Environments&diff=34136Continuous Adaptation via Meta-Learning in Nonstationary and Competitive Environments2018-03-15T01:01:31Z<p>Tdunn: /* Training */</p>
<hr />
<div>= Introduction =<br />
<br />
Typically, the basic goal of machine learning is to train a model to perform a task. In Meta-learning, the goal is to train a model to perform the task of training a model to perform a task. Hence in this case the term "Meta-Learning" has the exact meaning you would expect; the word "Meta" has the precise function of introducing a layer of abstraction.<br />
<br />
The meta-learning task can be made more concrete by a simple example. Consider the CIFAR-100 classification task that we used for our data competition. We can alter this task from being a 100-class classification problem to a collection of 100 binary classification problems. The goal of Meta-Learning here is to design and train and single binary classifier that will perform well on a randomly sampled task given a limited amount of training data for that specific task. In other words, we would like to train a model to perform the following procedure:<br />
<br />
# A task is sampled. The task is "Is X a dog?"<br />
# A small set of labeled training data is provided to the model. The labels represent whether or not the image is a picture of a dog.<br />
# The model uses the training data to adjust itself to the specific task of checking whether or not an image is a picture of a dog.<br />
<br />
This example also highlights the intuition that the skill of sight is distinct and separable from the skill of knowing what a dog looks like.<br />
<br />
In this paper, a probabilistic framework for meta learning is derived, then applied to tasks involving simulated robotic spiders. This framework generalizes the typical machine learning set up using Markov Decision Processes.<br />
<br />
= Model Agnostic Meta-Learning =<br />
<br />
An initial framework for meta-learning is given in "Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks" (Finn et al, 2017)<br />
<br />
"In our approach, the parameters of<br />
the model are explicitly trained such that a small<br />
number of gradient steps with a small amount<br />
of training data from a new task will produce<br />
good generalization performance on that task" (Finn et al, 2017)<br />
<br />
[[File:MAML.png | 500px]]<br />
<br />
In this training algorithm, the parameter vector <math>\theta</math> belonging to the model <math>f_{\theta}</math> is trained such that the meta-objective function <math>\mathcal{L} (\theta) = \sum_{\tau_i \sim P(\tau)} \mathcal{L}_{\tau_i} (f_{\theta_i' }) </math> is minimized. The sum in the objective function is over a sampled batch of training tasks. <math>\mathcal{L}_{\tau_i} (f_{\theta_i'})</math> is the training loss function corresponding to the <math>i^{th}</math> task in the batch evaluated at the model <math>f_{\theta_i'}</math>. The parameter vector <math>\theta_i'</math> is obtained by updating the general parameter <math>\theta</math> using the loss function <math>\mathcal{L}_{\tau_i}</math> and set of K training examples specific to the <math>i^{th}</math> task. Note that in alternate versions of this algorithm, additional testing sets are sampled from <math>\tau_i</math> and used to update <math>\theta</math> using testing loss functions instead of training loss functions.<br />
<br />
One of the important difference between this algorithm and more typical fine-tuning methods is that <math>\theta</math> is explicitly trained to be easily adjusted to perform well on different tasks rather than perform well on any specific tasks then fine tuned as the environment changes. (Sutton et al., 2007)<br />
<br />
= Probabilistic Framework for Meta-Learning =<br />
<br />
This paper puts the meta-learning problem into a Markov Decision Process (MDP) framework common to RL. Instead of training examples <math>\{(x, y)\}</math>, we have trajectories <math>\tau = (x_0, a_1, x_1, R_1, x_2, ... a_H, x_H, R_H)</math>. A trajectory is sequence of states/observations <math>x_t</math>, actions <math>a_t</math> and rewards <math>R_t</math> that is sampled from a task <math> T </math> according to a policy <math>\pi_{\theta}</math>. Included with said task is a method for assigning loss values to trajectories <math>L_T(\tau)</math> which is typically the negative cumulative reward. A policy is a deterministic function that takes in a state and returns an action. Our goal here is to train a policy <math>\pi_{\theta}</math> with parameter vector <math>\theta</math>. This is analougous to training a function <math>f_{\theta}</math> that assigns labels <math>y</math> to feature vectors <math>x</math>. More precisely we have the following definitions:<br />
<br />
* <math>T :=(L_T, P_T(x_0), P_T(x_t | x_{t-1}, a_t), H )</math> (A Task)<br />
* <math>D(T)</math> : A distribution over tasks.<br />
* <math>L_T</math>: A loss function for the task T that assigns numeric loss values to trajectories.<br />
* <math>P_T(x_0), P_T(x_t | x_{t-1}, a_t)</math>: Probability measures specifying the markovian dynamics of the observations <math>x_t</math><br />
* <math>H</math>: The horizon of the MDP. This is a fixed natural number specifying the lengths of the tasks trajectories.<br />
<br />
The papers goes further to define a Markov dynamic for sequences of tasks. Thus the policy that we would like to meta learn <math>\pi_{\theta}</math>, after being exposed to a sample of K trajectories <math>\tau_T^{1:k}</math> from the task <math>T_i</math>, should produce a new policy <math>\pi_{\phi}</math> that will perform well on the next task <math>T_{i+1}</math>. Thus we seek to minimize the following expectation.<br />
<br />
<math>\mathrm{E}_{P(T_0), P(T_{i+1} | T_i)}\bigg(\sum_{i=1}^{l} \mathcal{L}_{T_i, T_{i+1}}(\theta)\bigg)</math><br />
<br />
Where <math>\mathcal{L}_{T_i}(\theta) = \mathrm{E}_{\tau_i^{1:k} } \bigg( \mathrm{E}_{\tau_{i+1, \phi}}\Big( L_{T_{i+1}}(\tau_{i+1, \phi}) \Big) \bigg) </math> and <math>l</math> is the number of tasks.<br />
<br />
The meta-policy <math>\pi_{\theta}</math> is trained and then adapted at test time using the following procedures.<br />
<br />
[[File:MAML2.png | 800px]]<br />
<br />
The mathematics of calculating loss gradients is omitted.<br />
<br />
= Training Spiders to Run with Dynamic Handicaps (Robotic Locomotion in Non-Stationary Environments) =<br />
<br />
The authors used the MuJoCo physics simulator to create a simulated environment where robotic spiders with 6 legs are faced with the task of running due east as quickly as possible. The robotic spider observes the location and velocity of it's body, and the angles and velocities of its legs. It interacts with the environment by exerting torque on the joints of its legs. Each leg has two joints, the joint closer to the body rotates horizontally while the joint farther from the body rotates vertically. The environment is made non-stationary by gradually paralyzing two legs of the spider across training and testing episodes.<br />
Putting this example into the above probabilistic framework yields:<br />
<br />
* <math>T_i</math>: The task of walking east with the torques of two legs scaled by <math> (i-1)/6 </math><br />
* <math>\{T_i\}_{i=1}^{7}</math>: A sequence of tasks with the same two legs handicapped in each task. Note there are 15 different ways to choose such legs resulting in 15 sequences of tasks. 12 are used for training and 3 for testing.<br />
* A Markov Descision process composed of<br />
** Observations <math> x_t </math> containing information about the state of the spider.<br />
** Actions <math> a_t </math> containing information about the torques to apply to the spiders legs.<br />
** Rewards <math> R_t </math> corresponding to the speed at which the spider is moving east.<br />
<br />
Three differently structured policy neural networks are trained in this set up using both meta-learning and three different previously developed adaption methods.<br />
<br />
At testing time, the spiders following meta learned policies initially perform worse than the spiders using non-adaptive policies. However, by the third episode <math> i=3 </math> the meta learners perform on par. And by the sixth episode, when the selected legs are mostly immobile, the meta learners significantly out perform. These results can be seen in the graphs below.<br />
<br />
[[File:locomotion_results.png | 800px]]<br />
<br />
= Training Spiders to Fight Each Other (Adversarial Meta-Learning) =<br />
<br />
The authors created an adversarial environment called RoboSumo where pairs of agents with 4 (named Ants), 6 (named Bugs),or 8 legs (named spiders) sumo wrestle. The agents observe the location and velocity of their bodies and the bodies of their opponent, the angles and velocities of their legs, and the forces being exerted on them by their opponent (equivalent of tactile sense). The game is organized into episodes and rounds. Episodes are single wrestling matches with 500 time steps and win/lose/draw outcomes. Agents win by pushing their opponent out of the ring or making their opponent's body touch the ground. Rounds are batches of episodes. An episode results in a draw when neither of these things happen after 500 time steps. Rounds have possible outcomes win, lose, and draw that are decided based on majority of episodes won. K rounds will be fought. Both agents may update their policies between rounds. The agent that wins the majority of rounds is deemed the winner of the game.<br />
<br />
== Setup ==<br />
Similar to the Robotic locomotion example, this game can be phrased in terms of the RL MDP framework.<br />
<br />
* <math>T_i</math>: The task of fighting a round.<br />
* <math>\{T_i\}_{i=1}^{K}</math>: A sequence of rounds against the same opponent. Note that the opponent may update their policy between rounds but the anatomy of both wrestlers will be constant across rounds.<br />
* A Markov Descision process composed of<br />
** A horizon <math>H = 500*n</math> where <math>n</math> is the number of episodes per round.<br />
** Observations <math> x_t </math> containing information about the state of the agent and its opponent.<br />
** Actions <math> a_t </math> containing information about the torques to apply to the agents legs.<br />
** Rewards <math> R_t </math> rewards given to the agent based on its wrestling performance. <math>R_{500*n} = </math> +2000 if win episode, -2000 if lose, and -1000 if draw.<br />
<br />
Note that the above reward set up is quite sparse, therefore in order to encourage fast training, rewards are introduced at every time step for the following.<br />
* For staying close to the center of the ring.<br />
* For exerting force on the opponents body.<br />
* For moving towards the opponent.<br />
* For the distance of the opponent to the center of the ring.<br />
<br />
This makes sense intuitively as these are reasonable goals for agents to explore when they are learning to wrestle.<br />
<br />
== Training ==<br />
The same combinations of policy networks and adaptation methods that were used in the locomotion example are trained and tested here. A family of non-adaptive policies are first trained via self play and saved at all stages. Self-play simply means the two agents in the training environment use the same policy. All policy versions are saved so that agents of various skill levels can be sampled when training meta-learners. The weights of the different insects were calibrated such that the test win rate between two insects of differing anatomy, who have been trained for the same number of epochs via self-play, is close to %50.<br />
<br />
[[File:weight_cal.png | 800px]]<br />
<br />
We can see in the above figure that the weight of the spider had to be increased by almost four times in order for the agents to evenly matched.<br />
<br />
= Sources =<br />
# Chelsea Finn, Pieter Abbeel, Sergey Levine. "Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks." arXiv preprint arXiv:1703.03400v3 (2017).<br />
# Richard S Sutton, Anna Koop, and David Silver. On the role of tracking in stationary environments. In Proceedings of the 24th international conference on Machine learning, pp. 871–878. ACM, 2007.</div>Tdunnhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:weight_cal.png&diff=34135File:weight cal.png2018-03-15T00:59:42Z<p>Tdunn: </p>
<hr />
<div></div>Tdunnhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Continuous_Adaptation_via_Meta-Learning_in_Nonstationary_and_Competitive_Environments&diff=34134Continuous Adaptation via Meta-Learning in Nonstationary and Competitive Environments2018-03-15T00:59:33Z<p>Tdunn: /* Training */</p>
<hr />
<div>= Introduction =<br />
<br />
Typically, the basic goal of machine learning is to train a model to perform a task. In Meta-learning, the goal is to train a model to perform the task of training a model to perform a task. Hence in this case the term "Meta-Learning" has the exact meaning you would expect; the word "Meta" has the precise function of introducing a layer of abstraction.<br />
<br />
The meta-learning task can be made more concrete by a simple example. Consider the CIFAR-100 classification task that we used for our data competition. We can alter this task from being a 100-class classification problem to a collection of 100 binary classification problems. The goal of Meta-Learning here is to design and train and single binary classifier that will perform well on a randomly sampled task given a limited amount of training data for that specific task. In other words, we would like to train a model to perform the following procedure:<br />
<br />
# A task is sampled. The task is "Is X a dog?"<br />
# A small set of labeled training data is provided to the model. The labels represent whether or not the image is a picture of a dog.<br />
# The model uses the training data to adjust itself to the specific task of checking whether or not an image is a picture of a dog.<br />
<br />
This example also highlights the intuition that the skill of sight is distinct and separable from the skill of knowing what a dog looks like.<br />
<br />
In this paper, a probabilistic framework for meta learning is derived, then applied to tasks involving simulated robotic spiders. This framework generalizes the typical machine learning set up using Markov Decision Processes.<br />
<br />
= Model Agnostic Meta-Learning =<br />
<br />
An initial framework for meta-learning is given in "Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks" (Finn et al, 2017)<br />
<br />
"In our approach, the parameters of<br />
the model are explicitly trained such that a small<br />
number of gradient steps with a small amount<br />
of training data from a new task will produce<br />
good generalization performance on that task" (Finn et al, 2017)<br />
<br />
[[File:MAML.png | 500px]]<br />
<br />
In this training algorithm, the parameter vector <math>\theta</math> belonging to the model <math>f_{\theta}</math> is trained such that the meta-objective function <math>\mathcal{L} (\theta) = \sum_{\tau_i \sim P(\tau)} \mathcal{L}_{\tau_i} (f_{\theta_i' }) </math> is minimized. The sum in the objective function is over a sampled batch of training tasks. <math>\mathcal{L}_{\tau_i} (f_{\theta_i'})</math> is the training loss function corresponding to the <math>i^{th}</math> task in the batch evaluated at the model <math>f_{\theta_i'}</math>. The parameter vector <math>\theta_i'</math> is obtained by updating the general parameter <math>\theta</math> using the loss function <math>\mathcal{L}_{\tau_i}</math> and set of K training examples specific to the <math>i^{th}</math> task. Note that in alternate versions of this algorithm, additional testing sets are sampled from <math>\tau_i</math> and used to update <math>\theta</math> using testing loss functions instead of training loss functions.<br />
<br />
One of the important difference between this algorithm and more typical fine-tuning methods is that <math>\theta</math> is explicitly trained to be easily adjusted to perform well on different tasks rather than perform well on any specific tasks then fine tuned as the environment changes. (Sutton et al., 2007)<br />
<br />
= Probabilistic Framework for Meta-Learning =<br />
<br />
This paper puts the meta-learning problem into a Markov Decision Process (MDP) framework common to RL. Instead of training examples <math>\{(x, y)\}</math>, we have trajectories <math>\tau = (x_0, a_1, x_1, R_1, x_2, ... a_H, x_H, R_H)</math>. A trajectory is sequence of states/observations <math>x_t</math>, actions <math>a_t</math> and rewards <math>R_t</math> that is sampled from a task <math> T </math> according to a policy <math>\pi_{\theta}</math>. Included with said task is a method for assigning loss values to trajectories <math>L_T(\tau)</math> which is typically the negative cumulative reward. A policy is a deterministic function that takes in a state and returns an action. Our goal here is to train a policy <math>\pi_{\theta}</math> with parameter vector <math>\theta</math>. This is analougous to training a function <math>f_{\theta}</math> that assigns labels <math>y</math> to feature vectors <math>x</math>. More precisely we have the following definitions:<br />
<br />
* <math>T :=(L_T, P_T(x_0), P_T(x_t | x_{t-1}, a_t), H )</math> (A Task)<br />
* <math>D(T)</math> : A distribution over tasks.<br />
* <math>L_T</math>: A loss function for the task T that assigns numeric loss values to trajectories.<br />
* <math>P_T(x_0), P_T(x_t | x_{t-1}, a_t)</math>: Probability measures specifying the markovian dynamics of the observations <math>x_t</math><br />
* <math>H</math>: The horizon of the MDP. This is a fixed natural number specifying the lengths of the tasks trajectories.<br />
<br />
The papers goes further to define a Markov dynamic for sequences of tasks. Thus the policy that we would like to meta learn <math>\pi_{\theta}</math>, after being exposed to a sample of K trajectories <math>\tau_T^{1:k}</math> from the task <math>T_i</math>, should produce a new policy <math>\pi_{\phi}</math> that will perform well on the next task <math>T_{i+1}</math>. Thus we seek to minimize the following expectation.<br />
<br />
<math>\mathrm{E}_{P(T_0), P(T_{i+1} | T_i)}\bigg(\sum_{i=1}^{l} \mathcal{L}_{T_i, T_{i+1}}(\theta)\bigg)</math><br />
<br />
Where <math>\mathcal{L}_{T_i}(\theta) = \mathrm{E}_{\tau_i^{1:k} } \bigg( \mathrm{E}_{\tau_{i+1, \phi}}\Big( L_{T_{i+1}}(\tau_{i+1, \phi}) \Big) \bigg) </math> and <math>l</math> is the number of tasks.<br />
<br />
The meta-policy <math>\pi_{\theta}</math> is trained and then adapted at test time using the following procedures.<br />
<br />
[[File:MAML2.png | 800px]]<br />
<br />
The mathematics of calculating loss gradients is omitted.<br />
<br />
= Training Spiders to Run with Dynamic Handicaps (Robotic Locomotion in Non-Stationary Environments) =<br />
<br />
The authors used the MuJoCo physics simulator to create a simulated environment where robotic spiders with 6 legs are faced with the task of running due east as quickly as possible. The robotic spider observes the location and velocity of it's body, and the angles and velocities of its legs. It interacts with the environment by exerting torque on the joints of its legs. Each leg has two joints, the joint closer to the body rotates horizontally while the joint farther from the body rotates vertically. The environment is made non-stationary by gradually paralyzing two legs of the spider across training and testing episodes.<br />
Putting this example into the above probabilistic framework yields:<br />
<br />
* <math>T_i</math>: The task of walking east with the torques of two legs scaled by <math> (i-1)/6 </math><br />
* <math>\{T_i\}_{i=1}^{7}</math>: A sequence of tasks with the same two legs handicapped in each task. Note there are 15 different ways to choose such legs resulting in 15 sequences of tasks. 12 are used for training and 3 for testing.<br />
* A Markov Descision process composed of<br />
** Observations <math> x_t </math> containing information about the state of the spider.<br />
** Actions <math> a_t </math> containing information about the torques to apply to the spiders legs.<br />
** Rewards <math> R_t </math> corresponding to the speed at which the spider is moving east.<br />
<br />
Three differently structured policy neural networks are trained in this set up using both meta-learning and three different previously developed adaption methods.<br />
<br />
At testing time, the spiders following meta learned policies initially perform worse than the spiders using non-adaptive policies. However, by the third episode <math> i=3 </math> the meta learners perform on par. And by the sixth episode, when the selected legs are mostly immobile, the meta learners significantly out perform. These results can be seen in the graphs below.<br />
<br />
[[File:locomotion_results.png | 800px]]<br />
<br />
= Training Spiders to Fight Each Other (Adversarial Meta-Learning) =<br />
<br />
The authors created an adversarial environment called RoboSumo where pairs of agents with 4 (named Ants), 6 (named Bugs),or 8 legs (named spiders) sumo wrestle. The agents observe the location and velocity of their bodies and the bodies of their opponent, the angles and velocities of their legs, and the forces being exerted on them by their opponent (equivalent of tactile sense). The game is organized into episodes and rounds. Episodes are single wrestling matches with 500 time steps and win/lose/draw outcomes. Agents win by pushing their opponent out of the ring or making their opponent's body touch the ground. Rounds are batches of episodes. An episode results in a draw when neither of these things happen after 500 time steps. Rounds have possible outcomes win, lose, and draw that are decided based on majority of episodes won. K rounds will be fought. Both agents may update their policies between rounds. The agent that wins the majority of rounds is deemed the winner of the game.<br />
<br />
== Setup ==<br />
Similar to the Robotic locomotion example, this game can be phrased in terms of the RL MDP framework.<br />
<br />
* <math>T_i</math>: The task of fighting a round.<br />
* <math>\{T_i\}_{i=1}^{K}</math>: A sequence of rounds against the same opponent. Note that the opponent may update their policy between rounds but the anatomy of both wrestlers will be constant across rounds.<br />
* A Markov Descision process composed of<br />
** A horizon <math>H = 500*n</math> where <math>n</math> is the number of episodes per round.<br />
** Observations <math> x_t </math> containing information about the state of the agent and its opponent.<br />
** Actions <math> a_t </math> containing information about the torques to apply to the agents legs.<br />
** Rewards <math> R_t </math> rewards given to the agent based on its wrestling performance. <math>R_{500*n} = </math> +2000 if win episode, -2000 if lose, and -1000 if draw.<br />
<br />
Note that the above reward set up is quite sparse, therefore in order to encourage fast training, rewards are introduced at every time step for the following.<br />
* For staying close to the center of the ring.<br />
* For exerting force on the opponents body.<br />
* For moving towards the opponent.<br />
* For the distance of the opponent to the center of the ring.<br />
<br />
This makes sense intuitively as these are reasonable goals for agents to explore when they are learning to wrestle.<br />
<br />
== Training ==<br />
The same combinations of policy networks and adaptation methods that were used in the locomotion example are trained and tested here. A family of non-adaptive policies are first trained via self play and saved at all stages. Self-play simply means the two agents in the training environment use the same policy. All policy versions are saved so that agents of various skill levels can be sampled when training meta-learners. The weights of the different insects were calibrated such that the test win rate between two insects of differing anatomy, who have been trained for the same number of epochs via self-play, is close to %50.<br />
<br />
[[File:weight_cal.png | 800px]]<br />
<br />
= Sources =<br />
# Chelsea Finn, Pieter Abbeel, Sergey Levine. "Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks." arXiv preprint arXiv:1703.03400v3 (2017).<br />
# Richard S Sutton, Anna Koop, and David Silver. On the role of tracking in stationary environments. In Proceedings of the 24th international conference on Machine learning, pp. 871–878. ACM, 2007.</div>Tdunnhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Continuous_Adaptation_via_Meta-Learning_in_Nonstationary_and_Competitive_Environments&diff=34130Continuous Adaptation via Meta-Learning in Nonstationary and Competitive Environments2018-03-15T00:55:59Z<p>Tdunn: /* Training Spiders to Fight Each Other (Adversarial Meta-Learning) */</p>
<hr />
<div>= Introduction =<br />
<br />
Typically, the basic goal of machine learning is to train a model to perform a task. In Meta-learning, the goal is to train a model to perform the task of training a model to perform a task. Hence in this case the term "Meta-Learning" has the exact meaning you would expect; the word "Meta" has the precise function of introducing a layer of abstraction.<br />
<br />
The meta-learning task can be made more concrete by a simple example. Consider the CIFAR-100 classification task that we used for our data competition. We can alter this task from being a 100-class classification problem to a collection of 100 binary classification problems. The goal of Meta-Learning here is to design and train and single binary classifier that will perform well on a randomly sampled task given a limited amount of training data for that specific task. In other words, we would like to train a model to perform the following procedure:<br />
<br />
# A task is sampled. The task is "Is X a dog?"<br />
# A small set of labeled training data is provided to the model. The labels represent whether or not the image is a picture of a dog.<br />
# The model uses the training data to adjust itself to the specific task of checking whether or not an image is a picture of a dog.<br />
<br />
This example also highlights the intuition that the skill of sight is distinct and separable from the skill of knowing what a dog looks like.<br />
<br />
In this paper, a probabilistic framework for meta learning is derived, then applied to tasks involving simulated robotic spiders. This framework generalizes the typical machine learning set up using Markov Decision Processes.<br />
<br />
= Model Agnostic Meta-Learning =<br />
<br />
An initial framework for meta-learning is given in "Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks" (Finn et al, 2017)<br />
<br />
"In our approach, the parameters of<br />
the model are explicitly trained such that a small<br />
number of gradient steps with a small amount<br />
of training data from a new task will produce<br />
good generalization performance on that task" (Finn et al, 2017)<br />
<br />
[[File:MAML.png | 500px]]<br />
<br />
In this training algorithm, the parameter vector <math>\theta</math> belonging to the model <math>f_{\theta}</math> is trained such that the meta-objective function <math>\mathcal{L} (\theta) = \sum_{\tau_i \sim P(\tau)} \mathcal{L}_{\tau_i} (f_{\theta_i' }) </math> is minimized. The sum in the objective function is over a sampled batch of training tasks. <math>\mathcal{L}_{\tau_i} (f_{\theta_i'})</math> is the training loss function corresponding to the <math>i^{th}</math> task in the batch evaluated at the model <math>f_{\theta_i'}</math>. The parameter vector <math>\theta_i'</math> is obtained by updating the general parameter <math>\theta</math> using the loss function <math>\mathcal{L}_{\tau_i}</math> and set of K training examples specific to the <math>i^{th}</math> task. Note that in alternate versions of this algorithm, additional testing sets are sampled from <math>\tau_i</math> and used to update <math>\theta</math> using testing loss functions instead of training loss functions.<br />
<br />
One of the important difference between this algorithm and more typical fine-tuning methods is that <math>\theta</math> is explicitly trained to be easily adjusted to perform well on different tasks rather than perform well on any specific tasks then fine tuned as the environment changes. (Sutton et al., 2007)<br />
<br />
= Probabilistic Framework for Meta-Learning =<br />
<br />
This paper puts the meta-learning problem into a Markov Decision Process (MDP) framework common to RL. Instead of training examples <math>\{(x, y)\}</math>, we have trajectories <math>\tau = (x_0, a_1, x_1, R_1, x_2, ... a_H, x_H, R_H)</math>. A trajectory is sequence of states/observations <math>x_t</math>, actions <math>a_t</math> and rewards <math>R_t</math> that is sampled from a task <math> T </math> according to a policy <math>\pi_{\theta}</math>. Included with said task is a method for assigning loss values to trajectories <math>L_T(\tau)</math> which is typically the negative cumulative reward. A policy is a deterministic function that takes in a state and returns an action. Our goal here is to train a policy <math>\pi_{\theta}</math> with parameter vector <math>\theta</math>. This is analougous to training a function <math>f_{\theta}</math> that assigns labels <math>y</math> to feature vectors <math>x</math>. More precisely we have the following definitions:<br />
<br />
* <math>T :=(L_T, P_T(x_0), P_T(x_t | x_{t-1}, a_t), H )</math> (A Task)<br />
* <math>D(T)</math> : A distribution over tasks.<br />
* <math>L_T</math>: A loss function for the task T that assigns numeric loss values to trajectories.<br />
* <math>P_T(x_0), P_T(x_t | x_{t-1}, a_t)</math>: Probability measures specifying the markovian dynamics of the observations <math>x_t</math><br />
* <math>H</math>: The horizon of the MDP. This is a fixed natural number specifying the lengths of the tasks trajectories.<br />
<br />
The papers goes further to define a Markov dynamic for sequences of tasks. Thus the policy that we would like to meta learn <math>\pi_{\theta}</math>, after being exposed to a sample of K trajectories <math>\tau_T^{1:k}</math> from the task <math>T_i</math>, should produce a new policy <math>\pi_{\phi}</math> that will perform well on the next task <math>T_{i+1}</math>. Thus we seek to minimize the following expectation.<br />
<br />
<math>\mathrm{E}_{P(T_0), P(T_{i+1} | T_i)}\bigg(\sum_{i=1}^{l} \mathcal{L}_{T_i, T_{i+1}}(\theta)\bigg)</math><br />
<br />
Where <math>\mathcal{L}_{T_i}(\theta) = \mathrm{E}_{\tau_i^{1:k} } \bigg( \mathrm{E}_{\tau_{i+1, \phi}}\Big( L_{T_{i+1}}(\tau_{i+1, \phi}) \Big) \bigg) </math> and <math>l</math> is the number of tasks.<br />
<br />
The meta-policy <math>\pi_{\theta}</math> is trained and then adapted at test time using the following procedures.<br />
<br />
[[File:MAML2.png | 800px]]<br />
<br />
The mathematics of calculating loss gradients is omitted.<br />
<br />
= Training Spiders to Run with Dynamic Handicaps (Robotic Locomotion in Non-Stationary Environments) =<br />
<br />
The authors used the MuJoCo physics simulator to create a simulated environment where robotic spiders with 6 legs are faced with the task of running due east as quickly as possible. The robotic spider observes the location and velocity of it's body, and the angles and velocities of its legs. It interacts with the environment by exerting torque on the joints of its legs. Each leg has two joints, the joint closer to the body rotates horizontally while the joint farther from the body rotates vertically. The environment is made non-stationary by gradually paralyzing two legs of the spider across training and testing episodes.<br />
Putting this example into the above probabilistic framework yields:<br />
<br />
* <math>T_i</math>: The task of walking east with the torques of two legs scaled by <math> (i-1)/6 </math><br />
* <math>\{T_i\}_{i=1}^{7}</math>: A sequence of tasks with the same two legs handicapped in each task. Note there are 15 different ways to choose such legs resulting in 15 sequences of tasks. 12 are used for training and 3 for testing.<br />
* A Markov Descision process composed of<br />
** Observations <math> x_t </math> containing information about the state of the spider.<br />
** Actions <math> a_t </math> containing information about the torques to apply to the spiders legs.<br />
** Rewards <math> R_t </math> corresponding to the speed at which the spider is moving east.<br />
<br />
Three differently structured policy neural networks are trained in this set up using both meta-learning and three different previously developed adaption methods.<br />
<br />
At testing time, the spiders following meta learned policies initially perform worse than the spiders using non-adaptive policies. However, by the third episode <math> i=3 </math> the meta learners perform on par. And by the sixth episode, when the selected legs are mostly immobile, the meta learners significantly out perform. These results can be seen in the graphs below.<br />
<br />
[[File:locomotion_results.png | 800px]]<br />
<br />
= Training Spiders to Fight Each Other (Adversarial Meta-Learning) =<br />
<br />
The authors created an adversarial environment called RoboSumo where pairs of agents with 4 (named Ants), 6 (named Bugs),or 8 legs (named spiders) sumo wrestle. The agents observe the location and velocity of their bodies and the bodies of their opponent, the angles and velocities of their legs, and the forces being exerted on them by their opponent (equivalent of tactile sense). The game is organized into episodes and rounds. Episodes are single wrestling matches with 500 time steps and win/lose/draw outcomes. Agents win by pushing their opponent out of the ring or making their opponent's body touch the ground. Rounds are batches of episodes. An episode results in a draw when neither of these things happen after 500 time steps. Rounds have possible outcomes win, lose, and draw that are decided based on majority of episodes won. K rounds will be fought. Both agents may update their policies between rounds. The agent that wins the majority of rounds is deemed the winner of the game.<br />
<br />
== Setup ==<br />
Similar to the Robotic locomotion example, this game can be phrased in terms of the RL MDP framework.<br />
<br />
* <math>T_i</math>: The task of fighting a round.<br />
* <math>\{T_i\}_{i=1}^{K}</math>: A sequence of rounds against the same opponent. Note that the opponent may update their policy between rounds but the anatomy of both wrestlers will be constant across rounds.<br />
* A Markov Descision process composed of<br />
** A horizon <math>H = 500*n</math> where <math>n</math> is the number of episodes per round.<br />
** Observations <math> x_t </math> containing information about the state of the agent and its opponent.<br />
** Actions <math> a_t </math> containing information about the torques to apply to the agents legs.<br />
** Rewards <math> R_t </math> rewards given to the agent based on its wrestling performance. <math>R_{500*n} = </math> +2000 if win episode, -2000 if lose, and -1000 if draw.<br />
<br />
Note that the above reward set up is quite sparse, therefore in order to encourage fast training, rewards are introduced at every time step for the following.<br />
* For staying close to the center of the ring.<br />
* For exerting force on the opponents body.<br />
* For moving towards the opponent.<br />
* For the distance of the opponent to the center of the ring.<br />
<br />
This makes sense intuitively as these are reasonable goals for agents to explore when they are learning to wrestle.<br />
<br />
== Training ==<br />
The same combinations of policy networks and adaptation methods that were used in the locomotion example are trained and tested here. A family of non-adaptive policies are first trained via self play and saved at all stages. Self-play simply means the two agents in the training environment use the same policy. All policy versions are saved so that agents of various skill levels can be sampled when training meta-learners. The weights of the different insects were calibrated such that the test win rate between two insects of differing anatomy, who have been trained for the same number of epochs via self-play, is close to %50.<br />
<br />
= Sources =<br />
# Chelsea Finn, Pieter Abbeel, Sergey Levine. "Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks." arXiv preprint arXiv:1703.03400v3 (2017).<br />
# Richard S Sutton, Anna Koop, and David Silver. On the role of tracking in stationary environments. In Proceedings of the 24th international conference on Machine learning, pp. 871–878. ACM, 2007.</div>Tdunnhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Continuous_Adaptation_via_Meta-Learning_in_Nonstationary_and_Competitive_Environments&diff=34129Continuous Adaptation via Meta-Learning in Nonstationary and Competitive Environments2018-03-15T00:54:37Z<p>Tdunn: /* Training Spiders to Fight Each Other (Adversarial Meta-Learning) */</p>
<hr />
<div>= Introduction =<br />
<br />
Typically, the basic goal of machine learning is to train a model to perform a task. In Meta-learning, the goal is to train a model to perform the task of training a model to perform a task. Hence in this case the term "Meta-Learning" has the exact meaning you would expect; the word "Meta" has the precise function of introducing a layer of abstraction.<br />
<br />
The meta-learning task can be made more concrete by a simple example. Consider the CIFAR-100 classification task that we used for our data competition. We can alter this task from being a 100-class classification problem to a collection of 100 binary classification problems. The goal of Meta-Learning here is to design and train and single binary classifier that will perform well on a randomly sampled task given a limited amount of training data for that specific task. In other words, we would like to train a model to perform the following procedure:<br />
<br />
# A task is sampled. The task is "Is X a dog?"<br />
# A small set of labeled training data is provided to the model. The labels represent whether or not the image is a picture of a dog.<br />
# The model uses the training data to adjust itself to the specific task of checking whether or not an image is a picture of a dog.<br />
<br />
This example also highlights the intuition that the skill of sight is distinct and separable from the skill of knowing what a dog looks like.<br />
<br />
In this paper, a probabilistic framework for meta learning is derived, then applied to tasks involving simulated robotic spiders. This framework generalizes the typical machine learning set up using Markov Decision Processes.<br />
<br />
= Model Agnostic Meta-Learning =<br />
<br />
An initial framework for meta-learning is given in "Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks" (Finn et al, 2017)<br />
<br />
"In our approach, the parameters of<br />
the model are explicitly trained such that a small<br />
number of gradient steps with a small amount<br />
of training data from a new task will produce<br />
good generalization performance on that task" (Finn et al, 2017)<br />
<br />
[[File:MAML.png | 500px]]<br />
<br />
In this training algorithm, the parameter vector <math>\theta</math> belonging to the model <math>f_{\theta}</math> is trained such that the meta-objective function <math>\mathcal{L} (\theta) = \sum_{\tau_i \sim P(\tau)} \mathcal{L}_{\tau_i} (f_{\theta_i' }) </math> is minimized. The sum in the objective function is over a sampled batch of training tasks. <math>\mathcal{L}_{\tau_i} (f_{\theta_i'})</math> is the training loss function corresponding to the <math>i^{th}</math> task in the batch evaluated at the model <math>f_{\theta_i'}</math>. The parameter vector <math>\theta_i'</math> is obtained by updating the general parameter <math>\theta</math> using the loss function <math>\mathcal{L}_{\tau_i}</math> and set of K training examples specific to the <math>i^{th}</math> task. Note that in alternate versions of this algorithm, additional testing sets are sampled from <math>\tau_i</math> and used to update <math>\theta</math> using testing loss functions instead of training loss functions.<br />
<br />
One of the important difference between this algorithm and more typical fine-tuning methods is that <math>\theta</math> is explicitly trained to be easily adjusted to perform well on different tasks rather than perform well on any specific tasks then fine tuned as the environment changes. (Sutton et al., 2007)<br />
<br />
= Probabilistic Framework for Meta-Learning =<br />
<br />
This paper puts the meta-learning problem into a Markov Decision Process (MDP) framework common to RL. Instead of training examples <math>\{(x, y)\}</math>, we have trajectories <math>\tau = (x_0, a_1, x_1, R_1, x_2, ... a_H, x_H, R_H)</math>. A trajectory is sequence of states/observations <math>x_t</math>, actions <math>a_t</math> and rewards <math>R_t</math> that is sampled from a task <math> T </math> according to a policy <math>\pi_{\theta}</math>. Included with said task is a method for assigning loss values to trajectories <math>L_T(\tau)</math> which is typically the negative cumulative reward. A policy is a deterministic function that takes in a state and returns an action. Our goal here is to train a policy <math>\pi_{\theta}</math> with parameter vector <math>\theta</math>. This is analougous to training a function <math>f_{\theta}</math> that assigns labels <math>y</math> to feature vectors <math>x</math>. More precisely we have the following definitions:<br />
<br />
* <math>T :=(L_T, P_T(x_0), P_T(x_t | x_{t-1}, a_t), H )</math> (A Task)<br />
* <math>D(T)</math> : A distribution over tasks.<br />
* <math>L_T</math>: A loss function for the task T that assigns numeric loss values to trajectories.<br />
* <math>P_T(x_0), P_T(x_t | x_{t-1}, a_t)</math>: Probability measures specifying the markovian dynamics of the observations <math>x_t</math><br />
* <math>H</math>: The horizon of the MDP. This is a fixed natural number specifying the lengths of the tasks trajectories.<br />
<br />
The papers goes further to define a Markov dynamic for sequences of tasks. Thus the policy that we would like to meta learn <math>\pi_{\theta}</math>, after being exposed to a sample of K trajectories <math>\tau_T^{1:k}</math> from the task <math>T_i</math>, should produce a new policy <math>\pi_{\phi}</math> that will perform well on the next task <math>T_{i+1}</math>. Thus we seek to minimize the following expectation.<br />
<br />
<math>\mathrm{E}_{P(T_0), P(T_{i+1} | T_i)}\bigg(\sum_{i=1}^{l} \mathcal{L}_{T_i, T_{i+1}}(\theta)\bigg)</math><br />
<br />
Where <math>\mathcal{L}_{T_i}(\theta) = \mathrm{E}_{\tau_i^{1:k} } \bigg( \mathrm{E}_{\tau_{i+1, \phi}}\Big( L_{T_{i+1}}(\tau_{i+1, \phi}) \Big) \bigg) </math> and <math>l</math> is the number of tasks.<br />
<br />
The meta-policy <math>\pi_{\theta}</math> is trained and then adapted at test time using the following procedures.<br />
<br />
[[File:MAML2.png | 800px]]<br />
<br />
The mathematics of calculating loss gradients is omitted.<br />
<br />
= Training Spiders to Run with Dynamic Handicaps (Robotic Locomotion in Non-Stationary Environments) =<br />
<br />
The authors used the MuJoCo physics simulator to create a simulated environment where robotic spiders with 6 legs are faced with the task of running due east as quickly as possible. The robotic spider observes the location and velocity of it's body, and the angles and velocities of its legs. It interacts with the environment by exerting torque on the joints of its legs. Each leg has two joints, the joint closer to the body rotates horizontally while the joint farther from the body rotates vertically. The environment is made non-stationary by gradually paralyzing two legs of the spider across training and testing episodes.<br />
Putting this example into the above probabilistic framework yields:<br />
<br />
* <math>T_i</math>: The task of walking east with the torques of two legs scaled by <math> (i-1)/6 </math><br />
* <math>\{T_i\}_{i=1}^{7}</math>: A sequence of tasks with the same two legs handicapped in each task. Note there are 15 different ways to choose such legs resulting in 15 sequences of tasks. 12 are used for training and 3 for testing.<br />
* A Markov Descision process composed of<br />
** Observations <math> x_t </math> containing information about the state of the spider.<br />
** Actions <math> a_t </math> containing information about the torques to apply to the spiders legs.<br />
** Rewards <math> R_t </math> corresponding to the speed at which the spider is moving east.<br />
<br />
Three differently structured policy neural networks are trained in this set up using both meta-learning and three different previously developed adaption methods.<br />
<br />
At testing time, the spiders following meta learned policies initially perform worse than the spiders using non-adaptive policies. However, by the third episode <math> i=3 </math> the meta learners perform on par. And by the sixth episode, when the selected legs are mostly immobile, the meta learners significantly out perform. These results can be seen in the graphs below.<br />
<br />
[[File:locomotion_results.png | 800px]]<br />
<br />
= Training Spiders to Fight Each Other (Adversarial Meta-Learning) =<br />
<br />
The authors created an adversarial environment called RoboSumo where pairs of agents with 4 (named Ants), 6 (named Bugs),or 8 legs (named spiders) sumo wrestle. The agents observe the location and velocity of their bodies and the bodies of their opponent, the angles and velocities of their legs, and the forces being exerted on them by their opponent (equivalent of tactile sense). The game is organized into episodes and rounds. Episodes are single wrestling matches with 500 time steps and win/lose/draw outcomes. Agents win by pushing their opponent out of the ring or making their opponent's body touch the ground. Rounds are batches of episodes. An episode results in a draw when neither of these things happen after 500 time steps. Rounds have possible outcomes win, lose, and draw that are decided based on majority of episodes won. K rounds will be fought. Both agents may update their policies between rounds. The agent that wins the majority of rounds is deemed the winner of the game.<br />
<br />
Similar to the Robotic locomotion example, this game can be phrased in terms of the RL MDP framework.<br />
<br />
* <math>T_i</math>: The task of fighting a round.<br />
* <math>\{T_i\}_{i=1}^{K}</math>: A sequence of rounds against the same opponent. Note that the opponent may update their policy between rounds but the anatomy of both wrestlers will be constant across rounds.<br />
* A Markov Descision process composed of<br />
** A horizon <math>H = 500*n</math> where <math>n</math> is the number of episodes per round.<br />
** Observations <math> x_t </math> containing information about the state of the agent and its opponent.<br />
** Actions <math> a_t </math> containing information about the torques to apply to the agents legs.<br />
** Rewards <math> R_t </math> rewards given to the agent based on its wrestling performance. <math>R_{500*n} = </math> +2000 if win episode, -2000 if lose, and -1000 if draw.<br />
<br />
Note that the above reward set up is quite sparse, therefore in order to encourage fast training, rewards are introduced at every time step for the following.<br />
* For staying close to the center of the ring.<br />
* For exerting force on the opponents body.<br />
* For moving towards the opponent.<br />
* For the distance of the opponent to the center of the ring.<br />
<br />
This makes sense intuitively as these are reasonable goals for agents to explore when they are learning to wrestle.<br />
<br />
The same combinations of policy networks and adaptation methods that were used in the locomotion example are trained and tested here.<br />
<br />
A family of non-adaptive policies are first trained via self play and saved at all stages. Self-play simply means the two agents in the training environment use the same policy. All policy versions are saved so that agents of various skill levels can be sampled when training meta-learners.<br />
<br />
The weights of the different insects were calibrated such that the test win rate between two insects of differing anatomy, who have been trained for the same number of epochs via self-play, is close to %50.<br />
<br />
= Sources =<br />
# Chelsea Finn, Pieter Abbeel, Sergey Levine. "Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks." arXiv preprint arXiv:1703.03400v3 (2017).<br />
# Richard S Sutton, Anna Koop, and David Silver. On the role of tracking in stationary environments. In Proceedings of the 24th international conference on Machine learning, pp. 871–878. ACM, 2007.</div>Tdunnhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Continuous_Adaptation_via_Meta-Learning_in_Nonstationary_and_Competitive_Environments&diff=34128Continuous Adaptation via Meta-Learning in Nonstationary and Competitive Environments2018-03-15T00:52:24Z<p>Tdunn: /* Training Spiders to Fight Each Other (Adversarial Meta-Learning) */</p>
<hr />
<div>= Introduction =<br />
<br />
Typically, the basic goal of machine learning is to train a model to perform a task. In Meta-learning, the goal is to train a model to perform the task of training a model to perform a task. Hence in this case the term "Meta-Learning" has the exact meaning you would expect; the word "Meta" has the precise function of introducing a layer of abstraction.<br />
<br />
The meta-learning task can be made more concrete by a simple example. Consider the CIFAR-100 classification task that we used for our data competition. We can alter this task from being a 100-class classification problem to a collection of 100 binary classification problems. The goal of Meta-Learning here is to design and train and single binary classifier that will perform well on a randomly sampled task given a limited amount of training data for that specific task. In other words, we would like to train a model to perform the following procedure:<br />
<br />
# A task is sampled. The task is "Is X a dog?"<br />
# A small set of labeled training data is provided to the model. The labels represent whether or not the image is a picture of a dog.<br />
# The model uses the training data to adjust itself to the specific task of checking whether or not an image is a picture of a dog.<br />
<br />
This example also highlights the intuition that the skill of sight is distinct and separable from the skill of knowing what a dog looks like.<br />
<br />
In this paper, a probabilistic framework for meta learning is derived, then applied to tasks involving simulated robotic spiders. This framework generalizes the typical machine learning set up using Markov Decision Processes.<br />
<br />
= Model Agnostic Meta-Learning =<br />
<br />
An initial framework for meta-learning is given in "Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks" (Finn et al, 2017)<br />
<br />
"In our approach, the parameters of<br />
the model are explicitly trained such that a small<br />
number of gradient steps with a small amount<br />
of training data from a new task will produce<br />
good generalization performance on that task" (Finn et al, 2017)<br />
<br />
[[File:MAML.png | 500px]]<br />
<br />
In this training algorithm, the parameter vector <math>\theta</math> belonging to the model <math>f_{\theta}</math> is trained such that the meta-objective function <math>\mathcal{L} (\theta) = \sum_{\tau_i \sim P(\tau)} \mathcal{L}_{\tau_i} (f_{\theta_i' }) </math> is minimized. The sum in the objective function is over a sampled batch of training tasks. <math>\mathcal{L}_{\tau_i} (f_{\theta_i'})</math> is the training loss function corresponding to the <math>i^{th}</math> task in the batch evaluated at the model <math>f_{\theta_i'}</math>. The parameter vector <math>\theta_i'</math> is obtained by updating the general parameter <math>\theta</math> using the loss function <math>\mathcal{L}_{\tau_i}</math> and set of K training examples specific to the <math>i^{th}</math> task. Note that in alternate versions of this algorithm, additional testing sets are sampled from <math>\tau_i</math> and used to update <math>\theta</math> using testing loss functions instead of training loss functions.<br />
<br />
One of the important difference between this algorithm and more typical fine-tuning methods is that <math>\theta</math> is explicitly trained to be easily adjusted to perform well on different tasks rather than perform well on any specific tasks then fine tuned as the environment changes. (Sutton et al., 2007)<br />
<br />
= Probabilistic Framework for Meta-Learning =<br />
<br />
This paper puts the meta-learning problem into a Markov Decision Process (MDP) framework common to RL. Instead of training examples <math>\{(x, y)\}</math>, we have trajectories <math>\tau = (x_0, a_1, x_1, R_1, x_2, ... a_H, x_H, R_H)</math>. A trajectory is sequence of states/observations <math>x_t</math>, actions <math>a_t</math> and rewards <math>R_t</math> that is sampled from a task <math> T </math> according to a policy <math>\pi_{\theta}</math>. Included with said task is a method for assigning loss values to trajectories <math>L_T(\tau)</math> which is typically the negative cumulative reward. A policy is a deterministic function that takes in a state and returns an action. Our goal here is to train a policy <math>\pi_{\theta}</math> with parameter vector <math>\theta</math>. This is analougous to training a function <math>f_{\theta}</math> that assigns labels <math>y</math> to feature vectors <math>x</math>. More precisely we have the following definitions:<br />
<br />
* <math>T :=(L_T, P_T(x_0), P_T(x_t | x_{t-1}, a_t), H )</math> (A Task)<br />
* <math>D(T)</math> : A distribution over tasks.<br />
* <math>L_T</math>: A loss function for the task T that assigns numeric loss values to trajectories.<br />
* <math>P_T(x_0), P_T(x_t | x_{t-1}, a_t)</math>: Probability measures specifying the markovian dynamics of the observations <math>x_t</math><br />
* <math>H</math>: The horizon of the MDP. This is a fixed natural number specifying the lengths of the tasks trajectories.<br />
<br />
The papers goes further to define a Markov dynamic for sequences of tasks. Thus the policy that we would like to meta learn <math>\pi_{\theta}</math>, after being exposed to a sample of K trajectories <math>\tau_T^{1:k}</math> from the task <math>T_i</math>, should produce a new policy <math>\pi_{\phi}</math> that will perform well on the next task <math>T_{i+1}</math>. Thus we seek to minimize the following expectation.<br />
<br />
<math>\mathrm{E}_{P(T_0), P(T_{i+1} | T_i)}\bigg(\sum_{i=1}^{l} \mathcal{L}_{T_i, T_{i+1}}(\theta)\bigg)</math><br />
<br />
Where <math>\mathcal{L}_{T_i}(\theta) = \mathrm{E}_{\tau_i^{1:k} } \bigg( \mathrm{E}_{\tau_{i+1, \phi}}\Big( L_{T_{i+1}}(\tau_{i+1, \phi}) \Big) \bigg) </math> and <math>l</math> is the number of tasks.<br />
<br />
The meta-policy <math>\pi_{\theta}</math> is trained and then adapted at test time using the following procedures.<br />
<br />
[[File:MAML2.png | 800px]]<br />
<br />
The mathematics of calculating loss gradients is omitted.<br />
<br />
= Training Spiders to Run with Dynamic Handicaps (Robotic Locomotion in Non-Stationary Environments) =<br />
<br />
The authors used the MuJoCo physics simulator to create a simulated environment where robotic spiders with 6 legs are faced with the task of running due east as quickly as possible. The robotic spider observes the location and velocity of it's body, and the angles and velocities of its legs. It interacts with the environment by exerting torque on the joints of its legs. Each leg has two joints, the joint closer to the body rotates horizontally while the joint farther from the body rotates vertically. The environment is made non-stationary by gradually paralyzing two legs of the spider across training and testing episodes.<br />
Putting this example into the above probabilistic framework yields:<br />
<br />
* <math>T_i</math>: The task of walking east with the torques of two legs scaled by <math> (i-1)/6 </math><br />
* <math>\{T_i\}_{i=1}^{7}</math>: A sequence of tasks with the same two legs handicapped in each task. Note there are 15 different ways to choose such legs resulting in 15 sequences of tasks. 12 are used for training and 3 for testing.<br />
* A Markov Descision process composed of<br />
** Observations <math> x_t </math> containing information about the state of the spider.<br />
** Actions <math> a_t </math> containing information about the torques to apply to the spiders legs.<br />
** Rewards <math> R_t </math> corresponding to the speed at which the spider is moving east.<br />
<br />
Three differently structured policy neural networks are trained in this set up using both meta-learning and three different previously developed adaption methods.<br />
<br />
At testing time, the spiders following meta learned policies initially perform worse than the spiders using non-adaptive policies. However, by the third episode <math> i=3 </math> the meta learners perform on par. And by the sixth episode, when the selected legs are mostly immobile, the meta learners significantly out perform. These results can be seen in the graphs below.<br />
<br />
[[File:locomotion_results.png | 800px]]<br />
<br />
= Training Spiders to Fight Each Other (Adversarial Meta-Learning) =<br />
<br />
The authors created an adversarial environment called RoboSumo where pairs of agents with 4 (named Ants), 6 (named Bugs),or 8 legs (named spiders) sumo wrestle. The agents observe the location and velocity of their bodies and the bodies of their opponent, the angles and velocities of their legs, and the forces being exerted on them by their opponent (equivalent of tactile sense). The game is organized into episodes and rounds. Episodes are single wrestling matches with 500 time steps and win/lose/draw outcomes. Agents win by pushing their opponent out of the ring or making their opponent's body touch the ground. Rounds are batches of episodes. An episode results in a draw when neither of these things happen after 500 time steps. Rounds have possible outcomes win, lose, and draw that are decided based on majority of episodes won. K rounds will be fought. Both agents may update their policies between rounds. The agent that wins the majority of rounds is deemed the winner of the game.<br />
<br />
Similar to the Robotic locomotion example, this game can be phrased in terms of the RL MDP framework.<br />
<br />
* <math>T_i</math>: The task of fighting a round.<br />
* <math>\{T_i\}_{i=1}^{K}</math>: A sequence of rounds against the same opponent. Note that the opponent may update their policy between rounds but the anatomy of both wrestlers will be constant across rounds.<br />
* A Markov Descision process composed of<br />
** A horizon <math>H = 500*n</math> where <math>n</math> is the number of episodes per round.<br />
** Observations <math> x_t </math> containing information about the state of the agent and its opponent.<br />
** Actions <math> a_t </math> containing information about the torques to apply to the agents legs.<br />
** Rewards <math> R_t </math> rewards given to the agent based on its wrestling performance. <math>R_{500*n} = </math> +2000 if win episode, -2000 if lose, and -1000 if draw.<br />
<br />
Note that the above reward set up is quite sparse, therefore in order to encourage fast training, rewards are introduced at every time step for the following.<br />
* For staying close to the center of the ring.<br />
* For exerting force on the opponents body.<br />
* For moving towards the opponent.<br />
* For the distance of the opponent to the center of the ring.<br />
<br />
This makes sense intuitively as fighters these are reasonable goals for agents to explore when they are learning to wrestle.<br />
<br />
The same combinations of policy networks and adaptation methods that were used in the locomotion example are trained and tested here.<br />
<br />
A family of non-adaptive policies are first trained via self play and saved at all stages. Self-play simply means the two agents in the training environment use the same policy. All policy versions are saved so that agents of various skill levels can be sampled when training meta-learners.<br />
<br />
The weights of the different insects were calibrated such that the test win rate between two insects of differing anatomy, who have been trained for the same number of epochs via self-play, is close to %50.<br />
<br />
= Sources =<br />
# Chelsea Finn, Pieter Abbeel, Sergey Levine. "Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks." arXiv preprint arXiv:1703.03400v3 (2017).<br />
# Richard S Sutton, Anna Koop, and David Silver. On the role of tracking in stationary environments. In Proceedings of the 24th international conference on Machine learning, pp. 871–878. ACM, 2007.</div>Tdunnhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Continuous_Adaptation_via_Meta-Learning_in_Nonstationary_and_Competitive_Environments&diff=34127Continuous Adaptation via Meta-Learning in Nonstationary and Competitive Environments2018-03-15T00:51:30Z<p>Tdunn: /* Training Spiders to Fight Each Other (Adversarial Meta-Learning) */</p>
<hr />
<div>= Introduction =<br />
<br />
Typically, the basic goal of machine learning is to train a model to perform a task. In Meta-learning, the goal is to train a model to perform the task of training a model to perform a task. Hence in this case the term "Meta-Learning" has the exact meaning you would expect; the word "Meta" has the precise function of introducing a layer of abstraction.<br />
<br />
The meta-learning task can be made more concrete by a simple example. Consider the CIFAR-100 classification task that we used for our data competition. We can alter this task from being a 100-class classification problem to a collection of 100 binary classification problems. The goal of Meta-Learning here is to design and train and single binary classifier that will perform well on a randomly sampled task given a limited amount of training data for that specific task. In other words, we would like to train a model to perform the following procedure:<br />
<br />
# A task is sampled. The task is "Is X a dog?"<br />
# A small set of labeled training data is provided to the model. The labels represent whether or not the image is a picture of a dog.<br />
# The model uses the training data to adjust itself to the specific task of checking whether or not an image is a picture of a dog.<br />
<br />
This example also highlights the intuition that the skill of sight is distinct and separable from the skill of knowing what a dog looks like.<br />
<br />
In this paper, a probabilistic framework for meta learning is derived, then applied to tasks involving simulated robotic spiders. This framework generalizes the typical machine learning set up using Markov Decision Processes.<br />
<br />
= Model Agnostic Meta-Learning =<br />
<br />
An initial framework for meta-learning is given in "Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks" (Finn et al, 2017)<br />
<br />
"In our approach, the parameters of<br />
the model are explicitly trained such that a small<br />
number of gradient steps with a small amount<br />
of training data from a new task will produce<br />
good generalization performance on that task" (Finn et al, 2017)<br />
<br />
[[File:MAML.png | 500px]]<br />
<br />
In this training algorithm, the parameter vector <math>\theta</math> belonging to the model <math>f_{\theta}</math> is trained such that the meta-objective function <math>\mathcal{L} (\theta) = \sum_{\tau_i \sim P(\tau)} \mathcal{L}_{\tau_i} (f_{\theta_i' }) </math> is minimized. The sum in the objective function is over a sampled batch of training tasks. <math>\mathcal{L}_{\tau_i} (f_{\theta_i'})</math> is the training loss function corresponding to the <math>i^{th}</math> task in the batch evaluated at the model <math>f_{\theta_i'}</math>. The parameter vector <math>\theta_i'</math> is obtained by updating the general parameter <math>\theta</math> using the loss function <math>\mathcal{L}_{\tau_i}</math> and set of K training examples specific to the <math>i^{th}</math> task. Note that in alternate versions of this algorithm, additional testing sets are sampled from <math>\tau_i</math> and used to update <math>\theta</math> using testing loss functions instead of training loss functions.<br />
<br />
One of the important difference between this algorithm and more typical fine-tuning methods is that <math>\theta</math> is explicitly trained to be easily adjusted to perform well on different tasks rather than perform well on any specific tasks then fine tuned as the environment changes. (Sutton et al., 2007)<br />
<br />
= Probabilistic Framework for Meta-Learning =<br />
<br />
This paper puts the meta-learning problem into a Markov Decision Process (MDP) framework common to RL. Instead of training examples <math>\{(x, y)\}</math>, we have trajectories <math>\tau = (x_0, a_1, x_1, R_1, x_2, ... a_H, x_H, R_H)</math>. A trajectory is sequence of states/observations <math>x_t</math>, actions <math>a_t</math> and rewards <math>R_t</math> that is sampled from a task <math> T </math> according to a policy <math>\pi_{\theta}</math>. Included with said task is a method for assigning loss values to trajectories <math>L_T(\tau)</math> which is typically the negative cumulative reward. A policy is a deterministic function that takes in a state and returns an action. Our goal here is to train a policy <math>\pi_{\theta}</math> with parameter vector <math>\theta</math>. This is analougous to training a function <math>f_{\theta}</math> that assigns labels <math>y</math> to feature vectors <math>x</math>. More precisely we have the following definitions:<br />
<br />
* <math>T :=(L_T, P_T(x_0), P_T(x_t | x_{t-1}, a_t), H )</math> (A Task)<br />
* <math>D(T)</math> : A distribution over tasks.<br />
* <math>L_T</math>: A loss function for the task T that assigns numeric loss values to trajectories.<br />
* <math>P_T(x_0), P_T(x_t | x_{t-1}, a_t)</math>: Probability measures specifying the markovian dynamics of the observations <math>x_t</math><br />
* <math>H</math>: The horizon of the MDP. This is a fixed natural number specifying the lengths of the tasks trajectories.<br />
<br />
The papers goes further to define a Markov dynamic for sequences of tasks. Thus the policy that we would like to meta learn <math>\pi_{\theta}</math>, after being exposed to a sample of K trajectories <math>\tau_T^{1:k}</math> from the task <math>T_i</math>, should produce a new policy <math>\pi_{\phi}</math> that will perform well on the next task <math>T_{i+1}</math>. Thus we seek to minimize the following expectation.<br />
<br />
<math>\mathrm{E}_{P(T_0), P(T_{i+1} | T_i)}\bigg(\sum_{i=1}^{l} \mathcal{L}_{T_i, T_{i+1}}(\theta)\bigg)</math><br />
<br />
Where <math>\mathcal{L}_{T_i}(\theta) = \mathrm{E}_{\tau_i^{1:k} } \bigg( \mathrm{E}_{\tau_{i+1, \phi}}\Big( L_{T_{i+1}}(\tau_{i+1, \phi}) \Big) \bigg) </math> and <math>l</math> is the number of tasks.<br />
<br />
The meta-policy <math>\pi_{\theta}</math> is trained and then adapted at test time using the following procedures.<br />
<br />
[[File:MAML2.png | 800px]]<br />
<br />
The mathematics of calculating loss gradients is omitted.<br />
<br />
= Training Spiders to Run with Dynamic Handicaps (Robotic Locomotion in Non-Stationary Environments) =<br />
<br />
The authors used the MuJoCo physics simulator to create a simulated environment where robotic spiders with 6 legs are faced with the task of running due east as quickly as possible. The robotic spider observes the location and velocity of it's body, and the angles and velocities of its legs. It interacts with the environment by exerting torque on the joints of its legs. Each leg has two joints, the joint closer to the body rotates horizontally while the joint farther from the body rotates vertically. The environment is made non-stationary by gradually paralyzing two legs of the spider across training and testing episodes.<br />
Putting this example into the above probabilistic framework yields:<br />
<br />
* <math>T_i</math>: The task of walking east with the torques of two legs scaled by <math> (i-1)/6 </math><br />
* <math>\{T_i\}_{i=1}^{7}</math>: A sequence of tasks with the same two legs handicapped in each task. Note there are 15 different ways to choose such legs resulting in 15 sequences of tasks. 12 are used for training and 3 for testing.<br />
* A Markov Descision process composed of<br />
** Observations <math> x_t </math> containing information about the state of the spider.<br />
** Actions <math> a_t </math> containing information about the torques to apply to the spiders legs.<br />
** Rewards <math> R_t </math> corresponding to the speed at which the spider is moving east.<br />
<br />
Three differently structured policy neural networks are trained in this set up using both meta-learning and three different previously developed adaption methods.<br />
<br />
At testing time, the spiders following meta learned policies initially perform worse than the spiders using non-adaptive policies. However, by the third episode <math> i=3 </math> the meta learners perform on par. And by the sixth episode, when the selected legs are mostly immobile, the meta learners significantly out perform. These results can be seen in the graphs below.<br />
<br />
[[File:locomotion_results.png | 800px]]<br />
<br />
= Training Spiders to Fight Each Other (Adversarial Meta-Learning) =<br />
<br />
The authors created an adversarial environment called RoboSumo where pairs of agents with 4 (named Ants), 6 (named Bugs),or 8 legs (named spiders) sumo wrestle. The agents observe the location and velocity of their bodies and the bodies of their opponent, the angles and velocities of their legs, and the forces being exerted on them by their opponent (equivalent of tactile sense). The game is organized into episodes and rounds. Episodes are single wrestling matches with 500 time steps and win/lose/draw outcomes. Agents win by pushing their opponent out of the ring or making their opponent's body touch the ground. Rounds are batches of episodes. An episode results in a draw when neither of these things happen after 500 time steps. Rounds have possible outcomes win, lose, and draw that are decided based on majority of episodes won. K rounds will be fought. Both agents may update their policies between rounds. The agent that wins the majority of rounds is deemed the winner of the game.<br />
<br />
Similar to the Robotic locomotion example, this game can be phrased in terms of the RL MDP framework.<br />
<br />
* <math>T_i</math>: The task of fighting a round.<br />
* <math>\{T_i\}_{i=1}^{K}</math>: A sequence of rounds against the same opponent. Note that the opponent may update their policy between rounds but the anatomy of both wrestlers will be constant across rounds.<br />
* A Markov Descision process composed of<br />
** A horizon <math>H = 500*n</math> where <math>n</math> is the number of episodes per round.<br />
** Observations <math> x_t </math> containing information about the state of the agent and its opponent.<br />
** Actions <math> a_t </math> containing information about the torques to apply to the agents legs.<br />
** Rewards <math> R_t </math> rewards given to the agent based on its wrestling performance. <math>R_{500*n} = </math> +2000 if win episode, -2000 if lose, and -1000 if draw.<br />
<br />
Note that the above reward set up is quite sparse, therefore in order to encourage fast training, rewards are introduced at every time step for the following.<br />
* For staying close to the center of the ring.<br />
* For exerting force on the opponents body.<br />
* For moving towards the opponent.<br />
* For the distance of the opponent to the center of the ring.<br />
<br />
This makes sense intuitively as fighters these are reasonable goals for agents to explore when they are learning to wrestle.<br />
<br />
The same combinations of policy networks and adaptation methods that were used in the locomotion example are trained and tested here.<br />
<br />
A family of non-adaptive policies are first trained via self play and saved at all stages. Self-play simply means the two agents in the training environment use the same policy. All policy versions are saved so that agents of various skill levels can be sampled when training meta-learners.<br />
<br />
The weights of the different insects were calibrated such that the test win rate between two different insects who have been trained for the same number of opochs via self-play is close to %50.<br />
<br />
= Sources =<br />
# Chelsea Finn, Pieter Abbeel, Sergey Levine. "Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks." arXiv preprint arXiv:1703.03400v3 (2017).<br />
# Richard S Sutton, Anna Koop, and David Silver. On the role of tracking in stationary environments. In Proceedings of the 24th international conference on Machine learning, pp. 871–878. ACM, 2007.</div>Tdunnhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Continuous_Adaptation_via_Meta-Learning_in_Nonstationary_and_Competitive_Environments&diff=34100Continuous Adaptation via Meta-Learning in Nonstationary and Competitive Environments2018-03-15T00:09:59Z<p>Tdunn: /* Training Spiders to Fight Each Other (Adversarial Meta-Learning) */</p>
<hr />
<div>= Introduction =<br />
<br />
Typically, the basic goal of machine learning is to train a model to perform a task. In Meta-learning, the goal is to train a model to perform the task of training a model to perform a task. Hence in this case the term "Meta-Learning" has the exact meaning you would expect; the word "Meta" has the precise function of introducing a layer of abstraction.<br />
<br />
The meta-learning task can be made more concrete by a simple example. Consider the CIFAR-100 classification task that we used for our data competition. We can alter this task from being a 100-class classification problem to a collection of 100 binary classification problems. The goal of Meta-Learning here is to design and train and single binary classifier that will perform well on a randomly sampled task given a limited amount of training data for that specific task. In other words, we would like to train a model to perform the following procedure:<br />
<br />
# A task is sampled. The task is "Is X a dog?"<br />
# A small set of labeled training data is provided to the model. The labels represent whether or not the image is a picture of a dog.<br />
# The model uses the training data to adjust itself to the specific task of checking whether or not an image is a picture of a dog.<br />
<br />
This example also highlights the intuition that the skill of sight is distinct and separable from the skill of knowing what a dog looks like.<br />
<br />
In this paper, a probabilistic framework for meta learning is derived, then applied to tasks involving simulated robotic spiders. This framework generalizes the typical machine learning set up using Markov Decision Processes.<br />
<br />
= Model Agnostic Meta-Learning =<br />
<br />
An initial framework for meta-learning is given in "Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks" (Finn et al, 2017)<br />
<br />
"In our approach, the parameters of<br />
the model are explicitly trained such that a small<br />
number of gradient steps with a small amount<br />
of training data from a new task will produce<br />
good generalization performance on that task" (Finn et al, 2017)<br />
<br />
[[File:MAML.png | 500px]]<br />
<br />
In this training algorithm, the parameter vector <math>\theta</math> belonging to the model <math>f_{\theta}</math> is trained such that the meta-objective function <math>\mathcal{L} (\theta) = \sum_{\tau_i \sim P(\tau)} \mathcal{L}_{\tau_i} (f_{\theta_i' }) </math> is minimized. The sum in the objective function is over a sampled batch of training tasks. <math>\mathcal{L}_{\tau_i} (f_{\theta_i'})</math> is the training loss function corresponding to the <math>i^{th}</math> task in the batch evaluated at the model <math>f_{\theta_i'}</math>. The parameter vector <math>\theta_i'</math> is obtained by updating the general parameter <math>\theta</math> using the loss function <math>\mathcal{L}_{\tau_i}</math> and set of K training examples specific to the <math>i^{th}</math> task. Note that in alternate versions of this algorithm, additional testing sets are sampled from <math>\tau_i</math> and used to update <math>\theta</math> using testing loss functions instead of training loss functions.<br />
<br />
One of the important difference between this algorithm and more typical fine-tuning methods is that <math>\theta</math> is explicitly trained to be easily adjusted to perform well on different tasks rather than perform well on any specific tasks then fine tuned as the environment changes. (Sutton et al., 2007)<br />
<br />
= Probabilistic Framework for Meta-Learning =<br />
<br />
This paper puts the meta-learning problem into a Markov Decision Process (MDP) framework common to RL. Instead of training examples <math>\{(x, y)\}</math>, we have trajectories <math>\tau = (x_0, a_1, x_1, R_1, x_2, ... a_H, x_H, R_H)</math>. A trajectory is sequence of states/observations <math>x_t</math>, actions <math>a_t</math> and rewards <math>R_t</math> that is sampled from a task <math> T </math> according to a policy <math>\pi_{\theta}</math>. Included with said task is a method for assigning loss values to trajectories <math>L_T(\tau)</math> which is typically the negative cumulative reward. A policy is a deterministic function that takes in a state and returns an action. Our goal here is to train a policy <math>\pi_{\theta}</math> with parameter vector <math>\theta</math>. This is analougous to training a function <math>f_{\theta}</math> that assigns labels <math>y</math> to feature vectors <math>x</math>. More precisely we have the following definitions:<br />
<br />
* <math>T :=(L_T, P_T(x_0), P_T(x_t | x_{t-1}, a_t), H )</math> (A Task)<br />
* <math>D(T)</math> : A distribution over tasks.<br />
* <math>L_T</math>: A loss function for the task T that assigns numeric loss values to trajectories.<br />
* <math>P_T(x_0), P_T(x_t | x_{t-1}, a_t)</math>: Probability measures specifying the markovian dynamics of the observations <math>x_t</math><br />
* <math>H</math>: The horizon of the MDP. This is a fixed natural number specifying the lengths of the tasks trajectories.<br />
<br />
The papers goes further to define a Markov dynamic for sequences of tasks. Thus the policy that we would like to meta learn <math>\pi_{\theta}</math>, after being exposed to a sample of K trajectories <math>\tau_T^{1:k}</math> from the task <math>T_i</math>, should produce a new policy <math>\pi_{\phi}</math> that will perform well on the next task <math>T_{i+1}</math>. Thus we seek to minimize the following expectation.<br />
<br />
<math>\mathrm{E}_{P(T_0), P(T_{i+1} | T_i)}\bigg(\sum_{i=1}^{l} \mathcal{L}_{T_i, T_{i+1}}(\theta)\bigg)</math><br />
<br />
Where <math>\mathcal{L}_{T_i}(\theta) = \mathrm{E}_{\tau_i^{1:k} } \bigg( \mathrm{E}_{\tau_{i+1, \phi}}\Big( L_{T_{i+1}}(\tau_{i+1, \phi}) \Big) \bigg) </math> and <math>l</math> is the number of tasks.<br />
<br />
The meta-policy <math>\pi_{\theta}</math> is trained and then adapted at test time using the following procedures.<br />
<br />
[[File:MAML2.png | 800px]]<br />
<br />
The mathematics of calculating loss gradients is omitted.<br />
<br />
= Training Spiders to Run with Dynamic Handicaps (Robotic Locomotion in Non-Stationary Environments) =<br />
<br />
The authors used the MuJoCo physics simulator to create a simulated environment where robotic spiders with 6 legs are faced with the task of running due east as quickly as possible. The robotic spider observes the location and velocity of it's body, and the angles and velocities of its legs. It interacts with the environment by exerting torque on the joints of its legs. Each leg has two joints, the joint closer to the body rotates horizontally while the joint farther from the body rotates vertically. The environment is made non-stationary by gradually paralyzing two legs of the spider across training and testing episodes.<br />
Putting this example into the above probabilistic framework yields:<br />
<br />
* <math>T_i</math>: The task of walking east with the torques of two legs scaled by <math> (i-1)/6 </math><br />
* <math>\{T_i\}_{i=1}^{7}</math>: A sequence of tasks with the same two legs handicapped in each task. Note there are 15 different ways to choose such legs resulting in 15 sequences of tasks. 12 are used for training and 3 for testing.<br />
* A Markov Descision process composed of<br />
** Observations <math> x_t </math> containing information about the state of the spider.<br />
** Actions <math> a_t </math> containing information about the torques to apply to the spiders legs.<br />
** Rewards <math> R_t </math> corresponding to the speed at which the spider is moving east.<br />
<br />
Three differently structured policy neural networks are trained in this set up using both meta-learning and three different previously developed adaption methods.<br />
<br />
At testing time, the spiders following meta learned policies initially perform worse than the spiders using non-adaptive policies. However, by the third episode <math> i=3 </math> the meta learners perform on par. And by the sixth episode, when the selected legs are mostly immobile, the meta learners significantly out perform. These results can be seen in the graphs below.<br />
<br />
[[File:locomotion_results.png | 800px]]<br />
<br />
= Training Spiders to Fight Each Other (Adversarial Meta-Learning) =<br />
<br />
The authors created an adversarial environment called RoboSumo where pairs of agents with 4, 6,or 8 legs sumo wrestle. The agents observe the location and velocity of their bodies and the bodies of their opponent, the angles and velocities of their legs, and the forces being exerted on them by their opponent (equivalent of tactile sense). The game is organized into episodes and rounds. Episodes are single wrestling matches with 500 time steps and win/lose/draw outcomes. Agents win by pushing their opponent out of the ring or making their opponent's body touch the ground. Rounds are batches of episodes. An episode results in a draw when neither of these things happen after 500 time steps. Rounds have possible outcomes win, lose, and draw that are decided based on majority of episodes won. K rounds will be fought. Both agents may update their policies between rounds. The agent that wins the majority of rounds is deemed the winner of the game.<br />
<br />
Similar to the Robotic locomotion example, this game can be phrased in terms of the RL MDP framework.<br />
<br />
* <math>T_i</math>: The task of fighting a round.<br />
* <math>\{T_i\}_{i=1}^{K}</math>: A sequence of rounds against the same opponent. Note that the opponent may update their policy between rounds but the anatomy of both wrestlers will be constant across rounds.<br />
* A Markov Descision process composed of<br />
** A horizon <math>H = 500*n</math> where <math>n</math> is the number of episodes per round.<br />
** Observations <math> x_t </math> containing information about the state of the agent and its opponent.<br />
** Actions <math> a_t </math> containing information about the torques to apply to the agents legs.<br />
** Rewards <math> R_t </math> rewards given to the agent based on its wrestling performance. <math>R_{500*n} = </math> +2000 if win episode, -2000 if lose, and -1000 if draw.<br />
<br />
Note that the above reward set up is quite sparse, therefore in order to encourage fast training, rewards are introduced at every time step for the following.<br />
* For staying close to the center of the ring.<br />
* For exerting force on the opponents body.<br />
* For moving towards the opponent.<br />
* For the distance of the opponent to the center of the ring.<br />
<br />
This makes sense intuitively as fighters these are reasonable goals for agents to explore when they are learning to wrestle.<br />
<br />
The same combinations of policy networks and adaptation methods that were used in the locomotion example are trained and tested here.<br />
<br />
= Sources =<br />
# Chelsea Finn, Pieter Abbeel, Sergey Levine. "Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks." arXiv preprint arXiv:1703.03400v3 (2017).<br />
# Richard S Sutton, Anna Koop, and David Silver. On the role of tracking in stationary environments. In Proceedings of the 24th international conference on Machine learning, pp. 871–878. ACM, 2007.</div>Tdunnhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Continuous_Adaptation_via_Meta-Learning_in_Nonstationary_and_Competitive_Environments&diff=34033Continuous Adaptation via Meta-Learning in Nonstationary and Competitive Environments2018-03-14T16:58:55Z<p>Tdunn: /* Training Spiders to Fight Each Other (Adversarial Meta-Learning) */</p>
<hr />
<div>= Introduction =<br />
<br />
Typically, the basic goal of machine learning is to train a model to perform a task. In Meta-learning, the goal is to train a model to perform the task of training a model to perform a task. Hence in this case the term "Meta-Learning" has the exact meaning you would expect; the word "Meta" has the precise function of introducing a layer of abstraction.<br />
<br />
The meta-learning task can be made more concrete by a simple example. Consider the CIFAR-100 classification task that we used for our data competition. We can alter this task from being a 100-class classification problem to a collection of 100 binary classification problems. The goal of Meta-Learning here is to design and train and single binary classifier that will perform well on a randomly sampled task given a limited amount of training data for that specific task. In other words, we would like to train a model to perform the following procedure:<br />
<br />
# A task is sampled. The task is "Is X a dog?"<br />
# A small set of labeled training data is provided to the model. The labels represent whether or not the image is a picture of a dog.<br />
# The model uses the training data to adjust itself to the specific task of checking whether or not an image is a picture of a dog.<br />
<br />
This example also highlights the intuition that the skill of sight is distinct and separable from the skill of knowing what a dog looks like.<br />
<br />
In this paper, a probabilistic framework for meta learning is derived, then applied to tasks involving simulated robotic spiders. This framework generalizes the typical machine learning set up using Markov Decision Processes.<br />
<br />
= Model Agnostic Meta-Learning =<br />
<br />
An initial framework for meta-learning is given in "Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks" (Finn et al, 2017)<br />
<br />
"In our approach, the parameters of<br />
the model are explicitly trained such that a small<br />
number of gradient steps with a small amount<br />
of training data from a new task will produce<br />
good generalization performance on that task" (Finn et al, 2017)<br />
<br />
[[File:MAML.png | 500px]]<br />
<br />
In this training algorithm, the parameter vector <math>\theta</math> belonging to the model <math>f_{\theta}</math> is trained such that the meta-objective function <math>\mathcal{L} (\theta) = \sum_{\tau_i \sim P(\tau)} \mathcal{L}_{\tau_i} (f_{\theta_i' }) </math> is minimized. The sum in the objective function is over a sampled batch of training tasks. <math>\mathcal{L}_{\tau_i} (f_{\theta_i'})</math> is the training loss function corresponding to the <math>i^{th}</math> task in the batch evaluated at the model <math>f_{\theta_i'}</math>. The parameter vector <math>\theta_i'</math> is obtained by updating the general parameter <math>\theta</math> using the loss function <math>\mathcal{L}_{\tau_i}</math> and set of K training examples specific to the <math>i^{th}</math> task. Note that in alternate versions of this algorithm, additional testing sets are sampled from <math>\tau_i</math> and used to update <math>\theta</math> using testing loss functions instead of training loss functions.<br />
<br />
One of the important difference between this algorithm and more typical fine-tuning methods is that <math>\theta</math> is explicitly trained to be easily adjusted to perform well on different tasks rather than perform well on any specific tasks then fine tuned as the environment changes. (Sutton et al., 2007)<br />
<br />
= Probabilistic Framework for Meta-Learning =<br />
<br />
This paper puts the meta-learning problem into a Markov Decision Process (MDP) framework common to RL. Instead of training examples <math>\{(x, y)\}</math>, we have trajectories <math>\tau = (x_0, a_1, x_1, R_1, x_2, ... a_H, x_H, R_H)</math>. A trajectory is sequence of states/observations <math>x_t</math>, actions <math>a_t</math> and rewards <math>R_t</math> that is sampled from a task <math> T </math> according to a policy <math>\pi_{\theta}</math>. Included with said task is a method for assigning loss values to trajectories <math>L_T(\tau)</math> which is typically the negative cumulative reward. A policy is a deterministic function that takes in a state and returns an action. Our goal here is to train a policy <math>\pi_{\theta}</math> with parameter vector <math>\theta</math>. This is analougous to training a function <math>f_{\theta}</math> that assigns labels <math>y</math> to feature vectors <math>x</math>. More precisely we have the following definitions:<br />
<br />
* <math>T :=(L_T, P_T(x_0), P_T(x_t | x_{t-1}, a_t), H )</math> (A Task)<br />
* <math>D(T)</math> : A distribution over tasks.<br />
* <math>L_T</math>: A loss function for the task T that assigns numeric loss values to trajectories.<br />
* <math>P_T(x_0), P_T(x_t | x_{t-1}, a_t)</math>: Probability measures specifying the markovian dynamics of the observations <math>x_t</math><br />
* <math>H</math>: The horizon of the MDP. This is a fixed natural number specifying the lengths of the tasks trajectories.<br />
<br />
The papers goes further to define a Markov dynamic for sequences of tasks. Thus the policy that we would like to meta learn <math>\pi_{\theta}</math>, after being exposed to a sample of K trajectories <math>\tau_T^{1:k}</math> from the task <math>T_i</math>, should produce a new policy <math>\pi_{\phi}</math> that will perform well on the next task <math>T_{i+1}</math>. Thus we seek to minimize the following expectation.<br />
<br />
<math>\mathrm{E}_{P(T_0), P(T_{i+1} | T_i)}\bigg(\sum_{i=1}^{l} \mathcal{L}_{T_i, T_{i+1}}(\theta)\bigg)</math><br />
<br />
Where <math>\mathcal{L}_{T_i}(\theta) = \mathrm{E}_{\tau_i^{1:k} } \bigg( \mathrm{E}_{\tau_{i+1, \phi}}\Big( L_{T_{i+1}}(\tau_{i+1, \phi}) \Big) \bigg) </math> and <math>l</math> is the number of tasks.<br />
<br />
The meta-policy <math>\pi_{\theta}</math> is trained and then adapted at test time using the following procedures.<br />
<br />
[[File:MAML2.png | 800px]]<br />
<br />
The mathematics of calculating loss gradients is omitted.<br />
<br />
= Training Spiders to Run with Dynamic Handicaps (Robotic Locomotion in Non-Stationary Environments) =<br />
<br />
The authors used the MuJoCo physics simulator to create a simulated environment where robotic spiders with 6 legs are faced with the task of running due east as quickly as possible. The robotic spider observes the location and velocity of it's body, and the angles and velocities of its legs. It interacts with the environment by exerting torque on the joints of its legs. Each leg has two joints, the joint closer to the body rotates horizontally while the joint farther from the body rotates vertically. The environment is made non-stationary by gradually paralyzing two legs of the spider across training and testing episodes.<br />
Putting this example into the above probabilistic framework yields:<br />
<br />
* <math>T_i</math>: The task of walking east with the torques of two legs scaled by <math> (i-1)/6 </math><br />
* <math>\{T_i\}_{i=1}^{7}</math>: A sequence of tasks with the same two legs handicapped in each task. Note there are 15 different ways to choose such legs resulting in 15 sequences of tasks. 12 are used for training and 3 for testing.<br />
* A Markov Descision process composed of<br />
** Observations <math> x_t </math> containing information about the state of the spider.<br />
** Actions <math> a_t </math> containing information about the torques to apply to the spiders legs.<br />
** Rewards <math> R_t </math> corresponding to the speed at which the spider is moving east.<br />
<br />
Three differently structured policy neural networks are trained in this set up using both meta-learning and three different previously developed adaption methods.<br />
<br />
At testing time, the spiders following meta learned policies initially perform worse than the spiders using non-adaptive policies. However, by the third episode <math> i=3 </math> the meta learners perform on par. And by the sixth episode, when the selected legs are mostly immobile, the meta learners significantly out perform. These results can be seen in the graphs below.<br />
<br />
[[File:locomotion_results.png | 800px]]<br />
<br />
= Training Spiders to Fight Each Other (Adversarial Meta-Learning) =<br />
<br />
The authors created an adversarial environment called RoboSumo where pairs of agents with 4, 6,or 8 legs sumo wrestle. The agents observe the location and velocity of their bodies and the bodies of their opponent, the angles and velocities of their legs, and the forces being exerted on them by their opponent (equivalent of tactile sense). The game is organized into episodes and rounds. Episodes are single wrestling matches with 500 time steps and win/lose/draw outcomes. Agents win by pushing their opponent out of the ring or making their opponent's body touch the ground. Rounds are batches of episodes. An episode results in a draw when neither of these things happen after 500 time steps. Rounds have possible outcomes win, lose, and draw that are decided based on majority of episodes won. K rounds will be fought. Both agents may update their policies between rounds. The agent that wins the majority of rounds is deemed the winner of the game.<br />
<br />
Similar to the Robotic locomotion example, this game can be phrased in terms of the RL MDP framework.<br />
<br />
* <math>T_i</math>: The task of fighting a round.<br />
* <math>\{T_i\}_{i=1}^{K}</math>: A sequence of rounds against the same opponent. Note that the opponent may update their policy between rounds but the anatomy of both wrestlers will be constant across rounds.<br />
* A Markov Descision process composed of<br />
** A horizon <math>H = 500*n</math> where <math>n</math> is the number of episodes per round.<br />
** Observations <math> x_t </math> containing information about the state of the agent and its opponent.<br />
** Actions <math> a_t </math> containing information about the torques to apply to the agents legs.<br />
** Rewards <math> R_t </math> rewards given to the agent based on its wrestling performance. <math>R_{500*n} = </math> +2000 if win episode, -2000 if lose, and -1000 if draw.<br />
<br />
Note that the above reward set up is quite sparse, therefore in order to encourage fast training, rewards are introduced at every time step for the following.<br />
* For staying close to the center of the ring.<br />
* For exerting force on the opponents body.<br />
* For moving towards the opponent.<br />
* For the distance of the opponent to the center of the ring.<br />
<br />
This makes sense intuitively as fighters these are reasonable goals for agents to explore when they are learning to wrestle.<br />
<br />
The same combinations of policy networks and adaptation methods as were used in the locomotion example are trained and tested here.<br />
<br />
= Sources =<br />
# Chelsea Finn, Pieter Abbeel, Sergey Levine. "Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks." arXiv preprint arXiv:1703.03400v3 (2017).<br />
# Richard S Sutton, Anna Koop, and David Silver. On the role of tracking in stationary environments. In Proceedings of the 24th international conference on Machine learning, pp. 871–878. ACM, 2007.</div>Tdunnhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Continuous_Adaptation_via_Meta-Learning_in_Nonstationary_and_Competitive_Environments&diff=34032Continuous Adaptation via Meta-Learning in Nonstationary and Competitive Environments2018-03-14T16:57:31Z<p>Tdunn: /* Training Spiders to Fight Each Other (Adversarial Meta-Learning) */</p>
<hr />
<div>= Introduction =<br />
<br />
Typically, the basic goal of machine learning is to train a model to perform a task. In Meta-learning, the goal is to train a model to perform the task of training a model to perform a task. Hence in this case the term "Meta-Learning" has the exact meaning you would expect; the word "Meta" has the precise function of introducing a layer of abstraction.<br />
<br />
The meta-learning task can be made more concrete by a simple example. Consider the CIFAR-100 classification task that we used for our data competition. We can alter this task from being a 100-class classification problem to a collection of 100 binary classification problems. The goal of Meta-Learning here is to design and train and single binary classifier that will perform well on a randomly sampled task given a limited amount of training data for that specific task. In other words, we would like to train a model to perform the following procedure:<br />
<br />
# A task is sampled. The task is "Is X a dog?"<br />
# A small set of labeled training data is provided to the model. The labels represent whether or not the image is a picture of a dog.<br />
# The model uses the training data to adjust itself to the specific task of checking whether or not an image is a picture of a dog.<br />
<br />
This example also highlights the intuition that the skill of sight is distinct and separable from the skill of knowing what a dog looks like.<br />
<br />
In this paper, a probabilistic framework for meta learning is derived, then applied to tasks involving simulated robotic spiders. This framework generalizes the typical machine learning set up using Markov Decision Processes.<br />
<br />
= Model Agnostic Meta-Learning =<br />
<br />
An initial framework for meta-learning is given in "Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks" (Finn et al, 2017)<br />
<br />
"In our approach, the parameters of<br />
the model are explicitly trained such that a small<br />
number of gradient steps with a small amount<br />
of training data from a new task will produce<br />
good generalization performance on that task" (Finn et al, 2017)<br />
<br />
[[File:MAML.png | 500px]]<br />
<br />
In this training algorithm, the parameter vector <math>\theta</math> belonging to the model <math>f_{\theta}</math> is trained such that the meta-objective function <math>\mathcal{L} (\theta) = \sum_{\tau_i \sim P(\tau)} \mathcal{L}_{\tau_i} (f_{\theta_i' }) </math> is minimized. The sum in the objective function is over a sampled batch of training tasks. <math>\mathcal{L}_{\tau_i} (f_{\theta_i'})</math> is the training loss function corresponding to the <math>i^{th}</math> task in the batch evaluated at the model <math>f_{\theta_i'}</math>. The parameter vector <math>\theta_i'</math> is obtained by updating the general parameter <math>\theta</math> using the loss function <math>\mathcal{L}_{\tau_i}</math> and set of K training examples specific to the <math>i^{th}</math> task. Note that in alternate versions of this algorithm, additional testing sets are sampled from <math>\tau_i</math> and used to update <math>\theta</math> using testing loss functions instead of training loss functions.<br />
<br />
One of the important difference between this algorithm and more typical fine-tuning methods is that <math>\theta</math> is explicitly trained to be easily adjusted to perform well on different tasks rather than perform well on any specific tasks then fine tuned as the environment changes. (Sutton et al., 2007)<br />
<br />
= Probabilistic Framework for Meta-Learning =<br />
<br />
This paper puts the meta-learning problem into a Markov Decision Process (MDP) framework common to RL. Instead of training examples <math>\{(x, y)\}</math>, we have trajectories <math>\tau = (x_0, a_1, x_1, R_1, x_2, ... a_H, x_H, R_H)</math>. A trajectory is sequence of states/observations <math>x_t</math>, actions <math>a_t</math> and rewards <math>R_t</math> that is sampled from a task <math> T </math> according to a policy <math>\pi_{\theta}</math>. Included with said task is a method for assigning loss values to trajectories <math>L_T(\tau)</math> which is typically the negative cumulative reward. A policy is a deterministic function that takes in a state and returns an action. Our goal here is to train a policy <math>\pi_{\theta}</math> with parameter vector <math>\theta</math>. This is analougous to training a function <math>f_{\theta}</math> that assigns labels <math>y</math> to feature vectors <math>x</math>. More precisely we have the following definitions:<br />
<br />
* <math>T :=(L_T, P_T(x_0), P_T(x_t | x_{t-1}, a_t), H )</math> (A Task)<br />
* <math>D(T)</math> : A distribution over tasks.<br />
* <math>L_T</math>: A loss function for the task T that assigns numeric loss values to trajectories.<br />
* <math>P_T(x_0), P_T(x_t | x_{t-1}, a_t)</math>: Probability measures specifying the markovian dynamics of the observations <math>x_t</math><br />
* <math>H</math>: The horizon of the MDP. This is a fixed natural number specifying the lengths of the tasks trajectories.<br />
<br />
The papers goes further to define a Markov dynamic for sequences of tasks. Thus the policy that we would like to meta learn <math>\pi_{\theta}</math>, after being exposed to a sample of K trajectories <math>\tau_T^{1:k}</math> from the task <math>T_i</math>, should produce a new policy <math>\pi_{\phi}</math> that will perform well on the next task <math>T_{i+1}</math>. Thus we seek to minimize the following expectation.<br />
<br />
<math>\mathrm{E}_{P(T_0), P(T_{i+1} | T_i)}\bigg(\sum_{i=1}^{l} \mathcal{L}_{T_i, T_{i+1}}(\theta)\bigg)</math><br />
<br />
Where <math>\mathcal{L}_{T_i}(\theta) = \mathrm{E}_{\tau_i^{1:k} } \bigg( \mathrm{E}_{\tau_{i+1, \phi}}\Big( L_{T_{i+1}}(\tau_{i+1, \phi}) \Big) \bigg) </math> and <math>l</math> is the number of tasks.<br />
<br />
The meta-policy <math>\pi_{\theta}</math> is trained and then adapted at test time using the following procedures.<br />
<br />
[[File:MAML2.png | 800px]]<br />
<br />
The mathematics of calculating loss gradients is omitted.<br />
<br />
= Training Spiders to Run with Dynamic Handicaps (Robotic Locomotion in Non-Stationary Environments) =<br />
<br />
The authors used the MuJoCo physics simulator to create a simulated environment where robotic spiders with 6 legs are faced with the task of running due east as quickly as possible. The robotic spider observes the location and velocity of it's body, and the angles and velocities of its legs. It interacts with the environment by exerting torque on the joints of its legs. Each leg has two joints, the joint closer to the body rotates horizontally while the joint farther from the body rotates vertically. The environment is made non-stationary by gradually paralyzing two legs of the spider across training and testing episodes.<br />
Putting this example into the above probabilistic framework yields:<br />
<br />
* <math>T_i</math>: The task of walking east with the torques of two legs scaled by <math> (i-1)/6 </math><br />
* <math>\{T_i\}_{i=1}^{7}</math>: A sequence of tasks with the same two legs handicapped in each task. Note there are 15 different ways to choose such legs resulting in 15 sequences of tasks. 12 are used for training and 3 for testing.<br />
* A Markov Descision process composed of<br />
** Observations <math> x_t </math> containing information about the state of the spider.<br />
** Actions <math> a_t </math> containing information about the torques to apply to the spiders legs.<br />
** Rewards <math> R_t </math> corresponding to the speed at which the spider is moving east.<br />
<br />
Three differently structured policy neural networks are trained in this set up using both meta-learning and three different previously developed adaption methods.<br />
<br />
At testing time, the spiders following meta learned policies initially perform worse than the spiders using non-adaptive policies. However, by the third episode <math> i=3 </math> the meta learners perform on par. And by the sixth episode, when the selected legs are mostly immobile, the meta learners significantly out perform. These results can be seen in the graphs below.<br />
<br />
[[File:locomotion_results.png | 800px]]<br />
<br />
= Training Spiders to Fight Each Other (Adversarial Meta-Learning) =<br />
<br />
The authors created an adversarial environment called RoboSumo where pairs of agents with 4, 6,or 8 legs sumo wrestle. The agents observe the location and velocity of their bodies and the bodies of their opponent, the angles and velocities of their legs, and the forces being exerted on them by their opponent (equivalent of tactile sense). The game is organized into episodes and rounds. Episodes are single wrestling matches with 500 time steps and win/lose/draw outcomes. Agents win by pushing their opponent out of the ring or making their opponent's body touch the ground. Rounds are batches of episodes. An episode results in a draw when neither of these things happen after 500 time steps. Rounds have possible outcomes win, lose, and draw that are decided based on majority of episodes won. K rounds will be fought. Both agents may update their policies between rounds. The agent that wins the majority of rounds is deemed the winner of the game.<br />
<br />
Similar to the Robotic locomotion example, this game can be phrased in terms of the RL MDP framework.<br />
<br />
* <math>T_i</math>: The task of fighting a round.<br />
* <math>\{T_i\}_{i=1}^{K}</math>: A sequence of rounds against the same opponent. Note that the opponent may update their policy between rounds but the anatomy of both wrestlers will be constant across rounds.<br />
* A Markov Descision process composed of<br />
** A horizon <math>H = 500*n</math> where <math>n</math> is the number of episodes per round.<br />
** Observations <math> x_t </math> containing information about the state of the agent and its opponent.<br />
** Actions <math> a_t </math> containing information about the torques to apply to the agents legs.<br />
** Rewards <math> R_t </math> rewards given to the agent based on its wrestling performance. <math>R_{500*n} = </math> +2000 if win episode, -2000 if lose, and -1000 if draw.<br />
<br />
Note that the above reward set up is quite sparse, therefore in order to encourage fast training, rewards are introduced at every time step for the following.<br />
* For staying close to the center of the ring.<br />
* For exerting force on the opponents body.<br />
* For moving towards the opponent.<br />
* For the distance of the opponent to the center of the ring.<br />
<br />
This makes sense intuitively as fighters these are reasonable goals for agents to explore when they are learning to wrestle.<br />
<br />
= Sources =<br />
# Chelsea Finn, Pieter Abbeel, Sergey Levine. "Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks." arXiv preprint arXiv:1703.03400v3 (2017).<br />
# Richard S Sutton, Anna Koop, and David Silver. On the role of tracking in stationary environments. In Proceedings of the 24th international conference on Machine learning, pp. 871–878. ACM, 2007.</div>Tdunnhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Continuous_Adaptation_via_Meta-Learning_in_Nonstationary_and_Competitive_Environments&diff=34031Continuous Adaptation via Meta-Learning in Nonstationary and Competitive Environments2018-03-14T16:57:00Z<p>Tdunn: /* Training Spiders to Fight Each Other (Adversarial Meta-Learning) */</p>
<hr />
<div>= Introduction =<br />
<br />
Typically, the basic goal of machine learning is to train a model to perform a task. In Meta-learning, the goal is to train a model to perform the task of training a model to perform a task. Hence in this case the term "Meta-Learning" has the exact meaning you would expect; the word "Meta" has the precise function of introducing a layer of abstraction.<br />
<br />
The meta-learning task can be made more concrete by a simple example. Consider the CIFAR-100 classification task that we used for our data competition. We can alter this task from being a 100-class classification problem to a collection of 100 binary classification problems. The goal of Meta-Learning here is to design and train and single binary classifier that will perform well on a randomly sampled task given a limited amount of training data for that specific task. In other words, we would like to train a model to perform the following procedure:<br />
<br />
# A task is sampled. The task is "Is X a dog?"<br />
# A small set of labeled training data is provided to the model. The labels represent whether or not the image is a picture of a dog.<br />
# The model uses the training data to adjust itself to the specific task of checking whether or not an image is a picture of a dog.<br />
<br />
This example also highlights the intuition that the skill of sight is distinct and separable from the skill of knowing what a dog looks like.<br />
<br />
In this paper, a probabilistic framework for meta learning is derived, then applied to tasks involving simulated robotic spiders. This framework generalizes the typical machine learning set up using Markov Decision Processes.<br />
<br />
= Model Agnostic Meta-Learning =<br />
<br />
An initial framework for meta-learning is given in "Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks" (Finn et al, 2017)<br />
<br />
"In our approach, the parameters of<br />
the model are explicitly trained such that a small<br />
number of gradient steps with a small amount<br />
of training data from a new task will produce<br />
good generalization performance on that task" (Finn et al, 2017)<br />
<br />
[[File:MAML.png | 500px]]<br />
<br />
In this training algorithm, the parameter vector <math>\theta</math> belonging to the model <math>f_{\theta}</math> is trained such that the meta-objective function <math>\mathcal{L} (\theta) = \sum_{\tau_i \sim P(\tau)} \mathcal{L}_{\tau_i} (f_{\theta_i' }) </math> is minimized. The sum in the objective function is over a sampled batch of training tasks. <math>\mathcal{L}_{\tau_i} (f_{\theta_i'})</math> is the training loss function corresponding to the <math>i^{th}</math> task in the batch evaluated at the model <math>f_{\theta_i'}</math>. The parameter vector <math>\theta_i'</math> is obtained by updating the general parameter <math>\theta</math> using the loss function <math>\mathcal{L}_{\tau_i}</math> and set of K training examples specific to the <math>i^{th}</math> task. Note that in alternate versions of this algorithm, additional testing sets are sampled from <math>\tau_i</math> and used to update <math>\theta</math> using testing loss functions instead of training loss functions.<br />
<br />
One of the important difference between this algorithm and more typical fine-tuning methods is that <math>\theta</math> is explicitly trained to be easily adjusted to perform well on different tasks rather than perform well on any specific tasks then fine tuned as the environment changes. (Sutton et al., 2007)<br />
<br />
= Probabilistic Framework for Meta-Learning =<br />
<br />
This paper puts the meta-learning problem into a Markov Decision Process (MDP) framework common to RL. Instead of training examples <math>\{(x, y)\}</math>, we have trajectories <math>\tau = (x_0, a_1, x_1, R_1, x_2, ... a_H, x_H, R_H)</math>. A trajectory is sequence of states/observations <math>x_t</math>, actions <math>a_t</math> and rewards <math>R_t</math> that is sampled from a task <math> T </math> according to a policy <math>\pi_{\theta}</math>. Included with said task is a method for assigning loss values to trajectories <math>L_T(\tau)</math> which is typically the negative cumulative reward. A policy is a deterministic function that takes in a state and returns an action. Our goal here is to train a policy <math>\pi_{\theta}</math> with parameter vector <math>\theta</math>. This is analougous to training a function <math>f_{\theta}</math> that assigns labels <math>y</math> to feature vectors <math>x</math>. More precisely we have the following definitions:<br />
<br />
* <math>T :=(L_T, P_T(x_0), P_T(x_t | x_{t-1}, a_t), H )</math> (A Task)<br />
* <math>D(T)</math> : A distribution over tasks.<br />
* <math>L_T</math>: A loss function for the task T that assigns numeric loss values to trajectories.<br />
* <math>P_T(x_0), P_T(x_t | x_{t-1}, a_t)</math>: Probability measures specifying the markovian dynamics of the observations <math>x_t</math><br />
* <math>H</math>: The horizon of the MDP. This is a fixed natural number specifying the lengths of the tasks trajectories.<br />
<br />
The papers goes further to define a Markov dynamic for sequences of tasks. Thus the policy that we would like to meta learn <math>\pi_{\theta}</math>, after being exposed to a sample of K trajectories <math>\tau_T^{1:k}</math> from the task <math>T_i</math>, should produce a new policy <math>\pi_{\phi}</math> that will perform well on the next task <math>T_{i+1}</math>. Thus we seek to minimize the following expectation.<br />
<br />
<math>\mathrm{E}_{P(T_0), P(T_{i+1} | T_i)}\bigg(\sum_{i=1}^{l} \mathcal{L}_{T_i, T_{i+1}}(\theta)\bigg)</math><br />
<br />
Where <math>\mathcal{L}_{T_i}(\theta) = \mathrm{E}_{\tau_i^{1:k} } \bigg( \mathrm{E}_{\tau_{i+1, \phi}}\Big( L_{T_{i+1}}(\tau_{i+1, \phi}) \Big) \bigg) </math> and <math>l</math> is the number of tasks.<br />
<br />
The meta-policy <math>\pi_{\theta}</math> is trained and then adapted at test time using the following procedures.<br />
<br />
[[File:MAML2.png | 800px]]<br />
<br />
The mathematics of calculating loss gradients is omitted.<br />
<br />
= Training Spiders to Run with Dynamic Handicaps (Robotic Locomotion in Non-Stationary Environments) =<br />
<br />
The authors used the MuJoCo physics simulator to create a simulated environment where robotic spiders with 6 legs are faced with the task of running due east as quickly as possible. The robotic spider observes the location and velocity of it's body, and the angles and velocities of its legs. It interacts with the environment by exerting torque on the joints of its legs. Each leg has two joints, the joint closer to the body rotates horizontally while the joint farther from the body rotates vertically. The environment is made non-stationary by gradually paralyzing two legs of the spider across training and testing episodes.<br />
Putting this example into the above probabilistic framework yields:<br />
<br />
* <math>T_i</math>: The task of walking east with the torques of two legs scaled by <math> (i-1)/6 </math><br />
* <math>\{T_i\}_{i=1}^{7}</math>: A sequence of tasks with the same two legs handicapped in each task. Note there are 15 different ways to choose such legs resulting in 15 sequences of tasks. 12 are used for training and 3 for testing.<br />
* A Markov Descision process composed of<br />
** Observations <math> x_t </math> containing information about the state of the spider.<br />
** Actions <math> a_t </math> containing information about the torques to apply to the spiders legs.<br />
** Rewards <math> R_t </math> corresponding to the speed at which the spider is moving east.<br />
<br />
Three differently structured policy neural networks are trained in this set up using both meta-learning and three different previously developed adaption methods.<br />
<br />
At testing time, the spiders following meta learned policies initially perform worse than the spiders using non-adaptive policies. However, by the third episode <math> i=3 </math> the meta learners perform on par. And by the sixth episode, when the selected legs are mostly immobile, the meta learners significantly out perform. These results can be seen in the graphs below.<br />
<br />
[[File:locomotion_results.png | 800px]]<br />
<br />
= Training Spiders to Fight Each Other (Adversarial Meta-Learning) =<br />
<br />
The authors created an adversarial environment called RoboSumo where pairs of agents with 4, 6,or 8 legs sumo wrestle. The agents observe the location and velocity of their bodies and the bodies of their opponent, the angles and velocities of their legs, and the forces being exerted on them by their opponent (equivalent of tactile sense). The game is organized into episodes and rounds. Episodes are single wrestling matches with 500 time steps and win/lose/draw outcomes. Agents win by pushing their opponent out of the ring or making their opponent's body touch the ground. Rounds are batches of episodes. An episode results in a draw when neither of these things happen after 500 time steps. Rounds have possible outcomes win, lose, and draw that are decided based on majority of episodes won. K rounds will be fought. Both agents may update their policies between rounds. The agent that wins the majority of rounds is deemed the winner of the game.<br />
<br />
Similar to the Robotic locomotion example, this game can be phrased in terms of the RL MDP framework.<br />
<br />
* <math>T_i</math>: The task of fighting a round.<br />
* <math>\{T_i\}_{i=1}^{K}</math>: A sequence of rounds against the same opponent. Note that the opponent may update their policy between rounds but the anatomy of both wrestlers will be constant across rounds.<br />
* A Markov Descision process composed of<br />
** A horizon <math>H = 500*n</math> where <math>n</math> is the number of episodes per round.<br />
** Observations <math> x_t </math> containing information about the state of the agent and its opponent.<br />
** Actions <math> a_t </math> containing information about the torques to apply to the agents legs.<br />
** Rewards <math> R_t </math> rewards given to the agent based on its wrestling performance. <math>R_{500*n} = </math> +2000 if win episode, -2000 if lose, and -1000 if draw.<br />
<br />
Note that the above reward set up is quite sparse, therefore in order to encourage fast training, rewards are introduced at every time step for the following.<br />
* For staying close to the center of the ring.<br />
* For exerting force on the opponents body.<br />
* For moving towards the opponent.<br />
* For the distance of the opponent to the center of the ring.<br />
<br />
This makes sense intuitively as fighters these are reasonable goal for agents who are learning to wrestle to explore.<br />
<br />
= Sources =<br />
# Chelsea Finn, Pieter Abbeel, Sergey Levine. "Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks." arXiv preprint arXiv:1703.03400v3 (2017).<br />
# Richard S Sutton, Anna Koop, and David Silver. On the role of tracking in stationary environments. In Proceedings of the 24th international conference on Machine learning, pp. 871–878. ACM, 2007.</div>Tdunnhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Continuous_Adaptation_via_Meta-Learning_in_Nonstationary_and_Competitive_Environments&diff=34030Continuous Adaptation via Meta-Learning in Nonstationary and Competitive Environments2018-03-14T16:56:26Z<p>Tdunn: /* Training Spiders to Run with Dynamic Handicaps (Robotic Locomotion in Non-Stationary Environments) */</p>
<hr />
<div>= Introduction =<br />
<br />
Typically, the basic goal of machine learning is to train a model to perform a task. In Meta-learning, the goal is to train a model to perform the task of training a model to perform a task. Hence in this case the term "Meta-Learning" has the exact meaning you would expect; the word "Meta" has the precise function of introducing a layer of abstraction.<br />
<br />
The meta-learning task can be made more concrete by a simple example. Consider the CIFAR-100 classification task that we used for our data competition. We can alter this task from being a 100-class classification problem to a collection of 100 binary classification problems. The goal of Meta-Learning here is to design and train and single binary classifier that will perform well on a randomly sampled task given a limited amount of training data for that specific task. In other words, we would like to train a model to perform the following procedure:<br />
<br />
# A task is sampled. The task is "Is X a dog?"<br />
# A small set of labeled training data is provided to the model. The labels represent whether or not the image is a picture of a dog.<br />
# The model uses the training data to adjust itself to the specific task of checking whether or not an image is a picture of a dog.<br />
<br />
This example also highlights the intuition that the skill of sight is distinct and separable from the skill of knowing what a dog looks like.<br />
<br />
In this paper, a probabilistic framework for meta learning is derived, then applied to tasks involving simulated robotic spiders. This framework generalizes the typical machine learning set up using Markov Decision Processes.<br />
<br />
= Model Agnostic Meta-Learning =<br />
<br />
An initial framework for meta-learning is given in "Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks" (Finn et al, 2017)<br />
<br />
"In our approach, the parameters of<br />
the model are explicitly trained such that a small<br />
number of gradient steps with a small amount<br />
of training data from a new task will produce<br />
good generalization performance on that task" (Finn et al, 2017)<br />
<br />
[[File:MAML.png | 500px]]<br />
<br />
In this training algorithm, the parameter vector <math>\theta</math> belonging to the model <math>f_{\theta}</math> is trained such that the meta-objective function <math>\mathcal{L} (\theta) = \sum_{\tau_i \sim P(\tau)} \mathcal{L}_{\tau_i} (f_{\theta_i' }) </math> is minimized. The sum in the objective function is over a sampled batch of training tasks. <math>\mathcal{L}_{\tau_i} (f_{\theta_i'})</math> is the training loss function corresponding to the <math>i^{th}</math> task in the batch evaluated at the model <math>f_{\theta_i'}</math>. The parameter vector <math>\theta_i'</math> is obtained by updating the general parameter <math>\theta</math> using the loss function <math>\mathcal{L}_{\tau_i}</math> and set of K training examples specific to the <math>i^{th}</math> task. Note that in alternate versions of this algorithm, additional testing sets are sampled from <math>\tau_i</math> and used to update <math>\theta</math> using testing loss functions instead of training loss functions.<br />
<br />
One of the important difference between this algorithm and more typical fine-tuning methods is that <math>\theta</math> is explicitly trained to be easily adjusted to perform well on different tasks rather than perform well on any specific tasks then fine tuned as the environment changes. (Sutton et al., 2007)<br />
<br />
= Probabilistic Framework for Meta-Learning =<br />
<br />
This paper puts the meta-learning problem into a Markov Decision Process (MDP) framework common to RL. Instead of training examples <math>\{(x, y)\}</math>, we have trajectories <math>\tau = (x_0, a_1, x_1, R_1, x_2, ... a_H, x_H, R_H)</math>. A trajectory is sequence of states/observations <math>x_t</math>, actions <math>a_t</math> and rewards <math>R_t</math> that is sampled from a task <math> T </math> according to a policy <math>\pi_{\theta}</math>. Included with said task is a method for assigning loss values to trajectories <math>L_T(\tau)</math> which is typically the negative cumulative reward. A policy is a deterministic function that takes in a state and returns an action. Our goal here is to train a policy <math>\pi_{\theta}</math> with parameter vector <math>\theta</math>. This is analougous to training a function <math>f_{\theta}</math> that assigns labels <math>y</math> to feature vectors <math>x</math>. More precisely we have the following definitions:<br />
<br />
* <math>T :=(L_T, P_T(x_0), P_T(x_t | x_{t-1}, a_t), H )</math> (A Task)<br />
* <math>D(T)</math> : A distribution over tasks.<br />
* <math>L_T</math>: A loss function for the task T that assigns numeric loss values to trajectories.<br />
* <math>P_T(x_0), P_T(x_t | x_{t-1}, a_t)</math>: Probability measures specifying the markovian dynamics of the observations <math>x_t</math><br />
* <math>H</math>: The horizon of the MDP. This is a fixed natural number specifying the lengths of the tasks trajectories.<br />
<br />
The papers goes further to define a Markov dynamic for sequences of tasks. Thus the policy that we would like to meta learn <math>\pi_{\theta}</math>, after being exposed to a sample of K trajectories <math>\tau_T^{1:k}</math> from the task <math>T_i</math>, should produce a new policy <math>\pi_{\phi}</math> that will perform well on the next task <math>T_{i+1}</math>. Thus we seek to minimize the following expectation.<br />
<br />
<math>\mathrm{E}_{P(T_0), P(T_{i+1} | T_i)}\bigg(\sum_{i=1}^{l} \mathcal{L}_{T_i, T_{i+1}}(\theta)\bigg)</math><br />
<br />
Where <math>\mathcal{L}_{T_i}(\theta) = \mathrm{E}_{\tau_i^{1:k} } \bigg( \mathrm{E}_{\tau_{i+1, \phi}}\Big( L_{T_{i+1}}(\tau_{i+1, \phi}) \Big) \bigg) </math> and <math>l</math> is the number of tasks.<br />
<br />
The meta-policy <math>\pi_{\theta}</math> is trained and then adapted at test time using the following procedures.<br />
<br />
[[File:MAML2.png | 800px]]<br />
<br />
The mathematics of calculating loss gradients is omitted.<br />
<br />
= Training Spiders to Run with Dynamic Handicaps (Robotic Locomotion in Non-Stationary Environments) =<br />
<br />
The authors used the MuJoCo physics simulator to create a simulated environment where robotic spiders with 6 legs are faced with the task of running due east as quickly as possible. The robotic spider observes the location and velocity of it's body, and the angles and velocities of its legs. It interacts with the environment by exerting torque on the joints of its legs. Each leg has two joints, the joint closer to the body rotates horizontally while the joint farther from the body rotates vertically. The environment is made non-stationary by gradually paralyzing two legs of the spider across training and testing episodes.<br />
Putting this example into the above probabilistic framework yields:<br />
<br />
* <math>T_i</math>: The task of walking east with the torques of two legs scaled by <math> (i-1)/6 </math><br />
* <math>\{T_i\}_{i=1}^{7}</math>: A sequence of tasks with the same two legs handicapped in each task. Note there are 15 different ways to choose such legs resulting in 15 sequences of tasks. 12 are used for training and 3 for testing.<br />
* A Markov Descision process composed of<br />
** Observations <math> x_t </math> containing information about the state of the spider.<br />
** Actions <math> a_t </math> containing information about the torques to apply to the spiders legs.<br />
** Rewards <math> R_t </math> corresponding to the speed at which the spider is moving east.<br />
<br />
Three differently structured policy neural networks are trained in this set up using both meta-learning and three different previously developed adaption methods.<br />
<br />
At testing time, the spiders following meta learned policies initially perform worse than the spiders using non-adaptive policies. However, by the third episode <math> i=3 </math> the meta learners perform on par. And by the sixth episode, when the selected legs are mostly immobile, the meta learners significantly out perform. These results can be seen in the graphs below.<br />
<br />
[[File:locomotion_results.png | 800px]]<br />
<br />
= Training Spiders to Fight Each Other (Adversarial Meta-Learning) =<br />
<br />
The authors created an adversarial environment called RoboSumo where pairs of agents with 4, 6,or 8 legs sumo wrestle. The agents observe the location and velocity of their bodies and the bodies of their opponent, the angles and velocities of their legs, and the forces being exerted on them by their opponent (equivalent of tactile sense). The game is organized into episodes and rounds. Episodes are single wrestling matches with 500 time steps and win/lose/draw outcomes. Agents win by pushing their opponent out of the ring or making their opponent's body touch the ground. Rounds are batches of episodes. An episode results in a draw when neither of these things happen after 500 time steps. Rounds have possible outcomes win, lose, and draw that are decided based on majority of episodes won. K rounds will be fought. Both agents may update their policies between rounds. The agent that wins the majority of rounds is deemed the winner of the game.<br />
<br />
Similar to the Robotic locomotion example, this game can be phrased in terms of the RL MDP framework.<br />
<br />
* <math>T_i</math>: The task of fighting a round.<br />
* <math>\{T_i\}_{i=1}^{K}</math>: A sequence of rounds against the same opponent. Note that the opponent may update their policy between rounds but the anatomy of both wrestlers will be constant across rounds.<br />
* A Markov Descision process composed of<br />
** A horizon <math>H = 500*n</math> where <math>n</math> is the number of episodes per round.<br />
** Observations <math> x_t </math> containing information about the state of the agent and its opponent.<br />
** Actions <math> a_t </math> containing information about the torques to apply to the agents legs.<br />
** Rewards <math> R_t </math> rewards given to the agent based on its wrestling performance. <math>R_{500*n} = </math> +2000 if win episode, -2000 if lose, and -1000 if draw.<br />
<br />
Note that the above reward set up is quite sparse, therefor in order to encourage fast training, rewards are introduced at every time step for the following.<br />
* For staying close to the center of the ring.<br />
* For exerting force on the opponents body.<br />
* For moving towards the opponent.<br />
* For the distance of the opponent to the center of the ring.<br />
<br />
This makes sense intuitively as fighters these are reasonable goal for agents who are learning to wrestle to explore.<br />
<br />
= Sources =<br />
# Chelsea Finn, Pieter Abbeel, Sergey Levine. "Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks." arXiv preprint arXiv:1703.03400v3 (2017).<br />
# Richard S Sutton, Anna Koop, and David Silver. On the role of tracking in stationary environments. In Proceedings of the 24th international conference on Machine learning, pp. 871–878. ACM, 2007.</div>Tdunnhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Continuous_Adaptation_via_Meta-Learning_in_Nonstationary_and_Competitive_Environments&diff=34029Continuous Adaptation via Meta-Learning in Nonstationary and Competitive Environments2018-03-14T16:55:46Z<p>Tdunn: /* Training Spiders to Run with Dynamic Handicaps (Robotic Locomotion in Non-Stationary Environments) */</p>
<hr />
<div>= Introduction =<br />
<br />
Typically, the basic goal of machine learning is to train a model to perform a task. In Meta-learning, the goal is to train a model to perform the task of training a model to perform a task. Hence in this case the term "Meta-Learning" has the exact meaning you would expect; the word "Meta" has the precise function of introducing a layer of abstraction.<br />
<br />
The meta-learning task can be made more concrete by a simple example. Consider the CIFAR-100 classification task that we used for our data competition. We can alter this task from being a 100-class classification problem to a collection of 100 binary classification problems. The goal of Meta-Learning here is to design and train and single binary classifier that will perform well on a randomly sampled task given a limited amount of training data for that specific task. In other words, we would like to train a model to perform the following procedure:<br />
<br />
# A task is sampled. The task is "Is X a dog?"<br />
# A small set of labeled training data is provided to the model. The labels represent whether or not the image is a picture of a dog.<br />
# The model uses the training data to adjust itself to the specific task of checking whether or not an image is a picture of a dog.<br />
<br />
This example also highlights the intuition that the skill of sight is distinct and separable from the skill of knowing what a dog looks like.<br />
<br />
In this paper, a probabilistic framework for meta learning is derived, then applied to tasks involving simulated robotic spiders. This framework generalizes the typical machine learning set up using Markov Decision Processes.<br />
<br />
= Model Agnostic Meta-Learning =<br />
<br />
An initial framework for meta-learning is given in "Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks" (Finn et al, 2017)<br />
<br />
"In our approach, the parameters of<br />
the model are explicitly trained such that a small<br />
number of gradient steps with a small amount<br />
of training data from a new task will produce<br />
good generalization performance on that task" (Finn et al, 2017)<br />
<br />
[[File:MAML.png | 500px]]<br />
<br />
In this training algorithm, the parameter vector <math>\theta</math> belonging to the model <math>f_{\theta}</math> is trained such that the meta-objective function <math>\mathcal{L} (\theta) = \sum_{\tau_i \sim P(\tau)} \mathcal{L}_{\tau_i} (f_{\theta_i' }) </math> is minimized. The sum in the objective function is over a sampled batch of training tasks. <math>\mathcal{L}_{\tau_i} (f_{\theta_i'})</math> is the training loss function corresponding to the <math>i^{th}</math> task in the batch evaluated at the model <math>f_{\theta_i'}</math>. The parameter vector <math>\theta_i'</math> is obtained by updating the general parameter <math>\theta</math> using the loss function <math>\mathcal{L}_{\tau_i}</math> and set of K training examples specific to the <math>i^{th}</math> task. Note that in alternate versions of this algorithm, additional testing sets are sampled from <math>\tau_i</math> and used to update <math>\theta</math> using testing loss functions instead of training loss functions.<br />
<br />
One of the important difference between this algorithm and more typical fine-tuning methods is that <math>\theta</math> is explicitly trained to be easily adjusted to perform well on different tasks rather than perform well on any specific tasks then fine tuned as the environment changes. (Sutton et al., 2007)<br />
<br />
= Probabilistic Framework for Meta-Learning =<br />
<br />
This paper puts the meta-learning problem into a Markov Decision Process (MDP) framework common to RL. Instead of training examples <math>\{(x, y)\}</math>, we have trajectories <math>\tau = (x_0, a_1, x_1, R_1, x_2, ... a_H, x_H, R_H)</math>. A trajectory is sequence of states/observations <math>x_t</math>, actions <math>a_t</math> and rewards <math>R_t</math> that is sampled from a task <math> T </math> according to a policy <math>\pi_{\theta}</math>. Included with said task is a method for assigning loss values to trajectories <math>L_T(\tau)</math> which is typically the negative cumulative reward. A policy is a deterministic function that takes in a state and returns an action. Our goal here is to train a policy <math>\pi_{\theta}</math> with parameter vector <math>\theta</math>. This is analougous to training a function <math>f_{\theta}</math> that assigns labels <math>y</math> to feature vectors <math>x</math>. More precisely we have the following definitions:<br />
<br />
* <math>T :=(L_T, P_T(x_0), P_T(x_t | x_{t-1}, a_t), H )</math> (A Task)<br />
* <math>D(T)</math> : A distribution over tasks.<br />
* <math>L_T</math>: A loss function for the task T that assigns numeric loss values to trajectories.<br />
* <math>P_T(x_0), P_T(x_t | x_{t-1}, a_t)</math>: Probability measures specifying the markovian dynamics of the observations <math>x_t</math><br />
* <math>H</math>: The horizon of the MDP. This is a fixed natural number specifying the lengths of the tasks trajectories.<br />
<br />
The papers goes further to define a Markov dynamic for sequences of tasks. Thus the policy that we would like to meta learn <math>\pi_{\theta}</math>, after being exposed to a sample of K trajectories <math>\tau_T^{1:k}</math> from the task <math>T_i</math>, should produce a new policy <math>\pi_{\phi}</math> that will perform well on the next task <math>T_{i+1}</math>. Thus we seek to minimize the following expectation.<br />
<br />
<math>\mathrm{E}_{P(T_0), P(T_{i+1} | T_i)}\bigg(\sum_{i=1}^{l} \mathcal{L}_{T_i, T_{i+1}}(\theta)\bigg)</math><br />
<br />
Where <math>\mathcal{L}_{T_i}(\theta) = \mathrm{E}_{\tau_i^{1:k} } \bigg( \mathrm{E}_{\tau_{i+1, \phi}}\Big( L_{T_{i+1}}(\tau_{i+1, \phi}) \Big) \bigg) </math> and <math>l</math> is the number of tasks.<br />
<br />
The meta-policy <math>\pi_{\theta}</math> is trained and then adapted at test time using the following procedures.<br />
<br />
[[File:MAML2.png | 800px]]<br />
<br />
The mathematics of calculating loss gradients is omitted.<br />
<br />
= Training Spiders to Run with Dynamic Handicaps (Robotic Locomotion in Non-Stationary Environments) =<br />
<br />
The authors used the MuJoCo physics simulator to create a simulated environment where robotic spiders with 6 legs are faced with the task of running due east as quickly as possible. The robotic spider observes the location and velocity of it's body, and the angles and velocities of its legs. It interacts with the environment by exerting torque on the joints of its legs. Each leg has two joints, the joint closer to the body rotates horizontally while the joint farther from the body rotates vertically. The environment is made non-stationary by gradually paralyzing two legs of the spider across training and testing episodes.<br />
Putting this example into the above probabilistic framework yields:<br />
<br />
* <math>T_i</math>: The task of walking east with the torques of two legs scaled by <math> (i-1)/6 </math><br />
* <math>\{T_i\}_{i=1}^{7}</math>: A sequence of tasks with the same two legs handicapped in each task. Note there are 15 different ways to choose such legs resulting in 15 sequences of tasks. 12 are used for training and 3 for testing.<br />
* A Markov Descision process composed of<br />
** Observations <math> x_t </math> containing information about the state of the spider.<br />
** Actions <math> a_t </math> containing information about the torques to apply to the spiders legs.<br />
** Rewards <math> R_t </math> corresponding to the speed at which the spider is moving east.<br />
<br />
Three differently structured policy neural networks are trained in this set up using both meta-learning and three different previously developed adaption methods.<br />
<br />
At testing time, the spiders following meta learned policies initially perform worse than the spiders using non-adaptive policies. However, by the third episode <math> i=3 </math> the meta learners perform on par. And by the sixth episode, when the selected legs are mostly immobile, the meta learners significantly out perform.<br />
<br />
[[File:locomotion_results.png | 800px]]<br />
<br />
= Training Spiders to Fight Each Other (Adversarial Meta-Learning) =<br />
<br />
The authors created an adversarial environment called RoboSumo where pairs of agents with 4, 6,or 8 legs sumo wrestle. The agents observe the location and velocity of their bodies and the bodies of their opponent, the angles and velocities of their legs, and the forces being exerted on them by their opponent (equivalent of tactile sense). The game is organized into episodes and rounds. Episodes are single wrestling matches with 500 time steps and win/lose/draw outcomes. Agents win by pushing their opponent out of the ring or making their opponent's body touch the ground. Rounds are batches of episodes. An episode results in a draw when neither of these things happen after 500 time steps. Rounds have possible outcomes win, lose, and draw that are decided based on majority of episodes won. K rounds will be fought. Both agents may update their policies between rounds. The agent that wins the majority of rounds is deemed the winner of the game.<br />
<br />
Similar to the Robotic locomotion example, this game can be phrased in terms of the RL MDP framework.<br />
<br />
* <math>T_i</math>: The task of fighting a round.<br />
* <math>\{T_i\}_{i=1}^{K}</math>: A sequence of rounds against the same opponent. Note that the opponent may update their policy between rounds but the anatomy of both wrestlers will be constant across rounds.<br />
* A Markov Descision process composed of<br />
** A horizon <math>H = 500*n</math> where <math>n</math> is the number of episodes per round.<br />
** Observations <math> x_t </math> containing information about the state of the agent and its opponent.<br />
** Actions <math> a_t </math> containing information about the torques to apply to the agents legs.<br />
** Rewards <math> R_t </math> rewards given to the agent based on its wrestling performance. <math>R_{500*n} = </math> +2000 if win episode, -2000 if lose, and -1000 if draw.<br />
<br />
Note that the above reward set up is quite sparse, therefor in order to encourage fast training, rewards are introduced at every time step for the following.<br />
* For staying close to the center of the ring.<br />
* For exerting force on the opponents body.<br />
* For moving towards the opponent.<br />
* For the distance of the opponent to the center of the ring.<br />
<br />
This makes sense intuitively as fighters these are reasonable goal for agents who are learning to wrestle to explore.<br />
<br />
= Sources =<br />
# Chelsea Finn, Pieter Abbeel, Sergey Levine. "Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks." arXiv preprint arXiv:1703.03400v3 (2017).<br />
# Richard S Sutton, Anna Koop, and David Silver. On the role of tracking in stationary environments. In Proceedings of the 24th international conference on Machine learning, pp. 871–878. ACM, 2007.</div>Tdunnhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Continuous_Adaptation_via_Meta-Learning_in_Nonstationary_and_Competitive_Environments&diff=34028Continuous Adaptation via Meta-Learning in Nonstationary and Competitive Environments2018-03-14T16:55:32Z<p>Tdunn: /* Training Spiders to Run with Dynamic Handicaps (Robotic Locomotion in Non-Stationary Environments) */</p>
<hr />
<div>= Introduction =<br />
<br />
Typically, the basic goal of machine learning is to train a model to perform a task. In Meta-learning, the goal is to train a model to perform the task of training a model to perform a task. Hence in this case the term "Meta-Learning" has the exact meaning you would expect; the word "Meta" has the precise function of introducing a layer of abstraction.<br />
<br />
The meta-learning task can be made more concrete by a simple example. Consider the CIFAR-100 classification task that we used for our data competition. We can alter this task from being a 100-class classification problem to a collection of 100 binary classification problems. The goal of Meta-Learning here is to design and train and single binary classifier that will perform well on a randomly sampled task given a limited amount of training data for that specific task. In other words, we would like to train a model to perform the following procedure:<br />
<br />
# A task is sampled. The task is "Is X a dog?"<br />
# A small set of labeled training data is provided to the model. The labels represent whether or not the image is a picture of a dog.<br />
# The model uses the training data to adjust itself to the specific task of checking whether or not an image is a picture of a dog.<br />
<br />
This example also highlights the intuition that the skill of sight is distinct and separable from the skill of knowing what a dog looks like.<br />
<br />
In this paper, a probabilistic framework for meta learning is derived, then applied to tasks involving simulated robotic spiders. This framework generalizes the typical machine learning set up using Markov Decision Processes.<br />
<br />
= Model Agnostic Meta-Learning =<br />
<br />
An initial framework for meta-learning is given in "Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks" (Finn et al, 2017)<br />
<br />
"In our approach, the parameters of<br />
the model are explicitly trained such that a small<br />
number of gradient steps with a small amount<br />
of training data from a new task will produce<br />
good generalization performance on that task" (Finn et al, 2017)<br />
<br />
[[File:MAML.png | 500px]]<br />
<br />
In this training algorithm, the parameter vector <math>\theta</math> belonging to the model <math>f_{\theta}</math> is trained such that the meta-objective function <math>\mathcal{L} (\theta) = \sum_{\tau_i \sim P(\tau)} \mathcal{L}_{\tau_i} (f_{\theta_i' }) </math> is minimized. The sum in the objective function is over a sampled batch of training tasks. <math>\mathcal{L}_{\tau_i} (f_{\theta_i'})</math> is the training loss function corresponding to the <math>i^{th}</math> task in the batch evaluated at the model <math>f_{\theta_i'}</math>. The parameter vector <math>\theta_i'</math> is obtained by updating the general parameter <math>\theta</math> using the loss function <math>\mathcal{L}_{\tau_i}</math> and set of K training examples specific to the <math>i^{th}</math> task. Note that in alternate versions of this algorithm, additional testing sets are sampled from <math>\tau_i</math> and used to update <math>\theta</math> using testing loss functions instead of training loss functions.<br />
<br />
One of the important difference between this algorithm and more typical fine-tuning methods is that <math>\theta</math> is explicitly trained to be easily adjusted to perform well on different tasks rather than perform well on any specific tasks then fine tuned as the environment changes. (Sutton et al., 2007)<br />
<br />
= Probabilistic Framework for Meta-Learning =<br />
<br />
This paper puts the meta-learning problem into a Markov Decision Process (MDP) framework common to RL. Instead of training examples <math>\{(x, y)\}</math>, we have trajectories <math>\tau = (x_0, a_1, x_1, R_1, x_2, ... a_H, x_H, R_H)</math>. A trajectory is sequence of states/observations <math>x_t</math>, actions <math>a_t</math> and rewards <math>R_t</math> that is sampled from a task <math> T </math> according to a policy <math>\pi_{\theta}</math>. Included with said task is a method for assigning loss values to trajectories <math>L_T(\tau)</math> which is typically the negative cumulative reward. A policy is a deterministic function that takes in a state and returns an action. Our goal here is to train a policy <math>\pi_{\theta}</math> with parameter vector <math>\theta</math>. This is analougous to training a function <math>f_{\theta}</math> that assigns labels <math>y</math> to feature vectors <math>x</math>. More precisely we have the following definitions:<br />
<br />
* <math>T :=(L_T, P_T(x_0), P_T(x_t | x_{t-1}, a_t), H )</math> (A Task)<br />
* <math>D(T)</math> : A distribution over tasks.<br />
* <math>L_T</math>: A loss function for the task T that assigns numeric loss values to trajectories.<br />
* <math>P_T(x_0), P_T(x_t | x_{t-1}, a_t)</math>: Probability measures specifying the markovian dynamics of the observations <math>x_t</math><br />
* <math>H</math>: The horizon of the MDP. This is a fixed natural number specifying the lengths of the tasks trajectories.<br />
<br />
The papers goes further to define a Markov dynamic for sequences of tasks. Thus the policy that we would like to meta learn <math>\pi_{\theta}</math>, after being exposed to a sample of K trajectories <math>\tau_T^{1:k}</math> from the task <math>T_i</math>, should produce a new policy <math>\pi_{\phi}</math> that will perform well on the next task <math>T_{i+1}</math>. Thus we seek to minimize the following expectation.<br />
<br />
<math>\mathrm{E}_{P(T_0), P(T_{i+1} | T_i)}\bigg(\sum_{i=1}^{l} \mathcal{L}_{T_i, T_{i+1}}(\theta)\bigg)</math><br />
<br />
Where <math>\mathcal{L}_{T_i}(\theta) = \mathrm{E}_{\tau_i^{1:k} } \bigg( \mathrm{E}_{\tau_{i+1, \phi}}\Big( L_{T_{i+1}}(\tau_{i+1, \phi}) \Big) \bigg) </math> and <math>l</math> is the number of tasks.<br />
<br />
The meta-policy <math>\pi_{\theta}</math> is trained and then adapted at test time using the following procedures.<br />
<br />
[[File:MAML2.png | 800px]]<br />
<br />
The mathematics of calculating loss gradients is omitted.<br />
<br />
= Training Spiders to Run with Dynamic Handicaps (Robotic Locomotion in Non-Stationary Environments) =<br />
<br />
The authors used the MuJoCo physics simulator to create a simulated environment where robotic spiders with 6 legs are faced with the task of running due east as quickly as possible. The robotic spider observes the location and velocity of it's body, and the angles and velocities of its legs. It interacts with the environment by exerting torque on the joints of its legs. Each leg has two joints, the joint closer to the body rotates horizontally while the joint farther from the body rotates vertically. The environment is made non-stationary by gradually paralyzing two legs of the spider across training and testing episodes.<br />
Putting this example into the above probabilistic framework yields:<br />
<br />
* <math>T_i</math>: The task of walking east with the torques of two legs scaled by <math> (i-1)/6 </math><br />
* <math>\{T_i\}_{i=1}^{7}</math>: A sequence of tasks with the same two legs handicapped in each task. Note there are 15 different ways to choose such legs resulting in 15 sequences of tasks. 12 are used for training and 3 for testing.<br />
* A Markov Descision process composed of<br />
** Observations <math> x_t </math> containing information about the state of the spider.<br />
** Actions <math> a_t </math> containing information about the torques to apply to the spiders legs.<br />
** Rewards <math> R_t </math> corresponding to the speed at which the spider is moving east.<br />
<br />
Three differently structured policy neural networks are trained in this set up using both meta-learning and three different previously developed adaption methods.<br />
<br />
At testing time, the spiders following meta learned policies initially perform worse than the spiders using non-adaptive policies. However, by the third episode <math> i=3 </math> the meta learners perform on par. And by the sixth episode, when the selected legs are mostly immobile, the meta learners significantly out perform.<br />
<br />
[[File:locomotion_results.png | 500px]]<br />
<br />
= Training Spiders to Fight Each Other (Adversarial Meta-Learning) =<br />
<br />
The authors created an adversarial environment called RoboSumo where pairs of agents with 4, 6,or 8 legs sumo wrestle. The agents observe the location and velocity of their bodies and the bodies of their opponent, the angles and velocities of their legs, and the forces being exerted on them by their opponent (equivalent of tactile sense). The game is organized into episodes and rounds. Episodes are single wrestling matches with 500 time steps and win/lose/draw outcomes. Agents win by pushing their opponent out of the ring or making their opponent's body touch the ground. Rounds are batches of episodes. An episode results in a draw when neither of these things happen after 500 time steps. Rounds have possible outcomes win, lose, and draw that are decided based on majority of episodes won. K rounds will be fought. Both agents may update their policies between rounds. The agent that wins the majority of rounds is deemed the winner of the game.<br />
<br />
Similar to the Robotic locomotion example, this game can be phrased in terms of the RL MDP framework.<br />
<br />
* <math>T_i</math>: The task of fighting a round.<br />
* <math>\{T_i\}_{i=1}^{K}</math>: A sequence of rounds against the same opponent. Note that the opponent may update their policy between rounds but the anatomy of both wrestlers will be constant across rounds.<br />
* A Markov Descision process composed of<br />
** A horizon <math>H = 500*n</math> where <math>n</math> is the number of episodes per round.<br />
** Observations <math> x_t </math> containing information about the state of the agent and its opponent.<br />
** Actions <math> a_t </math> containing information about the torques to apply to the agents legs.<br />
** Rewards <math> R_t </math> rewards given to the agent based on its wrestling performance. <math>R_{500*n} = </math> +2000 if win episode, -2000 if lose, and -1000 if draw.<br />
<br />
Note that the above reward set up is quite sparse, therefor in order to encourage fast training, rewards are introduced at every time step for the following.<br />
* For staying close to the center of the ring.<br />
* For exerting force on the opponents body.<br />
* For moving towards the opponent.<br />
* For the distance of the opponent to the center of the ring.<br />
<br />
This makes sense intuitively as fighters these are reasonable goal for agents who are learning to wrestle to explore.<br />
<br />
= Sources =<br />
# Chelsea Finn, Pieter Abbeel, Sergey Levine. "Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks." arXiv preprint arXiv:1703.03400v3 (2017).<br />
# Richard S Sutton, Anna Koop, and David Silver. On the role of tracking in stationary environments. In Proceedings of the 24th international conference on Machine learning, pp. 871–878. ACM, 2007.</div>Tdunnhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:locomotion_results.png&diff=34027File:locomotion results.png2018-03-14T16:54:39Z<p>Tdunn: </p>
<hr />
<div></div>Tdunnhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Continuous_Adaptation_via_Meta-Learning_in_Nonstationary_and_Competitive_Environments&diff=34026Continuous Adaptation via Meta-Learning in Nonstationary and Competitive Environments2018-03-14T16:54:28Z<p>Tdunn: /* Training Spiders to Run with Dynamic Handicaps (Robotic Locomotion in Non-Stationary Environments) */</p>
<hr />
<div>= Introduction =<br />
<br />
Typically, the basic goal of machine learning is to train a model to perform a task. In Meta-learning, the goal is to train a model to perform the task of training a model to perform a task. Hence in this case the term "Meta-Learning" has the exact meaning you would expect; the word "Meta" has the precise function of introducing a layer of abstraction.<br />
<br />
The meta-learning task can be made more concrete by a simple example. Consider the CIFAR-100 classification task that we used for our data competition. We can alter this task from being a 100-class classification problem to a collection of 100 binary classification problems. The goal of Meta-Learning here is to design and train and single binary classifier that will perform well on a randomly sampled task given a limited amount of training data for that specific task. In other words, we would like to train a model to perform the following procedure:<br />
<br />
# A task is sampled. The task is "Is X a dog?"<br />
# A small set of labeled training data is provided to the model. The labels represent whether or not the image is a picture of a dog.<br />
# The model uses the training data to adjust itself to the specific task of checking whether or not an image is a picture of a dog.<br />
<br />
This example also highlights the intuition that the skill of sight is distinct and separable from the skill of knowing what a dog looks like.<br />
<br />
In this paper, a probabilistic framework for meta learning is derived, then applied to tasks involving simulated robotic spiders. This framework generalizes the typical machine learning set up using Markov Decision Processes.<br />
<br />
= Model Agnostic Meta-Learning =<br />
<br />
An initial framework for meta-learning is given in "Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks" (Finn et al, 2017)<br />
<br />
"In our approach, the parameters of<br />
the model are explicitly trained such that a small<br />
number of gradient steps with a small amount<br />
of training data from a new task will produce<br />
good generalization performance on that task" (Finn et al, 2017)<br />
<br />
[[File:MAML.png | 500px]]<br />
<br />
In this training algorithm, the parameter vector <math>\theta</math> belonging to the model <math>f_{\theta}</math> is trained such that the meta-objective function <math>\mathcal{L} (\theta) = \sum_{\tau_i \sim P(\tau)} \mathcal{L}_{\tau_i} (f_{\theta_i' }) </math> is minimized. The sum in the objective function is over a sampled batch of training tasks. <math>\mathcal{L}_{\tau_i} (f_{\theta_i'})</math> is the training loss function corresponding to the <math>i^{th}</math> task in the batch evaluated at the model <math>f_{\theta_i'}</math>. The parameter vector <math>\theta_i'</math> is obtained by updating the general parameter <math>\theta</math> using the loss function <math>\mathcal{L}_{\tau_i}</math> and set of K training examples specific to the <math>i^{th}</math> task. Note that in alternate versions of this algorithm, additional testing sets are sampled from <math>\tau_i</math> and used to update <math>\theta</math> using testing loss functions instead of training loss functions.<br />
<br />
One of the important difference between this algorithm and more typical fine-tuning methods is that <math>\theta</math> is explicitly trained to be easily adjusted to perform well on different tasks rather than perform well on any specific tasks then fine tuned as the environment changes. (Sutton et al., 2007)<br />
<br />
= Probabilistic Framework for Meta-Learning =<br />
<br />
This paper puts the meta-learning problem into a Markov Decision Process (MDP) framework common to RL. Instead of training examples <math>\{(x, y)\}</math>, we have trajectories <math>\tau = (x_0, a_1, x_1, R_1, x_2, ... a_H, x_H, R_H)</math>. A trajectory is sequence of states/observations <math>x_t</math>, actions <math>a_t</math> and rewards <math>R_t</math> that is sampled from a task <math> T </math> according to a policy <math>\pi_{\theta}</math>. Included with said task is a method for assigning loss values to trajectories <math>L_T(\tau)</math> which is typically the negative cumulative reward. A policy is a deterministic function that takes in a state and returns an action. Our goal here is to train a policy <math>\pi_{\theta}</math> with parameter vector <math>\theta</math>. This is analougous to training a function <math>f_{\theta}</math> that assigns labels <math>y</math> to feature vectors <math>x</math>. More precisely we have the following definitions:<br />
<br />
* <math>T :=(L_T, P_T(x_0), P_T(x_t | x_{t-1}, a_t), H )</math> (A Task)<br />
* <math>D(T)</math> : A distribution over tasks.<br />
* <math>L_T</math>: A loss function for the task T that assigns numeric loss values to trajectories.<br />
* <math>P_T(x_0), P_T(x_t | x_{t-1}, a_t)</math>: Probability measures specifying the markovian dynamics of the observations <math>x_t</math><br />
* <math>H</math>: The horizon of the MDP. This is a fixed natural number specifying the lengths of the tasks trajectories.<br />
<br />
The papers goes further to define a Markov dynamic for sequences of tasks. Thus the policy that we would like to meta learn <math>\pi_{\theta}</math>, after being exposed to a sample of K trajectories <math>\tau_T^{1:k}</math> from the task <math>T_i</math>, should produce a new policy <math>\pi_{\phi}</math> that will perform well on the next task <math>T_{i+1}</math>. Thus we seek to minimize the following expectation.<br />
<br />
<math>\mathrm{E}_{P(T_0), P(T_{i+1} | T_i)}\bigg(\sum_{i=1}^{l} \mathcal{L}_{T_i, T_{i+1}}(\theta)\bigg)</math><br />
<br />
Where <math>\mathcal{L}_{T_i}(\theta) = \mathrm{E}_{\tau_i^{1:k} } \bigg( \mathrm{E}_{\tau_{i+1, \phi}}\Big( L_{T_{i+1}}(\tau_{i+1, \phi}) \Big) \bigg) </math> and <math>l</math> is the number of tasks.<br />
<br />
The meta-policy <math>\pi_{\theta}</math> is trained and then adapted at test time using the following procedures.<br />
<br />
[[File:MAML2.png | 800px]]<br />
<br />
The mathematics of calculating loss gradients is omitted.<br />
<br />
= Training Spiders to Run with Dynamic Handicaps (Robotic Locomotion in Non-Stationary Environments) =<br />
<br />
The authors used the MuJoCo physics simulator to create a simulated environment where robotic spiders with 6 legs are faced with the task of running due east as quickly as possible. The robotic spider observes the location and velocity of it's body, and the angles and velocities of its legs. It interacts with the environment by exerting torque on the joints of its legs. Each leg has two joints, the joint closer to the body rotates horizontally while the joint farther from the body rotates vertically. The environment is made non-stationary by gradually paralyzing two legs of the spider across training and testing episodes.<br />
Putting this example into the above probabilistic framework yields:<br />
<br />
* <math>T_i</math>: The task of walking east with the torques of two legs scaled by <math> (i-1)/6 </math><br />
* <math>\{T_i\}_{i=1}^{7}</math>: A sequence of tasks with the same two legs handicapped in each task. Note there are 15 different ways to choose such legs resulting in 15 sequences of tasks. 12 are used for training and 3 for testing.<br />
* A Markov Descision process composed of<br />
** Observations <math> x_t </math> containing information about the state of the spider.<br />
** Actions <math> a_t </math> containing information about the torques to apply to the spiders legs.<br />
** Rewards <math> R_t </math> corresponding to the speed at which the spider is moving east.<br />
<br />
Three differently structured policy neural networks are trained in this set up using both meta-learning and three different previously developed adaption methods.<br />
<br />
At testing time, the spiders following meta learned policies initially perform worse than the spiders using non-adaptive policies. However, by the third episode <math> i=3 </math> the meta learners perform on par. And by the sixth episode, when the selected legs are mostly immobile, the meta learners significantly out perform.<br />
<br />
[[file:locomotion_results.png | 500px]]<br />
<br />
= Training Spiders to Fight Each Other (Adversarial Meta-Learning) =<br />
<br />
The authors created an adversarial environment called RoboSumo where pairs of agents with 4, 6,or 8 legs sumo wrestle. The agents observe the location and velocity of their bodies and the bodies of their opponent, the angles and velocities of their legs, and the forces being exerted on them by their opponent (equivalent of tactile sense). The game is organized into episodes and rounds. Episodes are single wrestling matches with 500 time steps and win/lose/draw outcomes. Agents win by pushing their opponent out of the ring or making their opponent's body touch the ground. Rounds are batches of episodes. An episode results in a draw when neither of these things happen after 500 time steps. Rounds have possible outcomes win, lose, and draw that are decided based on majority of episodes won. K rounds will be fought. Both agents may update their policies between rounds. The agent that wins the majority of rounds is deemed the winner of the game.<br />
<br />
Similar to the Robotic locomotion example, this game can be phrased in terms of the RL MDP framework.<br />
<br />
* <math>T_i</math>: The task of fighting a round.<br />
* <math>\{T_i\}_{i=1}^{K}</math>: A sequence of rounds against the same opponent. Note that the opponent may update their policy between rounds but the anatomy of both wrestlers will be constant across rounds.<br />
* A Markov Descision process composed of<br />
** A horizon <math>H = 500*n</math> where <math>n</math> is the number of episodes per round.<br />
** Observations <math> x_t </math> containing information about the state of the agent and its opponent.<br />
** Actions <math> a_t </math> containing information about the torques to apply to the agents legs.<br />
** Rewards <math> R_t </math> rewards given to the agent based on its wrestling performance. <math>R_{500*n} = </math> +2000 if win episode, -2000 if lose, and -1000 if draw.<br />
<br />
Note that the above reward set up is quite sparse, therefor in order to encourage fast training, rewards are introduced at every time step for the following.<br />
* For staying close to the center of the ring.<br />
* For exerting force on the opponents body.<br />
* For moving towards the opponent.<br />
* For the distance of the opponent to the center of the ring.<br />
<br />
This makes sense intuitively as fighters these are reasonable goal for agents who are learning to wrestle to explore.<br />
<br />
= Sources =<br />
# Chelsea Finn, Pieter Abbeel, Sergey Levine. "Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks." arXiv preprint arXiv:1703.03400v3 (2017).<br />
# Richard S Sutton, Anna Koop, and David Silver. On the role of tracking in stationary environments. In Proceedings of the 24th international conference on Machine learning, pp. 871–878. ACM, 2007.</div>Tdunnhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Continuous_Adaptation_via_Meta-Learning_in_Nonstationary_and_Competitive_Environments&diff=34025Continuous Adaptation via Meta-Learning in Nonstationary and Competitive Environments2018-03-14T16:53:20Z<p>Tdunn: /* Training Spiders to Run with Dynamic Handicaps (Robotic Locomotion in Non-Stationary Environments) */</p>
<hr />
<div>= Introduction =<br />
<br />
Typically, the basic goal of machine learning is to train a model to perform a task. In Meta-learning, the goal is to train a model to perform the task of training a model to perform a task. Hence in this case the term "Meta-Learning" has the exact meaning you would expect; the word "Meta" has the precise function of introducing a layer of abstraction.<br />
<br />
The meta-learning task can be made more concrete by a simple example. Consider the CIFAR-100 classification task that we used for our data competition. We can alter this task from being a 100-class classification problem to a collection of 100 binary classification problems. The goal of Meta-Learning here is to design and train and single binary classifier that will perform well on a randomly sampled task given a limited amount of training data for that specific task. In other words, we would like to train a model to perform the following procedure:<br />
<br />
# A task is sampled. The task is "Is X a dog?"<br />
# A small set of labeled training data is provided to the model. The labels represent whether or not the image is a picture of a dog.<br />
# The model uses the training data to adjust itself to the specific task of checking whether or not an image is a picture of a dog.<br />
<br />
This example also highlights the intuition that the skill of sight is distinct and separable from the skill of knowing what a dog looks like.<br />
<br />
In this paper, a probabilistic framework for meta learning is derived, then applied to tasks involving simulated robotic spiders. This framework generalizes the typical machine learning set up using Markov Decision Processes.<br />
<br />
= Model Agnostic Meta-Learning =<br />
<br />
An initial framework for meta-learning is given in "Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks" (Finn et al, 2017)<br />
<br />
"In our approach, the parameters of<br />
the model are explicitly trained such that a small<br />
number of gradient steps with a small amount<br />
of training data from a new task will produce<br />
good generalization performance on that task" (Finn et al, 2017)<br />
<br />
[[File:MAML.png | 500px]]<br />
<br />
In this training algorithm, the parameter vector <math>\theta</math> belonging to the model <math>f_{\theta}</math> is trained such that the meta-objective function <math>\mathcal{L} (\theta) = \sum_{\tau_i \sim P(\tau)} \mathcal{L}_{\tau_i} (f_{\theta_i' }) </math> is minimized. The sum in the objective function is over a sampled batch of training tasks. <math>\mathcal{L}_{\tau_i} (f_{\theta_i'})</math> is the training loss function corresponding to the <math>i^{th}</math> task in the batch evaluated at the model <math>f_{\theta_i'}</math>. The parameter vector <math>\theta_i'</math> is obtained by updating the general parameter <math>\theta</math> using the loss function <math>\mathcal{L}_{\tau_i}</math> and set of K training examples specific to the <math>i^{th}</math> task. Note that in alternate versions of this algorithm, additional testing sets are sampled from <math>\tau_i</math> and used to update <math>\theta</math> using testing loss functions instead of training loss functions.<br />
<br />
One of the important difference between this algorithm and more typical fine-tuning methods is that <math>\theta</math> is explicitly trained to be easily adjusted to perform well on different tasks rather than perform well on any specific tasks then fine tuned as the environment changes. (Sutton et al., 2007)<br />
<br />
= Probabilistic Framework for Meta-Learning =<br />
<br />
This paper puts the meta-learning problem into a Markov Decision Process (MDP) framework common to RL. Instead of training examples <math>\{(x, y)\}</math>, we have trajectories <math>\tau = (x_0, a_1, x_1, R_1, x_2, ... a_H, x_H, R_H)</math>. A trajectory is sequence of states/observations <math>x_t</math>, actions <math>a_t</math> and rewards <math>R_t</math> that is sampled from a task <math> T </math> according to a policy <math>\pi_{\theta}</math>. Included with said task is a method for assigning loss values to trajectories <math>L_T(\tau)</math> which is typically the negative cumulative reward. A policy is a deterministic function that takes in a state and returns an action. Our goal here is to train a policy <math>\pi_{\theta}</math> with parameter vector <math>\theta</math>. This is analougous to training a function <math>f_{\theta}</math> that assigns labels <math>y</math> to feature vectors <math>x</math>. More precisely we have the following definitions:<br />
<br />
* <math>T :=(L_T, P_T(x_0), P_T(x_t | x_{t-1}, a_t), H )</math> (A Task)<br />
* <math>D(T)</math> : A distribution over tasks.<br />
* <math>L_T</math>: A loss function for the task T that assigns numeric loss values to trajectories.<br />
* <math>P_T(x_0), P_T(x_t | x_{t-1}, a_t)</math>: Probability measures specifying the markovian dynamics of the observations <math>x_t</math><br />
* <math>H</math>: The horizon of the MDP. This is a fixed natural number specifying the lengths of the tasks trajectories.<br />
<br />
The papers goes further to define a Markov dynamic for sequences of tasks. Thus the policy that we would like to meta learn <math>\pi_{\theta}</math>, after being exposed to a sample of K trajectories <math>\tau_T^{1:k}</math> from the task <math>T_i</math>, should produce a new policy <math>\pi_{\phi}</math> that will perform well on the next task <math>T_{i+1}</math>. Thus we seek to minimize the following expectation.<br />
<br />
<math>\mathrm{E}_{P(T_0), P(T_{i+1} | T_i)}\bigg(\sum_{i=1}^{l} \mathcal{L}_{T_i, T_{i+1}}(\theta)\bigg)</math><br />
<br />
Where <math>\mathcal{L}_{T_i}(\theta) = \mathrm{E}_{\tau_i^{1:k} } \bigg( \mathrm{E}_{\tau_{i+1, \phi}}\Big( L_{T_{i+1}}(\tau_{i+1, \phi}) \Big) \bigg) </math> and <math>l</math> is the number of tasks.<br />
<br />
The meta-policy <math>\pi_{\theta}</math> is trained and then adapted at test time using the following procedures.<br />
<br />
[[File:MAML2.png | 800px]]<br />
<br />
The mathematics of calculating loss gradients is omitted.<br />
<br />
= Training Spiders to Run with Dynamic Handicaps (Robotic Locomotion in Non-Stationary Environments) =<br />
<br />
The authors used the MuJoCo physics simulator to create a simulated environment where robotic spiders with 6 legs are faced with the task of running due east as quickly as possible. The robotic spider observes the location and velocity of it's body, and the angles and velocities of its legs. It interacts with the environment by exerting torque on the joints of its legs. Each leg has two joints, the joint closer to the body rotates horizontally while the joint farther from the body rotates vertically. The environment is made non-stationary by gradually paralyzing two legs of the spider across training and testing episodes.<br />
Putting this example into the above probabilistic framework yields:<br />
<br />
* <math>T_i</math>: The task of walking east with the torques of two legs scaled by <math> (i-1)/6 </math><br />
* <math>\{T_i\}_{i=1}^{7}</math>: A sequence of tasks with the same two legs handicapped in each task. Note there are 15 different ways to choose such legs resulting in 15 sequences of tasks. 12 are used for training and 3 for testing.<br />
* A Markov Descision process composed of<br />
** Observations <math> x_t </math> containing information about the state of the spider.<br />
** Actions <math> a_t </math> containing information about the torques to apply to the spiders legs.<br />
** Rewards <math> R_t </math> corresponding to the speed at which the spider is moving east.<br />
<br />
Three differently structured policy neural networks are trained in this set up using both meta-learning and three different previously developed adaption methods.<br />
<br />
At testing time, the spiders following meta learned policies initially perform worse than the spiders using non-adaptive policies. However, by the third episode <math> i=3 </math> the meta learners perform on par. And by the sixth episode, when the selected legs are mostly immobile, the meta learners significantly out perform.<br />
<br />
= Training Spiders to Fight Each Other (Adversarial Meta-Learning) =<br />
<br />
The authors created an adversarial environment called RoboSumo where pairs of agents with 4, 6,or 8 legs sumo wrestle. The agents observe the location and velocity of their bodies and the bodies of their opponent, the angles and velocities of their legs, and the forces being exerted on them by their opponent (equivalent of tactile sense). The game is organized into episodes and rounds. Episodes are single wrestling matches with 500 time steps and win/lose/draw outcomes. Agents win by pushing their opponent out of the ring or making their opponent's body touch the ground. Rounds are batches of episodes. An episode results in a draw when neither of these things happen after 500 time steps. Rounds have possible outcomes win, lose, and draw that are decided based on majority of episodes won. K rounds will be fought. Both agents may update their policies between rounds. The agent that wins the majority of rounds is deemed the winner of the game.<br />
<br />
Similar to the Robotic locomotion example, this game can be phrased in terms of the RL MDP framework.<br />
<br />
* <math>T_i</math>: The task of fighting a round.<br />
* <math>\{T_i\}_{i=1}^{K}</math>: A sequence of rounds against the same opponent. Note that the opponent may update their policy between rounds but the anatomy of both wrestlers will be constant across rounds.<br />
* A Markov Descision process composed of<br />
** A horizon <math>H = 500*n</math> where <math>n</math> is the number of episodes per round.<br />
** Observations <math> x_t </math> containing information about the state of the agent and its opponent.<br />
** Actions <math> a_t </math> containing information about the torques to apply to the agents legs.<br />
** Rewards <math> R_t </math> rewards given to the agent based on its wrestling performance. <math>R_{500*n} = </math> +2000 if win episode, -2000 if lose, and -1000 if draw.<br />
<br />
Note that the above reward set up is quite sparse, therefor in order to encourage fast training, rewards are introduced at every time step for the following.<br />
* For staying close to the center of the ring.<br />
* For exerting force on the opponents body.<br />
* For moving towards the opponent.<br />
* For the distance of the opponent to the center of the ring.<br />
<br />
This makes sense intuitively as fighters these are reasonable goal for agents who are learning to wrestle to explore.<br />
<br />
= Sources =<br />
# Chelsea Finn, Pieter Abbeel, Sergey Levine. "Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks." arXiv preprint arXiv:1703.03400v3 (2017).<br />
# Richard S Sutton, Anna Koop, and David Silver. On the role of tracking in stationary environments. In Proceedings of the 24th international conference on Machine learning, pp. 871–878. ACM, 2007.</div>Tdunn