http://wiki.math.uwaterloo.ca/statwiki/api.php?action=feedcontributions&user=Bapskiko&feedformat=atomstatwiki - User contributions [US]2021-04-13T17:08:50ZUser contributionsMediaWiki 1.28.3http://wiki.math.uwaterloo.ca/statwiki/index.php?title=Training_And_Inference_with_Integers_in_Deep_Neural_Networks&diff=36020Training And Inference with Integers in Deep Neural Networks2018-04-02T21:08:14Z<p>Bapskiko: /* Weight Initialization */</p>
<hr />
<div>== Introduction ==<br />
<br />
Deep neural networks have enjoyed much success in all manners of tasks, but it is common for these networks to be complicated and have high memory requirements and be able to perform many floating-point operations. As a result, running many of these models are very expensive in terms of energy use, and using state-of-the-art networks in applications where energy is limited can be very difficult. In order to overcome this and allow use of these networks in situations with low energy availability, the energy costs must be reduced while trying to maintain as high network performance as possible and/or practical.<br />
<br />
Most existing methods focus on reducing the energy requirements during inference rather than training. Since training with SGD requires accumulation, training usually has higher precision demand than inference. Most of the existing methods focus on how to compress a model for inference, rather than during training. This paper proposes a framework to reduce complexity both during training and inference through the use of integers instead of floats. They address how to quantize all operations and operands as well as examining the bitwidth requirement for SGD computation & accumulation. Using integers instead of floats results in energy-savings because integer operations are more efficient than floating point (see the table below). Also, there already exists dedicated hardware for deep learning that uses integer operations (such as the 1st generation of Google TPU) so understanding the best way to use integers is well-motivated.<br />
{| class="wikitable"<br />
|+Rough Energy Costs in 45nm 0.9V [1]<br />
!<br />
! colspan="2" |Energy(pJ)<br />
! colspan="2" |Area(<math>\mu m^2</math>)<br />
|-<br />
!Operation<br />
!MUL<br />
!ADD<br />
!MUL<br />
!ADD<br />
|-<br />
|8-bit INT<br />
|0.2<br />
|0.03<br />
|282<br />
|36<br />
|-<br />
|16-bit FP<br />
|1.1<br />
|0.4<br />
|1640<br />
|1360<br />
|-<br />
|32-bit FP<br />
|3.7<br />
|0.9<br />
|7700<br />
|4184<br />
|}<br />
The authors call the framework WAGE because they consider how best to handle the '''W'''eights, '''A'''ctivations, '''G'''radients, and '''E'''rrors separately.<br />
<br />
== Related Work ==<br />
<br />
=== Weight and Activation ===<br />
Existing works to train DNNs on binary weights and activations [2] add noise to weights and activations as a form of regularization. The use of high-precision accumulation is required for SGD optimization since real-valued gradients are obtained from real-valued variables. XNOR-Net [11] uses bitwise operations to approximate convolutions in a highly memory-efficient manner, and applies a filter-wise scaling factor for weights to improve performance. However, these floating-point factors are calculated simultaneously during training, which aggravates the training effort. Ternary weight networks (TWN) [3] and Trained ternary quantization (TTQ) [9] offer more expressive ability than binary weight networks by constraining the weights to be ternary-valued {-1,0,1} using two symmetric thresholds.<br />
<br />
=== Gradient Computation and Accumulation ===<br />
The DoReFa-Net quantizes gradients to low-bandwidth floating point numbers with discrete states in the backwards pass. In order to reduce the overhead of gradient synchronization in distributed training the TernGrad method quantizes the gradient updates to ternary values. In both works the weights are still stored and updated with float32, and the quantization of batch normalization and its derivative is ignored.<br />
<br />
== WAGE Quantization ==<br />
The core idea of the proposed method is to constrain the following to low-bitwidth integers on each layer:<br />
* '''W:''' weight in inference<br />
* '''a:''' activation in inference<br />
* '''e:''' error in backpropagation<br />
* '''g:''' gradient in backpropagation<br />
[[File:p32fig1.PNG|center|thumb|800px|Four operators QW (·), QA(·), QG(·), QE(·) added in WAGE computation dataflow to reduce precision, bitwidth of signed integers are below or on the right of arrows, activations are included in MAC for concision.]]<br />
The error and gradient are defined as:<br />
<br />
<math>e^i = \frac{\partial L}{\partial a^i}, g^i = \frac{\partial L}{\partial W^i}</math><br />
<br />
where L is the loss function.<br />
<br />
The precision in bits of the errors, activations, gradients, and weights are <math>k_E</math>, <math>k_A</math>, <math>k_G</math>, and <math>k_W</math> respectively. As shown in the above figure, each quantity also has a quantization operators to reduce bitwidth increases caused by multiply-accumulate (MAC) operations. Also, note that since this is a layer-by-layer approach, each layer may be followed or preceded by a layer with different precision, or even a layer using floating point math.<br />
<br />
=== Shift-Based Linear Mapping and Stochastic Mapping ===<br />
The proposed method makes use of a linear mapping where continuous, unbounded values are discretized for each bitwidth <math>k</math> with a uniform spacing of<br />
<br />
<math>\sigma(k) = 2^{1-k}, k \in Z_+ </math><br />
With this, the full quantization function is<br />
<br />
<math>Q(x,k) = Clip\left \{ \sigma(k) \cdot round\left [ \frac{x}{\sigma(k)} \right ], -1 + \sigma(k), 1 - \sigma(k) \right \}</math>, <br />
<br />
where <math>round</math> approximates continuous values to their nearest discrete state, and <math>Clip</math> is the saturation function that clips unbounded values to <math>[-1 + \sigma, 1 - \sigma]</math>. Note that this function is only using when simulating integer operations on floating-point hardware, on native integer hardware, this is done automatically. In addition to this quantization function, a distribution scaling factor is used in some quantization operators to preserve as much variance as possible when applying the quantization function above. The scaling factor is defined below.<br />
<br />
<math>Shift(x) = 2^{round(log_2(x))}</math><br />
<br />
Finally, stochastic rounding is substituted for small or real-valued updates during gradient accumulation.<br />
<br />
A visual representation of these operations is below.<br />
[[File:p32fig2.PNG|center|thumb|800px|Quantization methods used in WAGE. The notation <math>P, x, \lfloor \cdot \rfloor, \lceil \cdot \rceil</math> denotes probability, vector, floor and ceil, respectively. <math>Shift(\cdot)</math> refers to distribution shifting with a certain argument]]<br />
<br />
=== Weight Initialization ===<br />
In this work, batch normalization is simplified to a constant scaling layer in order to sidestep the problem of normalizing outputs without floating point math, and to remove the extra memory requirement with batch normalization. As such, some care must be taken when initializing weights. The authors use a modified initialization method base on MSRA [4].<br />
<br />
<math>W \thicksim U(-L, +L),L = max \left \{ \sqrt{6/n_{in}}, L_{min} \right \}, L_{min} = \beta \sigma</math><br />
<br />
<math>n_{in}</math> is the layer fan-in number, <math>U</math> denotes uniform distribution. The original initialization method for <math>\eta</math> is modified by adding the condition that the distribution width should be at least <math>\beta \sigma</math>, where <math>\beta</math> is a constant greater than 1 and <math>\sigma</math> is the minimum step size seen already. This prevents weights being initialised to all-zeros in the case where the bitwidth is low, or the fan-in number is high.<br />
<br />
=== Quantization Details ===<br />
<br />
==== Weight <math>Q_W(\cdot)</math> ====<br />
<math>W_q = Q_W(W) = Q(W, k_W)</math><br />
<br />
The quantization operator is simply the quantization function previously introduced. <br />
<br />
==== Activation <math>Q_A(\cdot)</math> ====<br />
The authors say that the variance of the weights passed through this function will be scaled compared to the variance of the weights as initialized. To prevent this effect from blowing up the network outputs, they introduce a scaling factor <math>\alpha</math>. Notice that it is constant for each layer.<br />
<br />
<math>\alpha = max \left \{ Shift(L_{min} / L), 1 \right \}</math><br />
<br />
The quantization operator is then<br />
<br />
<math>a_q = Q_A(a) = Q(a/\alpha, k_A)</math><br />
<br />
The scaling factor approximates batch normalization.<br />
<br />
==== Error <math>Q_E(\cdot)</math> ====<br />
The magnitude of the error can vary greatly, and that a previous approach (DoReFa-Net [5]) solves the issue by using an affine transform to map the error to the range <math>[-1, 1]</math>, apply quantization, and then applying the inverse transform. However, the authors claim that this approach still requires using float32, and that the magnitude of the error is unimportant: rather it is the orientation of the error. Thus, they only scale the error distribution to the range <math>\left [ -\sqrt2, \sqrt2 \right ]</math> and quantise:<br />
<br />
<math>e_q = Q_E(e) = Q(e/Shift(max\{|e|\}), k_E)</math><br />
<br />
Max is the element-wise maximum. Note that this discards any error elements less than the minimum step size.<br />
<br />
==== Gradient <math>Q_G(\cdot)</math> ====<br />
Similar to the activations and errors, the gradients are rescaled:<br />
<br />
<math>g_s = \eta \cdot g/Shift(max\{|g|\})</math><br />
<br />
<math> \eta </math> is a shift-based learning rate. It is an integer power of 2. The shifted gradients are represented in units of minimum step sizes <math> \sigma(k) </math>. When reducing the bitwidth of the gradients (remember that the gradients are coming out of a MAC operation, so the bitwidth may have increased) stochastic rounding is used as a substitute for small gradient accumulation.<br />
<br />
<math>\Delta W = Q_G(g) = \sigma(k_G) \cdot sgn(g_s) \cdot \left \{ \lfloor | g_s | \rfloor + Bernoulli(|g_s|<br />
- \lfloor | g_s | \rfloor) \right \}</math><br />
<br />
This randomly rounds the result of the MAC operation up or down to the nearest quantization for the given gradient bitwidth. The weights are updated with the resulting discrete increments:<br />
<br />
<math>W_{t+1} = Clip \left \{ W_t - \Delta W_t, -1 + \sigma(k_G), 1 - \sigma(k_G) \right \}</math><br />
<br />
=== Miscellaneous ===<br />
To train WAGE networks, the authors used pure SGD exclusively because more complicated techniques such as Momentum or RMSProp increase memory consumption and are complicated by the rescaling that happens within each quantization operator.<br />
<br />
The quantization and stochastic rounding are a form of regularization.<br />
<br />
The authors didn't use a traditional softmax with cross-entropy loss for the experiments because there does not yet exist a softmax layer for low-bit integers. Instead, they use a sum of squared error loss. This works for tasks with a small number of categories, but does not scale well.<br />
<br />
== Experiments ==<br />
For all experiments, the default layer bitwidth configuration is 2-8-8-8 for Weights, Activations, Gradients, and Error bits. The weight bitwidth is set to 2 because that results in ternary weights, and therefore no multiplication during inference. They authors argue that the bitwidth for activation and errors should be the same because the computation graph for each is similar and might use the same hardware. During training, the weight bitwidth is 8. For inference the weights are ternarized.<br />
<br />
=== Implementation Details ===<br />
MNIST: Network is LeNet-5 variant [6] with 32C5-MP2-64C5-MP2-512FC-10SSE.<br />
<br />
SVHN & CIFAR10: VGG variant [7] with 2×(128C3)-MP2-2×(256C3)-MP2-2×(512C3)-MP2-1024FC-10SSE. For CIFAR10 dataset, the data augmentation is followed in Lee et al. (2015) [10] for training.<br />
<br />
ImageNet: AlexNet variant [8] on ILSVRC12 dataset.<br />
{| class="wikitable"<br />
|+Test or validation error rates (%) in previous works and WAGE on multiple datasets. Opt denotes gradient descent optimizer, withM means SGD with momentum, BN represents batch normalization, 32 bit refers to float32, and ImageNet top-k format: top1/top5.<br />
!Method<br />
!<math>k_W</math><br />
!<math>k_A</math><br />
!<math>k_G</math><br />
!<math>k_E</math><br />
!Opt<br />
!BN<br />
!MNIST<br />
!SVHN<br />
!CIFAR10<br />
!ImageNet<br />
|-<br />
|BC<br />
|1<br />
|32<br />
|32<br />
|32<br />
|Adam<br />
|yes<br />
|1.29<br />
|2.30<br />
|9.90<br />
|<br />
|-<br />
|BNN<br />
|1<br />
|1<br />
|32<br />
|32<br />
|Adam<br />
|yes <br />
|0.96<br />
|2.53<br />
|10.15<br />
|<br />
|-<br />
|BWN<br />
|1<br />
|32<br />
|32<br />
|32<br />
|withM<br />
|yes<br />
|<br />
|<br />
|<br />
|43.2/20.6<br />
|-<br />
|XNOR<br />
|1<br />
|1<br />
|32<br />
|32<br />
|Adam<br />
|yes<br />
|<br />
|<br />
|<br />
|55.8/30.8<br />
|-<br />
|TWN<br />
|2<br />
|32<br />
|32<br />
|32<br />
|withM<br />
|yes<br />
|0.65<br />
|<br />
|7.44<br />
|'''34.7/13.8'''<br />
|-<br />
|TTQ<br />
|2<br />
|32<br />
|32<br />
|32<br />
|Adam<br />
|yes<br />
|<br />
|<br />
|6.44<br />
|42.5/20.3<br />
|-<br />
|DoReFa<br />
|8<br />
|8<br />
|32<br />
|8<br />
|Adam<br />
|yes<br />
|<br />
|2.30<br />
|<br />
|47.0/<br />
|-<br />
|TernGrad<br />
|32<br />
|32<br />
|2<br />
|32<br />
|Adam<br />
|yes<br />
|<br />
|<br />
|14.36<br />
|42.4/19.5<br />
|-<br />
|WAGE<br />
|2<br />
|8<br />
|8<br />
|8<br />
|SGD<br />
|no<br />
|'''0.40'''<br />
|'''1.92'''<br />
|'''6.78'''<br />
|51.6/27.8<br />
|}<br />
<br />
=== Training Curves and Regularization ===<br />
The authors compare the 2-8-8-8 WAGE configuration introduced above, a 2-8-f-f (meaning float32) configuration, and a completely floating point version on CIFAR10. The test error is plotted against epoch. For training these networks, the learning rate is divided by 8 at the 200th epoch and again at the 250th epoch.<br />
[[File:p32fig3.PNG|center|thumb|800px|Training curves of WAGE variations and a vanilla CNN on CIFAR10]]<br />
The convergence of the 2-8-8-8 has comparable convergence to the vanilla CNN and outperforms the 2-8-f-f variant. The authors speculate that this is because the extra discretization acts as a regularizer.<br />
<br />
=== Bitwidth of Errors ===<br />
The CIFAR10 test accuracy is plotted against bitwidth below and the error density for a single layer is compared with the Vanilla network.<br />
[[File:p32fig4.PNG|center|thumb|520x522px|The 10 run accuracies of different <math>k_E</math>]]<br />
<br />
[[File:32_error.png|center|thumb|520x522px|Histogram of errors for Vanilla network and Wage network. After being quantized and shifted each layer, the error is reshaped and so most orientation information is retained. ]]<br />
<br />
The table below shows the test error rates on CIFAR10 when left-shift upper boundary with factor γ. From this table we could see that large values play critical roles for backpropagation training even though they don't have many while the majority with small values actually just noises.<br />
<br />
[[File:testerror_rate.png|center]]<br />
<br />
=== Bitwidth of Gradients ===<br />
<br />
The authors next investigated the choice of a proper <math>k_G</math> for gradients using the CIFAR10 dataset. <br />
<br />
{| class="wikitable"<br />
|+Test error rates (%) on CIFAR10 with different <math>k_G</math><br />
!<math>k_G</math><br />
!2<br />
!3<br />
!4<br />
!5<br />
!6<br />
!7<br />
!8<br />
!9<br />
!10<br />
!11<br />
!12<br />
|-<br />
|error<br />
|54.22<br />
|51.57<br />
|28.22<br />
|18.01<br />
|11.48<br />
|7.61<br />
|6.78<br />
|6.63<br />
|6.43<br />
|6.55<br />
|6.57<br />
|}<br />
<br />
The results show similar bitwidth requirements as the last experiment for <math>k_E</math>.<br />
<br />
The authors also examined the effect of bitwidth on the ImageNet implementation.<br />
<br />
Here, C denotes 12 bits (Hexidecimal) and BN refers to batch normalization being added. 7 models are used: 2888 from the first experiment, 288C for more accurate errors (12 bits), 28C8 for larger buffer space, 28f8 for non-quantization of gradients, 28ff for errors and gradients in float32, and 28ff with BN added. The baseline vanilla model refers to the original AlexNet architecture. <br />
<br />
{| class="wikitable"<br />
|+Top-5 error rates (%) on ImageNet with different <math>k_G</math>and <math>k_E</math><br />
!Pattern<br />
!vanilla<br />
!28ff-BN<br />
!28ff<br />
!28f8<br />
!28C8<br />
!288C<br />
!2888<br />
|-<br />
|error<br />
|19.29<br />
|20.67<br />
|24.14<br />
|23.92<br />
|26.88<br />
|28.06<br />
|27.82<br />
|}<br />
<br />
The comparison between 28C8 and 288C shows that the model may perform better if it has more buffer space <math>k_G</math> for gradient accumulation than if it has high-resolution orientation <math>k_E</math>. The authors also noted that batch normalization and <math>k_G</math> are more important for ImageNet because the training set samples are highly variant.<br />
<br />
== Discussion ==<br />
The authors have a few areas they believe this approach could be improved.<br />
<br />
'''MAC Operation:''' The 2-8-8-8 configuration was chosen because the low weight bitwidth means there aren't any multiplication during inference. However, this does not remove the requirement for multiplication during training. 2-2-8-8 configuration satisfies this requirement, but it is difficult to train and detrimental to the accuracy.<br />
<br />
'''Non-linear Quantization:''' The linear mapping used in this approach is simple, but there might be a more effective mapping. For example, a logarithmic mapping could be more effective if the weights and activations have a log-normal distribution.<br />
<br />
'''Normalization:''' Normalization layers (softmax, batch normalization) were not used in this paper. Quantized versions are an area of future work<br />
<br />
== Conclusion ==<br />
<br />
A framework for training and inference without the use of floating-point representation is presented. Future work may further improve compression and memory requirements.<br />
== References ==<br />
<br />
# Sze, Vivienne; Chen, Yu-Hsin; Yang, Tien-Ju; Emer, Joel (2017-03-27). [http://arxiv.org/abs/1703.09039 "Efficient Processing of Deep Neural Networks: A Tutorial and Survey"]. arXiv:1703.09039 [cs].<br />
# Courbariaux, Matthieu; Bengio, Yoshua; David, Jean-Pierre (2015-11-01). [http://arxiv.org/abs/1511.00363 "BinaryConnect: Training Deep Neural Networks with binary weights during propagations"]. arXiv:1511.00363 [cs].<br />
# Li, Fengfu; Zhang, Bo; Liu, Bin (2016-05-16). [http://arxiv.org/abs/1605.04711 "Ternary Weight Networks"]. arXiv:1605.04711 [cs].<br />
# He, Kaiming; Zhang, Xiangyu; Ren, Shaoqing; Sun, Jian (2015-02-06). [http://arxiv.org/abs/1502.01852 "Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification"]. arXiv:1502.01852 [cs].<br />
# Zhou, Shuchang; Wu, Yuxin; Ni, Zekun; Zhou, Xinyu; Wen, He; Zou, Yuheng (2016-06-20). [http://arxiv.org/abs/1606.06160 "DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients"]. arXiv:1606.06160 [cs].<br />
# Lecun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. (November 1998). [http://ieeexplore.ieee.org/document/726791/?reload=true "Gradient-based learning applied to document recognition"]. Proceedings of the IEEE. 86 (11): 2278–2324. doi:10.1109/5.726791. ISSN 0018-9219.<br />
# Simonyan, Karen; Zisserman, Andrew (2014-09-04). [http://arxiv.org/abs/1409.1556 "Very Deep Convolutional Networks for Large-Scale Image Recognition"]. arXiv:1409.1556 [cs].<br />
# Krizhevsky, Alex; Sutskever, Ilya; Hinton, Geoffrey E (2012). Pereira, F.; Burges, C. J. C.; Bottou, L.; Weinberger, K. Q., eds. [http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf Advances in Neural Information Processing Systems 25 (PDF)]. Curran Associates, Inc. pp. 1097–1105.<br />
# Chenzhuo Zhu, Song Han, Huizi Mao, and William J Dally. Trained ternary quantization. arXiv preprint arXiv:1612.01064, 2016.<br />
# Chen-Yu Lee, Saining Xie, Patrick Gallagher, Zhengyou Zhang, and Zhuowen Tu. Deeplysupervisednets. In Artificial Intelligence and Statistics, pp. 562–570, 2015.<br />
# Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnor-net: Imagenet classification using binary convolutional neural networks. In European Conference on Computer Vision, pp. 525–542. Springer, 2016.</div>Bapskikohttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18&diff=35787stat946w182018-03-28T00:40:27Z<p>Bapskiko: /* Paper presentation */</p>
<hr />
<div>=[https://piazza.com/uwaterloo.ca/fall2017/stat946/resources List of Papers]=<br />
<br />
= Record your contributions here [https://docs.google.com/spreadsheets/d/1fU746Cld_mSqQBCD5qadvkXZW1g-j-kHvmHQ6AMeuqU/edit?usp=sharing]=<br />
<br />
Use the following notations:<br />
<br />
P: You have written a summary/critique on the paper.<br />
<br />
T: You had a technical contribution on a paper (excluding the paper that you present).<br />
<br />
E: You had an editorial contribution on a paper (excluding the paper that you present).<br />
<br />
<br />
<br />
[https://docs.google.com/forms/d/e/1FAIpQLSdcfYZu5cvpsbzf0Nlxh9TFk8k1m5vUgU1vCLHQNmJog4xSHw/viewform?usp=sf_link Your feedback on presentations]<br />
<br />
=Paper presentation=<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="60pt"|Date<br />
|width="100pt"|Name <br />
|width="30pt"|Paper number <br />
|width="700pt"|Title<br />
|width="30pt"|Link to the paper<br />
|width="30pt"|Link to the summary<br />
|-<br />
|Feb 15 (example)||Ri Wang || ||Sequence to sequence learning with neural networks.||[http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Paper] || [http://wikicoursenote.com/wiki/Stat946f15/Sequence_to_sequence_learning_with_neural_networks#Long_Short-Term_Memory_Recurrent_Neural_Network Summary]<br />
|-<br />
|Feb 27 || || 1|| || || <br />
|-<br />
|Feb 27 || || 2|| || || <br />
|-<br />
|Feb 27 || || 3|| || || <br />
|-<br />
|Mar 1 || Peter Forsyth || 4|| Unsupervised Machine Translation Using Monolingual Corpora Only || [https://arxiv.org/pdf/1711.00043.pdf Paper] || [[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Unsupervised_Machine_Translation_Using_Monolingual_Corpora_Only Summary]]<br />
|-<br />
|Mar 1 || wenqing liu || 5|| Spectral Normalization for Generative Adversarial Networks || [https://openreview.net/pdf?id=B1QRgziT- Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Spectral_normalization_for_generative_adversial_network Summary]<br />
|-<br />
|Mar 1 || Ilia Sucholutsky || 6|| One-Shot Imitation Learning || [https://papers.nips.cc/paper/6709-one-shot-imitation-learning.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=One-Shot_Imitation_Learning Summary]<br />
|-<br />
|Mar 6 || George (Shiyang) Wen || 7|| AmbientGAN: Generative models from lossy measurements || [https://openreview.net/pdf?id=Hy7fDog0b Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/AmbientGAN:_Generative_Models_from_Lossy_Measurements Summary]<br />
|-<br />
|Mar 6 || Raphael Tang || 8|| Rethinking the Smaller-Norm-Less-Informative Assumption in Channel Pruning of Convolutional Layers || [https://arxiv.org/pdf/1802.00124.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Rethinking_the_Smaller-Norm-Less-Informative_Assumption_in_Channel_Pruning_of_Convolutional_Layers Summary]<br />
|-<br />
|Mar 6 ||Fan Xia || 9|| Word translation without parallel data ||[https://arxiv.org/pdf/1710.04087.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Word_translation_without_parallel_data Summary]<br />
|-<br />
|Mar 8 || Alex (Xian) Wang || 10 || Self-Normalizing Neural Networks || [http://papers.nips.cc/paper/6698-self-normalizing-neural-networks.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Self_Normalizing_Neural_Networks Summary] <br />
|-<br />
|Mar 8 || Michael Broughton || 11|| Convergence of Adam and beyond || [https://openreview.net/pdf?id=ryQu7f-RZ Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=On_The_Convergence_Of_ADAM_And_Beyond Summary] <br />
|-<br />
|Mar 8 || Wei Tao Chen || 12|| Predicting Floor-Level for 911 Calls with Neural Networks and Smartphone Sensor Data || [https://openreview.net/forum?id=ryBnUWb0b Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Predicting_Floor-Level_for_911_Calls_with_Neural_Networks_and_Smartphone_Sensor_Data Summary]<br />
|-<br />
|Mar 13 || Chunshang Li || 13 || UNDERSTANDING IMAGE MOTION WITH GROUP REPRESENTATIONS || [https://openreview.net/pdf?id=SJLlmG-AZ Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Understanding_Image_Motion_with_Group_Representations Summary] <br />
|-<br />
|Mar 13 || Saifuddin Hitawala || 14 || Robust Imitation of Diverse Behaviors || [https://papers.nips.cc/paper/7116-robust-imitation-of-diverse-behaviors.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Robust_Imitation_of_Diverse_Behaviors Summary] <br />
|-<br />
|Mar 13 || Taylor Denouden || 15|| A neural representation of sketch drawings || [https://arxiv.org/pdf/1704.03477.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=A_Neural_Representation_of_Sketch_Drawings Summary]<br />
|-<br />
|Mar 15 || Zehao Xu || 16|| Synthetic and natural noise both break neural machine translation || [https://openreview.net/pdf?id=BJ8vJebC- Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Synthetic_and_natural_noise_both_break_neural_machine_translation Summary]<br />
|-<br />
|Mar 15 || Prarthana Bhattacharyya || 17|| Wasserstein Auto-Encoders || [https://arxiv.org/pdf/1711.01558.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Wasserstein_Auto-Encoders Summary] <br />
|-<br />
|Mar 15 || Changjian Li || 18|| Label-Free Supervision of Neural Networks with Physics and Domain Knowledge || [https://arxiv.org/pdf/1609.05566.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Label-Free_Supervision_of_Neural_Networks_with_Physics_and_Domain_Knowledge Summary]<br />
|-<br />
|Mar 20 || Travis Dunn || 19|| Continuous Adaptation via Meta-Learning in Nonstationary and Competitive Environments || [https://openreview.net/pdf?id=Sk2u1g-0- Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Continuous_Adaptation_via_Meta-Learning_in_Nonstationary_and_Competitive_Environments Summary]<br />
|-<br />
|Mar 20 || Sushrut Bhalla || 20|| MaskRNN: Instance Level Video Object Segmentation || [https://papers.nips.cc/paper/6636-maskrnn-instance-level-video-object-segmentation.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/MaskRNN:_Instance_Level_Video_Object_Segmentation Summary]<br />
|-<br />
|Mar 20 || Hamid Tahir || 21|| Wavelet Pooling for Convolution Neural Networks || [https://openreview.net/pdf?id=rkhlb8lCZ Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Wavelet_Pooling_CNN Summary]<br />
|-<br />
|Mar 22 || Dongyang Yang|| 22|| Implicit Causal Models for Genome-wide Association Studies || [https://openreview.net/pdf?id=SyELrEeAb Paper] ||[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Implicit_Causal_Models_for_Genome-wide_Association_Studies Summary]<br />
|-<br />
|Mar 22 || Yao Li || 23||Improving GANs Using Optimal Transport || [https://openreview.net/pdf?id=rkQkBnJAb Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/IMPROVING_GANS_USING_OPTIMAL_TRANSPORT Summary]<br />
|-<br />
|Mar 22 || Sahil Pereira || 24||End-to-End Differentiable Adversarial Imitation Learning|| [http://proceedings.mlr.press/v70/baram17a/baram17a.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=End-to-End_Differentiable_Adversarial_Imitation_Learning Summary]<br />
|-<br />
|Mar 27 || Jaspreet Singh Sambee || 25|| Do Deep Neural Networks Suffer from Crowding? || [http://papers.nips.cc/paper/7146-do-deep-neural-networks-suffer-from-crowding.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Do_Deep_Neural_Networks_Suffer_from_Crowding Summary]<br />
|-<br />
|Mar 27 || Braden Hurl || 26|| Spherical CNNs || [https://openreview.net/pdf?id=Hkbd5xZRb Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Spherical_CNNs Summary]<br />
|-<br />
|Mar 27 || Marko Ilievski || 27|| Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders || [http://proceedings.mlr.press/v70/engel17a/engel17a.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Neural_Audio_Synthesis_of_Musical_Notes_with_WaveNet_autoencoders Summary]<br />
|-<br />
|Mar 29 || Alex Pon || 28||PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space || [https://arxiv.org/abs/1706.02413 Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=PointNet%2B%2B:_Deep_Hierarchical_Feature_Learning_on_Point_Sets_in_a_Metric_Space Summary]<br />
|-<br />
|Mar 29 || Sean Walsh || 29||Multi-scale Dense Networks for Resource Efficient Image Classification || [https://arxiv.org/pdf/1703.09844.pdf Paper] ||[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Multi-scale_Dense_Networks_for_Resource_Efficient_Image_Classification#Architecture Summary]<br />
|-<br />
|Mar 29 || Jason Ku || 30||MarrNet: 3D Shape Reconstruction via 2.5D Sketches || [https://arxiv.org/pdf/1711.03129.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=MarrNet:_3D_Shape_Reconstruction_via_2.5D_Sketches Summary] <br />
|-<br />
|Apr 3 || Tong Yang || 31|| Dynamic Routing Between Capsules. || [http://papers.nips.cc/paper/6975-dynamic-routing-between-capsules.pdf Paper] || <br />
|-<br />
|Apr 3 || Benjamin Skikos || 32|| Training and Inference with Integers in Deep Neural Networks || [https://openreview.net/pdf?id=HJGXzmspb Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Training_And_Inference_with_Integers_in_Deep_Neural_Networks Summary] <br />
|-<br />
|Apr 3 || Weishi Chen || 33|| Tensorized LSTMs for Sequence Learning || [https://arxiv.org/pdf/1711.01577.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Tensorized_LSTMs&action=edit&redlink=1 Summary] <br />
|-</div>Bapskikohttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Training_And_Inference_with_Integers_in_Deep_Neural_Networks&diff=35786Training And Inference with Integers in Deep Neural Networks2018-03-28T00:39:08Z<p>Bapskiko: Add figures</p>
<hr />
<div>== Introduction ==<br />
<br />
Deep neural networks have enjoyed much success in all manners of tasks, but it is common for these networks to be complicated, requiring large amounts of energy-intensive memory and floating-point operations. Therefore, in order to use state-of-the-art networks in applications where energy is limited or having packaging limitation for hardware, such as anything not connected to the power grid, the energy costs must be reduced while preserving as much performance as practical.<br />
<br />
Most existing methods focus on reducing the energy requirements during inference rather than training. Since training with SGD requires accumulation, training usually has higher precision demand than inference. Most of the existing methods focus on how to compress a model for inference, rather than during training. This paper proposes a framework to reduce complexity both during training and inference through the use of integers instead of floats. They address how to quantize all operations and operands as well as examining the bitwidth requirement for SGD computation & accumulation. Using integers instead of floats results in energy-savings because integer operations are more efficient than floating point (see the table below). Also, there already exists dedicated hardware for deep learning that uses integer operations (such as the 1st generation of Google TPU) so understanding the best way to use integers is well-motivated.<br />
{| class="wikitable"<br />
|+Rough Energy Costs in 45nm 0.9V [1]<br />
!<br />
! colspan="2" |Energy(pJ)<br />
! colspan="2" |Area(<math>\mu m^2</math>)<br />
|-<br />
!Operation<br />
!MUL<br />
!ADD<br />
!MUL<br />
!ADD<br />
|-<br />
|8-bit INT<br />
|0.2<br />
|0.03<br />
|282<br />
|36<br />
|-<br />
|16-bit FP<br />
|1.1<br />
|0.4<br />
|1640<br />
|1360<br />
|-<br />
|32-bit FP<br />
|3.7<br />
|0.9<br />
|7700<br />
|4184<br />
|}<br />
The authors call the framework WAGE because they consider how best to handle the '''W'''eights, '''A'''ctivations, '''G'''radients, and '''E'''rrors separately.<br />
<br />
== Related Work ==<br />
<br />
=== Weight and Activation ===<br />
Existing works on binary weights and activations [2] still use high-precision accumulation for SGD. Ternary weight networks [3] offer more expression than binary weight networks.<br />
<br />
=== Gradient Computation and Accumulation ===<br />
Some methods quantize gradients in the backwards pass, but the weights are still stored in float32, and batch normalization is ignored.<br />
<br />
== WAGE Quantization ==<br />
The core idea of the proposed method is to constrain the following to low-bitwidth integers on each layer:<br />
* '''W:''' weight in inference<br />
* '''a:''' activation in inference<br />
* '''e:''' error in backpropagation<br />
* '''g:''' gradient in backpropagation<br />
[[File:p32fig1.PNG|center|thumb|800px|Four operators QW (·), QA(·), QG(·), QE(·) added in WAGE computation dataflow to reduce precision, bitwidth of signed integers are below or on the right of arrows, activations are included in MAC for concision.]]<br />
The error and gradient are defined as:<br />
<br />
<math>e^i = \frac{\partial L}{\partial a^i}, g^i = \frac{\partial L}{\partial W^i}</math><br />
<br />
where L is the loss function.<br />
<br />
The precision in bits of the errors, activations, gradients, and weights are <math>k_E</math>, <math>k_A</math>, <math>k_G</math>, and <math>k_W</math> respectively. As shown in the above figure, each quantity also has a quantization operators to reduce bitwidth increases caused by multiply-accumulate (MAC) operations. Also, note that since this is a layer-by-layer approach, each layer may be followed or preceded by a layer with different precision, or even a layer using floating point math.<br />
<br />
=== Shift-Based Linear Mapping and Stochastic Mapping ===<br />
The proposed method makes use of a linear mapping where continuous, unbounded values are discretized for each bitwidth <math>k</math> with a uniform spacing of<br />
<br />
<math>\sigma(k) = 2^{1-k}, k \in Z_+ </math><br />
With this, the full quantization function is<br />
<br />
<math>Q(x,k) = Clip\left \{ \sigma(k) \cdot round\left [ \frac{x}{\sigma(k)} \right ], -1 + \sigma(k), 1 - \sigma(k) \right \}</math><br />
<br />
Note that this function is only using when simulating integer operations on floating-point hardware, on native integer hardware, this is done automatically. In addition to this quantization function.<br />
<br />
A distribution scaling factor is used in some quantization operators to preserve as much variance as possible when applying the quantization function above. The scaling factor is defined below.<br />
<br />
<math>Shift(x) = 2^{round(log_2(x))}</math><br />
<br />
Finally, stochastic rounding is substituted for small or real-valued updates during gradient accumulation.<br />
<br />
A visual representation of these operations is below.<br />
[[File:p32fig2.PNG|center|thumb|800px|Quantization methods used in WAGE. The notation <math>P, x, \lfloor \cdot \rfloor, \lceil \cdot \rceil</math> denotes probability, vector, floor and ceil, respectively. <math>Shift(\cdot)</math> refers to distribution shifting with a certain argument]]<br />
<br />
=== Weight Initialization ===<br />
In this work, batch normalization is simplified to a constant scaling layer in order to sidestep the problem of normalizing outputs without floating point math, and to remove the extra memory requirement with batch normalization. As such, some care must be taken when initializing weights. The authors use a modified initialization method base on MSRA [4].<br />
<br />
<math>W \thicksim U(-L, +L),L = max \left \{ \sqrt{6/n_{in}}, L_{min} \right \}, L_{min} = \beta \sigma</math><br />
<br />
<math>n_{in}</math> is the layer fan-in number, <math>U</math> denotes uniform distribution. The original<math>\eta</math> initialization method is modified by adding the condition that the distribution width should be at least <math>\beta \sigma</math>, where <math>\beta</math> is a constant greater than 1 and <math>\sigma</math> is the minimum step size see already. This prevents weights being initialised to all-zeros in the case where the bitwidth is low, or the fan-in number is high.<br />
<br />
=== Quantization Details ===<br />
<br />
==== Weight <math>Q_W(\cdot)</math> ====<br />
<math>W_q = Q_W(W) = Q(W, k_W)</math><br />
<br />
The quantization operator is simply the quantization function previously introduced. <br />
<br />
==== Activation <math>Q_A(\cdot)</math> ====<br />
The authors say that the variance of the weights passed through this function will be scaled compared to the variance of the weights as initialized. To prevent this effect from blowing up the network outputs, they introduce a scaling factor <math>\alpha</math>. Notice that it is constant for each layer.<br />
<br />
<math>\alpha = max \left \{ Shift(L_{min} / L), 1 \right \}</math><br />
<br />
The quantization operator is then<br />
<br />
<math>a_q = Q_A(a) = Q(a/\alpha, k_A)</math><br />
<br />
The scaling factor approximates batch normalization.<br />
<br />
==== Error <math>Q_E(\cdot)</math> ====<br />
The magnitude of the error can vary greatly, and that a previous approach (DoReFa-Net [5]) solves the issue by using an affine transform to map the error to the range <math>[-1, 1]</math>, apply quantization, and then applying the inverse transform. However, the authors claim that this approach still requires using float32, and that the magnitude of the error is unimportant: rather it is the orientation of the error. Thus, they only scale the error distribution to the range <math>\left [ -\sqrt2, \sqrt2 \right ]</math> and quantise:<br />
<br />
<math>e_q = Q_E(e) = Q(e/Shift(max\{|e|\}), k_E)</math><br />
<br />
Max is the element-wise maximum. Note that this discards any error elements less than the minimum step size.<br />
<br />
==== Gradient <math>Q_G(\cdot)</math> ====<br />
Similar to the activations and errors, the gradients are rescaled:<br />
<br />
<math>g_s = \eta \cdot g/Shift(max\{|g|\})</math><br />
<br />
<math> \eta </math> is a shift-based learning rate. It is an integer power of 2. The shifted gradients are represented in units of minimum step sizes <math> \sigma(k) </math>. When reducing the bitwidth of the gradients (remember that the gradients are coming out of a MAC operation, so the bitwidth may have increased) stochastic rounding is used as a substitute for small gradient accumulation.<br />
<br />
<math>\Delta W = Q_G(g) = \sigma(k_G) \cdot sgn(g_s) \cdot \left \{ \lfloor | g_s | \rfloor + Bernoulli(|g_s|<br />
- \lfloor | g_s | \rfloor) \right \}</math><br />
<br />
This randomly rounds the result of the MAC operation up or down to the nearest quantization for the given gradient bitwidth. The weights are updated with the resulting discrete increments:<br />
<br />
<math>W_{t+1} = Clip \left \{ W_t - \Delta W_t, -1 + \sigma(k_G), 1 - \sigma(k_G) \right \}</math><br />
<br />
=== Miscellaneous ===<br />
To train WAGE networks, the authors used pure SGD exclusively because more complicated techniques such as Momentum or RMSProp increase memory consumption and are complicated by the rescaling that happens within each quantization operator.<br />
<br />
The quantization and stochastic rounding are a form of regularization.<br />
<br />
The authors didn't use a traditional softmax with cross-entropy loss for the experiments because there does not yet exist a softmax layer for low-bit integers. Instead, they use a sum of squared error loss. This works for tasks with a small number of categories, but does not scale well.<br />
<br />
== Experiments ==<br />
For all experiments, the default layer bitwidth configuration is 2-8-8-8 for Weights, Activations, Gradients, and Error bits. The weight bitwidth is set to 2 because that results in ternary weights, and therefore no multiplication during inference. They authors argue that the bitwidth for activation and errors should be the same because the computation graph for each is similar and might use the same hardware. During training, the weight bitwidth is 8. For inference the weights are ternarized.<br />
<br />
=== Implementation Details ===<br />
MNIST: Network is LeNet-5 variant [6]<br />
<br />
SVHN & CIFAR10: VGG variant [7]<br />
<br />
ImageNet: AlexNet variant [8]<br />
{| class="wikitable"<br />
|+Test or validation error rates (%) in previous works and WAGE on multiple datasets. Opt denotes gradient descent optimizer, withM means SGD with momentum, BN represents batch normalization, 32 bit refers to float32, and ImageNet top-k format: top1/top5.<br />
!Method<br />
!<math>k_W</math><br />
!<math>k_A</math><br />
!<math>k_G</math><br />
!<math>k_E</math><br />
!Opt<br />
!BN<br />
!MNIST<br />
!SVHN<br />
!CIFAR10<br />
!ImageNet<br />
|-<br />
|BC<br />
|1<br />
|32<br />
|32<br />
|32<br />
|Adam<br />
|yes<br />
|1.29<br />
|2.30<br />
|9.90<br />
|<br />
|-<br />
|BNN<br />
|1<br />
|1<br />
|32<br />
|32<br />
|Adam<br />
|yes <br />
|0.96<br />
|2.53<br />
|10.15<br />
|<br />
|-<br />
|BWN<br />
|1<br />
|32<br />
|32<br />
|32<br />
|withM<br />
|yes<br />
|<br />
|<br />
|<br />
|43.2/20.6<br />
|-<br />
|XNOR<br />
|1<br />
|1<br />
|32<br />
|32<br />
|Adam<br />
|yes<br />
|<br />
|<br />
|<br />
|55.8/30.8<br />
|-<br />
|TWN<br />
|2<br />
|32<br />
|32<br />
|32<br />
|withM<br />
|yes<br />
|0.65<br />
|<br />
|7.44<br />
|'''34.7/13.8'''<br />
|-<br />
|TTQ<br />
|2<br />
|32<br />
|32<br />
|32<br />
|Adam<br />
|yes<br />
|<br />
|<br />
|6.44<br />
|42.5/20.3<br />
|-<br />
|DoReFa<br />
|8<br />
|8<br />
|32<br />
|8<br />
|Adam<br />
|yes<br />
|<br />
|2.30<br />
|<br />
|47.0/<br />
|-<br />
|TernGrad<br />
|32<br />
|32<br />
|2<br />
|32<br />
|Adam<br />
|yes<br />
|<br />
|<br />
|14.36<br />
|42.4/19.5<br />
|-<br />
|WAGE<br />
|2<br />
|8<br />
|8<br />
|8<br />
|SGD<br />
|no<br />
|'''0.40'''<br />
|'''1.92'''<br />
|'''6.78'''<br />
|51.6/27.8<br />
|}<br />
<br />
=== Training Curves and Regularization ===<br />
The authors compare the 2-8-8-8 WAGE configuration introduced above, a 2-8-f-f (meaning float32) configuration, and a completely floating point version on CIFAR10. The test error is plotted against epoch. For training these networks, the learning rate is divided by 8 at the 200th epoch and again at the 250th epoch.<br />
[[File:p32fig3.PNG|center|thumb|800px|Training curves of WAGE variations and a vanilla CNN on CIFAR10]]<br />
The convergence of the 2-8-8-8 has comparable convergence to the vanilla CNN and outperforms the 2-8-f-f variant. The authors speculate that this is because the extra discretization acts as a regularizer.<br />
<br />
=== Bitwidth of Errors ===<br />
The CIFAR10 test accuracy is plotted against bitwidth below<br />
[[File:p32fig4.PNG|center|thumb|520x522px|The 10 run accuracies of different <math>k_E</math>]]<br />
<br />
=== Bitwidth of Gradients ===<br />
{| class="wikitable"<br />
|+Test error rates (%) on CIFAR10 with different <math>k_G</math><br />
!<math>k_G</math><br />
!2<br />
!3<br />
!4<br />
!5<br />
!6<br />
!7<br />
!8<br />
!9<br />
!10<br />
!11<br />
!12<br />
|-<br />
|error<br />
|54.22<br />
|51.57<br />
|28.22<br />
|18.01<br />
|11.48<br />
|7.61<br />
|6.78<br />
|6.63<br />
|6.43<br />
|6.55<br />
|6.57<br />
|}<br />
The authors also examined the effect of bitwidth on the ImageNet implementation.<br />
<br />
{| class="wikitable"<br />
|+Top-5 error rates (%) on ImageNet with different <math>k_G</math>and <math>k_E</math><br />
!Pattern<br />
!vanilla<br />
!28ff-BN<br />
!28ff<br />
!28f8<br />
!28C8<br />
!288C<br />
!2888<br />
|-<br />
|error<br />
|19.29<br />
|20.67<br />
|24.14<br />
|23.92<br />
|26.88<br />
|28.06<br />
|27.82<br />
|}<br />
Here, C denotes 12 bits (Hexidecimal) and BN refers to batch normalization being added.<br />
<br />
== Discussion ==<br />
The authors have a few areas they believe this approach could be improved.<br />
<br />
'''MAC Operation:''' The 2-8-8-8 configuration was chosen because the low weight bitwidth means there aren't any multiplication during inference. However, this does not remove the requirement for multiplication during training. 2-2-8-8 configuration satisfies this requirement, but it is difficult to train and detrimental to the accuracy.<br />
<br />
'''Non-linear Quantization:''' The linear mapping used in this approach is simple, but there might be a more effective mapping. For example, a logarithmic mapping could be more effective if the weights and activations have a log-normal distribution.<br />
<br />
'''Normalization:''' Normalization layers (softmax, batch normalization) were not used in this paper. Quantized versions are an area of future work<br />
<br />
== Conclusion ==<br />
<br />
A framework for training and inference without the use of floating-point representation is presented. Future work may further improve compression and memory requirements.<br />
== References ==<br />
<br />
# Sze, Vivienne; Chen, Yu-Hsin; Yang, Tien-Ju; Emer, Joel (2017-03-27). [http://arxiv.org/abs/1703.09039 "Efficient Processing of Deep Neural Networks: A Tutorial and Survey"]. arXiv:1703.09039 [cs].<br />
# Courbariaux, Matthieu; Bengio, Yoshua; David, Jean-Pierre (2015-11-01). [http://arxiv.org/abs/1511.00363 "BinaryConnect: Training Deep Neural Networks with binary weights during propagations"]. arXiv:1511.00363 [cs].<br />
# Li, Fengfu; Zhang, Bo; Liu, Bin (2016-05-16). [http://arxiv.org/abs/1605.04711 "Ternary Weight Networks"]. arXiv:1605.04711 [cs].<br />
# He, Kaiming; Zhang, Xiangyu; Ren, Shaoqing; Sun, Jian (2015-02-06). [http://arxiv.org/abs/1502.01852 "Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification"]. arXiv:1502.01852 [cs].<br />
# Zhou, Shuchang; Wu, Yuxin; Ni, Zekun; Zhou, Xinyu; Wen, He; Zou, Yuheng (2016-06-20). [http://arxiv.org/abs/1606.06160 "DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients"]. arXiv:1606.06160 [cs].<br />
# Lecun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. (November 1998). [http://ieeexplore.ieee.org/document/726791/?reload=true "Gradient-based learning applied to document recognition"]. Proceedings of the IEEE. 86 (11): 2278–2324. doi:10.1109/5.726791. ISSN 0018-9219.<br />
# Simonyan, Karen; Zisserman, Andrew (2014-09-04). [http://arxiv.org/abs/1409.1556 "Very Deep Convolutional Networks for Large-Scale Image Recognition"]. arXiv:1409.1556 [cs].<br />
# Krizhevsky, Alex; Sutskever, Ilya; Hinton, Geoffrey E (2012). Pereira, F.; Burges, C. J. C.; Bottou, L.; Weinberger, K. Q., eds. [http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf Advances in Neural Information Processing Systems 25 (PDF)]. Curran Associates, Inc. pp. 1097–1105.</div>Bapskikohttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:p32fig4.PNG&diff=35785File:p32fig4.PNG2018-03-28T00:36:22Z<p>Bapskiko: </p>
<hr />
<div></div>Bapskikohttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:p32fig3.PNG&diff=35784File:p32fig3.PNG2018-03-28T00:36:15Z<p>Bapskiko: </p>
<hr />
<div></div>Bapskikohttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:p32fig2.PNG&diff=35783File:p32fig2.PNG2018-03-28T00:36:08Z<p>Bapskiko: </p>
<hr />
<div></div>Bapskikohttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:p32fig1.PNG&diff=35782File:p32fig1.PNG2018-03-28T00:36:00Z<p>Bapskiko: </p>
<hr />
<div></div>Bapskikohttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Training_And_Inference_with_Integers_in_Deep_Neural_Networks&diff=35781Training And Inference with Integers in Deep Neural Networks2018-03-28T00:32:50Z<p>Bapskiko: fix formula. this wiki needs to update. badly.</p>
<hr />
<div>== Introduction ==<br />
<br />
Deep neural networks have enjoyed much success in all manners of tasks, but it is common for these networks to be complicated, requiring large amounts of energy-intensive memory and floating-point operations. Therefore, in order to use state-of-the-art networks in applications where energy is limited or having packaging limitation for hardware, such as anything not connected to the power grid, the energy costs must be reduced while preserving as much performance as practical.<br />
<br />
Most existing methods focus on reducing the energy requirements during inference rather than training. Since training with SGD requires accumulation, training usually has higher precision demand than inference. Most of the existing methods focus on how to compress a model for inference, rather than during training. This paper proposes a framework to reduce complexity both during training and inference through the use of integers instead of floats. They address how to quantize all operations and operands as well as examining the bitwidth requirement for SGD computation & accumulation. Using integers instead of floats results in energy-savings because integer operations are more efficient than floating point (see the table below). Also, there already exists dedicated hardware for deep learning that uses integer operations (such as the 1st generation of Google TPU) so understanding the best way to use integers is well-motivated.<br />
{| class="wikitable"<br />
|+Rough Energy Costs in 45nm 0.9V [1]<br />
!<br />
! colspan="2" |Energy(pJ)<br />
! colspan="2" |Area(<math>\mu m^2</math>)<br />
|-<br />
!Operation<br />
!MUL<br />
!ADD<br />
!MUL<br />
!ADD<br />
|-<br />
|8-bit INT<br />
|0.2<br />
|0.03<br />
|282<br />
|36<br />
|-<br />
|16-bit FP<br />
|1.1<br />
|0.4<br />
|1640<br />
|1360<br />
|-<br />
|32-bit FP<br />
|3.7<br />
|0.9<br />
|7700<br />
|4184<br />
|}<br />
The authors call the framework WAGE because they consider how best to handle the '''W'''eights, '''A'''ctivations, '''G'''radients, and '''E'''rrors separately.<br />
<br />
== Related Work ==<br />
<br />
=== Weight and Activation ===<br />
Existing works on binary weights and activations [2] still use high-precision accumulation for SGD. Ternary weight networks [3] offer more expression than binary weight networks.<br />
<br />
=== Gradient Computation and Accumulation ===<br />
Some methods quantize gradients in the backwards pass, but the weights are still stored in float32, and batch normalization is ignored.<br />
<br />
== WAGE Quantization ==<br />
The core idea of the proposed method is to constrain the following to low-bitwidth integers on each layer:<br />
* '''W:''' weight in inference<br />
* '''a:''' activation in inference<br />
* '''e:''' error in backpropagation<br />
* '''g:''' gradient in backpropagation<br />
[[File:Sandbox, gardens of Schönbrunn.jpg|center|thumb|560x560px|Four operators QW (·), QA(·), QG(·), QE(·) added in WAGE computation dataflow to reduce precision, bitwidth of signed integers are below or on the right of arrows, activations are included in MAC for concision.]]<br />
The error and gradient are defined as:<br />
<br />
<math>e^i = \frac{\partial L}{\partial a^i}, g^i = \frac{\partial L}{\partial W^i}</math><br />
<br />
where L is the loss function.<br />
<br />
The precision in bits of the errors, activations, gradients, and weights are <math>k_E</math>, <math>k_A</math>, <math>k_G</math>, and <math>k_W</math> respectively. As shown in the above figure, each quantity also has a quantization operators to reduce bitwidth increases caused by multiply-accumulate (MAC) operations. Also, note that since this is a layer-by-layer approach, each layer may be followed or preceded by a layer with different precision, or even a layer using floating point math.<br />
<br />
=== Shift-Based Linear Mapping and Stochastic Mapping ===<br />
The proposed method makes use of a linear mapping where continuous, unbounded values are discretized for each bitwidth <math>k</math> with a uniform spacing of<br />
<br />
<math>\sigma(k) = 2^{1-k}, k \in Z_+ </math><br />
With this, the full quantization function is<br />
<br />
<math>Q(x,k) = Clip\left \{ \sigma(k) \cdot round\left [ \frac{x}{\sigma(k)} \right ], -1 + \sigma(k), 1 - \sigma(k) \right \}</math><br />
<br />
Note that this function is only using when simulating integer operations on floating-point hardware, on native integer hardware, this is done automatically. In addition to this quantization function.<br />
<br />
A distribution scaling factor is used in some quantization operators to preserve as much variance as possible when applying the quantization function above. The scaling factor is defined below.<br />
<br />
<math>Shift(x) = 2^{round(log_2(x))}</math><br />
<br />
Finally, stochastic rounding is substituted for small or real-valued updates during gradient accumulation.<br />
<br />
A visual representation of these operations is below.<br />
[[File:Sanxbox.JPG|center|thumb|220x220px|Quantization methods used in WAGE. The notation <math>P, x, \lfloor \cdot \rfloor, \lceil \cdot \rceil</math> denotes probability, vector, floor and ceil, respectively. <math>Shift(\cdot)</math> refers to distribution shifting with a certain argument]]<br />
<br />
=== Weight Initialization ===<br />
In this work, batch normalization is simplified to a constant scaling layer in order to sidestep the problem of normalizing outputs without floating point math, and to remove the extra memory requirement with batch normalization. As such, some care must be taken when initializing weights. The authors use a modified initialization method base on MSRA [4].<br />
<br />
<math>W \thicksim U(-L, +L),L = max \left \{ \sqrt{6/n_{in}}, L_{min} \right \}, L_{min} = \beta \sigma</math><br />
<br />
<math>n_{in}</math> is the layer fan-in number, <math>U</math> denotes uniform distribution. The original<math>\eta</math> initialization method is modified by adding the condition that the distribution width should be at least <math>\beta \sigma</math>, where <math>\beta</math> is a constant greater than 1 and <math>\sigma</math> is the minimum step size see already. This prevents weights being initialised to all-zeros in the case where the bitwidth is low, or the fan-in number is high.<br />
<br />
=== Quantization Details ===<br />
<br />
==== Weight <math>Q_W(\cdot)</math> ====<br />
<math>W_q = Q_W(W) = Q(W, k_W)</math><br />
<br />
The quantization operator is simply the quantization function previously introduced. <br />
<br />
==== Activation <math>Q_A(\cdot)</math> ====<br />
The authors say that the variance of the weights passed through this function will be scaled compared to the variance of the weights as initialized. To prevent this effect from blowing up the network outputs, they introduce a scaling factor <math>\alpha</math>. Notice that it is constant for each layer.<br />
<br />
<math>\alpha = max \left \{ Shift(L_{min} / L), 1 \right \}</math><br />
<br />
The quantization operator is then<br />
<br />
<math>a_q = Q_A(a) = Q(a/\alpha, k_A)</math><br />
<br />
The scaling factor approximates batch normalization.<br />
<br />
==== Error <math>Q_E(\cdot)</math> ====<br />
The magnitude of the error can vary greatly, and that a previous approach (DoReFa-Net [5]) solves the issue by using an affine transform to map the error to the range <math>[-1, 1]</math>, apply quantization, and then applying the inverse transform. However, the authors claim that this approach still requires using float32, and that the magnitude of the error is unimportant: rather it is the orientation of the error. Thus, they only scale the error distribution to the range <math>\left [ -\sqrt2, \sqrt2 \right ]</math> and quantise:<br />
<br />
<math>e_q = Q_E(e) = Q(e/Shift(max\{|e|\}), k_E)</math><br />
<br />
Max is the element-wise maximum. Note that this discards any error elements less than the minimum step size.<br />
<br />
==== Gradient <math>Q_G(\cdot)</math> ====<br />
Similar to the activations and errors, the gradients are rescaled:<br />
<br />
<math>g_s = \eta \cdot g/Shift(max\{|g|\})</math><br />
<br />
<math> \eta </math> is a shift-based learning rate. It is an integer power of 2. The shifted gradients are represented in units of minimum step sizes <math> \sigma(k) </math>. When reducing the bitwidth of the gradients (remember that the gradients are coming out of a MAC operation, so the bitwidth may have increased) stochastic rounding is used as a substitute for small gradient accumulation.<br />
<br />
<math>\Delta W = Q_G(g) = \sigma(k_G) \cdot sgn(g_s) \cdot \left \{ \lfloor | g_s | \rfloor + Bernoulli(|g_s|<br />
- \lfloor | g_s | \rfloor) \right \}</math><br />
<br />
This randomly rounds the result of the MAC operation up or down to the nearest quantization for the given gradient bitwidth. The weights are updated with the resulting discrete increments:<br />
<br />
<math>W_{t+1} = Clip \left \{ W_t - \Delta W_t, -1 + \sigma(k_G), 1 - \sigma(k_G) \right \}</math><br />
<br />
=== Miscellaneous ===<br />
To train WAGE networks, the authors used pure SGD exclusively because more complicated techniques such as Momentum or RMSProp increase memory consumption and are complicated by the rescaling that happens within each quantization operator.<br />
<br />
The quantization and stochastic rounding are a form of regularization.<br />
<br />
The authors didn't use a traditional softmax with cross-entropy loss for the experiments because there does not yet exist a softmax layer for low-bit integers. Instead, they use a sum of squared error loss. This works for tasks with a small number of categories, but does not scale well.<br />
<br />
== Experiments ==<br />
For all experiments, the default layer bitwidth configuration is 2-8-8-8 for Weights, Activations, Gradients, and Error bits. The weight bitwidth is set to 2 because that results in ternary weights, and therefore no multiplication during inference. They authors argue that the bitwidth for activation and errors should be the same because the computation graph for each is similar and might use the same hardware. During training, the weight bitwidth is 8. For inference the weights are ternarized.<br />
<br />
=== Implementation Details ===<br />
MNIST: Network is LeNet-5 variant [6]<br />
<br />
SVHN & CIFAR10: VGG variant [7]<br />
<br />
ImageNet: AlexNet variant [8]<br />
{| class="wikitable"<br />
|+Test or validation error rates (%) in previous works and WAGE on multiple datasets. Opt denotes gradient descent optimizer, withM means SGD with momentum, BN represents batch normalization, 32 bit refers to float32, and ImageNet top-k format: top1/top5.<br />
!Method<br />
!<math>k_W</math><br />
!<math>k_A</math><br />
!<math>k_G</math><br />
!<math>k_E</math><br />
!Opt<br />
!BN<br />
!MNIST<br />
!SVHN<br />
!CIFAR10<br />
!ImageNet<br />
|-<br />
|BC<br />
|1<br />
|32<br />
|32<br />
|32<br />
|Adam<br />
|yes<br />
|1.29<br />
|2.30<br />
|9.90<br />
|<br />
|-<br />
|BNN<br />
|1<br />
|1<br />
|32<br />
|32<br />
|Adam<br />
|yes <br />
|0.96<br />
|2.53<br />
|10.15<br />
|<br />
|-<br />
|BWN<br />
|1<br />
|32<br />
|32<br />
|32<br />
|withM<br />
|yes<br />
|<br />
|<br />
|<br />
|43.2/20.6<br />
|-<br />
|XNOR<br />
|1<br />
|1<br />
|32<br />
|32<br />
|Adam<br />
|yes<br />
|<br />
|<br />
|<br />
|55.8/30.8<br />
|-<br />
|TWN<br />
|2<br />
|32<br />
|32<br />
|32<br />
|withM<br />
|yes<br />
|0.65<br />
|<br />
|7.44<br />
|'''34.7/13.8'''<br />
|-<br />
|TTQ<br />
|2<br />
|32<br />
|32<br />
|32<br />
|Adam<br />
|yes<br />
|<br />
|<br />
|6.44<br />
|42.5/20.3<br />
|-<br />
|DoReFa<br />
|8<br />
|8<br />
|32<br />
|8<br />
|Adam<br />
|yes<br />
|<br />
|2.30<br />
|<br />
|47.0/<br />
|-<br />
|TernGrad<br />
|32<br />
|32<br />
|2<br />
|32<br />
|Adam<br />
|yes<br />
|<br />
|<br />
|14.36<br />
|42.4/19.5<br />
|-<br />
|WAGE<br />
|2<br />
|8<br />
|8<br />
|8<br />
|SGD<br />
|no<br />
|'''0.40'''<br />
|'''1.92'''<br />
|'''6.78'''<br />
|51.6/27.8<br />
|}<br />
<br />
=== Training Curves and Regularization ===<br />
The authors compare the 2-8-8-8 WAGE configuration introduced above, a 2-8-f-f (meaning float32) configuration, and a completely floating point version on CIFAR10. The test error is plotted against epoch. For training these networks, the learning rate is divided by 8 at the 200th epoch and again at the 250th epoch.<br />
[[File:Project 1164 Moskva 2009 G1.jpg|center|thumb|800x800px|Training curves of WAGE variations and a vanilla CNN on CIFAR10]]<br />
The convergence of the 2-8-8-8 has comparable convergence to the vanilla CNN and outperforms the 2-8-f-f variant. The authors speculate that this is because the extra discretization acts as a regularizer.<br />
<br />
=== Bitwidth of Errors ===<br />
The CIFAR10 test accuracy is plotted against bitwidth below<br />
[[File:Sandbox.svg|center|thumb|700x700px|The 10 run accuracies of different <math>k_E</math>]]<br />
<br />
=== Bitwidth of Gradients ===<br />
{| class="wikitable"<br />
|+Test error rates (%) on CIFAR10 with different <math>k_G</math><br />
!<math>k_G</math><br />
!2<br />
!3<br />
!4<br />
!5<br />
!6<br />
!7<br />
!8<br />
!9<br />
!10<br />
!11<br />
!12<br />
|-<br />
|error<br />
|54.22<br />
|51.57<br />
|28.22<br />
|18.01<br />
|11.48<br />
|7.61<br />
|6.78<br />
|6.63<br />
|6.43<br />
|6.55<br />
|6.57<br />
|}<br />
The authors also examined the effect of bitwidth on the ImageNet implementation.<br />
<br />
{| class="wikitable"<br />
|+Top-5 error rates (%) on ImageNet with different <math>k_G</math>and <math>k_E</math><br />
!Pattern<br />
!vanilla<br />
!28ff-BN<br />
!28ff<br />
!28f8<br />
!28C8<br />
!288C<br />
!2888<br />
|-<br />
|error<br />
|19.29<br />
|20.67<br />
|24.14<br />
|23.92<br />
|26.88<br />
|28.06<br />
|27.82<br />
|}<br />
Here, C denotes 12 bits (Hexidecimal) and BN refers to batch normalization being added.<br />
<br />
== Discussion ==<br />
The authors have a few areas they believe this approach could be improved.<br />
<br />
'''MAC Operation:''' The 2-8-8-8 configuration was chosen because the low weight bitwidth means there aren't any multiplication during inference. However, this does not remove the requirement for multiplication during training. 2-2-8-8 configuration satisfies this requirement, but it is difficult to train and detrimental to the accuracy.<br />
<br />
'''Non-linear Quantization:''' The linear mapping used in this approach is simple, but there might be a more effective mapping. For example, a logarithmic mapping could be more effective if the weights and activations have a log-normal distribution.<br />
<br />
'''Normalization:''' Normalization layers (softmax, batch normalization) were not used in this paper. Quantized versions are an area of future work<br />
<br />
== Conclusion ==<br />
<br />
A framework for training and inference without the use of floating-point representation is presented. Future work may further improve compression and memory requirements.<br />
== References ==<br />
<br />
# Sze, Vivienne; Chen, Yu-Hsin; Yang, Tien-Ju; Emer, Joel (2017-03-27). [http://arxiv.org/abs/1703.09039 "Efficient Processing of Deep Neural Networks: A Tutorial and Survey"]. arXiv:1703.09039 [cs].<br />
# Courbariaux, Matthieu; Bengio, Yoshua; David, Jean-Pierre (2015-11-01). [http://arxiv.org/abs/1511.00363 "BinaryConnect: Training Deep Neural Networks with binary weights during propagations"]. arXiv:1511.00363 [cs].<br />
# Li, Fengfu; Zhang, Bo; Liu, Bin (2016-05-16). [http://arxiv.org/abs/1605.04711 "Ternary Weight Networks"]. arXiv:1605.04711 [cs].<br />
# He, Kaiming; Zhang, Xiangyu; Ren, Shaoqing; Sun, Jian (2015-02-06). [http://arxiv.org/abs/1502.01852 "Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification"]. arXiv:1502.01852 [cs].<br />
# Zhou, Shuchang; Wu, Yuxin; Ni, Zekun; Zhou, Xinyu; Wen, He; Zou, Yuheng (2016-06-20). [http://arxiv.org/abs/1606.06160 "DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients"]. arXiv:1606.06160 [cs].<br />
# Lecun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. (November 1998). [http://ieeexplore.ieee.org/document/726791/?reload=true "Gradient-based learning applied to document recognition"]. Proceedings of the IEEE. 86 (11): 2278–2324. doi:10.1109/5.726791. ISSN 0018-9219.<br />
# Simonyan, Karen; Zisserman, Andrew (2014-09-04). [http://arxiv.org/abs/1409.1556 "Very Deep Convolutional Networks for Large-Scale Image Recognition"]. arXiv:1409.1556 [cs].<br />
# Krizhevsky, Alex; Sutskever, Ilya; Hinton, Geoffrey E (2012). Pereira, F.; Burges, C. J. C.; Bottou, L.; Weinberger, K. Q., eds. [http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf Advances in Neural Information Processing Systems 25 (PDF)]. Curran Associates, Inc. pp. 1097–1105.</div>Bapskikohttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Training_And_Inference_with_Integers_in_Deep_Neural_Networks&diff=35780Training And Inference with Integers in Deep Neural Networks2018-03-28T00:27:13Z<p>Bapskiko: Fix references</p>
<hr />
<div>== Introduction ==<br />
<br />
Deep neural networks have enjoyed much success in all manners of tasks, but it is common for these networks to be complicated, requiring large amounts of energy-intensive memory and floating-point operations. Therefore, in order to use state-of-the-art networks in applications where energy is limited or having packaging limitation for hardware, such as anything not connected to the power grid, the energy costs must be reduced while preserving as much performance as practical.<br />
<br />
Most existing methods focus on reducing the energy requirements during inference rather than training. Since training with SGD requires accumulation, training usually has higher precision demand than inference. Most of the existing methods focus on how to compress a model for inference, rather than during training. This paper proposes a framework to reduce complexity both during training and inference through the use of integers instead of floats. They address how to quantize all operations and operands as well as examining the bitwidth requirement for SGD computation & accumulation. Using integers instead of floats results in energy-savings because integer operations are more efficient than floating point (see the table below). Also, there already exists dedicated hardware for deep learning that uses integer operations (such as the 1st generation of Google TPU) so understanding the best way to use integers is well-motivated.<br />
{| class="wikitable"<br />
|+Rough Energy Costs in 45nm 0.9V [1]<br />
!<br />
! colspan="2" |Energy(pJ)<br />
! colspan="2" |Area(<math>\mu m^2</math>)<br />
|-<br />
!Operation<br />
!MUL<br />
!ADD<br />
!MUL<br />
!ADD<br />
|-<br />
|8-bit INT<br />
|0.2<br />
|0.03<br />
|282<br />
|36<br />
|-<br />
|16-bit FP<br />
|1.1<br />
|0.4<br />
|1640<br />
|1360<br />
|-<br />
|32-bit FP<br />
|3.7<br />
|0.9<br />
|7700<br />
|4184<br />
|}<br />
The authors call the framework WAGE because they consider how best to handle the '''W'''eights, '''A'''ctivations, '''G'''radients, and '''E'''rrors separately.<br />
<br />
== Related Work ==<br />
<br />
=== Weight and Activation ===<br />
Existing works on binary weights and activations [2] still use high-precision accumulation for SGD. Ternary weight networks [3] offer more expression than binary weight networks.<br />
<br />
=== Gradient Computation and Accumulation ===<br />
Some methods quantize gradients in the backwards pass, but the weights are still stored in float32, and batch normalization is ignored.<br />
<br />
== WAGE Quantization ==<br />
The core idea of the proposed method is to constrain the following to low-bitwidth integers on each layer:<br />
* '''W:''' weight in inference<br />
* '''a:''' activation in inference<br />
* '''e:''' error in backpropagation<br />
* '''g:''' gradient in backpropagation<br />
[[File:Sandbox, gardens of Schönbrunn.jpg|center|thumb|560x560px|Four operators QW (·), QA(·), QG(·), QE(·) added in WAGE computation dataflow to reduce precision, bitwidth of signed integers are below or on the right of arrows, activations are included in MAC for concision.]]<br />
The error and gradient are defined as:<br />
<br />
<math>e^i = \frac{\partial L}{\partial a^i}, g^i = \frac{\partial L}{\partial W^i}</math><br />
<br />
where L is the loss function.<br />
<br />
The precision in bits of the errors, activations, gradients, and weights are <math>k_E</math>, <math>k_A</math>, <math>k_G</math>, and <math>k_W</math> respectively. As shown in the above figure, each quantity also has a quantization operators to reduce bitwidth increases caused by multiply-accumulate (MAC) operations. Also, note that since this is a layer-by-layer approach, each layer may be followed or preceded by a layer with different precision, or even a layer using floating point math.<br />
<br />
=== Shift-Based Linear Mapping and Stochastic Mapping ===<br />
The proposed method makes use of a linear mapping where continuous, unbounded values are discretized for each bitwidth <math>k</math> with a uniform spacing of<br />
<br />
<math>\sigma(k) = 2^{1-k}, k \in \natnums_+</math><br />
<br />
With this, the full quantization function is<br />
<br />
<math>Q(x,k) = Clip\left \{ \sigma(k) \cdot round\left [ \frac{x}{\sigma(k)} \right ], -1 + \sigma(k), 1 - \sigma(k) \right \}</math><br />
<br />
Note that this function is only using when simulating integer operations on floating-point hardware, on native integer hardware, this is done automatically. In addition to this quantization function.<br />
<br />
A distribution scaling factor is used in some quantization operators to preserve as much variance as possible when applying the quantization function above. The scaling factor is defined below.<br />
<br />
<math>Shift(x) = 2^{round(log_2(x))}</math><br />
<br />
Finally, stochastic rounding is substituted for small or real-valued updates during gradient accumulation.<br />
<br />
A visual representation of these operations is below.<br />
[[File:Sanxbox.JPG|center|thumb|220x220px|Quantization methods used in WAGE. The notation <math>P, x, \lfloor \cdot \rfloor, \lceil \cdot \rceil</math> denotes probability, vector, floor and ceil, respectively. <math>Shift(\cdot)</math> refers to distribution shifting with a certain argument]]<br />
<br />
=== Weight Initialization ===<br />
In this work, batch normalization is simplified to a constant scaling layer in order to sidestep the problem of normalizing outputs without floating point math, and to remove the extra memory requirement with batch normalization. As such, some care must be taken when initializing weights. The authors use a modified initialization method base on MSRA [4].<br />
<br />
<math>W \thicksim U(-L, +L),L = max \left \{ \sqrt{6/n_{in}}, L_{min} \right \}, L_{min} = \beta \sigma</math><br />
<br />
<math>n_{in}</math> is the layer fan-in number, <math>U</math> denotes uniform distribution. The original<math>\eta</math> initialization method is modified by adding the condition that the distribution width should be at least <math>\beta \sigma</math>, where <math>\beta</math> is a constant greater than 1 and <math>\sigma</math> is the minimum step size see already. This prevents weights being initialised to all-zeros in the case where the bitwidth is low, or the fan-in number is high.<br />
<br />
=== Quantization Details ===<br />
<br />
==== Weight <math>Q_W(\cdot)</math> ====<br />
<math>W_q = Q_W(W) = Q(W, k_W)</math><br />
<br />
The quantization operator is simply the quantization function previously introduced. <br />
<br />
==== Activation <math>Q_A(\cdot)</math> ====<br />
The authors say that the variance of the weights passed through this function will be scaled compared to the variance of the weights as initialized. To prevent this effect from blowing up the network outputs, they introduce a scaling factor <math>\alpha</math>. Notice that it is constant for each layer.<br />
<br />
<math>\alpha = max \left \{ Shift(L_{min} / L), 1 \right \}</math><br />
<br />
The quantization operator is then<br />
<br />
<math>a_q = Q_A(a) = Q(a/\alpha, k_A)</math><br />
<br />
The scaling factor approximates batch normalization.<br />
<br />
==== Error <math>Q_E(\cdot)</math> ====<br />
The magnitude of the error can vary greatly, and that a previous approach (DoReFa-Net [5]) solves the issue by using an affine transform to map the error to the range <math>[-1, 1]</math>, apply quantization, and then applying the inverse transform. However, the authors claim that this approach still requires using float32, and that the magnitude of the error is unimportant: rather it is the orientation of the error. Thus, they only scale the error distribution to the range <math>\left [ -\sqrt2, \sqrt2 \right ]</math> and quantise:<br />
<br />
<math>e_q = Q_E(e) = Q(e/Shift(max\{|e|\}), k_E)</math><br />
<br />
Max is the element-wise maximum. Note that this discards any error elements less than the minimum step size.<br />
<br />
==== Gradient <math>Q_G(\cdot)</math> ====<br />
Similar to the activations and errors, the gradients are rescaled:<br />
<br />
<math>g_s = \eta \cdot g/Shift(max\{|g|\})</math><br />
<br />
<math> \eta </math> is a shift-based learning rate. It is an integer power of 2. The shifted gradients are represented in units of minimum step sizes <math> \sigma(k) </math>. When reducing the bitwidth of the gradients (remember that the gradients are coming out of a MAC operation, so the bitwidth may have increased) stochastic rounding is used as a substitute for small gradient accumulation.<br />
<br />
<math>\Delta W = Q_G(g) = \sigma(k_G) \cdot sgn(g_s) \cdot \left \{ \lfloor | g_s | \rfloor + Bernoulli(|g_s|<br />
- \lfloor | g_s | \rfloor) \right \}</math><br />
<br />
This randomly rounds the result of the MAC operation up or down to the nearest quantization for the given gradient bitwidth. The weights are updated with the resulting discrete increments:<br />
<br />
<math>W_{t+1} = Clip \left \{ W_t - \Delta W_t, -1 + \sigma(k_G), 1 - \sigma(k_G) \right \}</math><br />
<br />
=== Miscellaneous ===<br />
To train WAGE networks, the authors used pure SGD exclusively because more complicated techniques such as Momentum or RMSProp increase memory consumption and are complicated by the rescaling that happens within each quantization operator.<br />
<br />
The quantization and stochastic rounding are a form of regularization.<br />
<br />
The authors didn't use a traditional softmax with cross-entropy loss for the experiments because there does not yet exist a softmax layer for low-bit integers. Instead, they use a sum of squared error loss. This works for tasks with a small number of categories, but does not scale well.<br />
<br />
== Experiments ==<br />
For all experiments, the default layer bitwidth configuration is 2-8-8-8 for Weights, Activations, Gradients, and Error bits. The weight bitwidth is set to 2 because that results in ternary weights, and therefore no multiplication during inference. They authors argue that the bitwidth for activation and errors should be the same because the computation graph for each is similar and might use the same hardware. During training, the weight bitwidth is 8. For inference the weights are ternarized.<br />
<br />
=== Implementation Details ===<br />
MNIST: Network is LeNet-5 variant [6]<br />
<br />
SVHN & CIFAR10: VGG variant [7]<br />
<br />
ImageNet: AlexNet variant [8]<br />
{| class="wikitable"<br />
|+Test or validation error rates (%) in previous works and WAGE on multiple datasets. Opt denotes gradient descent optimizer, withM means SGD with momentum, BN represents batch normalization, 32 bit refers to float32, and ImageNet top-k format: top1/top5.<br />
!Method<br />
!<math>k_W</math><br />
!<math>k_A</math><br />
!<math>k_G</math><br />
!<math>k_E</math><br />
!Opt<br />
!BN<br />
!MNIST<br />
!SVHN<br />
!CIFAR10<br />
!ImageNet<br />
|-<br />
|BC<br />
|1<br />
|32<br />
|32<br />
|32<br />
|Adam<br />
|yes<br />
|1.29<br />
|2.30<br />
|9.90<br />
|<br />
|-<br />
|BNN<br />
|1<br />
|1<br />
|32<br />
|32<br />
|Adam<br />
|yes <br />
|0.96<br />
|2.53<br />
|10.15<br />
|<br />
|-<br />
|BWN<br />
|1<br />
|32<br />
|32<br />
|32<br />
|withM<br />
|yes<br />
|<br />
|<br />
|<br />
|43.2/20.6<br />
|-<br />
|XNOR<br />
|1<br />
|1<br />
|32<br />
|32<br />
|Adam<br />
|yes<br />
|<br />
|<br />
|<br />
|55.8/30.8<br />
|-<br />
|TWN<br />
|2<br />
|32<br />
|32<br />
|32<br />
|withM<br />
|yes<br />
|0.65<br />
|<br />
|7.44<br />
|'''34.7/13.8'''<br />
|-<br />
|TTQ<br />
|2<br />
|32<br />
|32<br />
|32<br />
|Adam<br />
|yes<br />
|<br />
|<br />
|6.44<br />
|42.5/20.3<br />
|-<br />
|DoReFa<br />
|8<br />
|8<br />
|32<br />
|8<br />
|Adam<br />
|yes<br />
|<br />
|2.30<br />
|<br />
|47.0/<br />
|-<br />
|TernGrad<br />
|32<br />
|32<br />
|2<br />
|32<br />
|Adam<br />
|yes<br />
|<br />
|<br />
|14.36<br />
|42.4/19.5<br />
|-<br />
|WAGE<br />
|2<br />
|8<br />
|8<br />
|8<br />
|SGD<br />
|no<br />
|'''0.40'''<br />
|'''1.92'''<br />
|'''6.78'''<br />
|51.6/27.8<br />
|}<br />
<br />
=== Training Curves and Regularization ===<br />
The authors compare the 2-8-8-8 WAGE configuration introduced above, a 2-8-f-f (meaning float32) configuration, and a completely floating point version on CIFAR10. The test error is plotted against epoch. For training these networks, the learning rate is divided by 8 at the 200th epoch and again at the 250th epoch.<br />
[[File:Project 1164 Moskva 2009 G1.jpg|center|thumb|800x800px|Training curves of WAGE variations and a vanilla CNN on CIFAR10]]<br />
The convergence of the 2-8-8-8 has comparable convergence to the vanilla CNN and outperforms the 2-8-f-f variant. The authors speculate that this is because the extra discretization acts as a regularizer.<br />
<br />
=== Bitwidth of Errors ===<br />
The CIFAR10 test accuracy is plotted against bitwidth below<br />
[[File:Sandbox.svg|center|thumb|700x700px|The 10 run accuracies of different <math>k_E</math>]]<br />
<br />
=== Bitwidth of Gradients ===<br />
{| class="wikitable"<br />
|+Test error rates (%) on CIFAR10 with different <math>k_G</math><br />
!<math>k_G</math><br />
!2<br />
!3<br />
!4<br />
!5<br />
!6<br />
!7<br />
!8<br />
!9<br />
!10<br />
!11<br />
!12<br />
|-<br />
|error<br />
|54.22<br />
|51.57<br />
|28.22<br />
|18.01<br />
|11.48<br />
|7.61<br />
|6.78<br />
|6.63<br />
|6.43<br />
|6.55<br />
|6.57<br />
|}<br />
The authors also examined the effect of bitwidth on the ImageNet implementation.<br />
<br />
{| class="wikitable"<br />
|+Top-5 error rates (%) on ImageNet with different <math>k_G</math>and <math>k_E</math><br />
!Pattern<br />
!vanilla<br />
!28ff-BN<br />
!28ff<br />
!28f8<br />
!28C8<br />
!288C<br />
!2888<br />
|-<br />
|error<br />
|19.29<br />
|20.67<br />
|24.14<br />
|23.92<br />
|26.88<br />
|28.06<br />
|27.82<br />
|}<br />
Here, C denotes 12 bits (Hexidecimal) and BN refers to batch normalization being added.<br />
<br />
== Discussion ==<br />
The authors have a few areas they believe this approach could be improved.<br />
<br />
'''MAC Operation:''' The 2-8-8-8 configuration was chosen because the low weight bitwidth means there aren't any multiplication during inference. However, this does not remove the requirement for multiplication during training. 2-2-8-8 configuration satisfies this requirement, but it is difficult to train and detrimental to the accuracy.<br />
<br />
'''Non-linear Quantization:''' The linear mapping used in this approach is simple, but there might be a more effective mapping. For example, a logarithmic mapping could be more effective if the weights and activations have a log-normal distribution.<br />
<br />
'''Normalization:''' Normalization layers (softmax, batch normalization) were not used in this paper. Quantized versions are an area of future work<br />
<br />
== Conclusion ==<br />
<br />
A framework for training and inference without the use of floating-point representation is presented. Future work may further improve compression and memory requirements.<br />
== References ==<br />
<br />
# Sze, Vivienne; Chen, Yu-Hsin; Yang, Tien-Ju; Emer, Joel (2017-03-27). [http://arxiv.org/abs/1703.09039 "Efficient Processing of Deep Neural Networks: A Tutorial and Survey"]. arXiv:1703.09039 [cs].<br />
# Courbariaux, Matthieu; Bengio, Yoshua; David, Jean-Pierre (2015-11-01). [http://arxiv.org/abs/1511.00363 "BinaryConnect: Training Deep Neural Networks with binary weights during propagations"]. arXiv:1511.00363 [cs].<br />
# Li, Fengfu; Zhang, Bo; Liu, Bin (2016-05-16). [http://arxiv.org/abs/1605.04711 "Ternary Weight Networks"]. arXiv:1605.04711 [cs].<br />
# He, Kaiming; Zhang, Xiangyu; Ren, Shaoqing; Sun, Jian (2015-02-06). [http://arxiv.org/abs/1502.01852 "Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification"]. arXiv:1502.01852 [cs].<br />
# Zhou, Shuchang; Wu, Yuxin; Ni, Zekun; Zhou, Xinyu; Wen, He; Zou, Yuheng (2016-06-20). [http://arxiv.org/abs/1606.06160 "DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients"]. arXiv:1606.06160 [cs].<br />
# Lecun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. (November 1998). [http://ieeexplore.ieee.org/document/726791/?reload=true "Gradient-based learning applied to document recognition"]. Proceedings of the IEEE. 86 (11): 2278–2324. doi:10.1109/5.726791. ISSN 0018-9219.<br />
# Simonyan, Karen; Zisserman, Andrew (2014-09-04). [http://arxiv.org/abs/1409.1556 "Very Deep Convolutional Networks for Large-Scale Image Recognition"]. arXiv:1409.1556 [cs].<br />
# Krizhevsky, Alex; Sutskever, Ilya; Hinton, Geoffrey E (2012). Pereira, F.; Burges, C. J. C.; Bottou, L.; Weinberger, K. Q., eds. [http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf Advances in Neural Information Processing Systems 25 (PDF)]. Curran Associates, Inc. pp. 1097–1105.</div>Bapskikohttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Training_And_Inference_with_Integers_in_Deep_Neural_Networks&diff=35779Training And Inference with Integers in Deep Neural Networks2018-03-28T00:11:08Z<p>Bapskiko: Copy from editor</p>
<hr />
<div>== Introduction ==<br />
<br />
Deep neural networks have enjoyed much success in all manners of tasks, but it is common for these networks to be complicated, requiring large amounts of energy-intensive memory and floating-point operations. Therefore, in order to use state-of-the-art networks in applications where energy is limited or having packaging limitation for hardware, such as anything not connected to the power grid, the energy costs must be reduced while preserving as much performance as practical.<br />
<br />
Most existing methods focus on reducing the energy requirements during inference rather than training. Since training with SGD requires accumulation, training usually has higher precision demand than inference. Most of the existing methods focus on how to compress a model for inference, rather than during training. This paper proposes a framework to reduce complexity both during training and inference through the use of integers instead of floats. They address how to quantize all operations and operands as well as examining the bitwidth requirement for SGD computation & accumulation. Using integers instead of floats results in energy-savings because integer operations are more efficient than floating point (see the table below). Also, there already exists dedicated hardware for deep learning that uses integer operations (such as the 1st generation of Google TPU) so understanding the best way to use integers is well-motivated.<br />
{| class="wikitable"<br />
|+Rough Energy Costs in 45nm 0.9V <ref>{{Cite journal|last=Sze|first=Vivienne|last2=Chen|first2=Yu-Hsin|last3=Yang|first3=Tien-Ju|last4=Emer|first4=Joel|date=2017-03-27|title=Efficient Processing of Deep Neural Networks: A Tutorial and Survey|url=http://arxiv.org/abs/1703.09039|journal=arXiv:1703.09039 [cs]}}</ref><br />
!<br />
! colspan="2" |Energy(pJ)<br />
! colspan="2" |Area(<math>\mu m^2</math>)<br />
|-<br />
!Operation<br />
!MUL<br />
!ADD<br />
!MUL<br />
!ADD<br />
|-<br />
|8-bit INT<br />
|0.2<br />
|0.03<br />
|282<br />
|36<br />
|-<br />
|16-bit FP<br />
|1.1<br />
|0.4<br />
|1640<br />
|1360<br />
|-<br />
|32-bit FP<br />
|3.7<br />
|0.9<br />
|7700<br />
|4184<br />
|}<br />
The authors call the framework WAGE because they consider how best to handle the '''W'''eights, '''A'''ctivations, '''G'''radients, and '''E'''rrors separately.<br />
<br />
== Related Work ==<br />
<br />
=== Weight and Activation ===<br />
Existing works on binary weights and activations <ref>{{Cite journal|last=Courbariaux|first=Matthieu|last2=Bengio|first2=Yoshua|last3=David|first3=Jean-Pierre|date=2015-11-01|title=BinaryConnect: Training Deep Neural Networks with binary weights during propagations|url=http://arxiv.org/abs/1511.00363|journal=arXiv:1511.00363 [cs]}}</ref> still use high-precision accumulation for SGD. Ternary weight networks<ref>{{Cite journal|last=Li|first=Fengfu|last2=Zhang|first2=Bo|last3=Liu|first3=Bin|date=2016-05-16|title=Ternary Weight Networks|url=http://arxiv.org/abs/1605.04711|journal=arXiv:1605.04711 [cs]}}</ref> offer more expression than binary weight networks.<br />
<br />
=== Gradient Computation and Accumulation ===<br />
Some methods quantize gradients in the backwards pass, but the weights are still stored in float32, and batch normalization is ignored.<br />
<br />
== WAGE Quantization ==<br />
The core idea of the proposed method is to constrain the following to low-bitwidth integers on each layer:<br />
* '''W:''' weight in inference<br />
* '''a:''' activation in inference<br />
* '''e:''' error in backpropagation<br />
* '''g:''' gradient in backpropagation<br />
[[File:Sandbox, gardens of Schönbrunn.jpg|center|thumb|560x560px|Four operators QW (·), QA(·), QG(·), QE(·) added in WAGE computation dataflow to reduce precision, bitwidth of signed integers are below or on the right of arrows, activations are included in MAC for concision.]]<br />
The error and gradient are defined as:<br />
<br />
<math>e^i = \frac{\partial L}{\partial a^i}, g^i = \frac{\partial L}{\partial W^i}</math><br />
<br />
where L is the loss function.<br />
<br />
The precision in bits of the errors, activations, gradients, and weights are <math>k_E</math>, <math>k_A</math>, <math>k_G</math>, and <math>k_W</math> respectively. As shown in the above figure, each quantity also has a quantization operators to reduce bitwidth increases caused by multiply-accumulate (MAC) operations. Also, note that since this is a layer-by-layer approach, each layer may be followed or preceded by a layer with different precision, or even a layer using floating point math.<br />
<br />
=== Shift-Based Linear Mapping and Stochastic Mapping ===<br />
The proposed method makes use of a linear mapping where continuous, unbounded values are discretized for each bitwidth <math>k</math> with a uniform spacing of<br />
<br />
<math>\sigma(k) = 2^{1-k}, k \in \natnums_+</math><br />
<br />
With this, the full quantization function is<br />
<br />
<math>Q(x,k) = Clip\left \{ \sigma(k) \cdot round\left [ \frac{x}{\sigma(k)} \right ], -1 + \sigma(k), 1 - \sigma(k) \right \}</math><br />
<br />
Note that this function is only using when simulating integer operations on floating-point hardware, on native integer hardware, this is done automatically. In addition to this quantization function.<br />
<br />
A distribution scaling factor is used in some quantization operators to preserve as much variance as possible when applying the quantization function above. The scaling factor is defined below.<br />
<br />
<math>Shift(x) = 2^{round(log_2(x))}</math><br />
<br />
Finally, stochastic rounding is substituted for small or real-valued updates during gradient accumulation.<br />
<br />
A visual representation of these operations is below.<br />
[[File:Sanxbox.JPG|center|thumb|220x220px|Quantization methods used in WAGE. The notation <math>P, x, \lfloor \cdot \rfloor, \lceil \cdot \rceil</math> denotes probability, vector, floor and ceil, respectively. <math>Shift(\cdot)</math> refers to distribution shifting with a certain argument]]<br />
<br />
=== Weight Initialization ===<br />
In this work, batch normalization is simplified to a constant scaling layer in order to sidestep the problem of normalizing outputs without floating point math, and to remove the extra memory requirement with batch normalization. As such, some care must be taken when initializing weights. The authors use a modified initialization method base on MSRA <ref>{{Cite journal|last=He|first=Kaiming|last2=Zhang|first2=Xiangyu|last3=Ren|first3=Shaoqing|last4=Sun|first4=Jian|date=2015-02-06|title=Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification|url=http://arxiv.org/abs/1502.01852|journal=arXiv:1502.01852 [cs]}}</ref>.<br />
<br />
<math>W \thicksim U(-L, +L),L = max \left \{ \sqrt{6/n_{in}}, L_{min} \right \}, L_{min} = \beta \sigma</math><br />
<br />
<math>n_{in}</math> is the layer fan-in number, <math>U</math> denotes uniform distribution. The original<math>\eta</math> initialization method is modified by adding the condition that the distribution width should be at least <math>\beta \sigma</math>, where <math>\beta</math> is a constant greater than 1 and <math>\sigma</math> is the minimum step size see already. This prevents weights being initialised to all-zeros in the case where the bitwidth is low, or the fan-in number is high.<br />
<br />
=== Quantization Details ===<br />
<br />
==== Weight <math>Q_W(\cdot)</math> ====<br />
<math>W_q = Q_W(W) = Q(W, k_W)</math><br />
<br />
The quantization operator is simply the quantization function previously introduced. <br />
<br />
==== Activation <math>Q_A(\cdot)</math> ====<br />
The authors say that the variance of the weights passed through this function will be scaled compared to the variance of the weights as initialized. To prevent this effect from blowing up the network outputs, they introduce a scaling factor <math>\alpha</math>. Notice that it is constant for each layer.<br />
<br />
<math>\alpha = max \left \{ Shift(L_{min} / L), 1 \right \}</math><br />
<br />
The quantization operator is then<br />
<br />
<math>a_q = Q_A(a) = Q(a/\alpha, k_A)</math><br />
<br />
The scaling factor approximates batch normalization.<br />
<br />
==== Error <math>Q_E(\cdot)</math> ====<br />
The magnitude of the error can vary greatly, and that a previous approach (DoReFa-Net<ref>{{Cite journal|last=Zhou|first=Shuchang|last2=Wu|first2=Yuxin|last3=Ni|first3=Zekun|last4=Zhou|first4=Xinyu|last5=Wen|first5=He|last6=Zou|first6=Yuheng|date=2016-06-20|title=DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients|url=http://arxiv.org/abs/1606.06160|journal=arXiv:1606.06160 [cs]}}</ref>) solves the issue by using an affine transform to map the error to the range <math>[-1, 1]</math>, apply quantization, and then applying the inverse transform. However, the authors claim that this approach still requires using float32, and that the magnitude of the error is unimportant: rather it is the orientation of the error. Thus, they only scale the error distribution to the range <math>\left [ -\sqrt2, \sqrt2 \right ]</math> and quantise:<br />
<br />
<math>e_q = Q_E(e) = Q(e/Shift(max\{|e|\}), k_E)</math><br />
<br />
Max is the element-wise maximum. Note that this discards any error elements less than the minimum step size.<br />
<br />
==== Gradient <math>Q_G(\cdot)</math> ====<br />
Similar to the activations and errors, the gradients are rescaled:<br />
<br />
<math>g_s = \eta \cdot g/Shift(max\{|g|\})</math><br />
<br />
<math> \eta </math> is a shift-based learning rate. It is an integer power of 2. The shifted gradients are represented in units of minimum step sizes <math> \sigma(k) </math>. When reducing the bitwidth of the gradients (remember that the gradients are coming out of a MAC operation, so the bitwidth may have increased) stochastic rounding is used as a substitute for small gradient accumulation.<br />
<br />
<math>\Delta W = Q_G(g) = \sigma(k_G) \cdot sgn(g_s) \cdot \left \{ \lfloor | g_s | \rfloor + Bernoulli(|g_s|<br />
- \lfloor | g_s | \rfloor) \right \}</math><br />
<br />
This randomly rounds the result of the MAC operation up or down to the nearest quantization for the given gradient bitwidth. The weights are updated with the resulting discrete increments:<br />
<br />
<math>W_{t+1} = Clip \left \{ W_t - \Delta W_t, -1 + \sigma(k_G), 1 - \sigma(k_G) \right \}</math><br />
<br />
=== Miscellaneous ===<br />
To train WAGE networks, the authors used pure SGD exclusively because more complicated techniques such as Momentum or RMSProp increase memory consumption and are complicated by the rescaling that happens within each quantization operator.<br />
<br />
The quantization and stochastic rounding are a form of regularization.<br />
<br />
The authors didn't use a traditional softmax with cross-entropy loss for the experiments because there does not yet exist a softmax layer for low-bit integers. Instead, they use a sum of squared error loss. This works for tasks with a small number of categories, but does not scale well.<br />
<br />
== Experiments ==<br />
For all experiments, the default layer bitwidth configuration is 2-8-8-8 for Weights, Activations, Gradients, and Error bits. The weight bitwidth is set to 2 because that results in ternary weights, and therefore no multiplication during inference. They authors argue that the bitwidth for activation and errors should be the same because the computation graph for each is similar and might use the same hardware. During training, the weight bitwidth is 8. For inference the weights are ternarized.<br />
<br />
=== Implementation Details ===<br />
MNIST: Network is LeNet-5 variant <ref>{{Cite journal|last=Lecun|first=Y.|last2=Bottou|first2=L.|last3=Bengio|first3=Y.|last4=Haffner|first4=P.|date=November 1998|title=Gradient-based learning applied to document recognition|url=http://ieeexplore.ieee.org/document/726791/?reload=true|journal=Proceedings of the IEEE|volume=86|issue=11|pages=2278–2324|doi=10.1109/5.726791|issn=0018-9219}}</ref><br />
<br />
SVHN & CIFAR10: VGG variant <ref>{{Cite journal|last=Simonyan|first=Karen|last2=Zisserman|first2=Andrew|date=2014-09-04|title=Very Deep Convolutional Networks for Large-Scale Image Recognition|url=http://arxiv.org/abs/1409.1556|journal=arXiv:1409.1556 [cs]}}</ref><br />
<br />
ImageNet: AlexNet variant <ref>{{Cite book|url=http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf|title=Advances in Neural Information Processing Systems 25|last=Krizhevsky|first=Alex|last2=Sutskever|first2=Ilya|last3=Hinton|first3=Geoffrey E|date=2012|publisher=Curran Associates, Inc.|editor-last=Pereira|editor-first=F.|pages=1097–1105|editor-last2=Burges|editor-first2=C. J. C.|editor-last3=Bottou|editor-first3=L.|editor-last4=Weinberger|editor-first4=K. Q.}}</ref><br />
{| class="wikitable"<br />
|+Test or validation error rates (%) in previous works and WAGE on multiple datasets. Opt denotes gradient descent optimizer, withM means SGD with momentum, BN represents batch normalization, 32 bit refers to float32, and ImageNet top-k format: top1/top5.<br />
!Method<br />
!<math>k_W</math><br />
!<math>k_A</math><br />
!<math>k_G</math><br />
!<math>k_E</math><br />
!Opt<br />
!BN<br />
!MNIST<br />
!SVHN<br />
!CIFAR10<br />
!ImageNet<br />
|-<br />
|BC<br />
|1<br />
|32<br />
|32<br />
|32<br />
|Adam<br />
|yes<br />
|1.29<br />
|2.30<br />
|9.90<br />
|<br />
|-<br />
|BNN<br />
|1<br />
|1<br />
|32<br />
|32<br />
|Adam<br />
|yes <br />
|0.96<br />
|2.53<br />
|10.15<br />
|<br />
|-<br />
|BWN<br />
|1<br />
|32<br />
|32<br />
|32<br />
|withM<br />
|yes<br />
|<br />
|<br />
|<br />
|43.2/20.6<br />
|-<br />
|XNOR<br />
|1<br />
|1<br />
|32<br />
|32<br />
|Adam<br />
|yes<br />
|<br />
|<br />
|<br />
|55.8/30.8<br />
|-<br />
|TWN<br />
|2<br />
|32<br />
|32<br />
|32<br />
|withM<br />
|yes<br />
|0.65<br />
|<br />
|7.44<br />
|'''34.7/13.8'''<br />
|-<br />
|TTQ<br />
|2<br />
|32<br />
|32<br />
|32<br />
|Adam<br />
|yes<br />
|<br />
|<br />
|6.44<br />
|42.5/20.3<br />
|-<br />
|DoReFa<br />
|8<br />
|8<br />
|32<br />
|8<br />
|Adam<br />
|yes<br />
|<br />
|2.30<br />
|<br />
|47.0/<br />
|-<br />
|TernGrad<br />
|32<br />
|32<br />
|2<br />
|32<br />
|Adam<br />
|yes<br />
|<br />
|<br />
|14.36<br />
|42.4/19.5<br />
|-<br />
|WAGE<br />
|2<br />
|8<br />
|8<br />
|8<br />
|SGD<br />
|no<br />
|'''0.40'''<br />
|'''1.92'''<br />
|'''6.78'''<br />
|51.6/27.8<br />
|}<br />
<br />
=== Training Curves and Regularization ===<br />
The authors compare the 2-8-8-8 WAGE configuration introduced above, a 2-8-f-f (meaning float32) configuration, and a completely floating point version on CIFAR10. The test error is plotted against epoch. For training these networks, the learning rate is divided by 8 at the 200th epoch and again at the 250th epoch.<br />
[[File:Project 1164 Moskva 2009 G1.jpg|center|thumb|800x800px|Training curves of WAGE variations and a vanilla CNN on CIFAR10]]<br />
The convergence of the 2-8-8-8 has comparable convergence to the vanilla CNN and outperforms the 2-8-f-f variant. The authors speculate that this is because the extra discretization acts as a regularizer.<br />
<br />
=== Bitwidth of Errors ===<br />
The CIFAR10 test accuracy is plotted against bitwidth below<br />
[[File:Sandbox.svg|center|thumb|700x700px|The 10 run accuracies of different <math>k_E</math>]]<br />
<br />
=== Bitwidth of Gradients ===<br />
{| class="wikitable"<br />
|+Test error rates (%) on CIFAR10 with different <math>k_G</math><br />
!<math>k_G</math><br />
!2<br />
!3<br />
!4<br />
!5<br />
!6<br />
!7<br />
!8<br />
!9<br />
!10<br />
!11<br />
!12<br />
|-<br />
|error<br />
|54.22<br />
|51.57<br />
|28.22<br />
|18.01<br />
|11.48<br />
|7.61<br />
|6.78<br />
|6.63<br />
|6.43<br />
|6.55<br />
|6.57<br />
|}<br />
The authors also examined the effect of bitwidth on the ImageNet implementation.<br />
<br />
{| class="wikitable"<br />
|+Top-5 error rates (%) on ImageNet with different <math>k_G</math>and <math>k_E</math><br />
!Pattern<br />
!vanilla<br />
!28ff-BN<br />
!28ff<br />
!28f8<br />
!28C8<br />
!288C<br />
!2888<br />
|-<br />
|error<br />
|19.29<br />
|20.67<br />
|24.14<br />
|23.92<br />
|26.88<br />
|28.06<br />
|27.82<br />
|}<br />
Here, C denotes 12 bits (Hexidecimal) and BN refers to batch normalization being added.<br />
<br />
== Discussion ==<br />
The authors have a few areas they believe this approach could be improved.<br />
<br />
'''MAC Operation:''' The 2-8-8-8 configuration was chosen because the low weight bitwidth means there aren't any multiplication during inference. However, this does not remove the requirement for multiplication during training. 2-2-8-8 configuration satisfies this requirement, but it is difficult to train and detrimental to the accuracy.<br />
<br />
'''Non-linear Quantization:''' The linear mapping used in this approach is simple, but there might be a more effective mapping. For example, a logarithmic mapping could be more effective if the weights and activations have a log-normal distribution.<br />
<br />
'''Normalization:''' Normalization layers (softmax, batch normalization) were not used in this paper. Quantized versions are an area of future work<br />
<br />
== Conclusion ==<br />
<br />
A framework for training and inference without the use of floating-point representation is presented. Future work may further improve compression and memory requirements.<br />
== References ==</div>Bapskikohttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Synthetic_and_natural_noise_both_break_neural_machine_translation&diff=35477stat946w18/Synthetic and natural noise both break neural machine translation2018-03-25T21:52:30Z<p>Bapskiko: /* Analysis */</p>
<hr />
<div>== Introduction ==<br />
* Humans have surprisingly robust language processing systems which can easily overcome typos, e.g.<br />
<br />
Aoccdrnig to a rscheearch at Cmabrigde Uinervtisy, it deosn't mttaer in waht oredr the ltteers in a wrod are, the olny iprmoetnt tihng is taht the frist and lsat ltteer be at the rghit pclae.<br />
<br />
* A person's ability to read this text comes as no surprise to the Psychology literature<br />
*# Saberi & Perrott (1999) found that this robustness extends to audio as well.<br />
*# Rayner et al. (2006) found that in noisier settings reading comprehension only slowed by 11%.<br />
*# McCusker et al. (1981) found that the common case of swapping letters could often go unnoticed by the reader.<br />
*# Mayall et al (1997) shows that we rely on word shape.<br />
*# Reicher, 1969; Pelli et al., (2003) found that we can switch between whole word recognition but the first and last letter positions are required to stay constant for comprehension<br />
<br />
However, neural machine translation (NMT) systems are brittle. i.e. The Arabic word<br />
[[File:Good_morning.PNG]] means a blessing for good morning, however [[File:Hunt.PNG]] means hunt or slaughter. <br />
<br />
Facebook's MT system mistakenly confused two words that only differ by one character, a situation that is challenging for a character-based NMT system.<br />
<br />
Figure 1 shows the performance translating German to English as a function of the percent of German words modified. Here we show two types of noise: (1) Random permutation of the word and (2) Swapping a pair of adjacent letters that does not include the first or last letter of the word. The important thing to note is that even small amounts of noise lead to substantial drops in performance.<br />
<br />
[[File:BLEU_plot.PNG]] <br />
<br />
BLEU (bilingual evaluation understudy) is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another. Quality is considered to be the correspondence between a machine's output and that of a human: "the closer a machine translation is to a professional human translation, the better it is". BLEU is between 0 and 1.<br />
<br />
This paper explores two simple strategies for increasing model robustness:<br />
# using structure-invariant representations (character CNN representation)<br />
# robust training on noisy data, a form of adversarial training.<br />
<br />
The goal of the paper is two-fold:<br />
# to initiate a conversation on robust training and modeling techniques in NMT<br />
# to promote the creation of better and more linguistically accurate artificial noise to be applied to new languages and tasks<br />
<br />
== Adversarial examples ==<br />
The growing literature on adversarial examples has demonstrated how dangerous it can be to have brittle machine learning systems being used so pervasively in the real world. Small changes to the input can lead to dramatic<br />
failures of deep learning models. This leads to a potential for malicious attacks using adversarial examples. An important distinction is often drawn between white-box attacks, where adversarial examples are generated with<br />
access to the model parameters, and black-box attacks, where examples are generated without such access.<br />
<br />
The paper devises simple methods for generating adversarial examples for NMT. They do not assume any access to the NMT models' gradients, instead relying on cognitively-informed and naturally occurring language errors to generate noise.<br />
<br />
== MT system ==<br />
We experiment with three different NMT systems with access to character information at different levels.<br />
# Use <code>char2char</code>, the fully character-level model of (Lee et al. 2017). This model processes a sentence as a sequence of characters. The encoder works as follows: the characters are embedded as vectors, and then the sequence of vectors is fed to a convolutional layer. The sequence output by the convolutional layer is then shortened by max pooling in the time dimension. The output of the max-pooling layer is then fed to a four-layer highway network (Srivasta et al. 2015), and the output of the highway network is in turn fed to a bidirectional GRU, producing a sequence of hidden units. The sequence of hidden units is then processed by the decoder, a GRU with attention, to produce probabilities over sequences of output characters.<br />
# Use <code>Nematus</code> (Sennrich et al., 2017), a popular NMT toolkit. It is another sequence-to-sequence model with several architecture modifications, especially operating on sub-word units using byte-pair encoding. Byte-pair encoding (Sennich et al. 2015, Gage 1994) is an algorithm according to which we begin with a list of characters as our symbols, and repeatedly fuse common combinations to create new symbols. For example, if we begin with the letters a to z as our symbol list, and we find that "th" is the most common two-letter combination in a corpus, then we would add "th" to our symbol list in the first iteration. After we have used this algorithm to create a symbol list of the desired size, we apply a standard encoder-decoder with attention.<br />
# Use an attentional sequence-to-sequence model with a word representation based on a character convolutional neural network (<code>charCNN</code>). The <code>charCNN</code> model is similar to <code>char2char</code>, but uses a shallower highway network and, although it reads the input sentence as characters, it produces as output a probability distribution over words, not characters.<br />
<br />
== Data ==<br />
=== MT Data ===<br />
We use the TED talks parallel corpus prepared for IWSLT 2016 (Cettolo et al., 2012) for testing all of the NMT systems.<br />
<br />
[[File:Table1x.PNG]]<br />
<br />
=== Natural and Artificial Noise ===<br />
==== Natural Noise ====<br />
The three languages, French, German, and Czech, each have their own frequent natural errors. The corpora of edits used for these languages are:<br />
<br />
# French : Wikipedia Correction and Paraphrase Corpus (WiCoPaCo)<br />
# German : RWSE Wikipedia Correction Dataset and The MERLIN corpus<br />
# Czech : CzeSL Grammatical Error Correction Dataset (CzeSL-GEC) which is a manually annotated dataset of essays written by both non-native learners of Czech and Czech pupils<br />
<br />
The authors harvested naturally occurring errors (typos, misspellings, etc.) corresponding to these three languages from available corpora of edits to build a look-up table of possible lexical replacements.<br />
<br />
They insert these errors into the source-side of the parallel data by replacing every word in the corpus with an error if one exists in our dataset. When there is more than one possible replacement to choose, words for which there is no error, are sampled uniformly and kept as is.<br />
<br />
==== Synthetic Noise ====<br />
In addition to naturally collected sources of error, we also experiment with four types of synthetic noise: Swap, Middle Random, Fully Random, and Key Typo. <br />
# <code>Swap</code>: The first and simplest source of noise is swapping two letters (do not alter the first or last letters, only apply to words of length >=4).<br />
# <code>Middle Random</code>: Randomize the order of all the letters in a word except for the first and last (only apply to words of length >=4).<br />
# <code>Fully Random</code> Completely randomized words.<br />
# <code>Keyboard Typo</code> Randomly replace one letter in each word with an adjacent key<br />
<br />
[[File:Table3x.PNG]]<br />
<br />
Table 3 shows BLEU scores of models trained on clean (Vanilla) texts and tested on clean and noisy<br />
texts. All models suffer a significant drop in BLEU when evaluated on noisy texts. This is true<br />
for both natural noise and all kinds of synthetic noise. The more noise in the text, the worse the<br />
translation quality, with random scrambling producing the lowest BLEU scores.<br />
<br />
In contrast to the poor performance of these methods in the presence of noise, humans can perform very well as mentioned in the introduction. The table below shows the translations performed by a German native-speaker human, not familiar with the meme and three machine translation methods. Clearly, the machine translation methods failed. <br />
<br />
[[File:paper16_tab4.png]]<br />
<br />
The author also examined improvements by using a simple spell checker. The author tried correcting error through Google's spell checker by simply accepting the first suggestion on the detected mistake. There was a small improvement in French and German translations, and a small drop in accuracy for the Czech translation due to more complex grammar. The author concluded using existing spell checkers would not improve the accuracy to be comparable with vanilla text. The results are shown in the table below.<br />
<br />
[[File:paper16_tab5.png]]<br />
<br />
== Dealing with noise ==<br />
=== Structure Invariant Representations ===<br />
The three NMT models are all sensitive to word structure. The <code>char2char</code> and <code>charCNN</code> models both have convolutional layers on character sequences, designed to capture character n-grams (which are sequences of characters or words, of length n). The model in <code>Nematus</code> is based on sub-word units obtained with byte pair encoding (where common consecutive characters are replaced with a unique byte that does not occur in the data). It thus relies on character order.<br />
<br />
The simplest way to improve such a model is to take the average character embeddings as a word representation. This model, referred to as <code>meanChar</code>, first generates a word representation by averaging character embeddings, and then proceeds with a word-level encoder similar to the <code>charCNN</code> model.<br />
<br />
[[File:Table5x.PNG]]<br />
<br />
<code>meanChar</code> is good with the other three scrambling errors (Swap, Middle Random and Fully Random), but bad with Keyboard errors and Natural errors.<br />
<br />
=== Black-Box Adversarial Training ===<br />
<br />
<code>charCNN</code> Performance<br />
[[File:Table6x.PNG]]<br />
<br />
Here is the result of the translation of the scrambled meme:<br />
“According to a study of Cambridge University, it doesn’t matter which technology in a word is going to get the letters in a word that is the only important thing for the first and last letter.”<br />
<br />
== Analysis ==<br />
=== Learning Multiple Kinds of Noise in <code>charCNN</code> ===<br />
<br />
As Table 6 above shows, <code>charCNN</code> models performed quite well across different noise types on the test set when they are trained on a mix of noise types, which led the authors to speculate that filters from different convolutional layers learned to be robust to different types of noises. To test this hypothesis, they analyzed the weights learned by <code>charCNN</code> models trained on two kinds of input: completely scrambled words (Rand) without other kinds of noise, and a mix of Rand+Key+Nat kinds of noise. For each model, they computed the variance across the filter dimension for each one of the 1000 filters and for each one of the 25 character embedding dimensions, which were then averaged across the filters to yield 25 variances. <br />
<br />
As Figure 2 below shows, the variances for the ensemble model are higher and more varied, which indicates that the filters learned different patterns and the model differentiated between different character embedding dimensions. Under the random scrambling scheme, there should be no patterns for the model to learn, so it makes sense for the filter weights to stay close uniform weights, hence the consistently lower variance measures.<br />
<br />
[[File:Table7x.PNG]]<br />
<br />
=== Richness of Natural Noise ===<br />
<br />
The synthetic noise used in this paper appears to be very different from natural noise. This is evident because none of the modes trained only on synthetic noise demonstrated good performance on natural noise. Therefore, the authors say that the noise models used in this paper are not representative of real noise and that a more sophisticated model using explicit phonemic and linguistic knowledge is required if an error-free corpus is to be augmented with error for training.<br />
<br />
== Conclusion ==<br />
In this work, the authors have shown that character-based NMT models are extremely brittle and tend to break when presented with both natural and synthetic kinds of noise. After a comparison of the models, they found that a character-based CNN can learn to<br />
address multiple types of errors that are seen in training.<br />
For the future work, the author suggested generating more realistic synthetic noise by using phonetic and syntactic structure. Also, they suggested that a better NMT architecture could be designed which can be robust to noise without seeing it in the training data.<br />
<br />
== Criticism ==<br />
A major critique of this paper is that the solutions presented do not adequately solve the problem. The response to the meanChar architecture has been mostly negative and the method of noise injection has been seen as a simple start. However, the authors have acknowledged these critiques stating that they realize their solution is just a starting point. They argue that this paper has opened the discussion on dealing with noise in machine translation which has been mostly left untouched. Also these solutions/models still do not tackle the problem of natural noise as the models trained on the synthetic noise don't generalize well to natural noise.<br />
<br />
== References ==<br />
# Yonatan Belinkov and Yonatan Bisk. Synthetic and Natural Noise Both Break Neural Machine Translation. In ''International Conference on Learning Representations (ICLR)'', 2017.<br />
# Mauro Cettolo, Christian Girardi, and Marcello Federico. WIT: Web Inventory of Transcribed and Translated Talks. In ''Proceedings of the 16th Conference of the European Association for Machine Translation (EAMT)'', pp. 261–268, Trento, Italy, May 2012.<br />
# Jason Lee, Kyunghyun Cho, and Thomas Hofmann. Fully Character-Level Neural Machine Translation without Explicit Segmentation. ''Transactions of the Association for Computational Linguistics (TACL)'', 2017.<br />
# Rico Sennrich, Orhan Firat, Kyunghyun Cho, Alexandra Birch, Barry Haddow, Julian Hitschler, Marcin Junczys-Dowmunt, Samuel Laubli, Antonio Valerio Miceli Barone, Jozef Mokry, and Maria Nadejde. Nematus: a Toolkit for Neural Machine Translation. In ''Proceedings of the Software Demonstrations of the 15th Conference of the European Chapter of the Association for Computational Linguistics'', pp. 65–68, Valencia, Spain, April 2017. Association for Computational Linguistics. URL http://aclweb.org/anthology/E17-3017.<br />
# Aurlien Max and Guillaume Wisniewski. Mining Naturally-occurring Corrections and Paraphrases from Wikipedias Revision History. In Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC’10), Valletta, Malta, may 2010. European Language Resources Association (ELRA). ISBN 2-9517408-6-7. URL https://wicopaco.limsi.fr.<br />
# Katrin Wisniewski, Karin Schne, Lionel Nicolas, Chiara Vettori, Adriane Boyd, Detmar Meurers, Andrea Abel, and Jirka Hana. MERLIN: An online trilingual learner corpus empirically grounding the European Reference Levels in authentic learner data, 10 2013. URL https://www.ukp.tu-darmstadt.de/data/spelling-correction/rwse-datasets.<br />
# Torsten Zesch. Measuring Contextual Fitness Using Error Contexts Extracted from the Wikipedia Revision History. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pp. 529–538, Avignon, France, April 2012. Association for Computational Linguistics.<br />
# Suranjana Samanta and Sameep Mehta. Towards Crafting Text Adversarial Samples. arXiv preprint arXiv:1707.02812, 2017. Karel Sebesta, Zuzanna Bedrichova, Katerina Sormov́a, Barbora Stindlov́a, Milan Hrdlicka, Tereza Hrdlickov́a, Jiŕı Hana, Vladiḿır Petkevic, Toḿas Jeĺınek, Svatava Skodov́a, Petr Janes, Katerina Lund́akov́a, Hana Skoumalov́a, Simon Sĺadek, Piotr Pierscieniak, Dagmar Toufarov́a, Milan Straka, Alexandr Rosen, Jakub Ńaplava, and Marie Poĺackova. CzeSL grammatical error correction dataset (CzeSL-GEC). Technical report, LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics, Charles University, 2017. URL https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-2143.</div>Bapskikohttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Neural_Audio_Synthesis_of_Musical_Notes_with_WaveNet_autoencoders&diff=35475Neural Audio Synthesis of Musical Notes with WaveNet autoencoders2018-03-25T19:27:11Z<p>Bapskiko: /* Future Directions */</p>
<hr />
<div>= Introduction =<br />
The authors of this paper have pointed out that the method in which most notes are created are hand-designed instruments modifying pitch, velocity and filter parameters to produce the required tone, timbre and dynamics of a sound. The authors suggest that this may be a problem and thus suggest a data-driven approach to audio synthesis. To train such a data expensive model the authors highlight the need for a large dataset much like imagenet for music. <br />
<br />
= Contributions =<br />
To solve the problem highlighted above the authors propose two main contributions of their paper: <br />
* Wavenet-style autoencoder that learn to encode temural data over a long term audio structures without requiring external conditioning<br />
* NSynth: a large dataset of musical notes inspired by the emerging of large image datasets<br />
<br />
<br />
= Models =<br />
<br />
[[File:paper26-figure1-models.png|center]]<br />
<br />
== WaveNet Autoencoder ==<br />
<br />
While the proposed autoencoder structure is very similar to that of WaveNet the authors argue that the algorithm is novel in two ways:<br />
* It is able to attain consistent long-term structure without any external conditioning <br />
* Creating meaningful embedding which can be interpolated between<br />
The authors accomplish this by passing the raw audio throw the encoder to produce an embedding <math>Z = f(x) </math>, next the input is shifted and feed into the decoder which reproduces the input. The resulting probability distribution: <br />
<br />
\begin{align}<br />
p(x) = \prod_{i=1}^N\{x_i | x_1, … , x_N-1, f(x) \}<br />
\end{align}<br />
<br />
A detailed block diagram of the modified WaveNet structure can be seen in figure 1b. This diagram demonstrates the encoder as a 30 layer network in each each node is a ReLU nonlinearity followed by a non-causal dilated convolution. Dilated convolution (aka convolutions with holes) is a type of convolution in which the filter skips input values with a certain step (step size of 1 is equivalent to the standard convolution), effectively allowing the network to operate at a coarser scale compared to traditional convolutional layers and have very large receptive fields. The resulting convolution is 128 channels all feed into another ReLU nonlinearity which is feed into another 1x1 convolution before getting down sampled with average pooling to produce a 16 dimension <math>Z </math> distribution. Each <math>Z </math> encoding is for a specific temporal resolution which the authors of the paper tuned to 32ms. This means that there are 125, 16 dimension <math>Z </math> encodings for each 4 second note present in the NSynth database (1984 embeddings). <br />
Before the <math>Z </math> embedding enters the decoder it is first upsampled to the original audio rate using nearest neighbor interpolation. The embedding then passes through the decoder to recreate the original audio note. The input audio data is first quantized using 8-bit mu-law encoding into 256 possible values, and the output prediction is the softmax over the possible values.<br />
<br />
== Baseline: Spectral Autoencoder ==<br />
Being unable to find an alternative fully deep model which the authors could use to compare to there proposed WaveNet autoencoder to, the authors just made a strong baseline. The baseline algorithm that the authors developed is a spectral autoencoder. The block diagram of its architecture can be seen in figure 1a. The baseline network is 10 layer deep. Each layer has a 4x4 kernels with 2x2 strides followed by a leaky-ReLU (0.1) and batch normalization. The final hidden vector(Z) was set to 1984 to exactly match the hidden vector of the WaveNet autoencoder. <br />
<br />
The authors attempted to train the baseline on multiple input: raw waveforms, FFT, and log magnitude of spectrum finding the latter to be best correlated with perceptual distortion. The authors also explored several representations of phase, finding that estimating magnitude and using established iterative techniques to reconstruct phase to be most effective. (The technique to reconstruct the phase from the magnitude comes from (Griffin and Lim 1984). It can be summarized as follows. In each iteration, generate a Fourier signal z by taking the Short Time Fourier transform of the current estimate of the complete time-domain signal, and replacing its magnitude component with the known true magnitude. Then find the time-domain signal whose Short Time Fourier transform is closest to z in the least-squares sense. This is the estimate of the complete signal for the next iteration. ) A final heuristic that was used by the authors to increase the accuracy of the baseline was weighting the mean square error (MSE) loss starting at 10 for 0 HZ and decreasing linearly to 1 at 4000 Hz and above. This is valid as the fundamental frequency of most instrument are found at lower frequencies. <br />
<br />
== Training ==<br />
Both the modified WaveNet and the baseline autoencoder used stochastic gradient descent with an Adam optimizer. The authors trained the baseline autoencoder model asynchronously for 1800000 epocs with a batch size of 8 with a learning rate of 1e-4. Where as the WaveNet modules were trained synchronously for 250000 epocs with a batch size of 32 with a decaying learning rate ranging from 2e-4 to 6e-6.<br />
<br />
= The NSynth Dataset =<br />
To evaluate the WaveNet autoencoder model, the authors' wanted an audio dataset that let them explore the learned embeddings. Musical notes are an ideal setting for this study. While several smaller music datasets exist, as deep networks train better on abundant, high-quality data, the authors decided on the development of a new dataset - NSynth Dataset.<br />
<br />
The NSynth dataset has 306 043 unique musical notes all 4 seconds in length sampled at 16,000 Hz. The data set consists of 1006 different instruments playing on average of 65.4 different pitches across on average 4.75 different velocities. Average pitches and velocities are used as not all instruments, can reach all 88 MIDI frequencies, or the 5 velocities desired by the authors. The dataset has the following split: training set with 289,205 notes, validation set with 12,678 notes, and test set with 4,096 notes.<br />
<br />
Along with each note the authors also included the following annotations:<br />
* Source - The way each sound was produced. There were 3 classes ‘acoustic’, ‘electronic’ and ‘synthetic’<br />
* Family - The family class of instruments that produced each note. There is 11 classes which include: {‘bass’, ‘brass’, ‘vocal’ ext.}<br />
* Qualities - Sonic qualities about each note<br />
<br />
The full dataset is publicly available here: https://magenta.tensorflow.org/datasets/nsynth.<br />
<br />
<br />
<br />
= Evaluation =<br />
<br />
To fully analyze all aspects of WaveNet the authors proposed three evaluations:<br />
* Reconstruction - Both Quantitative and Qualitative analysis were considered<br />
* Interpolation in Timbre and Dynamics<br />
* Entanglement of Pitch and Timbre <br />
<br />
Sound is historically very difficult to quantify from a picture representation as it requires training and expertise to analyze. Even with expertise it can be difficult to complete a full analyses as two very different sound can look quite similar in the respective pictorial representation. This is why the authors recommend all readers to listen to the created notes which can be sound here: https://magenta.tensorflow.org/nsynth.<br />
<br />
However, even when taking this under consideration the authors do pictorially demonstrate differences in the two proposed algorithms along with the original note, as it is hard to publish a paper with sound included. To demonstrate the pictorial difference the authors demonstrate each note using constant-q transform (CQT) which is able to capture the dynamics of timbre along with representing the frequencies of the sound.<br />
<br />
== Reconstruction ==<br />
<br />
[[File:paper27-figure2-reconstruction.png|center]]<br />
<br />
=== Qualitative Comparison ===<br />
In the Glockenspiel the WaveNet autoencoder is able to reproduce the magnitude, phase of the fundamental frequency (A and C in figure 2), and the attack (B in figure 2) of the instrument; Whereas the Baseline autoencoder introduces non existing harmonics (D in figure 2). The flugelhorn on the other hand, presents the starkest difference between the WaveNet and baseline autoencoders. The WaveNet while not perfect is able to reproduce the verbarto (I and J in figure 2) across multiple frequencies, which results in a natural sounding note. The baseline not only fails to do this but also adds extra noise (K in figure 2). The authors do add that the WaveNet produces some strikes (L in figure 2) however they argue that they are inaudible.<br />
<br />
[[File:paper27-table1.png|center]]<br />
<br />
Mu-law encoding was used in the original WaveNet [https://arxiv.org/pdf/1609.03499.pdf paper] to make the problem "more tractable" compared to raw 16-bit integer values. In that paper, they note that "especially for speech, this non-linear quantization produces a significantly better reconstruction" compared to a linear scheme. This might be expected considering that the mu-law companding transformation was designed to [https://www.cisco.com/c/en/us/support/docs/voice/h323/8123-waveform-coding.html#t4 encode speech]. In this application though, using this encoding creates perceptible distortion that sounds similar to clipping.<br />
<br />
=== Quantitative Comparison ===<br />
For a quantitative comparison the authors trained a separate multi-task classifier to classify a note using given pitch or quality of a note. The results of both the Baseline and the WaveNet where then inputted and attempted to be classified. As seen in table 1 WaveNet significantly outperformed the Baseline in both metrics posting a ~70% increase when only considering pitch.<br />
<br />
== Interpolation in Timbre and Dynamics ==<br />
<br />
[[File:paper27-figure3-interpolation.png|center]]<br />
<br />
For this evaluation the authors reconstructed from linear interpolations in Z space among different instruments and compared these to superimposed position of the original two instruments. Not surprisingly the model fuse aspects of both instruments during the recreation. The authors claim however, that WaveNet produces much more realistic sounding results. <br />
To support their claim the authors the authors point to WaveNet ability to create dynamic mixing of overtone in time, even jumping to higher harmonics (A in figure 3), capturing the timbre and dynamics of both the bass and flute. This can be once again seen in (B in figure 3) where Wavenet adds additional harmonics as well as a sub-harmonics to the original flute note. <br />
<br />
<br />
== Entanglement of Pitch and Timbre ==<br />
<br />
[[File:paper27-table2.png|center]]<br />
<br />
[[File:paper27-figure4-entanglement.png|center]]<br />
<br />
To study the entanglement between pitch and Z space the authors constructed a classifier which was expected to drop in accuracy if the representation of pitch and timbre is disentangled as it relies heavily on the pitch information. This is clearly demonstrated by the first two rows of table 2 where WaveNet relies more strongly on pitch then the baseline algorithm. The authors provide a more qualitative demonstrating in figure 4. They demonstrate a situation in which a classifier may be confused; a note with pitch of +12 is almost exactly the same as the original apart from an emergence of sub-harmonics.<br />
<br />
Further insight can be gained on the relationship between pitch and timbre by studying the trend amongst the network embeddings among the pitches for specific instruments. This is depicted in figure 5 for several instruments across their entire 88 note range at 127 velocity. It can be noted from the figure that the instruments have unique separation of two or more registers over which the embeddings of notes with different pitches are similar. This is expected since instrumental dynamics and timbre varies dramatically over the range of the instrument.<br />
<br />
= Future Directions =<br />
<br />
One significant area which the authors claim great improvement is needed is the large memory constraints required by there algorithm. Due to the large memory requirement the current WaveNet must rely on down sampling thus being unable to fully capture the global context. They claim that research using different input representations (instead of mu-law) to minimize distortion is ongoing.<br />
<br />
= Open Source Code =<br />
<br />
Google has released all code related to this paper at the following open source repository: https://github.com/tensorflow/magenta/tree/master/magenta/models/nsynth<br />
<br />
= References =<br />
<br />
# Engel, J., Resnick, C., Roberts, A., Dieleman, S., Norouzi, M., Eck, D. & Simonyan, K.. (2017). Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders. Proceedings of the 34th International Conference on Machine Learning, in PMLR 70:1068-1077<br />
# Griffin, Daniel, and Jae Lim. "Signal estimation from modified short-time Fourier transform." IEEE Transactions on Acoustics, Speech, and Signal Processing 32.2 (1984): 236-243.<br />
# NSynth: Neural Audio Synthesis. (2017, April 06). Retrieved March 19, 2018, from https://magenta.tensorflow.org/nsynth <br />
# The NSynth Dataset. (2017, April 05). Retrieved March 19, 2018, from https://magenta.tensorflow.org/datasets/nsynth</div>Bapskikohttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Neural_Audio_Synthesis_of_Musical_Notes_with_WaveNet_autoencoders&diff=35474Neural Audio Synthesis of Musical Notes with WaveNet autoencoders2018-03-25T19:25:42Z<p>Bapskiko: /* Qualitative Comparison */</p>
<hr />
<div>= Introduction =<br />
The authors of this paper have pointed out that the method in which most notes are created are hand-designed instruments modifying pitch, velocity and filter parameters to produce the required tone, timbre and dynamics of a sound. The authors suggest that this may be a problem and thus suggest a data-driven approach to audio synthesis. To train such a data expensive model the authors highlight the need for a large dataset much like imagenet for music. <br />
<br />
= Contributions =<br />
To solve the problem highlighted above the authors propose two main contributions of their paper: <br />
* Wavenet-style autoencoder that learn to encode temural data over a long term audio structures without requiring external conditioning<br />
* NSynth: a large dataset of musical notes inspired by the emerging of large image datasets<br />
<br />
<br />
= Models =<br />
<br />
[[File:paper26-figure1-models.png|center]]<br />
<br />
== WaveNet Autoencoder ==<br />
<br />
While the proposed autoencoder structure is very similar to that of WaveNet the authors argue that the algorithm is novel in two ways:<br />
* It is able to attain consistent long-term structure without any external conditioning <br />
* Creating meaningful embedding which can be interpolated between<br />
The authors accomplish this by passing the raw audio throw the encoder to produce an embedding <math>Z = f(x) </math>, next the input is shifted and feed into the decoder which reproduces the input. The resulting probability distribution: <br />
<br />
\begin{align}<br />
p(x) = \prod_{i=1}^N\{x_i | x_1, … , x_N-1, f(x) \}<br />
\end{align}<br />
<br />
A detailed block diagram of the modified WaveNet structure can be seen in figure 1b. This diagram demonstrates the encoder as a 30 layer network in each each node is a ReLU nonlinearity followed by a non-causal dilated convolution. Dilated convolution (aka convolutions with holes) is a type of convolution in which the filter skips input values with a certain step (step size of 1 is equivalent to the standard convolution), effectively allowing the network to operate at a coarser scale compared to traditional convolutional layers and have very large receptive fields. The resulting convolution is 128 channels all feed into another ReLU nonlinearity which is feed into another 1x1 convolution before getting down sampled with average pooling to produce a 16 dimension <math>Z </math> distribution. Each <math>Z </math> encoding is for a specific temporal resolution which the authors of the paper tuned to 32ms. This means that there are 125, 16 dimension <math>Z </math> encodings for each 4 second note present in the NSynth database (1984 embeddings). <br />
Before the <math>Z </math> embedding enters the decoder it is first upsampled to the original audio rate using nearest neighbor interpolation. The embedding then passes through the decoder to recreate the original audio note. The input audio data is first quantized using 8-bit mu-law encoding into 256 possible values, and the output prediction is the softmax over the possible values.<br />
<br />
== Baseline: Spectral Autoencoder ==<br />
Being unable to find an alternative fully deep model which the authors could use to compare to there proposed WaveNet autoencoder to, the authors just made a strong baseline. The baseline algorithm that the authors developed is a spectral autoencoder. The block diagram of its architecture can be seen in figure 1a. The baseline network is 10 layer deep. Each layer has a 4x4 kernels with 2x2 strides followed by a leaky-ReLU (0.1) and batch normalization. The final hidden vector(Z) was set to 1984 to exactly match the hidden vector of the WaveNet autoencoder. <br />
<br />
The authors attempted to train the baseline on multiple input: raw waveforms, FFT, and log magnitude of spectrum finding the latter to be best correlated with perceptual distortion. The authors also explored several representations of phase, finding that estimating magnitude and using established iterative techniques to reconstruct phase to be most effective. (The technique to reconstruct the phase from the magnitude comes from (Griffin and Lim 1984). It can be summarized as follows. In each iteration, generate a Fourier signal z by taking the Short Time Fourier transform of the current estimate of the complete time-domain signal, and replacing its magnitude component with the known true magnitude. Then find the time-domain signal whose Short Time Fourier transform is closest to z in the least-squares sense. This is the estimate of the complete signal for the next iteration. ) A final heuristic that was used by the authors to increase the accuracy of the baseline was weighting the mean square error (MSE) loss starting at 10 for 0 HZ and decreasing linearly to 1 at 4000 Hz and above. This is valid as the fundamental frequency of most instrument are found at lower frequencies. <br />
<br />
== Training ==<br />
Both the modified WaveNet and the baseline autoencoder used stochastic gradient descent with an Adam optimizer. The authors trained the baseline autoencoder model asynchronously for 1800000 epocs with a batch size of 8 with a learning rate of 1e-4. Where as the WaveNet modules were trained synchronously for 250000 epocs with a batch size of 32 with a decaying learning rate ranging from 2e-4 to 6e-6.<br />
<br />
= The NSynth Dataset =<br />
To evaluate the WaveNet autoencoder model, the authors' wanted an audio dataset that let them explore the learned embeddings. Musical notes are an ideal setting for this study. While several smaller music datasets exist, as deep networks train better on abundant, high-quality data, the authors decided on the development of a new dataset - NSynth Dataset.<br />
<br />
The NSynth dataset has 306 043 unique musical notes all 4 seconds in length sampled at 16,000 Hz. The data set consists of 1006 different instruments playing on average of 65.4 different pitches across on average 4.75 different velocities. Average pitches and velocities are used as not all instruments, can reach all 88 MIDI frequencies, or the 5 velocities desired by the authors. The dataset has the following split: training set with 289,205 notes, validation set with 12,678 notes, and test set with 4,096 notes.<br />
<br />
Along with each note the authors also included the following annotations:<br />
* Source - The way each sound was produced. There were 3 classes ‘acoustic’, ‘electronic’ and ‘synthetic’<br />
* Family - The family class of instruments that produced each note. There is 11 classes which include: {‘bass’, ‘brass’, ‘vocal’ ext.}<br />
* Qualities - Sonic qualities about each note<br />
<br />
The full dataset is publicly available here: https://magenta.tensorflow.org/datasets/nsynth.<br />
<br />
<br />
<br />
= Evaluation =<br />
<br />
To fully analyze all aspects of WaveNet the authors proposed three evaluations:<br />
* Reconstruction - Both Quantitative and Qualitative analysis were considered<br />
* Interpolation in Timbre and Dynamics<br />
* Entanglement of Pitch and Timbre <br />
<br />
Sound is historically very difficult to quantify from a picture representation as it requires training and expertise to analyze. Even with expertise it can be difficult to complete a full analyses as two very different sound can look quite similar in the respective pictorial representation. This is why the authors recommend all readers to listen to the created notes which can be sound here: https://magenta.tensorflow.org/nsynth.<br />
<br />
However, even when taking this under consideration the authors do pictorially demonstrate differences in the two proposed algorithms along with the original note, as it is hard to publish a paper with sound included. To demonstrate the pictorial difference the authors demonstrate each note using constant-q transform (CQT) which is able to capture the dynamics of timbre along with representing the frequencies of the sound.<br />
<br />
== Reconstruction ==<br />
<br />
[[File:paper27-figure2-reconstruction.png|center]]<br />
<br />
=== Qualitative Comparison ===<br />
In the Glockenspiel the WaveNet autoencoder is able to reproduce the magnitude, phase of the fundamental frequency (A and C in figure 2), and the attack (B in figure 2) of the instrument; Whereas the Baseline autoencoder introduces non existing harmonics (D in figure 2). The flugelhorn on the other hand, presents the starkest difference between the WaveNet and baseline autoencoders. The WaveNet while not perfect is able to reproduce the verbarto (I and J in figure 2) across multiple frequencies, which results in a natural sounding note. The baseline not only fails to do this but also adds extra noise (K in figure 2). The authors do add that the WaveNet produces some strikes (L in figure 2) however they argue that they are inaudible.<br />
<br />
[[File:paper27-table1.png|center]]<br />
<br />
Mu-law encoding was used in the original WaveNet [https://arxiv.org/pdf/1609.03499.pdf paper] to make the problem "more tractable" compared to raw 16-bit integer values. In that paper, they note that "especially for speech, this non-linear quantization produces a significantly better reconstruction" compared to a linear scheme. This might be expected considering that the mu-law companding transformation was designed to [https://www.cisco.com/c/en/us/support/docs/voice/h323/8123-waveform-coding.html#t4 encode speech]. In this application though, using this encoding creates perceptible distortion that sounds similar to clipping.<br />
<br />
=== Quantitative Comparison ===<br />
For a quantitative comparison the authors trained a separate multi-task classifier to classify a note using given pitch or quality of a note. The results of both the Baseline and the WaveNet where then inputted and attempted to be classified. As seen in table 1 WaveNet significantly outperformed the Baseline in both metrics posting a ~70% increase when only considering pitch.<br />
<br />
== Interpolation in Timbre and Dynamics ==<br />
<br />
[[File:paper27-figure3-interpolation.png|center]]<br />
<br />
For this evaluation the authors reconstructed from linear interpolations in Z space among different instruments and compared these to superimposed position of the original two instruments. Not surprisingly the model fuse aspects of both instruments during the recreation. The authors claim however, that WaveNet produces much more realistic sounding results. <br />
To support their claim the authors the authors point to WaveNet ability to create dynamic mixing of overtone in time, even jumping to higher harmonics (A in figure 3), capturing the timbre and dynamics of both the bass and flute. This can be once again seen in (B in figure 3) where Wavenet adds additional harmonics as well as a sub-harmonics to the original flute note. <br />
<br />
<br />
== Entanglement of Pitch and Timbre ==<br />
<br />
[[File:paper27-table2.png|center]]<br />
<br />
[[File:paper27-figure4-entanglement.png|center]]<br />
<br />
To study the entanglement between pitch and Z space the authors constructed a classifier which was expected to drop in accuracy if the representation of pitch and timbre is disentangled as it relies heavily on the pitch information. This is clearly demonstrated by the first two rows of table 2 where WaveNet relies more strongly on pitch then the baseline algorithm. The authors provide a more qualitative demonstrating in figure 4. They demonstrate a situation in which a classifier may be confused; a note with pitch of +12 is almost exactly the same as the original apart from an emergence of sub-harmonics.<br />
<br />
Further insight can be gained on the relationship between pitch and timbre by studying the trend amongst the network embeddings among the pitches for specific instruments. This is depicted in figure 5 for several instruments across their entire 88 note range at 127 velocity. It can be noted from the figure that the instruments have unique separation of two or more registers over which the embeddings of notes with different pitches are similar. This is expected since instrumental dynamics and timbre varies dramatically over the range of the instrument.<br />
<br />
= Future Directions =<br />
<br />
One significant area which the authors claim great improvement is needed is the large memory constraints required by there algorithm. Due to the large memory requirement the current WaveNet must rely on down sampling thus being unable to fully capture the global context. <br />
<br />
= Open Source Code =<br />
<br />
Google has released all code related to this paper at the following open source repository: https://github.com/tensorflow/magenta/tree/master/magenta/models/nsynth<br />
<br />
= References =<br />
<br />
# Engel, J., Resnick, C., Roberts, A., Dieleman, S., Norouzi, M., Eck, D. & Simonyan, K.. (2017). Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders. Proceedings of the 34th International Conference on Machine Learning, in PMLR 70:1068-1077<br />
# Griffin, Daniel, and Jae Lim. "Signal estimation from modified short-time Fourier transform." IEEE Transactions on Acoustics, Speech, and Signal Processing 32.2 (1984): 236-243.<br />
# NSynth: Neural Audio Synthesis. (2017, April 06). Retrieved March 19, 2018, from https://magenta.tensorflow.org/nsynth <br />
# The NSynth Dataset. (2017, April 05). Retrieved March 19, 2018, from https://magenta.tensorflow.org/datasets/nsynth</div>Bapskikohttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Neural_Audio_Synthesis_of_Musical_Notes_with_WaveNet_autoencoders&diff=35473Neural Audio Synthesis of Musical Notes with WaveNet autoencoders2018-03-25T19:25:22Z<p>Bapskiko: Undo revision 35472 by Bapskiko (talk)</p>
<hr />
<div>= Introduction =<br />
The authors of this paper have pointed out that the method in which most notes are created are hand-designed instruments modifying pitch, velocity and filter parameters to produce the required tone, timbre and dynamics of a sound. The authors suggest that this may be a problem and thus suggest a data-driven approach to audio synthesis. To train such a data expensive model the authors highlight the need for a large dataset much like imagenet for music. <br />
<br />
= Contributions =<br />
To solve the problem highlighted above the authors propose two main contributions of their paper: <br />
* Wavenet-style autoencoder that learn to encode temural data over a long term audio structures without requiring external conditioning<br />
* NSynth: a large dataset of musical notes inspired by the emerging of large image datasets<br />
<br />
<br />
= Models =<br />
<br />
[[File:paper26-figure1-models.png|center]]<br />
<br />
== WaveNet Autoencoder ==<br />
<br />
While the proposed autoencoder structure is very similar to that of WaveNet the authors argue that the algorithm is novel in two ways:<br />
* It is able to attain consistent long-term structure without any external conditioning <br />
* Creating meaningful embedding which can be interpolated between<br />
The authors accomplish this by passing the raw audio throw the encoder to produce an embedding <math>Z = f(x) </math>, next the input is shifted and feed into the decoder which reproduces the input. The resulting probability distribution: <br />
<br />
\begin{align}<br />
p(x) = \prod_{i=1}^N\{x_i | x_1, … , x_N-1, f(x) \}<br />
\end{align}<br />
<br />
A detailed block diagram of the modified WaveNet structure can be seen in figure 1b. This diagram demonstrates the encoder as a 30 layer network in each each node is a ReLU nonlinearity followed by a non-causal dilated convolution. Dilated convolution (aka convolutions with holes) is a type of convolution in which the filter skips input values with a certain step (step size of 1 is equivalent to the standard convolution), effectively allowing the network to operate at a coarser scale compared to traditional convolutional layers and have very large receptive fields. The resulting convolution is 128 channels all feed into another ReLU nonlinearity which is feed into another 1x1 convolution before getting down sampled with average pooling to produce a 16 dimension <math>Z </math> distribution. Each <math>Z </math> encoding is for a specific temporal resolution which the authors of the paper tuned to 32ms. This means that there are 125, 16 dimension <math>Z </math> encodings for each 4 second note present in the NSynth database (1984 embeddings). <br />
Before the <math>Z </math> embedding enters the decoder it is first upsampled to the original audio rate using nearest neighbor interpolation. The embedding then passes through the decoder to recreate the original audio note. The input audio data is first quantized using 8-bit mu-law encoding into 256 possible values, and the output prediction is the softmax over the possible values.<br />
<br />
== Baseline: Spectral Autoencoder ==<br />
Being unable to find an alternative fully deep model which the authors could use to compare to there proposed WaveNet autoencoder to, the authors just made a strong baseline. The baseline algorithm that the authors developed is a spectral autoencoder. The block diagram of its architecture can be seen in figure 1a. The baseline network is 10 layer deep. Each layer has a 4x4 kernels with 2x2 strides followed by a leaky-ReLU (0.1) and batch normalization. The final hidden vector(Z) was set to 1984 to exactly match the hidden vector of the WaveNet autoencoder. <br />
<br />
The authors attempted to train the baseline on multiple input: raw waveforms, FFT, and log magnitude of spectrum finding the latter to be best correlated with perceptual distortion. The authors also explored several representations of phase, finding that estimating magnitude and using established iterative techniques to reconstruct phase to be most effective. (The technique to reconstruct the phase from the magnitude comes from (Griffin and Lim 1984). It can be summarized as follows. In each iteration, generate a Fourier signal z by taking the Short Time Fourier transform of the current estimate of the complete time-domain signal, and replacing its magnitude component with the known true magnitude. Then find the time-domain signal whose Short Time Fourier transform is closest to z in the least-squares sense. This is the estimate of the complete signal for the next iteration. ) A final heuristic that was used by the authors to increase the accuracy of the baseline was weighting the mean square error (MSE) loss starting at 10 for 0 HZ and decreasing linearly to 1 at 4000 Hz and above. This is valid as the fundamental frequency of most instrument are found at lower frequencies. <br />
<br />
== Training ==<br />
Both the modified WaveNet and the baseline autoencoder used stochastic gradient descent with an Adam optimizer. The authors trained the baseline autoencoder model asynchronously for 1800000 epocs with a batch size of 8 with a learning rate of 1e-4. Where as the WaveNet modules were trained synchronously for 250000 epocs with a batch size of 32 with a decaying learning rate ranging from 2e-4 to 6e-6.<br />
<br />
= The NSynth Dataset =<br />
To evaluate the WaveNet autoencoder model, the authors' wanted an audio dataset that let them explore the learned embeddings. Musical notes are an ideal setting for this study. While several smaller music datasets exist, as deep networks train better on abundant, high-quality data, the authors decided on the development of a new dataset - NSynth Dataset.<br />
<br />
The NSynth dataset has 306 043 unique musical notes all 4 seconds in length sampled at 16,000 Hz. The data set consists of 1006 different instruments playing on average of 65.4 different pitches across on average 4.75 different velocities. Average pitches and velocities are used as not all instruments, can reach all 88 MIDI frequencies, or the 5 velocities desired by the authors. The dataset has the following split: training set with 289,205 notes, validation set with 12,678 notes, and test set with 4,096 notes.<br />
<br />
Along with each note the authors also included the following annotations:<br />
* Source - The way each sound was produced. There were 3 classes ‘acoustic’, ‘electronic’ and ‘synthetic’<br />
* Family - The family class of instruments that produced each note. There is 11 classes which include: {‘bass’, ‘brass’, ‘vocal’ ext.}<br />
* Qualities - Sonic qualities about each note<br />
<br />
The full dataset is publicly available here: https://magenta.tensorflow.org/datasets/nsynth.<br />
<br />
<br />
<br />
= Evaluation =<br />
<br />
To fully analyze all aspects of WaveNet the authors proposed three evaluations:<br />
* Reconstruction - Both Quantitative and Qualitative analysis were considered<br />
* Interpolation in Timbre and Dynamics<br />
* Entanglement of Pitch and Timbre <br />
<br />
Sound is historically very difficult to quantify from a picture representation as it requires training and expertise to analyze. Even with expertise it can be difficult to complete a full analyses as two very different sound can look quite similar in the respective pictorial representation. This is why the authors recommend all readers to listen to the created notes which can be sound here: https://magenta.tensorflow.org/nsynth.<br />
<br />
However, even when taking this under consideration the authors do pictorially demonstrate differences in the two proposed algorithms along with the original note, as it is hard to publish a paper with sound included. To demonstrate the pictorial difference the authors demonstrate each note using constant-q transform (CQT) which is able to capture the dynamics of timbre along with representing the frequencies of the sound.<br />
<br />
== Reconstruction ==<br />
<br />
[[File:paper27-figure2-reconstruction.png|center]]<br />
<br />
=== Qualitative Comparison ===<br />
In the Glockenspiel the WaveNet autoencoder is able to reproduce the magnitude, phase of the fundamental frequency (A and C in figure 2), and the attack (B in figure 2) of the instrument; Whereas the Baseline autoencoder introduces non existing harmonics (D in figure 2). The flugelhorn on the other hand, presents the starkest difference between the WaveNet and baseline autoencoders. The WaveNet while not perfect is able to reproduce the verbarto (I and J in figure 2) across multiple frequencies, which results in a natural sounding note. The baseline not only fails to do this but also adds extra noise (K in figure 2). The authors do add that the WaveNet produces some strikes (L in figure 2) however they argue that they are inaudible.<br />
<br />
[[File:paper27-table1.png|center]]<br />
<br />
=== Quantitative Comparison ===<br />
For a quantitative comparison the authors trained a separate multi-task classifier to classify a note using given pitch or quality of a note. The results of both the Baseline and the WaveNet where then inputted and attempted to be classified. As seen in table 1 WaveNet significantly outperformed the Baseline in both metrics posting a ~70% increase when only considering pitch.<br />
<br />
== Interpolation in Timbre and Dynamics ==<br />
<br />
[[File:paper27-figure3-interpolation.png|center]]<br />
<br />
For this evaluation the authors reconstructed from linear interpolations in Z space among different instruments and compared these to superimposed position of the original two instruments. Not surprisingly the model fuse aspects of both instruments during the recreation. The authors claim however, that WaveNet produces much more realistic sounding results. <br />
To support their claim the authors the authors point to WaveNet ability to create dynamic mixing of overtone in time, even jumping to higher harmonics (A in figure 3), capturing the timbre and dynamics of both the bass and flute. This can be once again seen in (B in figure 3) where Wavenet adds additional harmonics as well as a sub-harmonics to the original flute note. <br />
<br />
<br />
== Entanglement of Pitch and Timbre ==<br />
<br />
[[File:paper27-table2.png|center]]<br />
<br />
[[File:paper27-figure4-entanglement.png|center]]<br />
<br />
To study the entanglement between pitch and Z space the authors constructed a classifier which was expected to drop in accuracy if the representation of pitch and timbre is disentangled as it relies heavily on the pitch information. This is clearly demonstrated by the first two rows of table 2 where WaveNet relies more strongly on pitch then the baseline algorithm. The authors provide a more qualitative demonstrating in figure 4. They demonstrate a situation in which a classifier may be confused; a note with pitch of +12 is almost exactly the same as the original apart from an emergence of sub-harmonics.<br />
<br />
Further insight can be gained on the relationship between pitch and timbre by studying the trend amongst the network embeddings among the pitches for specific instruments. This is depicted in figure 5 for several instruments across their entire 88 note range at 127 velocity. It can be noted from the figure that the instruments have unique separation of two or more registers over which the embeddings of notes with different pitches are similar. This is expected since instrumental dynamics and timbre varies dramatically over the range of the instrument.<br />
<br />
= Future Directions =<br />
<br />
One significant area which the authors claim great improvement is needed is the large memory constraints required by there algorithm. Due to the large memory requirement the current WaveNet must rely on down sampling thus being unable to fully capture the global context. <br />
<br />
= Open Source Code =<br />
<br />
Google has released all code related to this paper at the following open source repository: https://github.com/tensorflow/magenta/tree/master/magenta/models/nsynth<br />
<br />
= References =<br />
<br />
# Engel, J., Resnick, C., Roberts, A., Dieleman, S., Norouzi, M., Eck, D. & Simonyan, K.. (2017). Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders. Proceedings of the 34th International Conference on Machine Learning, in PMLR 70:1068-1077<br />
# Griffin, Daniel, and Jae Lim. "Signal estimation from modified short-time Fourier transform." IEEE Transactions on Acoustics, Speech, and Signal Processing 32.2 (1984): 236-243.<br />
# NSynth: Neural Audio Synthesis. (2017, April 06). Retrieved March 19, 2018, from https://magenta.tensorflow.org/nsynth <br />
# The NSynth Dataset. (2017, April 05). Retrieved March 19, 2018, from https://magenta.tensorflow.org/datasets/nsynth</div>Bapskikohttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Neural_Audio_Synthesis_of_Musical_Notes_with_WaveNet_autoencoders&diff=35472Neural Audio Synthesis of Musical Notes with WaveNet autoencoders2018-03-25T19:24:18Z<p>Bapskiko: /* WaveNet Autoencoder */</p>
<hr />
<div>= Introduction =<br />
The authors of this paper have pointed out that the method in which most notes are created are hand-designed instruments modifying pitch, velocity and filter parameters to produce the required tone, timbre and dynamics of a sound. The authors suggest that this may be a problem and thus suggest a data-driven approach to audio synthesis. To train such a data expensive model the authors highlight the need for a large dataset much like imagenet for music. <br />
<br />
= Contributions =<br />
To solve the problem highlighted above the authors propose two main contributions of their paper: <br />
* Wavenet-style autoencoder that learn to encode temural data over a long term audio structures without requiring external conditioning<br />
* NSynth: a large dataset of musical notes inspired by the emerging of large image datasets<br />
<br />
<br />
= Models =<br />
<br />
[[File:paper26-figure1-models.png|center]]<br />
<br />
== WaveNet Autoencoder ==<br />
<br />
While the proposed autoencoder structure is very similar to that of WaveNet the authors argue that the algorithm is novel in two ways:<br />
* It is able to attain consistent long-term structure without any external conditioning <br />
* Creating meaningful embedding which can be interpolated between<br />
The authors accomplish this by passing the raw audio throw the encoder to produce an embedding <math>Z = f(x) </math>, next the input is shifted and feed into the decoder which reproduces the input. The resulting probability distribution: <br />
<br />
\begin{align}<br />
p(x) = \prod_{i=1}^N\{x_i | x_1, … , x_N-1, f(x) \}<br />
\end{align}<br />
<br />
A detailed block diagram of the modified WaveNet structure can be seen in figure 1b. This diagram demonstrates the encoder as a 30 layer network in each each node is a ReLU nonlinearity followed by a non-causal dilated convolution. Dilated convolution (aka convolutions with holes) is a type of convolution in which the filter skips input values with a certain step (step size of 1 is equivalent to the standard convolution), effectively allowing the network to operate at a coarser scale compared to traditional convolutional layers and have very large receptive fields. The resulting convolution is 128 channels all feed into another ReLU nonlinearity which is feed into another 1x1 convolution before getting down sampled with average pooling to produce a 16 dimension <math>Z </math> distribution. Each <math>Z </math> encoding is for a specific temporal resolution which the authors of the paper tuned to 32ms. This means that there are 125, 16 dimension <math>Z </math> encodings for each 4 second note present in the NSynth database (1984 embeddings). <br />
Before the <math>Z </math> embedding enters the decoder it is first upsampled to the original audio rate using nearest neighbor interpolation. The embedding then passes through the decoder to recreate the original audio note. <br />
<br />
The input audio data is first quantized using 8-bit mu-law encoding into 256 possible values, and the output prediction is the softmax over the possible values. Mu-law encoding was used in the original WaveNet [https://arxiv.org/pdf/1609.03499.pdf paper] to make the problem "more tractable" compared to raw 16-bit integer values. In that paper, they note that "especially for speech, this non-linear quantization produces a significantly better reconstruction" compared to a linear scheme. This might be expected considering that the mu-law companding transformation was designed to [https://www.cisco.com/c/en/us/support/docs/voice/h323/8123-waveform-coding.html#t4 encode speech]. In this application though, using this encoding creates perceptible distortion that sounds similar to clipping.<br />
<br />
== Baseline: Spectral Autoencoder ==<br />
Being unable to find an alternative fully deep model which the authors could use to compare to there proposed WaveNet autoencoder to, the authors just made a strong baseline. The baseline algorithm that the authors developed is a spectral autoencoder. The block diagram of its architecture can be seen in figure 1a. The baseline network is 10 layer deep. Each layer has a 4x4 kernels with 2x2 strides followed by a leaky-ReLU (0.1) and batch normalization. The final hidden vector(Z) was set to 1984 to exactly match the hidden vector of the WaveNet autoencoder. <br />
<br />
The authors attempted to train the baseline on multiple input: raw waveforms, FFT, and log magnitude of spectrum finding the latter to be best correlated with perceptual distortion. The authors also explored several representations of phase, finding that estimating magnitude and using established iterative techniques to reconstruct phase to be most effective. (The technique to reconstruct the phase from the magnitude comes from (Griffin and Lim 1984). It can be summarized as follows. In each iteration, generate a Fourier signal z by taking the Short Time Fourier transform of the current estimate of the complete time-domain signal, and replacing its magnitude component with the known true magnitude. Then find the time-domain signal whose Short Time Fourier transform is closest to z in the least-squares sense. This is the estimate of the complete signal for the next iteration. ) A final heuristic that was used by the authors to increase the accuracy of the baseline was weighting the mean square error (MSE) loss starting at 10 for 0 HZ and decreasing linearly to 1 at 4000 Hz and above. This is valid as the fundamental frequency of most instrument are found at lower frequencies. <br />
<br />
== Training ==<br />
Both the modified WaveNet and the baseline autoencoder used stochastic gradient descent with an Adam optimizer. The authors trained the baseline autoencoder model asynchronously for 1800000 epocs with a batch size of 8 with a learning rate of 1e-4. Where as the WaveNet modules were trained synchronously for 250000 epocs with a batch size of 32 with a decaying learning rate ranging from 2e-4 to 6e-6.<br />
<br />
= The NSynth Dataset =<br />
To evaluate the WaveNet autoencoder model, the authors' wanted an audio dataset that let them explore the learned embeddings. Musical notes are an ideal setting for this study. While several smaller music datasets exist, as deep networks train better on abundant, high-quality data, the authors decided on the development of a new dataset - NSynth Dataset.<br />
<br />
The NSynth dataset has 306 043 unique musical notes all 4 seconds in length sampled at 16,000 Hz. The data set consists of 1006 different instruments playing on average of 65.4 different pitches across on average 4.75 different velocities. Average pitches and velocities are used as not all instruments, can reach all 88 MIDI frequencies, or the 5 velocities desired by the authors. The dataset has the following split: training set with 289,205 notes, validation set with 12,678 notes, and test set with 4,096 notes.<br />
<br />
Along with each note the authors also included the following annotations:<br />
* Source - The way each sound was produced. There were 3 classes ‘acoustic’, ‘electronic’ and ‘synthetic’<br />
* Family - The family class of instruments that produced each note. There is 11 classes which include: {‘bass’, ‘brass’, ‘vocal’ ext.}<br />
* Qualities - Sonic qualities about each note<br />
<br />
The full dataset is publicly available here: https://magenta.tensorflow.org/datasets/nsynth.<br />
<br />
<br />
<br />
= Evaluation =<br />
<br />
To fully analyze all aspects of WaveNet the authors proposed three evaluations:<br />
* Reconstruction - Both Quantitative and Qualitative analysis were considered<br />
* Interpolation in Timbre and Dynamics<br />
* Entanglement of Pitch and Timbre <br />
<br />
Sound is historically very difficult to quantify from a picture representation as it requires training and expertise to analyze. Even with expertise it can be difficult to complete a full analyses as two very different sound can look quite similar in the respective pictorial representation. This is why the authors recommend all readers to listen to the created notes which can be sound here: https://magenta.tensorflow.org/nsynth.<br />
<br />
However, even when taking this under consideration the authors do pictorially demonstrate differences in the two proposed algorithms along with the original note, as it is hard to publish a paper with sound included. To demonstrate the pictorial difference the authors demonstrate each note using constant-q transform (CQT) which is able to capture the dynamics of timbre along with representing the frequencies of the sound.<br />
<br />
== Reconstruction ==<br />
<br />
[[File:paper27-figure2-reconstruction.png|center]]<br />
<br />
=== Qualitative Comparison ===<br />
In the Glockenspiel the WaveNet autoencoder is able to reproduce the magnitude, phase of the fundamental frequency (A and C in figure 2), and the attack (B in figure 2) of the instrument; Whereas the Baseline autoencoder introduces non existing harmonics (D in figure 2). The flugelhorn on the other hand, presents the starkest difference between the WaveNet and baseline autoencoders. The WaveNet while not perfect is able to reproduce the verbarto (I and J in figure 2) across multiple frequencies, which results in a natural sounding note. The baseline not only fails to do this but also adds extra noise (K in figure 2). The authors do add that the WaveNet produces some strikes (L in figure 2) however they argue that they are inaudible.<br />
<br />
[[File:paper27-table1.png|center]]<br />
<br />
=== Quantitative Comparison ===<br />
For a quantitative comparison the authors trained a separate multi-task classifier to classify a note using given pitch or quality of a note. The results of both the Baseline and the WaveNet where then inputted and attempted to be classified. As seen in table 1 WaveNet significantly outperformed the Baseline in both metrics posting a ~70% increase when only considering pitch.<br />
<br />
== Interpolation in Timbre and Dynamics ==<br />
<br />
[[File:paper27-figure3-interpolation.png|center]]<br />
<br />
For this evaluation the authors reconstructed from linear interpolations in Z space among different instruments and compared these to superimposed position of the original two instruments. Not surprisingly the model fuse aspects of both instruments during the recreation. The authors claim however, that WaveNet produces much more realistic sounding results. <br />
To support their claim the authors the authors point to WaveNet ability to create dynamic mixing of overtone in time, even jumping to higher harmonics (A in figure 3), capturing the timbre and dynamics of both the bass and flute. This can be once again seen in (B in figure 3) where Wavenet adds additional harmonics as well as a sub-harmonics to the original flute note. <br />
<br />
<br />
== Entanglement of Pitch and Timbre ==<br />
<br />
[[File:paper27-table2.png|center]]<br />
<br />
[[File:paper27-figure4-entanglement.png|center]]<br />
<br />
To study the entanglement between pitch and Z space the authors constructed a classifier which was expected to drop in accuracy if the representation of pitch and timbre is disentangled as it relies heavily on the pitch information. This is clearly demonstrated by the first two rows of table 2 where WaveNet relies more strongly on pitch then the baseline algorithm. The authors provide a more qualitative demonstrating in figure 4. They demonstrate a situation in which a classifier may be confused; a note with pitch of +12 is almost exactly the same as the original apart from an emergence of sub-harmonics.<br />
<br />
Further insight can be gained on the relationship between pitch and timbre by studying the trend amongst the network embeddings among the pitches for specific instruments. This is depicted in figure 5 for several instruments across their entire 88 note range at 127 velocity. It can be noted from the figure that the instruments have unique separation of two or more registers over which the embeddings of notes with different pitches are similar. This is expected since instrumental dynamics and timbre varies dramatically over the range of the instrument.<br />
<br />
= Future Directions =<br />
<br />
One significant area which the authors claim great improvement is needed is the large memory constraints required by there algorithm. Due to the large memory requirement the current WaveNet must rely on down sampling thus being unable to fully capture the global context. <br />
<br />
= Open Source Code =<br />
<br />
Google has released all code related to this paper at the following open source repository: https://github.com/tensorflow/magenta/tree/master/magenta/models/nsynth<br />
<br />
= References =<br />
<br />
# Engel, J., Resnick, C., Roberts, A., Dieleman, S., Norouzi, M., Eck, D. & Simonyan, K.. (2017). Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders. Proceedings of the 34th International Conference on Machine Learning, in PMLR 70:1068-1077<br />
# Griffin, Daniel, and Jae Lim. "Signal estimation from modified short-time Fourier transform." IEEE Transactions on Acoustics, Speech, and Signal Processing 32.2 (1984): 236-243.<br />
# NSynth: Neural Audio Synthesis. (2017, April 06). Retrieved March 19, 2018, from https://magenta.tensorflow.org/nsynth <br />
# The NSynth Dataset. (2017, April 05). Retrieved March 19, 2018, from https://magenta.tensorflow.org/datasets/nsynth</div>Bapskikohttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Understanding_Image_Motion_with_Group_Representations&diff=35416Understanding Image Motion with Group Representations2018-03-24T22:09:44Z<p>Bapskiko: /* Criticism */</p>
<hr />
<div>== Introduction ==<br />
Motion perception is a key component of computer vision. It is critical to problems such as optical flow and visual odometry, where a sequence of images are used to calculate either the pixel level (local) motion or the motion of the entire scene (global). The smooth image transformation caused by camera motion is a subspace of all position image transformations. Here, we are interested in realistic transformation caused by motion, therefore unrealistic motion caused by say, face swapping, are not considered. <br />
<br />
Supervised learning of 3D motion is challenging since explicit motion labels are no trivial to obtain. The proposed learning method does not need label data. Instead, the method constrains learning by using the properties of motion space. The paper presents a general model of visual motion, and how the motion space properties of associativity and invertibility can be used to constrain the learning of a deep neural network. The results show evidence that the learned model captions motion in both 2D and 3D settings.<br />
<br />
[[File:paper13_fig1.png|650px|center|]]<br />
<br />
== Related Work ==<br />
The most common global representations of motion are from structure from motion (SfM) and simultaneous localization and mapping (SLAM), which represent poses in special Euclidean group <math> SE(3) </math> to represent a sequence of motions. However, these cannot be used to represent non-rigid or independent motions. The most used method for local representation is optical flow, which estimates motion by pixel over 2-D image. Furthermore, scene flow is a more generalized method of optical flow which estimates the point trajectories from 3-D motions. The limitation of optical flow is that it only captures motion locally, which makes capturing the overall motion impossible. <br />
<br />
Another approach to representing motion is spatiotemporal features (STFs), which are flexible enough to represent non-rigid motions since there is usually a dimensionality reduction process involved. However these approaches are restricted to fixed windows of representation. <br />
<br />
There are also works using CNN’s to learn optical flow using brightness constancy assumptions, and/or photometric local constraints. Works on stereo depth estimation using learning has also shown results. Regarding image sequences, there are works on shuffling the order of images to learn representations of its contents, as well as learning representations equivariant to the egomotion of the camera.<br />
<br />
== Approach ==<br />
The proposed method is based on the observation that 3D motions, equipped with composition, form a group. By learning the underlying mapping that captures the motion transformations, we are approximating latent motion of the scene. The method is designed to capture group associativity and invertibility.<br />
<br />
Consider a latent structure space <math>S</math>, element of the structure space generates images via projection <math>\pi:S\rightarrow I</math>, latent motion space <math>M</math> which is some closed subgroup of the set of homeomorphism on <math>S</math>. For <math>s \in S</math>, a continuous motion sequence <math> \{m_t \in M | t \geq 0\} </math> generates continous image sequence <math> \{i_t \in I | t \geq 0\} </math> where <math> i_t=\pi(m_t(s)) </math>. Writing this as a hidden Markov model gives <math> i_t=\pi(m_{\Delta t}(s_{t-1}))) </math> where the current state is based on the change from the previous. Since <math> M </math> is a closed group on <math> S </math>, it is associative, has inverse, and contains idenity. <math> SE(3) </math> is an exmaple of this. To be more specific, the latent structure of a scene from rigid image motion could be modelled by a point cloud with a motion space <math>M=SE(3)</math>, where rigid image motion can be produced by a camera translating and rotating through a rigid scene in 3D. When a scene has N rigid bodies, the motion space can be represented as <math>M=[SE(3)]^N</math>.<br />
<br />
=== Learning Motion by Group Properties ===<br />
The goal is to learn a function <math> \Phi : I \times I \rightarrow \overline{M} </math>, <math> \overline{M} </math> indicating mapping of image pairs from <math> M </math> to its representation, as well as the composition operator <math> \diamond : \overline{M} \rightarrow \overline{M} </math> that emulates the composition of these elements in <math> M </math>. For all sequences, it is assumed that for all times <math> t_0 < t_1 < t_2 ... </math>, the sequence representation should have the following properties: <br />
# Associativity: <math> \Phi(I_{t_0}, I_{t_1}) \diamond \Phi(I_{t_2}, I_{t_3}) = (\Phi(I_{t_0}, I_{t_1}) \diamond \Phi(I_{t_1}, I_{t_2})) \diamond \Phi(I_{t_2}, I_{t_3}) = \Phi(I_{t_0}, I_{t_1}) \diamond (\Phi(I_{t_1}, I_{t_2}) \diamond \Phi(I_{t_2}, I_{t_3})) = \Phi(I_{t_0}, I_{t_1}) \diamond \Phi(I_{t_1}, I_{t_3})</math>, which means that the motion of differently composed subsequences of a sequence are equivalent<br />
# Has Identity: <math> \Phi(I_{t_0}, I_{t_1}) \diamond e = \Phi(I_{t_0}, I_{t_1}) = e \diamond \Phi(I_{t_0}, I_{t_1}) </math> and <math> e=\Phi(I_{t}, I_{t}) \forall t </math>, where <math>e</math> is the null image motion and the unique identity in the latent space<br />
# Invertibility: <math> \Phi(I_{t_0}, I_{t_1}) \diamond \Phi(I_{t_1}, I_{t_0}) = e </math>, so the inverse of the motion of an image sequence is the motion of that image sequence reversed<br />
An embedding loss is used to approximately enforce associativity and invertibility among subsequences sampled from image sequence. Associativity is encouraged by pushing sequences with the same final motion but different transitions to the same representation. Invertibility is encouraged by pushing sequences corresponding to the same motion with but in opposite directions away from each other, as well as pushing all loops to the same representation. Uniqueness of the identity is encouraged by pushing loops away from non-identity representations. Loops from different sequences are also pushed to the same representation (the identity).<br />
<br />
These constraints are true to any type of transformation resulting from image motion. This puts little restriction on the learning problems and allows all features relevant to the motion structure to be captured. On the other hand, optical flow assumes unchanging brightness between frames of the same projected scene, and motion estimates would degrade when that assumption does not hold.<br />
<br />
Also with this method, it is possible multiple representations <math> \overline{M} </math> can be learned from a single <math> M </math>, thus the learned representation is not necessary unique. In addition, the scenes are not expected to have rapid changing content, scene cuts, or long-term occlusions.<br />
<br />
=== Sequence Learning with Neural Networks ===<br />
The functions <math> \Phi </math> and <math> \diamond </math> are approximated by CNN and RNN, respectively. LSTM is used for RNN. The input to the network is a sequence of images <math> I_t = \{I_1,...,I_t\} </math>. The CNN processes pairs of images and generates intermediate representations, and the LSTM operates over the sequence of CNN outputs to produce an embedding sequence <math> R_t = \{R_{1,2},...,R_{t-1,t}\} </math>. Only the embedding at the final time step is used for loss. The network is trained to minimize a hinge loss with respect to embeddings to pairs of sequences as defined below:<br />
<br />
<center><math>L(R^1,R^2) = \begin{cases} d(R^1,R^2), & \text{if positive pair} \\ max(0, m - d(R^1,R^2)), & \text{if negative pair} \end{cases}</math></center><br />
<center><math> d_{cosine}(R^1,R^2)=1-\frac{\langle R^1,R^2 \rangle}{\lVert R^1 \rVert \lVert R^2 \rVert} </math></center><br />
<br />
where <math>d(R^1,R^2)</math> measure the distance between the embeddings of two sequences used for training selected to be cosine distance, <math> m </math> is a fixed scalar margin selected to be 0.5. Positive pairs are training examples where two sequences have the same final motion, negative pairs are training examples where two sequences have the exact opposite final motion. Using L2 distances yields similar results as cosine distances.<br />
<br />
Each training sequence is recomposed into 6 subsequences: two forward, two backward, and two identity. To prevent the network from only looking at static differences, subsequence pairs are sampled such that they have the same start and end frames but different motions in between. Sequences of varying lengths are also used to generalize motion on different temporal scales. Training the network with only one input images per time step was also tried, but consistently yielded work results than image pairs.<br />
<br />
[[File:paper13_fig2.png|650px|center|]]<br />
<br />
Overall, training with image pairs resulted in lower error than training with just single images. This is demonstrated in the below table.<br />
<br />
<br />
[[File:table.png|700px|center|]]<br />
<br />
== Experimentation ==<br />
Trained network using rotated and translated MNIST dataset as well as KITTI dataset. <br />
* Used Torch<br />
* Used Adam for optimization, decay schedule of 30 epochs, learning rate chosen by random serach<br />
* 50-60 batch size for MNIST, 25-30 batch size for KITTI<br />
* Dilated convolution with Relu and batch normalization<br />
* Two LSTM cell per layer 256 hidden units each<br />
* Sequence length of 3-5 images<br />
* MINIST networks with up to 12 images <br />
<br />
=== Rigid Motion in 2D ===<br />
* MNIST data rotated <math>[0, 360)</math> degrees and translated <math>[-10, 10] </math> pixels, i.e. <math>SE(2)</math> transformations<br />
* Visualized the representation using t-SNE<br />
** Clear clustering by translation and rotation but not object classes<br />
** Suggests the representation captures the motion properties in the dataset, but is independent of image contents<br />
* Visualized the image-conditioned saliency maps<br />
**In Figure 3, the red represents the positive gradients of the activation function with respect to the input image, and the negative gradients are represented in blue.<br />
**If we consider a saliency map as a first-order Taylor expansion, then the map could show the relationship between pixel and the representation.<br />
** Take derivative of the network output respect to the map<br />
** The area that has the highest gradient means that part contributes the most to the output<br />
** The resulting salient map strongly resembles spatiotemporal energy filters of classical motion processing<br />
** Suggests the network is learning the right motion structure<br />
<br />
[[File:paper13_fig3.png|700px|center|]]<br />
<br />
=== Real World Motion in 3D ===<br />
* Uses KITTI dataset collected on a car driving through roads in Germany<br />
* On a separate dataset with ground truth camera pose, linearly regress the representation to the ground truth<br />
** The result is compared against self supervised flow algorithm Yu et al.(2016) after the output from the flow algorithm is downsampled, then feed through PCA, then regressed against the camera motion<br />
** The data shows it performs not as well as the supervised algorithm, but consistently better than chance (guessing the mean value)<br />
** Shows the method is able to capture dominant motion structure<br />
* Test performance on interpolation task<br />
** Check <math>R([I_1,I_T])</math> against <math>R([I_1, I_m, I_T])</math>, <math>R([I_1, I_{IN}, I_T])</math>, and <math>R([I_1, I_{OUT}, I_T])</math><br />
** Test how sensitive the network is to deviations from unnatural motion<br />
** High errors <math>\gg 1</math> means the network can distinguish between realistic and unrealistic motion<br />
** In order to do this, the distance between the embeddings of the frame sequences of the first and last frame <math>R([I_1,I_T])</math> and of the first, middle, and last frame <math>R([I_1, I_m, I_T])</math> is computed. This distance is compared with the distance when the middle frame of the second embedding is changed to a frame that is visually similar (inside sequence): <math>R([I_1, I_{IN}, I_T])</math> and one that is visually dissimilar (outside sequence): <math>R([I_1, I_{OUT}, I_T])</math>. The results are shown in Table 3. The embedding distance method is compared to the Euclidean distance which is defined as the mean pixel distance between the test frame and <math>{I_1,I_T}</math>, whichever is smaller. It can be seen from the results that the embedding distance of the true frame is significantly lower than other frames. This means that the embedding distance used in the network is more sensitive to any atypical motions of the scenes. <br />
* Visualized saliency maps<br />
** Highlights objects moving in the background, and motion of the car in the foreground<br />
** Suggests the method can be used for tracking as well<br />
<br />
[[File:paper13_tab2.png|700px|center|]]<br />
<br />
[[File:paper13_fig4.png|700px|center|]]<br />
<br />
[[File:paper13_fig5.png|700px|center|]]<br />
<br />
[[File:table3_motion.PNG|700px|center|]]<br />
<br />
* Figure 7 displays graphs comparing the mean squared error of the method presented in this paper to the baseline chance method and the supervised Flow PCA method.<br />
<br />
[[File:paper13_fig6.PNG|700px|center|]]<br />
<br />
== Conclusion ==<br />
The author presented a new model of motion and a method for learning motion representations. It is shown that by enforcing group properties we can learn motion representations that are able to generalize between scenes with disparate content. The results can be useful for navigation, prediction, and other behavioral tasks relying on motion. Due to the fact that this method does not require labelled data, it can be applied to a large variety of tasks.<br />
<br />
== Criticism ==<br />
Although this method does not require any labelled data, it is still learning by supervision through defined constraints. The idea of training using unlabelled data is interesting and it does have meaningful practical application. Unfortunately, the author did not provide convincing experimental results. Results from motion estimation problems are typically compared against ground truth data for their accuracy. The author performed experiments on transformed MNIST data and KITTI data. The MNIST data is transformed by the author, thus the ground truth is readily available. However the author only claimed the validity of the results through indirect means of using t-SNE and saliency map visualization. For the KITTI dataset, the author regressed the representations against ground truth for some mapping from the network output to some physical motion representation. Again, the results were compared only indirectly against ground truth. Such experimentation made the method hardly convincing and applicable to real world applications. In addition, the network does not output motion representations with physical meanings, make the proposed method useless for any real world applications.<br />
<br />
One of the motivations the authors use for this approach is that traditional SLAM formulations represent motion as a sequence of poses in <math> SE(3) </math>, and that they are unable to represent non-rigid or independent motions. There exist SLAM formulations that represent motion as [http://ieeexplore.ieee.org/document/7353368/ Gaussian processes], as well as [http://journals.sagepub.com/doi/abs/10.1177/0278364915585860 temporal basis functions], and it is quite [https://openslam.org/robotvision.html common] for inertial, monocular-camera SLAM problems to use a motion representation on <math> SIM(3) </math>, which is the group containing all scale-preserving transformations. A <math> SIM(3) </math> transformation is not, in general, rigid, so it is not true to say that modern SLAM is unable to represent non-rigid motions. Additionally, the saliency images from the KITTI experiment displaying network gradients on independently moving objects in the scene does not necessarily mean that the motion representation is capturing independent motion, it just means that the network representation is dependent on those pixels. As the authors did not provide an error comparison between images containing independent motions and those without, it is possible that these network gradients only contribute to error (in terms of the camera pose) instead of capturing independent motions.<br />
<br />
Another criticism is that the group-properties constraint the authors impose is too weak. Any set consisting of functions, their inverses, and the identity forms a group. While physical motions are one example of such a group, there are many valid groups that do not represent any coherent physical motions.<br />
<br />
Since the network has to learn both the group elements, <math>\overline{M}</math>, and the composition function, <math>\diamond</math>, associated with the group it is difficult to tell how each of them are performing.<br />
<br />
== References ==<br />
Jaegle, A. (2018). Understanding image motion with group representations . ICLR. Retrieved from https://openreview.net/pdf?id=SJLlmG-AZ.<br />
<br />
Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. In ''International Conference on Learning Representations (ICLR) Workshop'', 2013.<br />
<br />
Jason J Yu, Adam W Harley, and Konstantinos G Derpanis. Back to basics: Unsupervised learning of optical flow via brightness constancy and motion smoothness. ''European Conference on Computer Vision (ECCV) Workshops'', 2016.</div>Bapskikohttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Predicting_Floor-Level_for_911_Calls_with_Neural_Networks_and_Smartphone_Sensor_Data&diff=35411stat946w18/Predicting Floor-Level for 911 Calls with Neural Networks and Smartphone Sensor Data2018-03-24T19:30:12Z<p>Bapskiko: /* Vertical Height Estimation */</p>
<hr />
<div>= Introduction =<br />
During emergency 911 calls, knowing the exact position of the victim is crucial to a fast response and a successful rescue. Problems arise when the caller is unable to give their physical position accurately. This can happen for instance when the caller is disoriented, held hostage, or a child is calling on behalf of the victim. GPS sensors on smartphones can provide the rescuers with the geographic location. However GPS fails to give an accurate floor level inside a tall building. Previous works have explored using Wi-Fi signal or beacons placed inside the buildings, but these methods are not self-contained and require prior infrastructure knowledge.<br />
<br />
Fortunately, today’s smartphones are equipped with many more sensors including barometers and magnetometers. Deep learning can be applied to predict floor level based on these sensor readings. <br />
Firstly, an LSTM is trained to classify whether the caller is indoors or outdoors using GPS, RSSI (Received Signal Strength Indication), and magnetometer sensor readings. Next, an unsupervised clustering algorithm is used to predict the floor level depending on the barometric pressure difference. With these two parts working together, a self-contained floor level prediction system can achieve 100% accuracy, without any external prior knowledge.<br />
<br />
= Data Description =<br />
The authors developed an iOS app called Sensory and used it to collect data on an iPhone 6. The following sensor readings were recorded: indoors, created at, session id, floor, RSSI strength, GPS latitude, GPS longitude, GPS vertical accuracy, GPS horizontal accuracy, GPS course, GPS speed, barometric relative altitude, barometric pressure, environment context, environment mean building floors, environment activity, city name, country name, magnet x, magnet y, magnet z, magnet total.<br />
<br />
The indoor-outdoor data has to be manually entered as soon as the user enters or exits a building. To gather the data for floor level prediction, the authors conducted 63 trials among five different buildings throughout New York City. The actual floor level was recorded manually for validation purposes only, since unsupervised learning is being used.<br />
<br />
= Methods =<br />
The proposed method first determines if the user is indoors or outdoors and detects the instances of transition between them. When a outdoor to indoor transition event occurs, the elevation of the user is saved using an estimation from the cellphone barometer. Finally, the exact floor level is predicted through clustering techniques. Indoor/outdoor classification is critical to the working of this method. Once the user is detected to be outdoors, he is assumed to be at the ground level. The vertical height and floor estimation is applied only when the user is indoors. The indoor/outdoor transitions are used to save the barometer readings at the ground level for use as reference pressure.<br />
<br />
=== Indoor/Outdoor Classification === <br />
<br />
An LSTM network is used to solve the indoor-outdoor classification problem. Here is a diagram of the network architecture.<br />
<br />
[[File:lstm.jpg | 500px]]<br />
<br />
Figure 1: LSTM network architecture. A 3-layer LSTM. Inputs are sensor readings for d consecutive time-steps. Target is y = 1 if indoors and y = 0 if outdoors.<br />
<br />
<math> X_i</math> contains a set of <math>d</math> consecutive sensor readings, i.e. <math> X_i = [x_1, x_2,...,x_d] </math>. <math>Y</math> is labelled as 0 for outdoors and 1 for indoors. <math>d</math> is chosen to be 3 by random-search so that <math>X</math> has 3 points <math>X_i = [x_{j-1}, x_j, x_{j+1}]</math> and the middle <math>x_j</math> is used for the <math>y</math> label.<br />
The LSTM contains three layers. Layers one and two have 50 neurons followed by a dropout layer set to 0.2. Layer 3 has two neurons fed directly into a one-neuron feedforward layer with a sigmoid activation function. The input is the sensor readings, and the output is the indoor-outdoor label. The objective function is the cross-entropy between the true label and the prediction.<br />
<br />
\begin{equation}<br />
C(y_i, \hat{y}_i) = \frac{1}{n} \sum_{i=1}^{n} -(y_i log(\hat{y_i}) + (1 - y_i) log(1 - \hat{y_i}))<br />
\label{equation:binCE}<br />
\end{equation}<br />
<br />
The main reason why the neural network is able to predict whether the user is indoors or outdoors is that it learns a pattern of how the walls of buildings interfere with the GPS signals. The LSTM is able to find the pattern in the GPS signal strength in combination with other sensor readings to give an accurate prediction. However, the change in GPS signal does not happen instantaneously as the user walks indoor. Thus, a window of 20 seconds is allowed, and the minimum barometric pressure reading within that window is recorded as the ground floor.<br />
<br />
=== Indoor/Outdoor Transition === <br />
To determine the exact time the user makes an indoor/outdoor transition, two vector masks are convolved across the LSTM predictions.<br />
<br />
\begin{equation}<br />
V_1 = [1, 1, 1, 1, 1, 0, 0, 0, 0, 0]<br />
\end{equation} <br />
<br />
\begin{equation}<br />
V_2 = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]<br />
\end{equation}<br />
<br />
The Jaccard distances measures the similarity of two sets and is calculated with the following equation:<br />
<br />
\begin{equation}<br />
J_j = J(s_i, V_j) = \frac{|s_i \cap V_j|}{|s_i| + |V_j| - |s_i \cap V_j|} <br />
\label{equation:Jaccard}<br />
\end{equation}<br />
<br />
If the Jaccard distance between <math>V_{1}</math> and sub-sequence <math> s_i </math> is greater or equal to the threshold 0.4, it means there was a transition from indoors to outdoors in the vicinity of the 20 second range of the vector mask. Similarly, a distance of to 0.4 or greater to <math>V_{2}</math> indicates a transition from outdoors to indoors. Sets of transition windows are merged together if they occur close in time to each other, with the average transition time of both windows being used as the new transition time.<br />
<br />
[[File:FindIOIndexes.png | 700px]]<br />
<br />
=== Vertical Height Estimation === <br />
Once the barometric pressure of the ground floor is known, the user’s current relative altitude can be calculated by the international pressure equation, where <math>m_\Delta</math> is the estimated height, <math> p_1 </math> is the pressure reading of the device, and <math> p_0 </math> is the reference pressure at ground level while transitioning from outdoor to indoor.<br />
<br />
\begin{equation}<br />
m_\Delta = f_{floor}(p_0, p_1) = 44330 (1 - (\frac{p_1}{p_0})^{\frac{1}{5.255}})<br />
\label{equation:baroHeight}<br />
\end{equation}<br />
<br />
In appendix B.1, the authors acknowledge that for this system to work, pressures variations due to weather or temperature must be accounted for as those variations are on the same order of magnitude or larger than the pressure variations caused by changing altitude. They suggest using a nearby reference station with known altitude to continuously measure and correct for this effect.<br />
<br />
=== Floor Estimation === <br />
Given the user’s relative altitude, the floor level can be determined. However, this is not a straightforward task because different buildings have different floor heights, different floor labeling (E.g. not including the 13th floor), and floor heights within the same building can vary from floor to floor. To solve these problems, altitude data collected are clustered into groups. Each cluster represents the approximate altitude of a floor.<br />
<br />
Here is an example of altitude data collected across 41 trials in the Uris Hall building in New York City. Each dashed line represent the center of a cluster.<br />
<br />
[[File:clusters.png | 500px]]<br />
<br />
Figure 2: Distribution of measurements across 41 trials in the Uris Hall building in New York City. A clear size difference is specially noticeable at the lobby. Each dotted line corresponds to an actual floor in the building learned from clustered data-points.<br />
<br />
Here is the algorithm for the floor level prediction.<br />
<br />
[[File:PredictFloor.png | 700px]]<br />
<br />
= Experiments and Results =<br />
The authors performed evaluation on two different tasks: The indoor-outdoor classification task and the floor level prediction task. In the indoor-outdoor detection task, they compared six different models, LSTM, feedforward neural networks, logistic regression, SVM, HMM and Random Forests. In the floor level prediction task, they evaluated the full system.<br />
<br />
== Indoor-Outdoor Classification Results ==<br />
Here are the results for the indoor-outdoor classification problem using different machine learning techniques. LSTM has the best performance on the test set.<br />
The LSTM is trained for 24 epochs with a batch size of 128. All the hyper-parameters such as learning rate(0.006), number of layers, d size, number of hidden units and dropout rate were searched through random search algorithm.<br />
<br />
[[File:IOResults.png]]<br />
<br />
== Floor Level Prediction Results ==<br />
The following are the results for the floor level prediction from the 63 collected samples. Results are given as the percent which matched the floor exactly, off by one, or off by more than one. In each column, the left number is the accuracy using a fixed floor height, and the number on the right is the accuracy when clustering was used to calculate a variable floor height. It was found that using the clustering technique produced 100% accuracy on floor predictions. The conclusion from these results is that using building-specific floor heights produces significantly better results.<br />
<br />
[[File:FloorLevelResults.png]]<br />
<br />
== Floor Level Clustering Results == <br />
Here is the comparison between the estimated floor height and the ground truth in the Uris Hall building.<br />
<br />
[[File:FloorComparison.png]]<br />
<br />
= Criticism =<br />
This paper is an interesting application of deep learning and achieves an outstanding result of 100% accuracy. However, it offers no new theoretical discoveries. The machine learning techniques used are fairly standard. The neural networks used in this paper only contains 3 layers, and the clustering is applied on one-dimensional data. This leads to the question whether deep learning is necessary and suitable for this task.<br />
<br />
It was explained in the paper that there are many cases where the system does not work. Some cases that were mentioned include: buildings with glass walls, delayed GPS signals, <br />
and pressure changes caused by air conditioning. Other examples I can think of are: uneven floors with some area higher than others, floors rarely visited, and tunnels from one building to another. These special cases are not specifically mentioned in the paper, but they do note that differences between outdoors and pressure-sealed buildings is a problem<br />
<br />
Another weakness of the method comes from the clustering technique. It requires a fair bit of training data. The author suggested two approaches. First, the data can be stored in the individual smartphone. This is not realistic as most people do not visit every single floor of every building, even if it is their own apartment buildings. The second approach is to let a central system (emergency department) collect data from multiple users (which is what the paper’s results are based on). However, such data collection would need to be done in accordance with local laws. Perhaps a better solution would be to use elevation reading to estimate a floor based on typical floor height. Even having a small range of floors of interest could help first responders significantly narrow down their response time.<br />
<br />
Aside from all the technical issues, if knowing the exact floor is required, would it maybe be easier to let the rescuers carry a barometer with them and search for the floor with the transmitted pressure reading?</div>Bapskikohttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Predicting_Floor-Level_for_911_Calls_with_Neural_Networks_and_Smartphone_Sensor_Data&diff=35410stat946w18/Predicting Floor-Level for 911 Calls with Neural Networks and Smartphone Sensor Data2018-03-24T19:29:46Z<p>Bapskiko: /* Vertical Height Estimation */</p>
<hr />
<div>= Introduction =<br />
During emergency 911 calls, knowing the exact position of the victim is crucial to a fast response and a successful rescue. Problems arise when the caller is unable to give their physical position accurately. This can happen for instance when the caller is disoriented, held hostage, or a child is calling on behalf of the victim. GPS sensors on smartphones can provide the rescuers with the geographic location. However GPS fails to give an accurate floor level inside a tall building. Previous works have explored using Wi-Fi signal or beacons placed inside the buildings, but these methods are not self-contained and require prior infrastructure knowledge.<br />
<br />
Fortunately, today’s smartphones are equipped with many more sensors including barometers and magnetometers. Deep learning can be applied to predict floor level based on these sensor readings. <br />
Firstly, an LSTM is trained to classify whether the caller is indoors or outdoors using GPS, RSSI (Received Signal Strength Indication), and magnetometer sensor readings. Next, an unsupervised clustering algorithm is used to predict the floor level depending on the barometric pressure difference. With these two parts working together, a self-contained floor level prediction system can achieve 100% accuracy, without any external prior knowledge.<br />
<br />
= Data Description =<br />
The authors developed an iOS app called Sensory and used it to collect data on an iPhone 6. The following sensor readings were recorded: indoors, created at, session id, floor, RSSI strength, GPS latitude, GPS longitude, GPS vertical accuracy, GPS horizontal accuracy, GPS course, GPS speed, barometric relative altitude, barometric pressure, environment context, environment mean building floors, environment activity, city name, country name, magnet x, magnet y, magnet z, magnet total.<br />
<br />
The indoor-outdoor data has to be manually entered as soon as the user enters or exits a building. To gather the data for floor level prediction, the authors conducted 63 trials among five different buildings throughout New York City. The actual floor level was recorded manually for validation purposes only, since unsupervised learning is being used.<br />
<br />
= Methods =<br />
The proposed method first determines if the user is indoors or outdoors and detects the instances of transition between them. When a outdoor to indoor transition event occurs, the elevation of the user is saved using an estimation from the cellphone barometer. Finally, the exact floor level is predicted through clustering techniques. Indoor/outdoor classification is critical to the working of this method. Once the user is detected to be outdoors, he is assumed to be at the ground level. The vertical height and floor estimation is applied only when the user is indoors. The indoor/outdoor transitions are used to save the barometer readings at the ground level for use as reference pressure.<br />
<br />
=== Indoor/Outdoor Classification === <br />
<br />
An LSTM network is used to solve the indoor-outdoor classification problem. Here is a diagram of the network architecture.<br />
<br />
[[File:lstm.jpg | 500px]]<br />
<br />
Figure 1: LSTM network architecture. A 3-layer LSTM. Inputs are sensor readings for d consecutive time-steps. Target is y = 1 if indoors and y = 0 if outdoors.<br />
<br />
<math> X_i</math> contains a set of <math>d</math> consecutive sensor readings, i.e. <math> X_i = [x_1, x_2,...,x_d] </math>. <math>Y</math> is labelled as 0 for outdoors and 1 for indoors. <math>d</math> is chosen to be 3 by random-search so that <math>X</math> has 3 points <math>X_i = [x_{j-1}, x_j, x_{j+1}]</math> and the middle <math>x_j</math> is used for the <math>y</math> label.<br />
The LSTM contains three layers. Layers one and two have 50 neurons followed by a dropout layer set to 0.2. Layer 3 has two neurons fed directly into a one-neuron feedforward layer with a sigmoid activation function. The input is the sensor readings, and the output is the indoor-outdoor label. The objective function is the cross-entropy between the true label and the prediction.<br />
<br />
\begin{equation}<br />
C(y_i, \hat{y}_i) = \frac{1}{n} \sum_{i=1}^{n} -(y_i log(\hat{y_i}) + (1 - y_i) log(1 - \hat{y_i}))<br />
\label{equation:binCE}<br />
\end{equation}<br />
<br />
The main reason why the neural network is able to predict whether the user is indoors or outdoors is that it learns a pattern of how the walls of buildings interfere with the GPS signals. The LSTM is able to find the pattern in the GPS signal strength in combination with other sensor readings to give an accurate prediction. However, the change in GPS signal does not happen instantaneously as the user walks indoor. Thus, a window of 20 seconds is allowed, and the minimum barometric pressure reading within that window is recorded as the ground floor.<br />
<br />
=== Indoor/Outdoor Transition === <br />
To determine the exact time the user makes an indoor/outdoor transition, two vector masks are convolved across the LSTM predictions.<br />
<br />
\begin{equation}<br />
V_1 = [1, 1, 1, 1, 1, 0, 0, 0, 0, 0]<br />
\end{equation} <br />
<br />
\begin{equation}<br />
V_2 = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]<br />
\end{equation}<br />
<br />
The Jaccard distances measures the similarity of two sets and is calculated with the following equation:<br />
<br />
\begin{equation}<br />
J_j = J(s_i, V_j) = \frac{|s_i \cap V_j|}{|s_i| + |V_j| - |s_i \cap V_j|} <br />
\label{equation:Jaccard}<br />
\end{equation}<br />
<br />
If the Jaccard distance between <math>V_{1}</math> and sub-sequence <math> s_i </math> is greater or equal to the threshold 0.4, it means there was a transition from indoors to outdoors in the vicinity of the 20 second range of the vector mask. Similarly, a distance of to 0.4 or greater to <math>V_{2}</math> indicates a transition from outdoors to indoors. Sets of transition windows are merged together if they occur close in time to each other, with the average transition time of both windows being used as the new transition time.<br />
<br />
[[File:FindIOIndexes.png | 700px]]<br />
<br />
=== Vertical Height Estimation === <br />
Once the barometric pressure of the ground floor is known, the user’s current relative altitude can be calculated by the international pressure equation, where <math>m_\Delta</math> is the estimated height, <math> p_1 </math> is the pressure reading of the device, and <math> p_0 </math> is the reference pressure at ground level while transitioning from outdoor to indoor.<br />
<br />
\begin{equation}<br />
m_\Delta = f_{floor}(p_0, p_1) = 44330 (1 - (\frac{p_1}{p_0})^{\frac{1}{5.255}})<br />
\label{equation:baroHeight}<br />
\end{equation}<br />
<br />
In appendix B.1, the authors acknowledge that for this system to work, pressures variations due to weather or temperature must be accounted for as those variations are on the same order of magnitude or larger than the pressure variations caused by changing altitude. They suggest using a nearby reference station with known altitude to continuously measure and correct for these pressure variations.<br />
<br />
=== Floor Estimation === <br />
Given the user’s relative altitude, the floor level can be determined. However, this is not a straightforward task because different buildings have different floor heights, different floor labeling (E.g. not including the 13th floor), and floor heights within the same building can vary from floor to floor. To solve these problems, altitude data collected are clustered into groups. Each cluster represents the approximate altitude of a floor.<br />
<br />
Here is an example of altitude data collected across 41 trials in the Uris Hall building in New York City. Each dashed line represent the center of a cluster.<br />
<br />
[[File:clusters.png | 500px]]<br />
<br />
Figure 2: Distribution of measurements across 41 trials in the Uris Hall building in New York City. A clear size difference is specially noticeable at the lobby. Each dotted line corresponds to an actual floor in the building learned from clustered data-points.<br />
<br />
Here is the algorithm for the floor level prediction.<br />
<br />
[[File:PredictFloor.png | 700px]]<br />
<br />
= Experiments and Results =<br />
The authors performed evaluation on two different tasks: The indoor-outdoor classification task and the floor level prediction task. In the indoor-outdoor detection task, they compared six different models, LSTM, feedforward neural networks, logistic regression, SVM, HMM and Random Forests. In the floor level prediction task, they evaluated the full system.<br />
<br />
== Indoor-Outdoor Classification Results ==<br />
Here are the results for the indoor-outdoor classification problem using different machine learning techniques. LSTM has the best performance on the test set.<br />
The LSTM is trained for 24 epochs with a batch size of 128. All the hyper-parameters such as learning rate(0.006), number of layers, d size, number of hidden units and dropout rate were searched through random search algorithm.<br />
<br />
[[File:IOResults.png]]<br />
<br />
== Floor Level Prediction Results ==<br />
The following are the results for the floor level prediction from the 63 collected samples. Results are given as the percent which matched the floor exactly, off by one, or off by more than one. In each column, the left number is the accuracy using a fixed floor height, and the number on the right is the accuracy when clustering was used to calculate a variable floor height. It was found that using the clustering technique produced 100% accuracy on floor predictions. The conclusion from these results is that using building-specific floor heights produces significantly better results.<br />
<br />
[[File:FloorLevelResults.png]]<br />
<br />
== Floor Level Clustering Results == <br />
Here is the comparison between the estimated floor height and the ground truth in the Uris Hall building.<br />
<br />
[[File:FloorComparison.png]]<br />
<br />
= Criticism =<br />
This paper is an interesting application of deep learning and achieves an outstanding result of 100% accuracy. However, it offers no new theoretical discoveries. The machine learning techniques used are fairly standard. The neural networks used in this paper only contains 3 layers, and the clustering is applied on one-dimensional data. This leads to the question whether deep learning is necessary and suitable for this task.<br />
<br />
It was explained in the paper that there are many cases where the system does not work. Some cases that were mentioned include: buildings with glass walls, delayed GPS signals, <br />
and pressure changes caused by air conditioning. Other examples I can think of are: uneven floors with some area higher than others, floors rarely visited, and tunnels from one building to another. These special cases are not specifically mentioned in the paper, but they do note that differences between outdoors and pressure-sealed buildings is a problem<br />
<br />
Another weakness of the method comes from the clustering technique. It requires a fair bit of training data. The author suggested two approaches. First, the data can be stored in the individual smartphone. This is not realistic as most people do not visit every single floor of every building, even if it is their own apartment buildings. The second approach is to let a central system (emergency department) collect data from multiple users (which is what the paper’s results are based on). However, such data collection would need to be done in accordance with local laws. Perhaps a better solution would be to use elevation reading to estimate a floor based on typical floor height. Even having a small range of floors of interest could help first responders significantly narrow down their response time.<br />
<br />
Aside from all the technical issues, if knowing the exact floor is required, would it maybe be easier to let the rescuers carry a barometer with them and search for the floor with the transmitted pressure reading?</div>Bapskikohttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Predicting_Floor-Level_for_911_Calls_with_Neural_Networks_and_Smartphone_Sensor_Data&diff=35409stat946w18/Predicting Floor-Level for 911 Calls with Neural Networks and Smartphone Sensor Data2018-03-24T19:25:23Z<p>Bapskiko: /* Criticism */</p>
<hr />
<div>= Introduction =<br />
During emergency 911 calls, knowing the exact position of the victim is crucial to a fast response and a successful rescue. Problems arise when the caller is unable to give their physical position accurately. This can happen for instance when the caller is disoriented, held hostage, or a child is calling on behalf of the victim. GPS sensors on smartphones can provide the rescuers with the geographic location. However GPS fails to give an accurate floor level inside a tall building. Previous works have explored using Wi-Fi signal or beacons placed inside the buildings, but these methods are not self-contained and require prior infrastructure knowledge.<br />
<br />
Fortunately, today’s smartphones are equipped with many more sensors including barometers and magnetometers. Deep learning can be applied to predict floor level based on these sensor readings. <br />
Firstly, an LSTM is trained to classify whether the caller is indoors or outdoors using GPS, RSSI (Received Signal Strength Indication), and magnetometer sensor readings. Next, an unsupervised clustering algorithm is used to predict the floor level depending on the barometric pressure difference. With these two parts working together, a self-contained floor level prediction system can achieve 100% accuracy, without any external prior knowledge.<br />
<br />
= Data Description =<br />
The authors developed an iOS app called Sensory and used it to collect data on an iPhone 6. The following sensor readings were recorded: indoors, created at, session id, floor, RSSI strength, GPS latitude, GPS longitude, GPS vertical accuracy, GPS horizontal accuracy, GPS course, GPS speed, barometric relative altitude, barometric pressure, environment context, environment mean building floors, environment activity, city name, country name, magnet x, magnet y, magnet z, magnet total.<br />
<br />
The indoor-outdoor data has to be manually entered as soon as the user enters or exits a building. To gather the data for floor level prediction, the authors conducted 63 trials among five different buildings throughout New York City. The actual floor level was recorded manually for validation purposes only, since unsupervised learning is being used.<br />
<br />
= Methods =<br />
The proposed method first determines if the user is indoors or outdoors and detects the instances of transition between them. When a outdoor to indoor transition event occurs, the elevation of the user is saved using an estimation from the cellphone barometer. Finally, the exact floor level is predicted through clustering techniques. Indoor/outdoor classification is critical to the working of this method. Once the user is detected to be outdoors, he is assumed to be at the ground level. The vertical height and floor estimation is applied only when the user is indoors. The indoor/outdoor transitions are used to save the barometer readings at the ground level for use as reference pressure.<br />
<br />
=== Indoor/Outdoor Classification === <br />
<br />
An LSTM network is used to solve the indoor-outdoor classification problem. Here is a diagram of the network architecture.<br />
<br />
[[File:lstm.jpg | 500px]]<br />
<br />
Figure 1: LSTM network architecture. A 3-layer LSTM. Inputs are sensor readings for d consecutive time-steps. Target is y = 1 if indoors and y = 0 if outdoors.<br />
<br />
<math> X_i</math> contains a set of <math>d</math> consecutive sensor readings, i.e. <math> X_i = [x_1, x_2,...,x_d] </math>. <math>Y</math> is labelled as 0 for outdoors and 1 for indoors. <math>d</math> is chosen to be 3 by random-search so that <math>X</math> has 3 points <math>X_i = [x_{j-1}, x_j, x_{j+1}]</math> and the middle <math>x_j</math> is used for the <math>y</math> label.<br />
The LSTM contains three layers. Layers one and two have 50 neurons followed by a dropout layer set to 0.2. Layer 3 has two neurons fed directly into a one-neuron feedforward layer with a sigmoid activation function. The input is the sensor readings, and the output is the indoor-outdoor label. The objective function is the cross-entropy between the true label and the prediction.<br />
<br />
\begin{equation}<br />
C(y_i, \hat{y}_i) = \frac{1}{n} \sum_{i=1}^{n} -(y_i log(\hat{y_i}) + (1 - y_i) log(1 - \hat{y_i}))<br />
\label{equation:binCE}<br />
\end{equation}<br />
<br />
The main reason why the neural network is able to predict whether the user is indoors or outdoors is that it learns a pattern of how the walls of buildings interfere with the GPS signals. The LSTM is able to find the pattern in the GPS signal strength in combination with other sensor readings to give an accurate prediction. However, the change in GPS signal does not happen instantaneously as the user walks indoor. Thus, a window of 20 seconds is allowed, and the minimum barometric pressure reading within that window is recorded as the ground floor.<br />
<br />
=== Indoor/Outdoor Transition === <br />
To determine the exact time the user makes an indoor/outdoor transition, two vector masks are convolved across the LSTM predictions.<br />
<br />
\begin{equation}<br />
V_1 = [1, 1, 1, 1, 1, 0, 0, 0, 0, 0]<br />
\end{equation} <br />
<br />
\begin{equation}<br />
V_2 = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]<br />
\end{equation}<br />
<br />
The Jaccard distances measures the similarity of two sets and is calculated with the following equation:<br />
<br />
\begin{equation}<br />
J_j = J(s_i, V_j) = \frac{|s_i \cap V_j|}{|s_i| + |V_j| - |s_i \cap V_j|} <br />
\label{equation:Jaccard}<br />
\end{equation}<br />
<br />
If the Jaccard distance between <math>V_{1}</math> and sub-sequence <math> s_i </math> is greater or equal to the threshold 0.4, it means there was a transition from indoors to outdoors in the vicinity of the 20 second range of the vector mask. Similarly, a distance of to 0.4 or greater to <math>V_{2}</math> indicates a transition from outdoors to indoors. Sets of transition windows are merged together if they occur close in time to each other, with the average transition time of both windows being used as the new transition time.<br />
<br />
[[File:FindIOIndexes.png | 700px]]<br />
<br />
=== Vertical Height Estimation === <br />
Once the barometric pressure of the ground floor is known, the user’s current relative altitude can be calculated by the international pressure equation, where <math>m_\Delta</math> is the estimated height, <math> p_1 </math> is the pressure reading of the device, and <math> p_0 </math> is the reference pressure at ground level while transitioning from outdoor to indoor.<br />
<br />
\begin{equation}<br />
m_\Delta = f_{floor}(p_0, p_1) = 44330 (1 - (\frac{p_1}{p_0})^{\frac{1}{5.255}})<br />
\label{equation:baroHeight}<br />
\end{equation}<br />
<br />
=== Floor Estimation === <br />
Given the user’s relative altitude, the floor level can be determined. However, this is not a straightforward task because different buildings have different floor heights, different floor labeling (E.g. not including the 13th floor), and floor heights within the same building can vary from floor to floor. To solve these problems, altitude data collected are clustered into groups. Each cluster represents the approximate altitude of a floor.<br />
<br />
Here is an example of altitude data collected across 41 trials in the Uris Hall building in New York City. Each dashed line represent the center of a cluster.<br />
<br />
[[File:clusters.png | 500px]]<br />
<br />
Figure 2: Distribution of measurements across 41 trials in the Uris Hall building in New York City. A clear size difference is specially noticeable at the lobby. Each dotted line corresponds to an actual floor in the building learned from clustered data-points.<br />
<br />
Here is the algorithm for the floor level prediction.<br />
<br />
[[File:PredictFloor.png | 700px]]<br />
<br />
= Experiments and Results =<br />
The authors performed evaluation on two different tasks: The indoor-outdoor classification task and the floor level prediction task. In the indoor-outdoor detection task, they compared six different models, LSTM, feedforward neural networks, logistic regression, SVM, HMM and Random Forests. In the floor level prediction task, they evaluated the full system.<br />
<br />
== Indoor-Outdoor Classification Results ==<br />
Here are the results for the indoor-outdoor classification problem using different machine learning techniques. LSTM has the best performance on the test set.<br />
The LSTM is trained for 24 epochs with a batch size of 128. All the hyper-parameters such as learning rate(0.006), number of layers, d size, number of hidden units and dropout rate were searched through random search algorithm.<br />
<br />
[[File:IOResults.png]]<br />
<br />
== Floor Level Prediction Results ==<br />
The following are the results for the floor level prediction from the 63 collected samples. Results are given as the percent which matched the floor exactly, off by one, or off by more than one. In each column, the left number is the accuracy using a fixed floor height, and the number on the right is the accuracy when clustering was used to calculate a variable floor height. It was found that using the clustering technique produced 100% accuracy on floor predictions. The conclusion from these results is that using building-specific floor heights produces significantly better results.<br />
<br />
[[File:FloorLevelResults.png]]<br />
<br />
== Floor Level Clustering Results == <br />
Here is the comparison between the estimated floor height and the ground truth in the Uris Hall building.<br />
<br />
[[File:FloorComparison.png]]<br />
<br />
= Criticism =<br />
This paper is an interesting application of deep learning and achieves an outstanding result of 100% accuracy. However, it offers no new theoretical discoveries. The machine learning techniques used are fairly standard. The neural networks used in this paper only contains 3 layers, and the clustering is applied on one-dimensional data. This leads to the question whether deep learning is necessary and suitable for this task.<br />
<br />
It was explained in the paper that there are many cases where the system does not work. Some cases that were mentioned include: buildings with glass walls, delayed GPS signals, <br />
and pressure changes caused by air conditioning. Other examples I can think of are: uneven floors with some area higher than others, floors rarely visited, and tunnels from one building to another. These special cases are not specifically mentioned in the paper, but they do note that differences between outdoors and pressure-sealed buildings is a problem<br />
<br />
Another weakness of the method comes from the clustering technique. It requires a fair bit of training data. The author suggested two approaches. First, the data can be stored in the individual smartphone. This is not realistic as most people do not visit every single floor of every building, even if it is their own apartment buildings. The second approach is to let a central system (emergency department) collect data from multiple users (which is what the paper’s results are based on). However, such data collection would need to be done in accordance with local laws. Perhaps a better solution would be to use elevation reading to estimate a floor based on typical floor height. Even having a small range of floors of interest could help first responders significantly narrow down their response time.<br />
<br />
Aside from all the technical issues, if knowing the exact floor is required, would it maybe be easier to let the rescuers carry a barometer with them and search for the floor with the transmitted pressure reading?</div>Bapskikohttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Rethinking_the_Smaller-Norm-Less-Informative_Assumption_in_Channel_Pruning_of_Convolutional_Layers&diff=35386stat946w18/Rethinking the Smaller-Norm-Less-Informative Assumption in Channel Pruning of Convolutional Layers2018-03-23T04:03:57Z<p>Bapskiko: /* Results */</p>
<hr />
<div>== Introduction ==<br />
<br />
With the recent and ongoing surge in low-power, intelligent agents (such as wearables, smartphones, and IoT devices), there exists a growing need for machine learning models to work well in memory and CPU-constrained environments. Deep learning models have achieved state-of-the-art on a broad range of tasks; however, they are difficult to deploy in their original forms. For example, AlexNet (Krizhevsky et al., 2012), a model for image classification, contains 61 million parameters and requires 1.5 billion floating point operations per second (FLOPs) in one inference pass. A more accurate model, ResNet-50 (He et al., 2016), has 25 million parameters but requires 4.08 billion FLOPs. Clearly, it would be difficult to deploy and run these models on low-power devices.<br />
<br />
In general, model compression can be accomplished using four main non-exclusive methods (Cheng et al., 2017): weight pruning, quantization, matrix transformations, and weight tying. By non-exclusive, we mean that these methods can be used in combination for pruning a single model; the use of one method does not exclude any of the other methods from being viable. <br />
<br />
Ye et al. (2018) explores pruning entire channels in a convolutional neural network. Past work has mostly focused on norm or error-based heuristics to prune channels; instead, Ye et al. (2018) show that their approach is, "mathematically appealing from an optimization perspective and easy to reproduce" (Ye et al., 2018). In other words, they argue that the norm-based assumption is not as informative or theoretically justified as their approach, and provide strong empirical findings.<br />
<br />
== Motivation ==<br />
<br />
Some previous works on pruning channel filters (Li et al., 2016; Molchanov et al., 2016) have focused on using the L1 norm to determine the importance of a channel. Ye et al. (2018) show that, in the deep linear convolution case, penalizing the per-layer norm is coarse-grained; they argue that one cannot assign different coefficients to L1 penalties associated with different layers without risking the loss function being susceptible to trivial re-parameterizations. As an example, consider the following deep linear convolutional neural network with modified LASSO loss:<br />
<br />
$$\min \mathbb{E}_D \lVert W_{2n} * \dots * W_1 x - y\rVert^2 + \lambda \sum_{i=1}^n \lVert W_{2i} \rVert_1$$<br />
<br />
where W are the weights and * is convolution. Here we have chosen the coefficient 0 for the L1 penalty associated with odd-numbered layers and the coefficient 1 for the L1 penalty associated with even-numbered layers. This loss is susceptible to trivial re-parameterizations: without affecting the least-squares loss, we can always reduce the LASSO loss by halving the weights of all even-numbered layers and doubling the weights of all odd-numbered layers.<br />
<br />
Furthermore, batch normalization (Ioffe, 2015) is incompatible with this method of weight regularization. Consider batch normalization at the <math>l</math>-th layer.<br />
<br />
<center><math>x^{l+1} = max\{\gamma \cdot BN_{\mu,\sigma,\epsilon}(W^l * x^l) + \beta, 0\}</math></center><br />
<br />
Due to the batch normalization, any uniform scaling of <math>W^l</math> which would change <math>l_1</math> and <math>l_2</math> norms, but has no have no effect on <math>x^{l+1}</math>. Thus, when trying to minimize weight norms of multiple layers, it is unclear how to properly choose penalties for each layer. Therefore, penalizing the norm of a filter in a deep convolutional network is hard to justify from a theoretical perspective.<br />
<br />
<br />
Thus, although not providing a complete theoretical guarantee on loss, Ye et al. (2018) develop a pruning technique that claims to be more justified than norm-based pruning is.<br />
<br />
== Method ==<br />
<br />
At a high level, Ye et al. (2018) propose that, instead of discovering sparsity via penalizing the per-filter or per-channel norm, penalize the batch normalization scale parameters ''gamma'' instead. The reasoning is that by having fewer parameters to constrain and working with normalized values, sparsity is easier to enforce, monitor, and learn. Having sparse batch normalization terms has the effect of pruning '''entire''' channels: if ''gamma'' is zero, then the output at that layer becomes constant (the bias term), and thus the preceding channels can be pruned.<br />
<br />
=== Summary ===<br />
<br />
The basic algorithm can be summarized as follows:<br />
<br />
1. Penalize the L1-norm of the batch normalization scaling parameters in the loss<br />
<br />
2. Train until loss plateaus<br />
<br />
3. Remove channels that correspond to a downstream zero in batch normalization<br />
<br />
4. Fine-tune the pruned model using regular learning<br />
<br />
=== Details ===<br />
<br />
There still exist a few problems that this summary has not addressed so far. Sub-gradient descent is known to have inverse square root convergence rate on subdifferentials (Gordon et al., 2012), so the sparsity gradient descent update may be suboptimal. Furthermore, the sparse penalty needs to be normalized with respect to previous channel sizes, since the penalty should be roughly equally distributed across all convolution layers.<br />
<br />
==== Slow Convergence ====<br />
To address the issue of slow convergence, Ye et al. (2018) use an iterative shrinking-thresholding algorithm (ISTA) (Beck & Teboulle, 2009) to update the batch normalization scale parameter. The intuition for ISTA is that the structure of the optimization objective can be taken advantage of. Consider: $$L(x) = f(x) + g(x).$$<br />
<br />
Let ''f'' be the model loss and ''g'' be the non-differentiable penalty (LASSO). ISTA is able to use the structure of the loss and converge in O(1/n), instead of O(1/sqrt(n)) when using subgradient descent, which assumes no structure about the loss. Even though ISTA is used in convex settings, Ye et. al (2018) argue that it still performs better than gradient descent.<br />
<br />
==== Penalty Normalization ====<br />
<br />
In the paper, Ye et al. (2018) normalize the per-layer sparse penalty with respect to the global input size, the current layer kernel areas, the previous layer kernel areas, and the local input feature map area.<br />
<br />
[[File:Screenshot_from_2018-02-28_17-06-41.png]] (Ye et al., 2018)<br />
<br />
To control the global penalty, a hyperparamter ''rho'' is multiplied with all the per-layer ''lambda'' in the final loss.<br />
<br />
=== Steps ===<br />
<br />
The final algorithm can be summarized as follows:<br />
<br />
1. Compute the per-layer normalized sparse penalty constant <math>\lambda</math><br />
<br />
2. Compute the global LASSO loss with global scaling constant <math>\rho</math><br />
<br />
3. Until convergence, train scaling parameters using ISTA and non-scaling parameters using regular gradient descent.<br />
<br />
4. Remove channels that correspond to a downstream zero in batch normalization<br />
<br />
5. Fine-tune the pruned model using regular learning<br />
<br />
== Results ==<br />
<br />
The authors show state-of-the-art performance, compared with other channel-pruning approaches. It is important to note that it would be unfair to compare against general pruning approaches; channel pruning specifically removes channels without introducing '''intra-kernel sparsity''', whereas other pruning approaches introduce irregular kernel sparsity and hence computational inefficiencies.<br />
<br />
=== CIFAR-10 Experiment ===<br />
<br />
[[File:Screenshot_from_2018-02-28_17-24-25.png]]<br />
<br />
For the convNet, reducing the number of parameters in the base model increased the accuracy in model A. This suggests that the base model is over-parameterized. Otherwise, there would be a trade-off of accuracy and model efficiency.<br />
<br />
=== ILSVRC2012 Experiment ===<br />
<br />
The authors note that while ResNet-101 takes hundreds of epochs to train, pruning only takes 5-10, with fine-tuning adding another 2, giving an empirical example how long pruning might take in practice.<br />
<br />
[[File:Screenshot_from_2018-02-28_17-24-36.png]]<br />
<br />
=== Image Foreground-Background Segmentation Experiment ===<br />
<br />
The authors note that it is common practice to take a network with pre-trained on a large task and fine-tune it to apply it to a different, smaller task. One might expect there might be some extra channels that while useful for the large task, can be omitted for the simpler task. This experiment replicated that use-case by taking a NN originally trained on multiple datasets and applying the proposed pruning method. The authors note that the pruned network actually improves over the original network in all but the most challenging test dataset, which is in line with the initial expectation. The results are shown in table below<br />
<br />
[[File:paper8_Segmentation.png|700px]]<br />
<br />
== Conclusion ==<br />
<br />
Pruning large neural architectures to fit on low-power devices is an important task. For a real quantitative measure of efficiency, it would be interesting to conduct actual power measurements on the pruned models versus baselines; reduction in FLOPs doesn't necessarily correspond with vastly reduced power since memory accesses dominate energy consumption (Han et al., 2015). However, the reduction in the number of FLOPs and parameters is encouraging, so moderate power savings should be expected.<br />
<br />
It would also be interesting to combine multiple approaches, or "throw the whole kitchen sink" at this task. Han et al. (2015) sparked much recent interest by successfully combining weight pruning, quantization, and Huffman coding without loss in accuracy. However, their approach introduced irregular sparsity in the convolutional layers, so a direct comparison cannot be made.<br />
<br />
In conclusion, this novel, theoretically-motivated interpretation of channel pruning was successfully applied to several important tasks.<br />
<br />
== References ==<br />
<br />
* Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (pp. 1097-1105).<br />
* He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).<br />
* Cheng, Y., Wang, D., Zhou, P., & Zhang, T. (2017). A Survey of Model Compression and Acceleration for Deep Neural Networks. arXiv preprint arXiv:1710.09282.<br />
* Ye, J., Lu, X., Lin, Z., & Wang, J. Z. (2018). Rethinking the Smaller-Norm-Less-Informative Assumption in Channel Pruning of Convolution Layers. arXiv preprint arXiv:1802.00124.<br />
* Li, H., Kadav, A., Durdanovic, I., Samet, H., & Graf, H. P. (2016). Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710.<br />
* Molchanov, P., Tyree, S., Karras, T., Aila, T., & Kautz, J. (2016). Pruning convolutional neural networks for resource efficient inference.<br />
* Ioffe, S., & Szegedy, C. (2015, June). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning (pp. 448-456).<br />
* Gordon, G., & Tibshirani, R. (2012). Subgradient method. https://www.cs.cmu.edu/~ggordon/10725-F12/slides/06-sg-method.pdf<br />
* Beck, A., & Teboulle, M. (2009). A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences, 2(1), 183-202.<br />
* Han, S., Mao, H., & Dally, W. J. (2015). Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149</div>Bapskikohttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/AmbientGAN:_Generative_Models_from_Lossy_Measurements&diff=35384stat946w18/AmbientGAN: Generative Models from Lossy Measurements2018-03-23T02:47:52Z<p>Bapskiko: /* Robustness To Measurement Model */</p>
<hr />
<div>= Introduction =<br />
Generative Adversarial Networks operate by simulating complex distributions but training them requires access to large amounts of high quality data. Often times we only have access to noisy or partial observations, which will, from here on, be referred to as measurements of the true data. If we know the measurement function and would like to train a generative model for the true data, there are several ways to continue which have varying degrees of success. We will use noisy MNIST data as an illustrative example. Suppose we only see MNIST data that has been run through a Gaussian kernel (blurred) with some noise from a <math>N(0, 0.5^2)</math> distribution added to each pixel:<br />
<br />
<gallery mode="packed"><br />
File:mnist.png| True Data (Unobserved)<br />
File:mnistmeasured.png| Measured Data (Observed)<br />
</gallery><br />
<br />
<br />
=== Ignore the problem ===<br />
[[File:GANignore.png|500px]] [[File:mnistignore.png|300px]]<br />
<br />
Train a generative model directly on the measured data. This will obviously be unable to generate the true distribution before measurement has occurred. <br />
<br />
<br />
=== Try to recover the information lost ===<br />
[[File:GANrecovery.png|420px]] [[File:mnistrecover.png|300px]]<br />
<br />
Works better than ignoring the problem but depends on how easily the measurement function can be inverted.<br />
<br />
=== AmbientGAN ===<br />
[[File:GANambient.png|500px]] [[File:mnistambient.png|300px]]<br />
<br />
Ashish Bora, Eric Price and Alexandros G. Dimakis propose AmbientGAN as a way to recover the true underlying distribution from measurements of the true data. <br />
<br />
AmbientGAN works by training a generator which attempts to have the measurements of the output it generates fool the discriminator. The discriminator must distinguish between real and generated measurements.<br />
<br />
= Datasets and Model Architectures=<br />
We used three datasets for our experiments: MNIST, CelebA and CIFAR-10 datasets We briefly describe the generative models used for the experiments. For the MNIST dataset, we use two GAN models. The first model is a conditional DCGAN, while the second model is an unconditional Wasserstein GAN with gradient penalty (WGANGP). For the CelebA dataset, we use an unconditional DCGAN. For the CIFAR-10 dataset, we use an Auxiliary Classifier Wasserstein GAN with gradient penalty (ACWGANGP). For measurements with 2D outputs, i.e. Block-Pixels, Block-Patch, Keep-Patch, Extract-Patch, and Convolve+Noise, we use the same discriminator architectures as in the original work. For 1D projections, i.e. Pad-Rotate-Project, Pad-Rotate-Project-θ, we use fully connected discriminators. The architecture of the fully connected discriminator used for the MNIST dataset was 25-25-1 and for the CelebA dataset was 100-100-1.<br />
<br />
= Model =<br />
For the following variables superscript <math>r</math> represents the true distributions while superscript <math>g</math> represents the generated distributions. Let <math>x</math>, represent the underlying space and <math>y</math> for the measurement.<br />
<br />
Thus, <math>p_x^r</math> is the real underlying distribution over <math>\mathbb{R}^n</math> that we are interested in. However if we assume that our (known) measurement functions, <math>f_\theta: \mathbb{R}^n \to \mathbb{R}^m</math> are parameterized by <math>\Theta \sim p_\theta</math>, we can then observe <math>Y = f_\theta(x) \sim p_y^r</math> where <math>p_y^r</math> is a distribution over the measurements <math>y</math>.<br />
<br />
Mirroring the standard GAN setup we let <math>Z \in \mathbb{R}^k, Z \sim p_z</math> and <math>\Theta \sim p_\theta</math> be random variables coming from a distribution that is easy to sample. <br />
<br />
If we have a generator <math>G: \mathbb{R}^k \to \mathbb{R}^n</math> then we can generate <math>X^g = G(Z)</math> which has distribution <math>p_x^g</math> a measurement <math>Y^g = f_\Theta(G(Z))</math> which has distribution <math>p_y^g</math>. <br />
<br />
Unfortunately we do not observe any <math>X^g \sim p_x</math> so we can use the discriminator directly on <math>G(Z)</math> to train the generator. Instead we will use the discriminator to distinguish between the <math>Y^g -<br />
f_\Theta(G(Z))</math> and <math>Y^r</math>. That is we train the discriminator, <math>D: \mathbb{R}^m \to \mathbb{R}</math> to detect if a measurement came from <math>p_y^r</math> or <math>p_y^g</math>.<br />
<br />
AmbientGAN has the objective function:<br />
<br />
\begin{align}<br />
\min_G \max_D \mathbb{E}_{Y^r \sim p_y^r}[q(D(Y^r))] + \mathbb{E}_{Z \sim p_z, \Theta \sim p_\theta}[q(1 - D(f_\Theta(G(Z))))]<br />
\end{align}<br />
<br />
where <math>q(.)</math> is the quality function; for the standard GAN <math>q(x) = log(x)</math> and for Wasserstein GAN <math>q(x) = x</math>.<br />
<br />
As a technical limitation we require <math>f_\theta</math> to be differentiable with the respect each input for all values of <math>\theta</math>.<br />
<br />
With this set up we sample <math>Z \sim p_z</math>, <math>\Theta \sim p_\theta</math>, and <math>Y^r \sim U\{y_1, \cdots, y_s\}</math> each iteration and use them to compute the stochastic gradients of the objective function. We alternate between updating <math>G</math> and updating <math>D</math>. <br />
<br />
= Empirical Results =<br />
<br />
The paper continues to present results of AmbientGAN under various measurement functions when compared to baseline models. We have already seen one example in the introduction: a comparison of AmbientGAN in the Convolve + Noise Measurement case compared to the ignore-baseline, and the unmeasure-baseline. <br />
<br />
=== Convolve + Noise ===<br />
Additional results with the convolve + noise case with the celebA dataset, with the AmbientGAN compared to the baseline results with Wiener deconvolution. It is clear that AmbientGAN has superior performance in this case. The measurement is created from <math>f_{\Theta}(x) = k*x + \Theta</math>, where <math>*</math> is the convolution operation, <math>k</math> is the convolution kernel, and <math>\Theta \sim p_{\theta}</math> is the noise distribution.<br />
<br />
[[File:paper7_fig3.png]]<br />
<br />
Images undergone convolve + noise transformations (left). Results with Wiener deconvolution (middle). Results with AmbientGAN (right).<br />
<br />
=== Block-Pixels ===<br />
With the block-pixels measurement function each pixel is independently set to 0 with probability <math>p</math>.<br />
<br />
[[File:block-pixels.png]]<br />
<br />
Measurements from the celebA dataset with <math>p=0.95</math> (left). Images generated from GAN trained on unmeasured (via blurring) data (middle). Results generated from AmbientGAN (right).<br />
<br />
=== Block-Patch ===<br />
<br />
[[File:block-patch.png]]<br />
<br />
A random 14x14 patch is set to zero (left). Unmeasured using-navier-stoke inpainting (middle). AmbientGAN (right). <br />
<br />
=== Pad-Rotate-Project-<math>\theta</math> ===<br />
<br />
[[File:pad-rotate-project-theta.png]]<br />
<br />
Results generated by AmbientGAN where the measurement function 0 pads the images, rotates it by <math>\theta</math>, and projects it on to the x axis. For each measurement the value of <math>\theta</math> is known. <br />
<br />
The generated images only have the basic features of a face and is referred to as a failure case in the paper. However the measurement function performs relatively well given how lossy the measurement function is. <br />
<br />
=== Explanation of Inception Score ===<br />
To evaluate GAN performance, the authors make use of the inception score, a metric introduced by Salimans et al.(2016). To evaluate the inception score on a datapoint, a pre-trained inception classification model (Szegedy et al. 2016) is applied to that datapoint, and the KL divergence between its label distribution conditional on the datapoint and its marginal label distribution is computed. This KL divergence is the inception score. The idea is that meaningful images should be recognized by the inception model as belonging to some class, and so the conditional distribution should have low entropy, while the model should produce a variety of images, so the marginal should have high entropy. Thus an effective GAN should have a high inception score.<br />
<br />
=== MNIST Inception ===<br />
<br />
[[File:MNIST-inception.png]]<br />
<br />
AmbientGAN was compared with baselines through training several models with different probability <math>p</math> of blocking pixels. The plot on the left shows that the inception scores change as the block probability <math>p</math> changes. All four models are similar when no pixels are blocked <math>(p=0)</math>. By the increase of the blocking probability, AmbientGAN models present a relatively stable performance and perform better than the baseline models. Therefore, AmbientGAN is more robust than all other baseline models.<br />
<br />
The plot on the right reveals the changes in inception scores while the standard deviation of the additive Gaussian noise increased. Baselines perform better when the noise is small. By the increase of the variance, AmbientGAN models present a much better performance compare to the baseline models. Further AmbientGAN retains high inception scores as measurements become more and more lossy.<br />
<br />
For 1D projection, Pad-Rotate-Project model achieved an inception score of 4.18. Pad-Rotate-Project-θ model achieved an inception score of 8.12, which is close to the score of vanilla GAN 8.99.<br />
<br />
=== CIFAR-10 Inception ===<br />
<br />
[[File:CIFAR-inception.png]]<br />
<br />
AmbientGAN is faster to train and more robust even on more complex distributions such as CIFAR-10. Similar trends were observed on the CIFAR-10 data, and AmbientGAN maintains relatively stable inception score as the block probability was increased.<br />
<br />
=== Robustness To Measurement Model ===<br />
<br />
In order to empirically gauge robustness to measurement modelling error, the authors used the block-pixels measurement model: the image dataset was computed with <math> p^* = 0.5 </math>, and several versions of the model were trained, each using different values of blocking probability <math> p </math>. The inception scores were calculated and plotted as a function of <math> p </math>. This is shown on the left below:<br />
<br />
[[File:robustnessambientgan.png | 800px]]<br />
<br />
The authors observe that the inception score peaks when the model uses the correct probability, but decreases smoothly as the probability moves away, demonstrating some robustness.<br />
<br />
= Theoretical Results =<br />
<br />
The theoretical results in the paper prove the true underlying distribution of <math>p_x^r</math> can be recovered when we have data that comes from the Gaussian-Projection measurement, Fourier transform measurement and the block-pixels measurement. The do this by showing the distribution of the measurements <math>p_y^r</math> corresponds to a unique distribution <math>p_x^r</math>. Thus even when the measurement itself is non-invertible the effect of the measurement on the distribution <math>p_x^r</math> is invertible. Lemma 5.1 ensures this is sufficient to provide the AmbientGAN training process with a consistency guarantee. For full proofs of the results please see appendix A. <br />
<br />
=== Lemma 5.1 === <br />
Let <math>p_x^r</math> be the true data distribution, and <math>p_\theta</math> be the distributions over the parameters of the measurement function. Let <math>p_y^r</math> be the induced measurement distribution. <br />
<br />
Assume for <math>p_\theta</math> there is a unique probability distribution <math>p_x^r</math> that induces <math>p_y^r</math>. <br />
<br />
Then for the standard GAN model if the discriminator <math>D</math> is optimal such that <math>D(\cdot) = \frac{p_y^r(\cdot)}{p_y^r(\cdot) + p_y^g(\cdot)}</math>, then a generator <math>G</math> is optimal if and only if <math>p_x^g = p_x^r</math>. <br />
<br />
=== Theorems 5.2===<br />
For the Gussian-Projection measurement model, there is a unique underlying distribution <math>p_x^{r} </math> that can induce the observed measurement distribution <math>p_y^{r} </math>.<br />
<br />
=== Theorems 5.3===<br />
Let <math> \mathcal{F} (\cdot) </math> denote the Fourier transform and let <math>supp (\cdot) </math> be the support of a function. Consider the Convolve+Noise measurement model with the convolution kernel <math> k </math>and additive noise distribution <math>p_\theta </math>. If <math> supp( \mathcal{F} (k))^{c}=\phi </math> and <math> supp( \mathcal{F} (p_\theta))^{c}=\phi </math>, then there is a unique distribution <math>p_x^{r} </math> that can induce the measurement distribution <math>p_y^{r} </math>.<br />
<br />
=== Theorems 5.4===<br />
Assume that each image pixel takes values in a finite set P. Thus <math>x \in P^n \subset \mathbb{R}^{n} </math>. Assume <math>0 \in P </math>, and consider the Block-Pixels measurement model with <math>p </math> being the probability of blocking a pixel. If <math>p <1</math>, then there is a unique distribution <math>p_x^{r} </math> that can induce the measurement distribution <math>p_y^{r} </math>. Further, for any <math> \epsilon > 0, \delta \in (0, 1] </math>, given a dataset of<br />
\begin{equation}<br />
s=\Omega \left( \frac{|P|^{2n}}{(1-p)^{2n} \epsilon^{2}} log \left( \frac{|P|^{n}}{\delta} \right) \right)<br />
\end{equation}<br />
IID measurement samples from pry , if the discriminator D is optimal, then with probability <math> \geq 1 - \delta </math> over the dataset, any optimal generator G must satisfy <math> d_{TV} \left( p^g_x , p^r_x \right) \leq \epsilon </math>, where <math> d_{TV} \left( \cdot, \cdot \right) </math> is the total variation distance.<br />
<br />
= Conclusion =<br />
Generative models are powerful tools, but constructing a generative model requires a large, high quality dataset of the distribution of interest. We show how to relax this requirement, by learning a distribution from a dataset that only contains incomplete, noisy measurements of the distribution. We hope that this will allow for the construction of new generative models of distributions for which no high quality dataset exists.<br />
<br />
= Future Research =<br />
<br />
One critical weakness of AmbientGAN is the assumption that the measurement model is known. It would be nice to be able to train an AmbientGAN model when we have an unknown measurement model but also a small sample of unmeasured data.<br />
<br />
A related piece of work is [https://arxiv.org/abs/1802.01284 here]. In particular, Algorithm 2 in the paper excluding the discriminator is similar to AmbientGAN.<br />
<br />
= References =<br />
# https://openreview.net/forum?id=Hy7fDog0b<br />
# Salimans, Tim, et al. "Improved techniques for training gans." Advances in Neural Information Processing Systems. 2016.<br />
# Szegedy, Christian, et al. "Rethinking the inception architecture for computer vision." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016.</div>Bapskikohttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/AmbientGAN:_Generative_Models_from_Lossy_Measurements&diff=35383stat946w18/AmbientGAN: Generative Models from Lossy Measurements2018-03-23T02:47:41Z<p>Bapskiko: /* Robustness To Measurement Model */</p>
<hr />
<div>= Introduction =<br />
Generative Adversarial Networks operate by simulating complex distributions but training them requires access to large amounts of high quality data. Often times we only have access to noisy or partial observations, which will, from here on, be referred to as measurements of the true data. If we know the measurement function and would like to train a generative model for the true data, there are several ways to continue which have varying degrees of success. We will use noisy MNIST data as an illustrative example. Suppose we only see MNIST data that has been run through a Gaussian kernel (blurred) with some noise from a <math>N(0, 0.5^2)</math> distribution added to each pixel:<br />
<br />
<gallery mode="packed"><br />
File:mnist.png| True Data (Unobserved)<br />
File:mnistmeasured.png| Measured Data (Observed)<br />
</gallery><br />
<br />
<br />
=== Ignore the problem ===<br />
[[File:GANignore.png|500px]] [[File:mnistignore.png|300px]]<br />
<br />
Train a generative model directly on the measured data. This will obviously be unable to generate the true distribution before measurement has occurred. <br />
<br />
<br />
=== Try to recover the information lost ===<br />
[[File:GANrecovery.png|420px]] [[File:mnistrecover.png|300px]]<br />
<br />
Works better than ignoring the problem but depends on how easily the measurement function can be inverted.<br />
<br />
=== AmbientGAN ===<br />
[[File:GANambient.png|500px]] [[File:mnistambient.png|300px]]<br />
<br />
Ashish Bora, Eric Price and Alexandros G. Dimakis propose AmbientGAN as a way to recover the true underlying distribution from measurements of the true data. <br />
<br />
AmbientGAN works by training a generator which attempts to have the measurements of the output it generates fool the discriminator. The discriminator must distinguish between real and generated measurements.<br />
<br />
= Datasets and Model Architectures=<br />
We used three datasets for our experiments: MNIST, CelebA and CIFAR-10 datasets We briefly describe the generative models used for the experiments. For the MNIST dataset, we use two GAN models. The first model is a conditional DCGAN, while the second model is an unconditional Wasserstein GAN with gradient penalty (WGANGP). For the CelebA dataset, we use an unconditional DCGAN. For the CIFAR-10 dataset, we use an Auxiliary Classifier Wasserstein GAN with gradient penalty (ACWGANGP). For measurements with 2D outputs, i.e. Block-Pixels, Block-Patch, Keep-Patch, Extract-Patch, and Convolve+Noise, we use the same discriminator architectures as in the original work. For 1D projections, i.e. Pad-Rotate-Project, Pad-Rotate-Project-θ, we use fully connected discriminators. The architecture of the fully connected discriminator used for the MNIST dataset was 25-25-1 and for the CelebA dataset was 100-100-1.<br />
<br />
= Model =<br />
For the following variables superscript <math>r</math> represents the true distributions while superscript <math>g</math> represents the generated distributions. Let <math>x</math>, represent the underlying space and <math>y</math> for the measurement.<br />
<br />
Thus, <math>p_x^r</math> is the real underlying distribution over <math>\mathbb{R}^n</math> that we are interested in. However if we assume that our (known) measurement functions, <math>f_\theta: \mathbb{R}^n \to \mathbb{R}^m</math> are parameterized by <math>\Theta \sim p_\theta</math>, we can then observe <math>Y = f_\theta(x) \sim p_y^r</math> where <math>p_y^r</math> is a distribution over the measurements <math>y</math>.<br />
<br />
Mirroring the standard GAN setup we let <math>Z \in \mathbb{R}^k, Z \sim p_z</math> and <math>\Theta \sim p_\theta</math> be random variables coming from a distribution that is easy to sample. <br />
<br />
If we have a generator <math>G: \mathbb{R}^k \to \mathbb{R}^n</math> then we can generate <math>X^g = G(Z)</math> which has distribution <math>p_x^g</math> a measurement <math>Y^g = f_\Theta(G(Z))</math> which has distribution <math>p_y^g</math>. <br />
<br />
Unfortunately we do not observe any <math>X^g \sim p_x</math> so we can use the discriminator directly on <math>G(Z)</math> to train the generator. Instead we will use the discriminator to distinguish between the <math>Y^g -<br />
f_\Theta(G(Z))</math> and <math>Y^r</math>. That is we train the discriminator, <math>D: \mathbb{R}^m \to \mathbb{R}</math> to detect if a measurement came from <math>p_y^r</math> or <math>p_y^g</math>.<br />
<br />
AmbientGAN has the objective function:<br />
<br />
\begin{align}<br />
\min_G \max_D \mathbb{E}_{Y^r \sim p_y^r}[q(D(Y^r))] + \mathbb{E}_{Z \sim p_z, \Theta \sim p_\theta}[q(1 - D(f_\Theta(G(Z))))]<br />
\end{align}<br />
<br />
where <math>q(.)</math> is the quality function; for the standard GAN <math>q(x) = log(x)</math> and for Wasserstein GAN <math>q(x) = x</math>.<br />
<br />
As a technical limitation we require <math>f_\theta</math> to be differentiable with the respect each input for all values of <math>\theta</math>.<br />
<br />
With this set up we sample <math>Z \sim p_z</math>, <math>\Theta \sim p_\theta</math>, and <math>Y^r \sim U\{y_1, \cdots, y_s\}</math> each iteration and use them to compute the stochastic gradients of the objective function. We alternate between updating <math>G</math> and updating <math>D</math>. <br />
<br />
= Empirical Results =<br />
<br />
The paper continues to present results of AmbientGAN under various measurement functions when compared to baseline models. We have already seen one example in the introduction: a comparison of AmbientGAN in the Convolve + Noise Measurement case compared to the ignore-baseline, and the unmeasure-baseline. <br />
<br />
=== Convolve + Noise ===<br />
Additional results with the convolve + noise case with the celebA dataset, with the AmbientGAN compared to the baseline results with Wiener deconvolution. It is clear that AmbientGAN has superior performance in this case. The measurement is created from <math>f_{\Theta}(x) = k*x + \Theta</math>, where <math>*</math> is the convolution operation, <math>k</math> is the convolution kernel, and <math>\Theta \sim p_{\theta}</math> is the noise distribution.<br />
<br />
[[File:paper7_fig3.png]]<br />
<br />
Images undergone convolve + noise transformations (left). Results with Wiener deconvolution (middle). Results with AmbientGAN (right).<br />
<br />
=== Block-Pixels ===<br />
With the block-pixels measurement function each pixel is independently set to 0 with probability <math>p</math>.<br />
<br />
[[File:block-pixels.png]]<br />
<br />
Measurements from the celebA dataset with <math>p=0.95</math> (left). Images generated from GAN trained on unmeasured (via blurring) data (middle). Results generated from AmbientGAN (right).<br />
<br />
=== Block-Patch ===<br />
<br />
[[File:block-patch.png]]<br />
<br />
A random 14x14 patch is set to zero (left). Unmeasured using-navier-stoke inpainting (middle). AmbientGAN (right). <br />
<br />
=== Pad-Rotate-Project-<math>\theta</math> ===<br />
<br />
[[File:pad-rotate-project-theta.png]]<br />
<br />
Results generated by AmbientGAN where the measurement function 0 pads the images, rotates it by <math>\theta</math>, and projects it on to the x axis. For each measurement the value of <math>\theta</math> is known. <br />
<br />
The generated images only have the basic features of a face and is referred to as a failure case in the paper. However the measurement function performs relatively well given how lossy the measurement function is. <br />
<br />
=== Explanation of Inception Score ===<br />
To evaluate GAN performance, the authors make use of the inception score, a metric introduced by Salimans et al.(2016). To evaluate the inception score on a datapoint, a pre-trained inception classification model (Szegedy et al. 2016) is applied to that datapoint, and the KL divergence between its label distribution conditional on the datapoint and its marginal label distribution is computed. This KL divergence is the inception score. The idea is that meaningful images should be recognized by the inception model as belonging to some class, and so the conditional distribution should have low entropy, while the model should produce a variety of images, so the marginal should have high entropy. Thus an effective GAN should have a high inception score.<br />
<br />
=== MNIST Inception ===<br />
<br />
[[File:MNIST-inception.png]]<br />
<br />
AmbientGAN was compared with baselines through training several models with different probability <math>p</math> of blocking pixels. The plot on the left shows that the inception scores change as the block probability <math>p</math> changes. All four models are similar when no pixels are blocked <math>(p=0)</math>. By the increase of the blocking probability, AmbientGAN models present a relatively stable performance and perform better than the baseline models. Therefore, AmbientGAN is more robust than all other baseline models.<br />
<br />
The plot on the right reveals the changes in inception scores while the standard deviation of the additive Gaussian noise increased. Baselines perform better when the noise is small. By the increase of the variance, AmbientGAN models present a much better performance compare to the baseline models. Further AmbientGAN retains high inception scores as measurements become more and more lossy.<br />
<br />
For 1D projection, Pad-Rotate-Project model achieved an inception score of 4.18. Pad-Rotate-Project-θ model achieved an inception score of 8.12, which is close to the score of vanilla GAN 8.99.<br />
<br />
=== CIFAR-10 Inception ===<br />
<br />
[[File:CIFAR-inception.png]]<br />
<br />
AmbientGAN is faster to train and more robust even on more complex distributions such as CIFAR-10. Similar trends were observed on the CIFAR-10 data, and AmbientGAN maintains relatively stable inception score as the block probability was increased.<br />
<br />
=== Robustness To Measurement Model ===<br />
<br />
In order to empirically gauge robustness to measurement modelling error, the authors used the block-pixels measurement model: the image dataset was computed with <math> p^* = 0.5 </math>, and several versions of the model were trained, each using different values of blocking probability <math> p </math>. The inception scores were calculated and plotted as a function of <math> p </math>. This is shown on the left below:<br />
<br />
[[File:robustnessambientgan.png | 800px]]<br />
<br />
The authors observe that the inception score peaks when the model uses the correct probability, but decreases smoothly a the probability moves away, demonstrating some robustness.<br />
<br />
= Theoretical Results =<br />
<br />
The theoretical results in the paper prove the true underlying distribution of <math>p_x^r</math> can be recovered when we have data that comes from the Gaussian-Projection measurement, Fourier transform measurement and the block-pixels measurement. The do this by showing the distribution of the measurements <math>p_y^r</math> corresponds to a unique distribution <math>p_x^r</math>. Thus even when the measurement itself is non-invertible the effect of the measurement on the distribution <math>p_x^r</math> is invertible. Lemma 5.1 ensures this is sufficient to provide the AmbientGAN training process with a consistency guarantee. For full proofs of the results please see appendix A. <br />
<br />
=== Lemma 5.1 === <br />
Let <math>p_x^r</math> be the true data distribution, and <math>p_\theta</math> be the distributions over the parameters of the measurement function. Let <math>p_y^r</math> be the induced measurement distribution. <br />
<br />
Assume for <math>p_\theta</math> there is a unique probability distribution <math>p_x^r</math> that induces <math>p_y^r</math>. <br />
<br />
Then for the standard GAN model if the discriminator <math>D</math> is optimal such that <math>D(\cdot) = \frac{p_y^r(\cdot)}{p_y^r(\cdot) + p_y^g(\cdot)}</math>, then a generator <math>G</math> is optimal if and only if <math>p_x^g = p_x^r</math>. <br />
<br />
=== Theorems 5.2===<br />
For the Gussian-Projection measurement model, there is a unique underlying distribution <math>p_x^{r} </math> that can induce the observed measurement distribution <math>p_y^{r} </math>.<br />
<br />
=== Theorems 5.3===<br />
Let <math> \mathcal{F} (\cdot) </math> denote the Fourier transform and let <math>supp (\cdot) </math> be the support of a function. Consider the Convolve+Noise measurement model with the convolution kernel <math> k </math>and additive noise distribution <math>p_\theta </math>. If <math> supp( \mathcal{F} (k))^{c}=\phi </math> and <math> supp( \mathcal{F} (p_\theta))^{c}=\phi </math>, then there is a unique distribution <math>p_x^{r} </math> that can induce the measurement distribution <math>p_y^{r} </math>.<br />
<br />
=== Theorems 5.4===<br />
Assume that each image pixel takes values in a finite set P. Thus <math>x \in P^n \subset \mathbb{R}^{n} </math>. Assume <math>0 \in P </math>, and consider the Block-Pixels measurement model with <math>p </math> being the probability of blocking a pixel. If <math>p <1</math>, then there is a unique distribution <math>p_x^{r} </math> that can induce the measurement distribution <math>p_y^{r} </math>. Further, for any <math> \epsilon > 0, \delta \in (0, 1] </math>, given a dataset of<br />
\begin{equation}<br />
s=\Omega \left( \frac{|P|^{2n}}{(1-p)^{2n} \epsilon^{2}} log \left( \frac{|P|^{n}}{\delta} \right) \right)<br />
\end{equation}<br />
IID measurement samples from pry , if the discriminator D is optimal, then with probability <math> \geq 1 - \delta </math> over the dataset, any optimal generator G must satisfy <math> d_{TV} \left( p^g_x , p^r_x \right) \leq \epsilon </math>, where <math> d_{TV} \left( \cdot, \cdot \right) </math> is the total variation distance.<br />
<br />
= Conclusion =<br />
Generative models are powerful tools, but constructing a generative model requires a large, high quality dataset of the distribution of interest. We show how to relax this requirement, by learning a distribution from a dataset that only contains incomplete, noisy measurements of the distribution. We hope that this will allow for the construction of new generative models of distributions for which no high quality dataset exists.<br />
<br />
= Future Research =<br />
<br />
One critical weakness of AmbientGAN is the assumption that the measurement model is known. It would be nice to be able to train an AmbientGAN model when we have an unknown measurement model but also a small sample of unmeasured data.<br />
<br />
A related piece of work is [https://arxiv.org/abs/1802.01284 here]. In particular, Algorithm 2 in the paper excluding the discriminator is similar to AmbientGAN.<br />
<br />
= References =<br />
# https://openreview.net/forum?id=Hy7fDog0b<br />
# Salimans, Tim, et al. "Improved techniques for training gans." Advances in Neural Information Processing Systems. 2016.<br />
# Szegedy, Christian, et al. "Rethinking the inception architecture for computer vision." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016.</div>Bapskikohttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/AmbientGAN:_Generative_Models_from_Lossy_Measurements&diff=35381stat946w18/AmbientGAN: Generative Models from Lossy Measurements2018-03-23T02:43:57Z<p>Bapskiko: /* Empirical Results */</p>
<hr />
<div>= Introduction =<br />
Generative Adversarial Networks operate by simulating complex distributions but training them requires access to large amounts of high quality data. Often times we only have access to noisy or partial observations, which will, from here on, be referred to as measurements of the true data. If we know the measurement function and would like to train a generative model for the true data, there are several ways to continue which have varying degrees of success. We will use noisy MNIST data as an illustrative example. Suppose we only see MNIST data that has been run through a Gaussian kernel (blurred) with some noise from a <math>N(0, 0.5^2)</math> distribution added to each pixel:<br />
<br />
<gallery mode="packed"><br />
File:mnist.png| True Data (Unobserved)<br />
File:mnistmeasured.png| Measured Data (Observed)<br />
</gallery><br />
<br />
<br />
=== Ignore the problem ===<br />
[[File:GANignore.png|500px]] [[File:mnistignore.png|300px]]<br />
<br />
Train a generative model directly on the measured data. This will obviously be unable to generate the true distribution before measurement has occurred. <br />
<br />
<br />
=== Try to recover the information lost ===<br />
[[File:GANrecovery.png|420px]] [[File:mnistrecover.png|300px]]<br />
<br />
Works better than ignoring the problem but depends on how easily the measurement function can be inverted.<br />
<br />
=== AmbientGAN ===<br />
[[File:GANambient.png|500px]] [[File:mnistambient.png|300px]]<br />
<br />
Ashish Bora, Eric Price and Alexandros G. Dimakis propose AmbientGAN as a way to recover the true underlying distribution from measurements of the true data. <br />
<br />
AmbientGAN works by training a generator which attempts to have the measurements of the output it generates fool the discriminator. The discriminator must distinguish between real and generated measurements.<br />
<br />
= Datasets and Model Architectures=<br />
We used three datasets for our experiments: MNIST, CelebA and CIFAR-10 datasets We briefly describe the generative models used for the experiments. For the MNIST dataset, we use two GAN models. The first model is a conditional DCGAN, while the second model is an unconditional Wasserstein GAN with gradient penalty (WGANGP). For the CelebA dataset, we use an unconditional DCGAN. For the CIFAR-10 dataset, we use an Auxiliary Classifier Wasserstein GAN with gradient penalty (ACWGANGP). For measurements with 2D outputs, i.e. Block-Pixels, Block-Patch, Keep-Patch, Extract-Patch, and Convolve+Noise, we use the same discriminator architectures as in the original work. For 1D projections, i.e. Pad-Rotate-Project, Pad-Rotate-Project-θ, we use fully connected discriminators. The architecture of the fully connected discriminator used for the MNIST dataset was 25-25-1 and for the CelebA dataset was 100-100-1.<br />
<br />
= Model =<br />
For the following variables superscript <math>r</math> represents the true distributions while superscript <math>g</math> represents the generated distributions. Let <math>x</math>, represent the underlying space and <math>y</math> for the measurement.<br />
<br />
Thus, <math>p_x^r</math> is the real underlying distribution over <math>\mathbb{R}^n</math> that we are interested in. However if we assume that our (known) measurement functions, <math>f_\theta: \mathbb{R}^n \to \mathbb{R}^m</math> are parameterized by <math>\Theta \sim p_\theta</math>, we can then observe <math>Y = f_\theta(x) \sim p_y^r</math> where <math>p_y^r</math> is a distribution over the measurements <math>y</math>.<br />
<br />
Mirroring the standard GAN setup we let <math>Z \in \mathbb{R}^k, Z \sim p_z</math> and <math>\Theta \sim p_\theta</math> be random variables coming from a distribution that is easy to sample. <br />
<br />
If we have a generator <math>G: \mathbb{R}^k \to \mathbb{R}^n</math> then we can generate <math>X^g = G(Z)</math> which has distribution <math>p_x^g</math> a measurement <math>Y^g = f_\Theta(G(Z))</math> which has distribution <math>p_y^g</math>. <br />
<br />
Unfortunately we do not observe any <math>X^g \sim p_x</math> so we can use the discriminator directly on <math>G(Z)</math> to train the generator. Instead we will use the discriminator to distinguish between the <math>Y^g -<br />
f_\Theta(G(Z))</math> and <math>Y^r</math>. That is we train the discriminator, <math>D: \mathbb{R}^m \to \mathbb{R}</math> to detect if a measurement came from <math>p_y^r</math> or <math>p_y^g</math>.<br />
<br />
AmbientGAN has the objective function:<br />
<br />
\begin{align}<br />
\min_G \max_D \mathbb{E}_{Y^r \sim p_y^r}[q(D(Y^r))] + \mathbb{E}_{Z \sim p_z, \Theta \sim p_\theta}[q(1 - D(f_\Theta(G(Z))))]<br />
\end{align}<br />
<br />
where <math>q(.)</math> is the quality function; for the standard GAN <math>q(x) = log(x)</math> and for Wasserstein GAN <math>q(x) = x</math>.<br />
<br />
As a technical limitation we require <math>f_\theta</math> to be differentiable with the respect each input for all values of <math>\theta</math>.<br />
<br />
With this set up we sample <math>Z \sim p_z</math>, <math>\Theta \sim p_\theta</math>, and <math>Y^r \sim U\{y_1, \cdots, y_s\}</math> each iteration and use them to compute the stochastic gradients of the objective function. We alternate between updating <math>G</math> and updating <math>D</math>. <br />
<br />
= Empirical Results =<br />
<br />
The paper continues to present results of AmbientGAN under various measurement functions when compared to baseline models. We have already seen one example in the introduction: a comparison of AmbientGAN in the Convolve + Noise Measurement case compared to the ignore-baseline, and the unmeasure-baseline. <br />
<br />
=== Convolve + Noise ===<br />
Additional results with the convolve + noise case with the celebA dataset, with the AmbientGAN compared to the baseline results with Wiener deconvolution. It is clear that AmbientGAN has superior performance in this case. The measurement is created from <math>f_{\Theta}(x) = k*x + \Theta</math>, where <math>*</math> is the convolution operation, <math>k</math> is the convolution kernel, and <math>\Theta \sim p_{\theta}</math> is the noise distribution.<br />
<br />
[[File:paper7_fig3.png]]<br />
<br />
Images undergone convolve + noise transformations (left). Results with Wiener deconvolution (middle). Results with AmbientGAN (right).<br />
<br />
=== Block-Pixels ===<br />
With the block-pixels measurement function each pixel is independently set to 0 with probability <math>p</math>.<br />
<br />
[[File:block-pixels.png]]<br />
<br />
Measurements from the celebA dataset with <math>p=0.95</math> (left). Images generated from GAN trained on unmeasured (via blurring) data (middle). Results generated from AmbientGAN (right).<br />
<br />
=== Block-Patch ===<br />
<br />
[[File:block-patch.png]]<br />
<br />
A random 14x14 patch is set to zero (left). Unmeasured using-navier-stoke inpainting (middle). AmbientGAN (right). <br />
<br />
=== Pad-Rotate-Project-<math>\theta</math> ===<br />
<br />
[[File:pad-rotate-project-theta.png]]<br />
<br />
Results generated by AmbientGAN where the measurement function 0 pads the images, rotates it by <math>\theta</math>, and projects it on to the x axis. For each measurement the value of <math>\theta</math> is known. <br />
<br />
The generated images only have the basic features of a face and is referred to as a failure case in the paper. However the measurement function performs relatively well given how lossy the measurement function is. <br />
<br />
=== Explanation of Inception Score ===<br />
To evaluate GAN performance, the authors make use of the inception score, a metric introduced by Salimans et al.(2016). To evaluate the inception score on a datapoint, a pre-trained inception classification model (Szegedy et al. 2016) is applied to that datapoint, and the KL divergence between its label distribution conditional on the datapoint and its marginal label distribution is computed. This KL divergence is the inception score. The idea is that meaningful images should be recognized by the inception model as belonging to some class, and so the conditional distribution should have low entropy, while the model should produce a variety of images, so the marginal should have high entropy. Thus an effective GAN should have a high inception score.<br />
<br />
=== MNIST Inception ===<br />
<br />
[[File:MNIST-inception.png]]<br />
<br />
AmbientGAN was compared with baselines through training several models with different probability <math>p</math> of blocking pixels. The plot on the left shows that the inception scores change as the block probability <math>p</math> changes. All four models are similar when no pixels are blocked <math>(p=0)</math>. By the increase of the blocking probability, AmbientGAN models present a relatively stable performance and perform better than the baseline models. Therefore, AmbientGAN is more robust than all other baseline models.<br />
<br />
The plot on the right reveals the changes in inception scores while the standard deviation of the additive Gaussian noise increased. Baselines perform better when the noise is small. By the increase of the variance, AmbientGAN models present a much better performance compare to the baseline models. Further AmbientGAN retains high inception scores as measurements become more and more lossy.<br />
<br />
For 1D projection, Pad-Rotate-Project model achieved an inception score of 4.18. Pad-Rotate-Project-θ model achieved an inception score of 8.12, which is close to the score of vanilla GAN 8.99.<br />
<br />
=== CIFAR-10 Inception ===<br />
<br />
[[File:CIFAR-inception.png]]<br />
<br />
AmbientGAN is faster to train and more robust even on more complex distributions such as CIFAR-10. Similar trends were observed on the CIFAR-10 data, and AmbientGAN maintains relatively stable inception score as the block probability was increased.<br />
<br />
=== Robustness To Measurement Model ===<br />
<br />
In order to empirically gauge robustness to measurement modelling error, the authors used the block-pixels measurement model: the image dataset was computed with <math> p^* = 0.5 </math>, and several versions of the model were trained, each using different values of blocking probability <math> p </math>. The inception scores were calculated and plotted as a function of <math> p </math>. This is shown on the left below:<br />
<br />
[[File:robustnessambientgan.png | 800px]]<br />
<br />
= Theoretical Results =<br />
<br />
The theoretical results in the paper prove the true underlying distribution of <math>p_x^r</math> can be recovered when we have data that comes from the Gaussian-Projection measurement, Fourier transform measurement and the block-pixels measurement. The do this by showing the distribution of the measurements <math>p_y^r</math> corresponds to a unique distribution <math>p_x^r</math>. Thus even when the measurement itself is non-invertible the effect of the measurement on the distribution <math>p_x^r</math> is invertible. Lemma 5.1 ensures this is sufficient to provide the AmbientGAN training process with a consistency guarantee. For full proofs of the results please see appendix A. <br />
<br />
=== Lemma 5.1 === <br />
Let <math>p_x^r</math> be the true data distribution, and <math>p_\theta</math> be the distributions over the parameters of the measurement function. Let <math>p_y^r</math> be the induced measurement distribution. <br />
<br />
Assume for <math>p_\theta</math> there is a unique probability distribution <math>p_x^r</math> that induces <math>p_y^r</math>. <br />
<br />
Then for the standard GAN model if the discriminator <math>D</math> is optimal such that <math>D(\cdot) = \frac{p_y^r(\cdot)}{p_y^r(\cdot) + p_y^g(\cdot)}</math>, then a generator <math>G</math> is optimal if and only if <math>p_x^g = p_x^r</math>. <br />
<br />
=== Theorems 5.2===<br />
For the Gussian-Projection measurement model, there is a unique underlying distribution <math>p_x^{r} </math> that can induce the observed measurement distribution <math>p_y^{r} </math>.<br />
<br />
=== Theorems 5.3===<br />
Let <math> \mathcal{F} (\cdot) </math> denote the Fourier transform and let <math>supp (\cdot) </math> be the support of a function. Consider the Convolve+Noise measurement model with the convolution kernel <math> k </math>and additive noise distribution <math>p_\theta </math>. If <math> supp( \mathcal{F} (k))^{c}=\phi </math> and <math> supp( \mathcal{F} (p_\theta))^{c}=\phi </math>, then there is a unique distribution <math>p_x^{r} </math> that can induce the measurement distribution <math>p_y^{r} </math>.<br />
<br />
=== Theorems 5.4===<br />
Assume that each image pixel takes values in a finite set P. Thus <math>x \in P^n \subset \mathbb{R}^{n} </math>. Assume <math>0 \in P </math>, and consider the Block-Pixels measurement model with <math>p </math> being the probability of blocking a pixel. If <math>p <1</math>, then there is a unique distribution <math>p_x^{r} </math> that can induce the measurement distribution <math>p_y^{r} </math>. Further, for any <math> \epsilon > 0, \delta \in (0, 1] </math>, given a dataset of<br />
\begin{equation}<br />
s=\Omega \left( \frac{|P|^{2n}}{(1-p)^{2n} \epsilon^{2}} log \left( \frac{|P|^{n}}{\delta} \right) \right)<br />
\end{equation}<br />
IID measurement samples from pry , if the discriminator D is optimal, then with probability <math> \geq 1 - \delta </math> over the dataset, any optimal generator G must satisfy <math> d_{TV} \left( p^g_x , p^r_x \right) \leq \epsilon </math>, where <math> d_{TV} \left( \cdot, \cdot \right) </math> is the total variation distance.<br />
<br />
= Conclusion =<br />
Generative models are powerful tools, but constructing a generative model requires a large, high quality dataset of the distribution of interest. We show how to relax this requirement, by learning a distribution from a dataset that only contains incomplete, noisy measurements of the distribution. We hope that this will allow for the construction of new generative models of distributions for which no high quality dataset exists.<br />
<br />
= Future Research =<br />
<br />
One critical weakness of AmbientGAN is the assumption that the measurement model is known. It would be nice to be able to train an AmbientGAN model when we have an unknown measurement model but also a small sample of unmeasured data.<br />
<br />
A related piece of work is [https://arxiv.org/abs/1802.01284 here]. In particular, Algorithm 2 in the paper excluding the discriminator is similar to AmbientGAN.<br />
<br />
= References =<br />
# https://openreview.net/forum?id=Hy7fDog0b<br />
# Salimans, Tim, et al. "Improved techniques for training gans." Advances in Neural Information Processing Systems. 2016.<br />
# Szegedy, Christian, et al. "Rethinking the inception architecture for computer vision." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016.</div>Bapskikohttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:robustnessambientgan.png&diff=35380File:robustnessambientgan.png2018-03-23T02:43:19Z<p>Bapskiko: </p>
<hr />
<div></div>Bapskikohttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Do_Deep_Neural_Networks_Suffer_from_Crowding&diff=35379Do Deep Neural Networks Suffer from Crowding2018-03-23T02:03:49Z<p>Bapskiko: /* Observations */</p>
<hr />
<div>= Introduction =<br />
Since the increase in popularity of Deep Neural Networks (DNNs), there has been lots of research in making machines capable of recognizing objects the same way humans do. Humans can recognize objects in a way that is invariant to scale, translation, and clutter. Crowding is another visual effect suffered by humans, in which an object that can be recognized in isolation can no longer be recognized when other objects, called flankers, are placed close to it. This paper focuses on studying the impact of crowding on DNNs by adding clutter to the images and then analyzing which models and settings suffer less from such effects. <br />
<br />
[[File:paper25_fig_crowding_ex.png|center|600px]]<br />
The figure shows a visual example of crowding [3]. Keep your eyes still and look at the dot in the center and try to identify the "A" in the two circles. You should see that it is much easier to make out the "A" in the right than in the left circle. The same "A" exists in both circles, however, the left circle contains flankers which are those line segments.<br />
<br />
The paper investigates two types of DNNs for crowding: traditional deep convolutional neural networks (DCNN) and a multi-scale eccentricity-dependent model which is an extension of the DCNNs and inspired by the retina where the receptive field size of the convolutional filters in the model grows with increasing distance from the center of the image, called the eccentricity and will be explained below. The authors focus on the dependence of crowding on image factors, such as flanker configuration, target-flanker similarity, target eccentricity and premature pooling in particular.<br />
<br />
= Models =<br />
Two types of models are considered: deep convolutional neural networks and eccentricity-dependent models. Based on several hypothesis that pooling is the cause of crowding in human perception, the paper tries to investigate the effects of pooling on the detection of crowded images through these two network types. <br />
<br />
== Deep Convolutional Neural Networks ==<br />
The DCNN is a basic architecture with 3 convolutional layers, spatial 3x3 max-pooling with varying strides and a fully connected layer for classification as shown in the below figure. <br />
[[File:DCNN.png|800px|center]]<br />
<br />
The network is fed with images resized to 60x60, with mini-batches of 128 images, 32 feature channels for all convolutional layers, and convolutional filters of size 5x5 and stride 1.<br />
<br />
As highlighted earlier, the effect of pooling is into main consideration and hence three different configurations have been investigated as below: <br />
<br />
1. '''No total pooling''' Feature maps sizes decrease only due to boundary effects, as the 3x3 max pooling has stride 1. The square feature maps sizes after each pool layer are 60-54-48-42.<br />
2. '''Progressive pooling''' 3x3 pooling with a stride of 2 halves the square size of the feature maps, until we pool over what remains in the final layer, getting rid of any spatial information before the fully connected layer. (60-27-11-1).<br />
3. '''At end pooling''' Same as no total pooling, but before the fully connected layer, max-pool over the entire feature map. (60-54-48-1).<br />
<br />
===What is the problem in CNNs?===<br />
CNNs fall short in explaining human perceptual invariance. First, CNNs typically take input at a single uniform resolution. Biological measurements suggest that resolution is not uniform across the human visual field, but rather decays with eccentricity, i.e. distance from the center of focus Even more importantly, CNNs rely on data augmentation to achieve transformation-invariance and obviously a lot of processing is needed for CNNs.<br />
<br />
==Eccentricity-dependent Model==<br />
In order to take care of the scale invariance in the input image, the eccentricity dependent DNN is utilized. The main intuition behind this architecture is that as we increase eccentricity, the receptive fields also increase and hence the model will become invariant to changing input scales. The authors note that the width of each scale is roughly related to the amount of translation invariance for objects at that scale, simply because once the object is outside that window, the filter no longer observes it. Therefore, the authors say that the architecture emphasizes scale invariance over translation invariance, in contrast to traditional DCNNs. In this model, the input image is cropped into varying scales (11 crops increasing by a factor of <math>\sqrt{2}</math> which are then resized to 60x60 pixels) and then fed to the network. The model computes an invariant representation of the input by sampling the inverted pyramid at a discrete set of scales with the same number of filters at each scale. Since the same number of filters are used for each scale, the smaller crops will be sampled at a high resolution while the larger crops will be sampled with a low resolution. These scales are fed into the network as an input channel to the convolutional layers and share the weights across scale and space.<br />
[[File:EDM.png|2000x450px|center]]<br />
<br />
The architecture of this model is the same as the previous DCNN model with the only change being the extra filters added for each of the scales, so the number of parameters remains the same as DCNN models. The authors perform spatial pooling, the aforementioned ''At end pooling'' is used here, and scale pooling which helps in reducing a number of scales by taking the maximum value of corresponding locations in the feature maps across multiple scales. It has three configurations: (1) at the beginning, in which all the different scales are pooled together after the first layer, 11-1-1-1-1 (2) progressively, 11-7-5-3-1 and (3) at the end, 11-11-11-11-1, in which all 11 scales are pooled together at the last layer.<br />
<br />
===Contrast Normalization===<br />
Since we have multiple scales of an input image, in some experiments, we perform normalization such that the sum of the pixel intensities in each scale is in the same range [0,1] (this is to prevent smaller crops, which have more non-black pixels, from disproportionately dominating max-pooling across scales). The normalized pixel intensities are then divided by a factor proportional to the crop area [[File:sqrtf.png|60px]] where i=1 is the smallest crop.<br />
<br />
=Experiments=<br />
Targets are the set of objects to be recognized and flankers are the set of objects the model has not been trained to recognize, which act as clutter with respect to these target objects. The target objects are the even MNIST numbers having translational variance (shifted at different locations of the image along the horizontal axis), while flankers are from odd MNIST numbers, notMNIST dataset (contains alphabet letters) and Omniglot dataset (contains characters). Examples of the target and flanker configurations are shown below: <br />
[[File:eximages.png|800px|center]]<br />
<br />
The target and the object are referred to as ''a'' and ''x'' respectively with the below four configurations: <br />
# No flankers. Only the target object. (a in the plots) <br />
# One central flanker closer to the center of the image than the target. (xa) <br />
# One peripheral flanker closer to the boundary of the image that the target. (ax) <br />
# Two flankers spaced equally around the target, being both the same object (xax).<br />
<br />
Training is done using backpropogation with images of size <math>1920 px^2</math> with embedded targets objects and flankers of size of <math>120 px^2</math>. The training and test images are divided as per the usual MNIST configuration. To determine if there is a difference between the peripheral flankers and the central flankers, all the tests are performed in the right half image plane.<br />
<br />
==DNNs trained with Target and Flankers==<br />
This is a constant spacing training setup where identical flankers are placed at a distance of 120 pixels either side of the target(xax) with the target having translational variance. The tests are evaluated on (i) DCNN with at the end pooling, and (ii) eccentricity-dependent model with 11-11-11-11-1 scale pooling, at the end spatial pooling and contrast normalization. The test data has different flanker configurations as described above.<br />
[[File:result1.png|x450px|center]]<br />
<br />
===Observations===<br />
* With the flanker configuration same as the training one, models are better at recognizing objects in clutter rather than isolated objects for all image locations<br />
* If the target-flanker spacing is changed, then models perform worse<br />
* the eccentricity model is much better at recognizing objects in isolation than the DCNN because the multi-scale crops divide the image into discrete regions, letting the model learn from image parts as well as the whole image<br />
* Only the eccentricity-dependent model is robust to different flanker configurations not included in training when the target is centered.<br />
<br />
==DNNs trained with Images with the Target in Isolation==<br />
Here the target objects are in isolation and with translational variance while the test-set is the same set of flanker configurations as used before.<br />
[[File:result2.png|750x400px|center]]<br />
In addition to the evaluation of DCNNs in constant target eccentricity at 240 pixels, here they are tested with images in which the target is fixed at 720 pixels from the center of the image, as shown in Fig 3. Since the target is already at the edge of the visual field, a flanker cannot be more peripheral in the image than the target. Same results as for the 240 pixels target eccentricity can be extracted. The closer the flanker is to the target, the more accuracy decreases. Also, it can be seen that when the target is close to the image boundary, recognition is poor because of boundary effects eroding away information about the target<br />
[[File:paper25_supplemental1.png|800px|center]]<br />
<br />
===DCNN Observations===<br />
* The recognition gets worse with the increase in the number of flankers.<br />
* Convolutional networks are capable of being invariant to translations.<br />
* In the constant target eccentricity setup, where the target is fixed at the center of the image with varying target-flanker spacing, we observe that as the distance between target and flankers increase, recognition gets better.<br />
* Spatial pooling helps in learning invariance.<br />
*Flankers similar to the target object helps in recognition since they don't activate the convolutional filter more.<br />
* notMNIST data affects leads to more crowding since they have many more edges and white image pixels which activate the convolutional layers more.<br />
<br />
===Eccentric Model===<br />
The set-up is the same as explained earlier.<br />
[[File:result3.png|750x400px|center]]<br />
<br />
====Observations====<br />
* If the target is placed at the center and no contrast normalization is done, then the recognition accuracy is high since this model concentrates the most on the central region of the image.<br />
* If contrast normalization is done, then all the scales will contribute equal amount and hence the eccentricity dependence is removed.<br />
* Early pooling is harmful since it might take away the useful information very early which might be useful to the network.<br />
<br />
==Complex Clutter==<br />
Here, the targets are randomly embedded into images of the Places dataset and shifted along horizontally in order to investigate model robustness when the target is not at the image center. Tests are performed on DCNN and the eccentricity model with and without contrast normalization using at end pooling. The results are shown in Figure 9 below. <br />
<br />
[[File:result4.png|750x400px|center]]<br />
<br />
====Observations====<br />
* Only eccentricity model without contrast normalization can recognize the target and only when the target is close to the image center.<br />
* The eccentricity model does not need to be trained on different types of clutter to become robust to those types of clutter, but it needs to fixate on the relevant part of the image to recognize the target.<br />
<br />
=Conclusions=<br />
We often think that just training the network with data similar to the test data would achieve good results in a general scenario too but that's not the case as we trained the model with flankers and it did not give us the ideal results for the target objects.<br />
*'''Flanker Configuration''': When models are trained with images of objects in isolation, adding flankers harms recognition. Adding two flankers is the same or worse than adding just one and the smaller the spacing between flanker and target, the more crowding occurs. This is because the pooling operation merges nearby responses, such as the target and flankers if they are close.<br />
*'''Similarity between target and flanker''': Flankers more similar to targets cause more crowding, because of the selectivity property of the learned DNN filters.<br />
*'''Dependence on target location and contrast normalization''': In DCNNs and eccentricity-dependent models with contrast normalization, recognition accuracy is the same across all eccentricities. In eccentricity-dependent networks without contrast normalization, recognition does not decrease despite the presence of clutter when the target is at the center of the image.<br />
*'''Effect of pooling''': adding pooling leads to better recognition accuracy of the models. Yet, in the eccentricity model, pooling across the scales too early in the hierarchy leads to lower accuracy.<br />
<br />
=Critique=<br />
This paper just tries to check the impact of flankers on targets as to how crowding can affect recognition but it does not propose anything novel in terms of architecture to take care of such a type of crowding. The eccentricity based model does well only when the target is placed at the center of the image but maybe windowing over the frames instead of taking crops starting from the middle might help.<br />
<br />
This paper focuses on image classification. For a stronger argument, their model could be applied to the task of object detection. Perhaps crowding does not have as large of an impact when the objects of interest are localized by a region proposal network.<br />
<br />
This paper does not provide a convincing argument that the problem of crowding as experienced by humans somehow shares a similar mechanism to the problem of DNN accuracy falling when there is more clutter in the scene. The multi-scale architecture does not seem all that close to the distribution of rods and cones in the retina[https://www.ncbi.nlm.nih.gov/books/NBK10848/figure/A763/?report=objectonly]. It might be that the eccentric model does well when the target is centered because it is being sampled by more scales, not because it is similar to a primate visual cortex, and primates are able to recognize an object in clutter when looking directly at it.<br />
<br />
=References=<br />
1) Volokitin A, Roig G, Poggio T:"Do Deep Neural Networks Suffer from Crowding?" Conference on Neural Information Processing Systems (NIPS). 2017<br />
2) Francis X. Chen, Gemma Roig, Leyla Isik, Xavier Boix and Tomaso Poggio: "Eccentricity Dependent Deep Neural Networks for Modeling Human Vision" Journal of Vision. 17. 808. 10.1167/17.10.808.<br />
3) J Harrison, W & W Remington, R & Mattingley, Jason. (2014). Visual crowding is anisotropic along the horizontal meridian during smooth pursuit. Journal of vision. 14. 10.1167/14.1.21. http://willjharrison.com/2014/01/new-paper-visual-crowding-is-anisotropic-along-the-horizontal-meridian-during-smooth-pursuit/</div>Bapskikohttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Do_Deep_Neural_Networks_Suffer_from_Crowding&diff=35378Do Deep Neural Networks Suffer from Crowding2018-03-23T02:03:09Z<p>Bapskiko: /* Critique */</p>
<hr />
<div>= Introduction =<br />
Since the increase in popularity of Deep Neural Networks (DNNs), there has been lots of research in making machines capable of recognizing objects the same way humans do. Humans can recognize objects in a way that is invariant to scale, translation, and clutter. Crowding is another visual effect suffered by humans, in which an object that can be recognized in isolation can no longer be recognized when other objects, called flankers, are placed close to it. This paper focuses on studying the impact of crowding on DNNs by adding clutter to the images and then analyzing which models and settings suffer less from such effects. <br />
<br />
[[File:paper25_fig_crowding_ex.png|center|600px]]<br />
The figure shows a visual example of crowding [3]. Keep your eyes still and look at the dot in the center and try to identify the "A" in the two circles. You should see that it is much easier to make out the "A" in the right than in the left circle. The same "A" exists in both circles, however, the left circle contains flankers which are those line segments.<br />
<br />
The paper investigates two types of DNNs for crowding: traditional deep convolutional neural networks (DCNN) and a multi-scale eccentricity-dependent model which is an extension of the DCNNs and inspired by the retina where the receptive field size of the convolutional filters in the model grows with increasing distance from the center of the image, called the eccentricity and will be explained below. The authors focus on the dependence of crowding on image factors, such as flanker configuration, target-flanker similarity, target eccentricity and premature pooling in particular.<br />
<br />
= Models =<br />
Two types of models are considered: deep convolutional neural networks and eccentricity-dependent models. Based on several hypothesis that pooling is the cause of crowding in human perception, the paper tries to investigate the effects of pooling on the detection of crowded images through these two network types. <br />
<br />
== Deep Convolutional Neural Networks ==<br />
The DCNN is a basic architecture with 3 convolutional layers, spatial 3x3 max-pooling with varying strides and a fully connected layer for classification as shown in the below figure. <br />
[[File:DCNN.png|800px|center]]<br />
<br />
The network is fed with images resized to 60x60, with mini-batches of 128 images, 32 feature channels for all convolutional layers, and convolutional filters of size 5x5 and stride 1.<br />
<br />
As highlighted earlier, the effect of pooling is into main consideration and hence three different configurations have been investigated as below: <br />
<br />
1. '''No total pooling''' Feature maps sizes decrease only due to boundary effects, as the 3x3 max pooling has stride 1. The square feature maps sizes after each pool layer are 60-54-48-42.<br />
2. '''Progressive pooling''' 3x3 pooling with a stride of 2 halves the square size of the feature maps, until we pool over what remains in the final layer, getting rid of any spatial information before the fully connected layer. (60-27-11-1).<br />
3. '''At end pooling''' Same as no total pooling, but before the fully connected layer, max-pool over the entire feature map. (60-54-48-1).<br />
<br />
===What is the problem in CNNs?===<br />
CNNs fall short in explaining human perceptual invariance. First, CNNs typically take input at a single uniform resolution. Biological measurements suggest that resolution is not uniform across the human visual field, but rather decays with eccentricity, i.e. distance from the center of focus Even more importantly, CNNs rely on data augmentation to achieve transformation-invariance and obviously a lot of processing is needed for CNNs.<br />
<br />
==Eccentricity-dependent Model==<br />
In order to take care of the scale invariance in the input image, the eccentricity dependent DNN is utilized. The main intuition behind this architecture is that as we increase eccentricity, the receptive fields also increase and hence the model will become invariant to changing input scales. The authors note that the width of each scale is roughly related to the amount of translation invariance for objects at that scale, simply because once the object is outside that window, the filter no longer observes it. Therefore, the authors say that the architecture emphasizes scale invariance over translation invariance, in contrast to traditional DCNNs. In this model, the input image is cropped into varying scales (11 crops increasing by a factor of <math>\sqrt{2}</math> which are then resized to 60x60 pixels) and then fed to the network. The model computes an invariant representation of the input by sampling the inverted pyramid at a discrete set of scales with the same number of filters at each scale. Since the same number of filters are used for each scale, the smaller crops will be sampled at a high resolution while the larger crops will be sampled with a low resolution. These scales are fed into the network as an input channel to the convolutional layers and share the weights across scale and space.<br />
[[File:EDM.png|2000x450px|center]]<br />
<br />
The architecture of this model is the same as the previous DCNN model with the only change being the extra filters added for each of the scales, so the number of parameters remains the same as DCNN models. The authors perform spatial pooling, the aforementioned ''At end pooling'' is used here, and scale pooling which helps in reducing a number of scales by taking the maximum value of corresponding locations in the feature maps across multiple scales. It has three configurations: (1) at the beginning, in which all the different scales are pooled together after the first layer, 11-1-1-1-1 (2) progressively, 11-7-5-3-1 and (3) at the end, 11-11-11-11-1, in which all 11 scales are pooled together at the last layer.<br />
<br />
===Contrast Normalization===<br />
Since we have multiple scales of an input image, in some experiments, we perform normalization such that the sum of the pixel intensities in each scale is in the same range [0,1] (this is to prevent smaller crops, which have more non-black pixels, from disproportionately dominating max-pooling across scales). The normalized pixel intensities are then divided by a factor proportional to the crop area [[File:sqrtf.png|60px]] where i=1 is the smallest crop.<br />
<br />
=Experiments=<br />
Targets are the set of objects to be recognized and flankers are the set of objects the model has not been trained to recognize, which act as clutter with respect to these target objects. The target objects are the even MNIST numbers having translational variance (shifted at different locations of the image along the horizontal axis), while flankers are from odd MNIST numbers, notMNIST dataset (contains alphabet letters) and Omniglot dataset (contains characters). Examples of the target and flanker configurations are shown below: <br />
[[File:eximages.png|800px|center]]<br />
<br />
The target and the object are referred to as ''a'' and ''x'' respectively with the below four configurations: <br />
# No flankers. Only the target object. (a in the plots) <br />
# One central flanker closer to the center of the image than the target. (xa) <br />
# One peripheral flanker closer to the boundary of the image that the target. (ax) <br />
# Two flankers spaced equally around the target, being both the same object (xax).<br />
<br />
Training is done using backpropogation with images of size <math>1920 px^2</math> with embedded targets objects and flankers of size of <math>120 px^2</math>. The training and test images are divided as per the usual MNIST configuration. To determine if there is a difference between the peripheral flankers and the central flankers, all the tests are performed in the right half image plane.<br />
<br />
==DNNs trained with Target and Flankers==<br />
This is a constant spacing training setup where identical flankers are placed at a distance of 120 pixels either side of the target(xax) with the target having translational variance. The tests are evaluated on (i) DCNN with at the end pooling, and (ii) eccentricity-dependent model with 11-11-11-11-1 scale pooling, at the end spatial pooling and contrast normalization. The test data has different flanker configurations as described above.<br />
[[File:result1.png|x450px|center]]<br />
<br />
===Observations===<br />
* With the flanker configuration same as the training one, models are better at recognizing objects in clutter rather than isolated objects for all image locations<br />
* If the target-flanker spacing is changed, then models perform worse<br />
* the eccentricity model is much better at recognizing objects in isolation than the DCNN because the multi-scale crops divide the image into discrete regions, letting the model learn from image parts as well as the whole image<br />
* Only the eccentricity-dependent model is robust to different flanker configurations not included in training when the target is centered.<br />
<br />
==DNNs trained with Images with the Target in Isolation==<br />
Here the target objects are in isolation and with translational variance while the test-set is the same set of flanker configurations as used before.<br />
[[File:result2.png|750x400px|center]]<br />
In addition to the evaluation of DCNNs in constant target eccentricity at 240 pixels, here they are tested with images in which the target is fixed at 720 pixels from the center of the image, as shown in Fig 3. Since the target is already at the edge of the visual field, a flanker cannot be more peripheral in the image than the target. Same results as for the 240 pixels target eccentricity can be extracted. The closer the flanker is to the target, the more accuracy decreases. Also, it can be seen that when the target is close to the image boundary, recognition is poor because of boundary effects eroding away information about the target<br />
[[File:paper25_supplemental1.png|800px|center]]<br />
<br />
===DCNN Observations===<br />
* The recognition gets worse with the increase in the number of flankers.<br />
* Convolutional networks are capable of being invariant to translations.<br />
* In the constant target eccentricity setup, where the target is fixed at the center of the image with varying target-flanker spacing, we observe that as the distance between target and flankers increase, recognition gets better.<br />
* Spatial pooling helps in learning invariance.<br />
*Flankers similar to the target object helps in recognition since they don't activate the convolutional filter more.<br />
* notMNIST data affects leads to more crowding since they have many more edges and white image pixels which activate the convolutional layers more.<br />
<br />
===Eccentric Model===<br />
The set-up is the same as explained earlier.<br />
[[File:result3.png|750x400px|center]]<br />
<br />
====Observations====<br />
* If the target is placed at the center and no contrast normalization is done, then the recognition accuracy is high since this model concentrates the most on the central region of the image.<br />
* If contrast normalization is done, then all the scales will contribute equal amount and hence the eccentricity dependence is removed.<br />
* Early pooling is harmful since it might take away the useful information very early which might be useful to the network.<br />
<br />
==Complex Clutter==<br />
Here, the targets are randomly embedded into images of the Places dataset and shifted along horizontally in order to investigate model robustness when the target is not at the image center. Tests are performed on DCNN and the eccentricity model with and without contrast normalization using at end pooling. The results are shown in Figure 9 below. <br />
<br />
[[File:result4.png|750x400px|center]]<br />
<br />
====Observations====<br />
- Only eccentricity model without contrast normalization can recognize the target and only when the target is close to the image center.<br />
- The eccentricity model does not need to be trained on different types of clutter to become robust to those types of clutter, but it needs to fixate on the relevant part of the image to recognize the target.<br />
<br />
=Conclusions=<br />
We often think that just training the network with data similar to the test data would achieve good results in a general scenario too but that's not the case as we trained the model with flankers and it did not give us the ideal results for the target objects.<br />
*'''Flanker Configuration''': When models are trained with images of objects in isolation, adding flankers harms recognition. Adding two flankers is the same or worse than adding just one and the smaller the spacing between flanker and target, the more crowding occurs. This is because the pooling operation merges nearby responses, such as the target and flankers if they are close.<br />
*'''Similarity between target and flanker''': Flankers more similar to targets cause more crowding, because of the selectivity property of the learned DNN filters.<br />
*'''Dependence on target location and contrast normalization''': In DCNNs and eccentricity-dependent models with contrast normalization, recognition accuracy is the same across all eccentricities. In eccentricity-dependent networks without contrast normalization, recognition does not decrease despite the presence of clutter when the target is at the center of the image.<br />
*'''Effect of pooling''': adding pooling leads to better recognition accuracy of the models. Yet, in the eccentricity model, pooling across the scales too early in the hierarchy leads to lower accuracy.<br />
<br />
=Critique=<br />
This paper just tries to check the impact of flankers on targets as to how crowding can affect recognition but it does not propose anything novel in terms of architecture to take care of such a type of crowding. The eccentricity based model does well only when the target is placed at the center of the image but maybe windowing over the frames instead of taking crops starting from the middle might help.<br />
<br />
This paper focuses on image classification. For a stronger argument, their model could be applied to the task of object detection. Perhaps crowding does not have as large of an impact when the objects of interest are localized by a region proposal network.<br />
<br />
This paper does not provide a convincing argument that the problem of crowding as experienced by humans somehow shares a similar mechanism to the problem of DNN accuracy falling when there is more clutter in the scene. The multi-scale architecture does not seem all that close to the distribution of rods and cones in the retina[https://www.ncbi.nlm.nih.gov/books/NBK10848/figure/A763/?report=objectonly]. It might be that the eccentric model does well when the target is centered because it is being sampled by more scales, not because it is similar to a primate visual cortex, and primates are able to recognize an object in clutter when looking directly at it.<br />
<br />
=References=<br />
1) Volokitin A, Roig G, Poggio T:"Do Deep Neural Networks Suffer from Crowding?" Conference on Neural Information Processing Systems (NIPS). 2017<br />
2) Francis X. Chen, Gemma Roig, Leyla Isik, Xavier Boix and Tomaso Poggio: "Eccentricity Dependent Deep Neural Networks for Modeling Human Vision" Journal of Vision. 17. 808. 10.1167/17.10.808.<br />
3) J Harrison, W & W Remington, R & Mattingley, Jason. (2014). Visual crowding is anisotropic along the horizontal meridian during smooth pursuit. Journal of vision. 14. 10.1167/14.1.21. http://willjharrison.com/2014/01/new-paper-visual-crowding-is-anisotropic-along-the-horizontal-meridian-during-smooth-pursuit/</div>Bapskikohttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Do_Deep_Neural_Networks_Suffer_from_Crowding&diff=35371Do Deep Neural Networks Suffer from Crowding2018-03-23T01:33:22Z<p>Bapskiko: /* Eccentricity-dependent Model */</p>
<hr />
<div>= Introduction =<br />
Since the increase in popularity of Deep Neural Networks (DNNs), there has been lots of research in making machines capable of recognizing objects the same way humans do. Humans can recognize objects in a way that is invariant to scale, translation, and clutter. Crowding is another visual effect suffered by humans, in which an object that can be recognized in isolation can no longer be recognized when other objects, called flankers, are placed close to it. This paper focuses on studying the impact of crowding on DNNs by adding clutter to the images and then analyzing which models and settings suffer less from such effects. <br />
<br />
[[File:paper25_fig_crowding_ex.png|center|600px]]<br />
The figure shows a visual example of crowding [3]. Keep your eyes still and look at the dot in the center and try to identify the "A" in the two circles. You should see that it is much easier to make out the "A" in the right than in the left circle. The same "A" exists in both circles, however, the left circle contains flankers which are those line segments.<br />
<br />
The paper investigates two types of DNNs for crowding: traditional deep convolutional neural networks (DCNN) and a multi-scale eccentricity-dependent model which is an extension of the DCNNs and inspired by the retina where the receptive field size of the convolutional filters in the model grows with increasing distance from the center of the image, called the eccentricity and will be explained below. The authors focus on the dependence of crowding on image factors, such as flanker configuration, target-flanker similarity, target eccentricity and premature pooling in particular.<br />
<br />
= Models =<br />
Two types of models are considered: deep convolutional neural networks and eccentricity-dependent models. Based on several hypothesis that pooling is the cause of crowding in human perception, the paper tries to investigate the effects of pooling on the detection of crowded images through these two network types. <br />
<br />
== Deep Convolutional Neural Networks ==<br />
The DCNN is a basic architecture with 3 convolutional layers, spatial 3x3 max-pooling with varying strides and a fully connected layer for classification as shown in the below figure. <br />
[[File:DCNN.png|800px|center]]<br />
<br />
The network is fed with images resized to 60x60, with mini-batches of 128 images, 32 feature channels for all convolutional layers, and convolutional filters of size 5x5 and stride 1.<br />
<br />
As highlighted earlier, the effect of pooling is into main consideration and hence three different configurations have been investigated as below: <br />
<br />
1. '''No total pooling''' Feature maps sizes decrease only due to boundary effects, as the 3x3 max pooling has stride 1. The square feature maps sizes after each pool layer are 60-54-48-42.<br />
2. '''Progressive pooling''' 3x3 pooling with a stride of 2 halves the square size of the feature maps, until we pool over what remains in the final layer, getting rid of any spatial information before the fully connected layer. (60-27-11-1).<br />
3. '''At end pooling''' Same as no total pooling, but before the fully connected layer, max-pool over the entire feature map. (60-54-48-1).<br />
<br />
===What is the problem in CNNs?===<br />
CNNs fall short in explaining human perceptual invariance. First, CNNs typically take input at a single uniform resolution. Biological measurements suggest that resolution is not uniform across the human visual field, but rather decays with eccentricity, i.e. distance from the center of focus Even more importantly, CNNs rely on data augmentation to achieve transformation-invariance and obviously a lot of processing is needed for CNNs.<br />
<br />
==Eccentricity-dependent Model==<br />
In order to take care of the scale invariance in the input image, the eccentricity dependent DNN is utilized. The main intuition behind this architecture is that as we increase eccentricity, the receptive fields also increase and hence the model will become invariant to changing input scales. The authors note that the width of each scale is roughly related to the amount of translation invariance for objects at that scale, simply because once the object is outside that window, the filter no longer observes it. Therefore, the authors say that the architecture emphasizes scale invariance over translation invariance, in contrast to traditional DCNNs. In this model, the input image is cropped into varying scales (11 crops increasing by a factor of <math>\sqrt{2}</math> which are then resized to 60x60 pixels) and then fed to the network. The model computes an invariant representation of the input by sampling the inverted pyramid at a discrete set of scales with the same number of filters at each scale. Since the same number of filters are used for each scale, the smaller crops will be sampled at a high resolution while the larger crops will be sampled with a low resolution. These scales are fed into the network as an input channel to the convolutional layers and share the weights across scale and space.<br />
[[File:EDM.png|2000x450px|center]]<br />
<br />
The architecture of this model is the same as the previous DCNN model with the only change being the extra filters added for each of the scales, so the number of parameters remains the same as DCNN models. The authors perform spatial pooling, the aforementioned ''At end pooling'' is used here, and scale pooling which helps in reducing a number of scales by taking the maximum value of corresponding locations in the feature maps across multiple scales. It has three configurations: (1) at the beginning, in which all the different scales are pooled together after the first layer, 11-1-1-1-1 (2) progressively, 11-7-5-3-1 and (3) at the end, 11-11-11-11-1, in which all 11 scales are pooled together at the last layer.<br />
<br />
===Contrast Normalization===<br />
Since we have multiple scales of an input image, in some experiments, we perform normalization such that the sum of the pixel intensities in each scale is in the same range [0,1] (this is to prevent smaller crops, which have more non-black pixels, from disproportionately dominating max-pooling across scales). The normalized pixel intensities are then divided by a factor proportional to the crop area [[File:sqrtf.png|60px]] where i=1 is the smallest crop.<br />
<br />
=Experiments=<br />
Targets are the set of objects to be recognized and flankers are the set of objects the model has not been trained to recognize, which act as clutter with respect to these target objects. The target objects are the even MNIST numbers having translational variance (shifted at different locations of the image along the horizontal axis), while flankers are from odd MNIST numbers, notMNIST dataset (contains alphabet letters) and Omniglot dataset (contains characters). Examples of the target and flanker configurations are shown below: <br />
[[File:eximages.png|800px|center]]<br />
<br />
The target and the object are referred to as ''a'' and ''x'' respectively with the below four configurations: <br />
# No flankers. Only the target object. (a in the plots) <br />
# One central flanker closer to the center of the image than the target. (xa) <br />
# One peripheral flanker closer to the boundary of the image that the target. (ax) <br />
# Two flankers spaced equally around the target, being both the same object (xax).<br />
<br />
Training is done using backpropogation with images of size <math>1920 px^2</math> with embedded targets objects and flankers of size of <math>120 px^2</math>. The training and test images are divided as per the usual MNIST configuration. To determine if there is a difference between the peripheral flankers and the central flankers, all the tests are performed in the right half image plane.<br />
<br />
==DNNs trained with Target and Flankers==<br />
This is a constant spacing training setup where identical flankers are placed at a distance of 120 pixels either side of the target(xax) with the target having translational variance. The tests are evaluated on (i) DCNN with at the end pooling, and (ii) eccentricity-dependent model with 11-11-11-11-1 scale pooling, at the end spatial pooling and contrast normalization. The test data has different flanker configurations as described above.<br />
[[File:result1.png|x450px|center]]<br />
<br />
===Observations===<br />
* With the flanker configuration same as the training one, models are better at recognizing objects in clutter rather than isolated objects for all image locations<br />
* If the target-flanker spacing is changed, then models perform worse<br />
* the eccentricity model is much better at recognizing objects in isolation than the DCNN because the multi-scale crops divide the image into discrete regions, letting the model learn from image parts as well as the whole image<br />
* Only the eccentricity-dependent model is robust to different flanker configurations not included in training when the target is centered.<br />
<br />
==DNNs trained with Images with the Target in Isolation==<br />
Here the target objects are in isolation and with translational variance while the test-set is the same set of flanker configurations as used before.<br />
[[File:result2.png|750x400px|center]]<br />
In addition to the evaluation of DCNNs in constant target eccentricity at 240 pixels, here they are tested with images in which the target is fixed at 720 pixels from the center of the image, as shown in Fig 3. Since the target is already at the edge of the visual field, a flanker cannot be more peripheral in the image than the target. Same results as for the 240 pixels target eccentricity can be extracted. The closer the flanker is to the target, the more accuracy decreases. Also, it can be seen that when the target is close to the image boundary, recognition is poor because of boundary effects eroding away information about the target<br />
[[File:paper25_supplemental1.png|800px|center]]<br />
<br />
===DCNN Observations===<br />
* The recognition gets worse with the increase in the number of flankers.<br />
* Convolutional networks are capable of being invariant to translations.<br />
* In the constant target eccentricity setup, where the target is fixed at the center of the image with varying target-flanker spacing, we observe that as the distance between target and flankers increase, recognition gets better.<br />
* Spatial pooling helps in learning invariance.<br />
*Flankers similar to the target object helps in recognition since they don't activate the convolutional filter more.<br />
* notMNIST data affects leads to more crowding since they have many more edges and white image pixels which activate the convolutional layers more.<br />
<br />
===Eccentric Model===<br />
The set-up is the same as explained earlier.<br />
[[File:result3.png|750x400px|center]]<br />
<br />
====Observations====<br />
* If the target is placed at the center and no contrast normalization is done, then the recognition accuracy is high since this model concentrates the most on the central region of the image.<br />
* If contrast normalization is done, then all the scales will contribute equal amount and hence the eccentricity dependence is removed.<br />
* Early pooling is harmful since it might take away the useful information very early which might be useful to the network.<br />
<br />
==Complex Clutter==<br />
Here, the targets are randomly embedded into images of the Places dataset and shifted along horizontally in order to investigate model robustness when the target is not at the image center. Tests are performed on DCNN and the eccentricity model with and without contrast normalization using at end pooling. The results are shown in Figure 9 below. <br />
<br />
[[File:result4.png|750x400px|center]]<br />
<br />
====Observations====<br />
- Only eccentricity model without contrast normalization can recognize the target and only when the target is close to the image center.<br />
- The eccentricity model does not need to be trained on different types of clutter to become robust to those types of clutter, but it needs to fixate on the relevant part of the image to recognize the target.<br />
<br />
=Conclusions=<br />
We often think that just training the network with data similar to the test data would achieve good results in a general scenario too but that's not the case as we trained the model with flankers and it did not give us the ideal results for the target objects.<br />
*'''Flanker Configuration''': When models are trained with images of objects in isolation, adding flankers harms recognition. Adding two flankers is the same or worse than adding just one and the smaller the spacing between flanker and target, the more crowding occurs. This is because the pooling operation merges nearby responses, such as the target and flankers if they are close.<br />
*'''Similarity between target and flanker''': Flankers more similar to targets cause more crowding, because of the selectivity property of the learned DNN filters.<br />
*'''Dependence on target location and contrast normalization''': In DCNNs and eccentricity-dependent models with contrast normalization, recognition accuracy is the same across all eccentricities. In eccentricity-dependent networks without contrast normalization, recognition does not decrease despite the presence of clutter when the target is at the center of the image.<br />
*'''Effect of pooling''': adding pooling leads to better recognition accuracy of the models. Yet, in the eccentricity model, pooling across the scales too early in the hierarchy leads to lower accuracy.<br />
<br />
=Critique=<br />
This paper just tries to check the impact of flankers on targets as to how crowding can affect recognition but it does not propose anything novel in terms of architecture to take care of such a type of crowding. The eccentricity based model does well only when the target is placed at the center of the image but maybe windowing over the frames instead of taking crops starting from the middle might help.<br />
<br />
This paper focuses on image classification. For a stronger argument, their model could be applied to the task of object detection. Perhaps crowding does not have as large of an impact when the objects of interest are localized by a region proposal network.<br />
<br />
=References=<br />
1) Volokitin A, Roig G, Poggio T:"Do Deep Neural Networks Suffer from Crowding?" Conference on Neural Information Processing Systems (NIPS). 2017<br />
2) Francis X. Chen, Gemma Roig, Leyla Isik, Xavier Boix and Tomaso Poggio: "Eccentricity Dependent Deep Neural Networks for Modeling Human Vision" Journal of Vision. 17. 808. 10.1167/17.10.808.<br />
3) J Harrison, W & W Remington, R & Mattingley, Jason. (2014). Visual crowding is anisotropic along the horizontal meridian during smooth pursuit. Journal of vision. 14. 10.1167/14.1.21. http://willjharrison.com/2014/01/new-paper-visual-crowding-is-anisotropic-along-the-horizontal-meridian-during-smooth-pursuit/</div>Bapskikohttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Spherical_CNNs&diff=35313Spherical CNNs2018-03-22T22:31:28Z<p>Bapskiko: /* Implementation with GFFT */</p>
<hr />
<div>= Introduction =<br />
Convolutional Neural Networks (CNNs), or network architectures involving CNNs, are the current state of the art for learning 2D image processing tasks such as semantic segmentation and object detection. CNNs work well in large part due to the property of being translationally equivariant. This property allows a network trained to detect a certain type of object to still detect the object even if it is translated to another position in the image. However, this does not correspond well to spherical signals since projecting a spherical signal onto a plane will result in distortions, as demonstrated in Figure 1. There are many different types of spherical projections onto a 2D plane, as most people know from the various types of world maps, none of which provide all the necessary properties for rotation-invariant learning. Applications where spherical CNNs can be applied include omnidirectional vision for robots, molecular regression problems, and weather/climate modelling.<br />
<br />
[[File:paper26-fig1.png|center]]<br />
<br />
The implementation of a spherical CNN is challenging mainly because no perfectly symmetrical grids for the sphere exists which makes it difficult to define the rotation of a spherical filter by one pixel and the computational efficiency of the system.<br />
<br />
The main contributions of this paper are the following:<br />
# The theory of spherical CNNs.<br />
# The first automatically differentiable implementation of the generalized Fourier transform for <math>S^2</math> and SO(3). The provided PyTorch code by the authors is easy to use, fast, and memory efficient.<br />
# The first empirical support for the utility of spherical CNNs for rotation-invariant learning problems.<br />
<br />
= Notation =<br />
Below are listed several important terms:<br />
* '''Unit Sphere''' <math>S^2</math> is defined as a sphere where all of its points are distance of 1 from the origin. The unit sphere can be parameterized by the spherical coordinates <math>\alpha ∈ [0, 2π]</math> and <math>β ∈ [0, π]</math>. This is a two-dimensional manifold with respect to <math>\alpha</math> and <math>β</math>.<br />
* '''<math>S^2</math> Sphere''' The three dimensional surface from a 3D sphere<br />
* '''Spherical Signals''' In the paper spherical images and filters are modeled as continuous functions <math>f : s^2 → \mathbb{R}^K</math>. K is the number of channels. Such as how RGB images have 3 channels a spherical signal can have numerous channels describing the data. Examples of channels which were used can be found in the experiments section.<br />
* '''Rotations - SO(3)''' The group of 3D rotations on an <math>S^2</math> sphere. Sometimes called the "special orthogonal group". In this paper the ZYZ-Euler parameterization is used to represent SO(3) rotations with <math>\alpha, \beta</math>, and <math>\gamma</math>. Any rotation can be broken down into first a rotation (<math>\alpha</math>) about the Z-axis, then a rotation (<math>\beta</math>) about the new Y-axis (Y'), followed by a rotation (<math>\gamma</math>) about the new Z axis (Z"). [In the rest of this paper, to integrate functions on SO(3), the authors use a rotationally invariant probability measure on the Borel subsets of SO(3). This measure is an example of a Haar measure. Haar measures generalize the idea of rotationally invariant probability measures to general topological groups. For more on Haar measures, see (Feldman 2002) ]<br />
<br />
= Related Work =<br />
The related work presented in this paper is very brief, in large part due to the novelty of spherical CNNs and the length of the rest of the paper. The authors enumerate numerous papers which attempt to exploit larger groups of symmetries such as the translational symmetries of CNNs but do not go into specific details for any of these attempts. They do state that all the previous works are limited to discrete groups with the exception of SO(2)-steerable networks.<br />
The authors also mention that previous works exist that analyze spherical images but that these do not have an equivariant architecture. They claim that Spherical CNNs are "the first to achieve equivariance to a continuous, non-commutative group (SO(3))". They also claim to be the first to use the generalized Fourier transform for speed effective performance of group correlation.<br />
<br />
= Correlations on the Sphere and Rotation Group =<br />
Spherical correlation is like planar correlation except instead of translation, there is rotation. The definitions for each are provided as follows:<br />
<br />
'''Planar correlation''' The value of the output feature map at translation <math>\small x ∈ Z^2</math> is computed as an inner product between the input feature map and a filter, shifted by <math>\small x</math>.<br />
<br />
'''Spherical correlation''' The value of the output feature map evaluated at rotation <math>\small R ∈ SO(3)</math> is computed as an inner product between the input feature map and a filter, rotated by <math>\small R</math>.<br />
<br />
'''Rotation of Spherical Signals''' The paper introduces the rotation operator <math>L_R</math>. The rotation operator simply rotates a function (which allows us to rotate the the spherical filters) by <math>R^{-1}</math>. With this definition we have the property that <math>L_{RR'} = L_R L_{R'}</math>.<br />
<br />
'''Inner Products''' The inner product of spherical signals is simply the integral summation on the vector space over the entire sphere.<br />
<br />
<math>\langle\psi , f \rangle = \int_{S^2} \sum_{k=1}^K \psi_k (x)f_k (x)dx</math><br />
<br />
<math>dx</math> here is SO(3) rotation invariant and is equivalent to <math>d \alpha sin(\beta) d \beta / 4 \pi </math> in spherical coordinates. This comes from the ZYZ-Euler paramaterization where any rotation can be broken down into first a rotation about the Z-axis, then a rotation about the new Y-axis (Y'), followed by a rotation about the new Z axis (Z"). More details on this are given in Appendix A in the paper.<br />
<br />
By this definition, the invariance of the inner product is then guaranteed for any rotation <math>R ∈ SO(3)</math>. In other words, when subjected to rotations, the volume under a spherical heightmap does not change. The following equations show that <math>L_R</math> has a distinct adjoint (<math>L_{R^{-1}}</math>) and that <math>L_R</math> is unitary and thus preserves orthogonality and distances.<br />
<br />
<math>\langle L_R \psi \,, f \rangle = \int_{S^2} \sum_{k=1}^K \psi_k (R^{-1} x)f_k (x)dx</math><br />
<br />
::::<math>= \int_{S^2} \sum_{k=1}^K \psi_k (x)f_k (Rx)dx</math><br />
<br />
::::<math>= \langle \psi , L_{R^{-1}} f \rangle</math><br />
<br />
'''Spherical Correlation''' With the above knowledge the definition of spherical correlation of two signals <math>f</math> and <math>\psi</math> is:<br />
<br />
<math>[\psi \star f](R) = \langle L_R \psi \,, f \rangle = \int_{S^2} \sum_{k=1}^K \psi_k (R^{-1} x)f_k (x)dx</math><br />
<br />
The output of the above equation is a function on SO(3). This can be thought of as for each rotation combination of <math>\alpha , \beta , \gamma </math> there is a different volume under the correlation. The authors make a point of noting that previous work by Driscoll and Healey only ensures circular symmetries about the Z axis and their new formulation ensures symmetry about any rotation.<br />
<br />
'''Rotation of SO(3) Signals''' The first layer of Spherical CNNs take a function on the sphere (<math>S^2</math>) and output a function on SO(3). Therefore, if a Spherical CNN with more than one layer is going to be built there needs to be a way to find the correlation between two signals on SO(3). The authors then generalize the rotation operator (<math>L_R</math>) to encompass acting on signals from SO(3). This new definition of <math>L_R</math> is as follows: (where <math>R^{-1}Q</math> is a composition of rotations, i.e. multiplication of rotation matrices)<br />
<br />
<math>[L_Rf](Q)=f(R^{-1} Q)</math><br />
<br />
'''Rotation Group Correlation''' The correlation of two signals (<math>f,\psi</math>) on SO(3) with K channels is defined as the following:<br />
<br />
<math>[\psi \star f](R) = \langle L_R \psi , f \rangle = \int_{SO(3)} \sum_{k=1}^K \psi_k (R^{-1} Q)f_k (Q)dQ</math><br />
<br />
where dQ represents the ZYZ-Euler angles <math>d \alpha sin(\beta) d \beta d \gamma / 8 \pi^2 </math>. A complete derivation of this can be found in Appendix A.<br />
<br />
'''Equivariance''' The equivariance for the rotation group correlation is similarly demonstrated. A layer is equivariant if for some operator <math>T_R</math>, <math>\Phi \circ L_R = T_R \circ \Phi</math>, and: <br />
<br />
<math>[\psi \star [L_Qf]](R) = \langle L_R \psi , L_Qf \rangle = \langle L_{Q^{-1} R} \psi , f \rangle = [\psi \star f](Q^{-1}R) = [L_Q[\psi \star f]](R) </math>.<br />
<br />
= Implementation with GFFT =<br />
The authors leverage the Generalized Fourier Transform (GFT) and Generalized Fast Fourier Transform (GFFT) algorithms to compute the correlations outlined in the previous section. The Fast Fourier Transform (FFT) can compute correlations and convolutions efficiently by means of the Fourier theorem. The Fourier theorem states that a continuous periodic function can be expressed as a sum of a series of sine or cosine terms (called Fourier coefficients). The FFT can be generalized to <math>S^2</math> and SO(3) and is then called the GFT. The GFT is a linear projection of a function onto orthogonal basis functions. The basis functions are a set of irreducible unitary representations for a group (such as for <math>S^2</math> or SO(3)). For <math>S^2</math> the basis functions are the spherical harmonics <math>Y_m^l(x)</math>. For SO(3) these basis functions are called the Wigner D-functions <math>D_{mn}^l(R)</math>. For both sets of functions the indices are restricted to <math>l\geq0</math> and <math>-l \leq m,n \geq l</math>. The Wigner D-functions are also orthogonal so the Fourier coefficients can be computed by the inner product with the Wigner D-functions (See Appendix C for complete proof). The Wigner D-functions are complete which means that any function (which is well behaved) on SO(3) can be expressed as a linear combination of the Wigner D-functions. The GFT of a function on SO(3) is thus:<br />
<br />
<math>\hat{f^l} = \int_X f(x) D^l(x)dx</math><br />
<br />
where <math>\hat{f}</math> represents the Fourier coefficients. For <math>S^2</math> we have the same equation but with the basis functions <math>Y^l</math>.<br />
<br />
The inverse SO(3) Fourier transform is:<br />
<br />
<math>f(R)=[\mathcal{F}^{-1} \hat{f}](R) = \sum_{l=0}^b (2l + 1) \sum_{m=-l}^l \sum_{n=-l}^l \hat{f_{mn}^l} D_{mn}^l(R) </math><br />
<br />
The bandwidth b represents the maximum frequency and is related to the resolution of the spatial grid. Kostelec and Rockmore are referenced for more knowledge on this topic.<br />
<br />
The authors give proofs (Appendix D) that the SO(3) correlation satisfies the Fourier theorem and the <math>S^2</math> correlation of spherical signals can be computed by the outer products of the <math>S^2</math>-FTs (Shown in Figure 2).<br />
<br />
[[File:paper26-fig2.png|center]]<br />
<br />
A high-level, approximately-correct, somewhat intuitive explanation of the above figure is that the spherical signal <math> f </math> parameterized over <math> \alpha </math> and <math> \beta </math> having <math> k </math> channels is being correlated with a single filter <math> \psi </math> with the end result being a 3-D feature map on SO(3) (parameterized by Euler angles). The size in <math> \alpha </math> and <math> \beta </math> is the kernel size. The index <math> l </math> going from 0 to 3 correspond the degree of the basis functions used in the Fourier transform. As the degree goes up, so does the dimensionality of vector-valued (for spheres) basis functions. The signals involved are discrete, so the maximum degree (analogous to number of Fourier coefficients) depends on the resolution of the signal. The SO(3) basis functions are matrix-valued, but because <math> S^2 = SO(3)/SO(2) </math>, it ends up that the sphere basis functions correspond to one column in the matrix-valued SO(3) basis functions, which is why the outer product in the figure works.<br />
<br />
The GFFT algorithm details are taken from Kostelec and Rockmore. The authors claim they have the first automatically differentiable implementation of the GFT for <math>S^2</math> and SO(3). The authors do not provide any run time comparisons for real time applications (they just mentioned that FFT can be computed in <math>O(n\mathrm{log}n)</math> time) or any comparisons on training times with/without GFFT. However, they do provide the source code of their implementation at: https://github.com/jonas-koehler/s2cnn.<br />
<br />
= Experiments =<br />
The authors provide several experiments. The first set of experiments are designed to show the numerical stability and accuracy of the outlined methods. The second group of experiments demonstrates how the algorithms can be applied to current problem domains.<br />
<br />
==Equivariance Error==<br />
In this experiment the authors try to show experimentally that their theory of equivariance holds. They express that they had doubts about the equivariance in practice due to potential discretization artifacts since equivariance was proven for the continuous case, with the potential consequence of equivariance not holding being that the weight sharing scheme becomes less effective. The experiment is set up by first testing the equivariance of the SO(3) correlation at different resolutions. 500 random rotations and feature maps (with 10 channels) are sampled. They then calculate the approximation error <math>\small\Delta = \dfrac{1}{n} \sum_{i=1}^n std(L_{R_i} \Phi(f_i) - \phi(L_{R_i} f_i))/std(\Phi(f_i))</math><br />
Note: The authors do not mention what the std function is however it is likely the standard deviation function as 'std' is the command for standard deviation in MATLAB.<br />
<math>\Phi</math> is a composition of SO(3) correlation layers with filters which have been randomly initialized. The authors mention that they were expecting <math>\Delta</math> to be zero in the case of perfect equivariance. This is due to, as proven earlier, the following two terms equaling each other in the continuous case: <math>\small L_{R_i} \Phi(f_i) - \phi(L_{R_i} f_i)</math>. The results are shown in Figure 3. <br />
<br />
[[File:paper26-fig3.png|center]]<br />
<br />
<math>\Delta</math> only grows with resolution/layers when there is no activation function. With ReLU activation the error stays constant once slightly higher than 0 resolution. The authors indicate that the error must therefore be from the feature map rotation since this type of error is exact only for bandlimited functions.<br />
<br />
==MNIST Data==<br />
The experiment using MNIST data was created by projecting MNIST digits onto a sphere using stereographic projection to create the resulting images as seen in Figure 4.<br />
<br />
[[File:paper26-fig4.png|center]]<br />
<br />
The authors created two datasets, one with the projected digits and the other with the same projected digits which were then subjected to a random rotation. The spherical CNN architecture used was <math>\small S^2</math>conv-ReLU-SO(3)conv-ReLU-FC-softmax and was attempted with bandwidths of 30,10,6 and 20,40,10 channels for each layer respectively. This model was compared to a baseline CNN with layers conv-ReLU-conv-ReLU-FC-softmax with 5x5 filters, 32,64,10 channels and stride of 3. For comparison this leads to approximately 68K parameters for the baseline and 58K parameters for the spherical CNN. Results can be seen in Table 1. It is clear from the results that the spherical CNN architecture made the network rotationally invariant. Performance on the rotated set is almost identical to the non-rotated set. This is true even when trained on the non-rotated set and tested on the rotated set. Compare this to the non-spherical architecture which becomes unusable when rotating the digits.<br />
<br />
[[File:paper26-tab1.png|center]]<br />
<br />
==SHREC17==<br />
The SHREC dataset contains 3D models from the ShapeNet dataset which are classified into categories. It consists of a regularly aligned dataset and a rotated dataset. The models from the SHREC17 dataset were projected onto a sphere by means of raycasting. Different properties of the objects obtained from the raycast of the original model and the convex hull of the model make up the different channels which are input into the spherical CNN.<br />
<br />
<br />
[[File:paper26-fig5.png|center]]<br />
<br />
<br />
The network architecture used is an initial <math>\small S^2</math>conv-BN-ReLU block which is followed by two SO(3)conv-BN-ReLU blocks. The output is then fed into a MaxPool-BN block then a linear layer to the output for final classification. The architecture for this experiment has ~1.4M parameters, far exceeding the scale of the spherical CNNs in the other experiments.<br />
<br />
This architecture achieves state of the art results on the SHREC17 tasks. The model places 2nd or 3rd in all categories but was not submitted as the SHREC17 task is closed. Table 2 shows the comparison of results with the top 3 submissions in each category. In the table, P@N stands for precision, R@N stands for recall, F1@N stands for F-score, mAP stands for mean average precision, and NDCG stands for normalized discounted cumulative gain in relevance based on whether the category and subcategory labels are predicted correctly. The authors claim the results show empirical proof of the usefulness of spherical CNNs. They elaborate that this is largely due to the fact that most architectures on the SHREC17 competition are highly specialized whereas their model is fairly general.<br />
<br />
<br />
[[File:paper26-tab2.png|center]]<br />
<br />
==Molecular Atomization==<br />
In this experiment a spherical CNN is implemented with an architecture resembling that of ResNet. They use the QM7 dataset (Blum et al. 2009) which has the task of predicting atomization energy of molecules. The QM7 dataset is a subset of GDB-13 (database of organic molecules) composed of all molecules up to 23 atoms. The positions and charges given in the dataset are projected onto the sphere using potential functions. This is done as follows. First, for each atom, a sphere is defined around its position with the radius of the sphere kept uniform across all atoms. The radius is chosen as the minimal radius so no intersections between atoms occur in the training set. Finally, using potential functions, a T channel spherical signal is produced for each atom in the molecule as shown in the figure below. A summary of their results is shown in Table 3 along with some of the spherical CNN architecture details. It shows the different RMSE obtained from different methods. The results from this final experiment also seem to be promising as the network the authors present achieves the second best score. They also note that the first place method grows exponentially with the number of atoms per molecule so is unlikely to scale well.<br />
<br />
[[File:paper26-tab3.png|center]]<br />
<br />
[[File:paper26-f6.png|center]]<br />
<br />
= Conclusions =<br />
This paper presents a novel architecture called Spherical CNNs and introduces a trainable signal representation for spherical signals rotationally equivariant by design. The paper defines <math>\small S^2</math> and SO(3) cross correlations, shows the theory behind their rotational invariance for continuous functions, and demonstrates that the invariance also applies to the discrete case. An effective GFFT algorithm was implemented and evaluated on two very different datasets with close to state of the art results, demonstrating that there are practical applications to Spherical CNNs.<br />
<br />
For future work the authors believe that improvements can be obtained by generalizing the algorithms to the SE(3) group (SE(3) simply adds translations in 3D space to the SO(3) group). The authors also briefly mention their excitement for applying Spherical CNNs to omnidirectional vision such as in drones and autonomous cars. They state that there is very little publicly available omnidirectional image data which could be why they did not conduct any experiments in this area.<br />
<br />
= Commentary =<br />
The reviews on Spherical CNNs are very positive and it is ranked in the top 1% of papers submitted to ICLR 2018. Positive points are the novelty of the architecture, the wide variety of experiments performed, and the writing. One critique of the original submission is that the related works section only lists, instead of describing, previous methods and that a description of the methods would have provided more clarity. The authors have since expanded the section however I found that it is still limited which the authors attribute to length limitations. Another critique is that the evaluation does not provide enough depth. For example, it would have been great to see an example of omnidirectional vision for spherical networks. However, this is to be expected as it is just the introduction of spherical CNNs and more work is sure to come.<br />
<br />
= Source Code =<br />
Source code is available at:<br />
https://github.com/jonas-koehler/s2cnn<br />
<br />
= Sources =<br />
* T. Cohen et al. Spherical CNNs, 2018.<br />
* J. Feldman. Haar Measure. http://www.math.ubc.ca/~feldman/m606/haar.pdf<br />
* P. Kostelec, D. Rockmore. FFTs on the Rotation Group, 2008.</div>Bapskikohttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=On_The_Convergence_Of_ADAM_And_Beyond&diff=35271On The Convergence Of ADAM And Beyond2018-03-22T20:05:41Z<p>Bapskiko: /* Notation */</p>
<hr />
<div>= Introduction =<br />
Somewhat different to the presentation I gave in class, this paper focuses strictly on the pitfalls in convergance of the ADAM training algorithm for neural networks from a theoretical standpoint and proposes a novel improvement to ADAM called AMSGrad. The paper introduces the idea that it is possible for ADAM to get "stuck" in it's weighted average history, preventing it from converging to an optimal solution. An example is that in an experiment there may be a large spike in the gradient during some minibatches, but since ADAM weighs the current update by the exponential moving averages of squared past<br />
gradients, the effect of the large spike in gradient is lost. This can be prevented through novel adjustments to the ADAM optimization algorithm, which can improve convergence.<br />
<br />
== Notation ==<br />
The paper presents the following framework that generalizes training algorithms to allow us to define a specific variant such as AMSGrad or SGD entirely within it:<br />
<br />
[[File:training_algo_framework.png|700px|center]]<br />
<br />
Where we have <math> x_t </math> as our network parameters defined within a vector space <math> \mathcal{F} </math>. <math> \prod_{\mathcal{F}} (y) = </math> the projection of <math> y </math> on to the set <math> \mathcal{F} </math>.<br />
<math> \psi_t </math> and <math> \phi_t </math> correspond to arbitrary functions we will provide later, The former maps from the history of gradients to <math> \mathbb{R}^d </math> and the latter maps from the history of the gradients to positive semi definite matrices. And finally <math> f_t </math> is our loss function at some time <math> t </math>, the rest should be pretty self explanatory. Using this framework and defining different <math> \psi_t </math> , <math> \phi_t </math> will allow us to recover all different kinds of training algorithms under this one roof.<br />
<br />
=== SGD As An Example ===<br />
To recover SGD using this framework we simply select <math> \phi_t (g_1, \dotsc, g_t) = g_t</math>, <math> \psi_t (g_1, \dotsc, g_t) = I </math> and <math>\alpha_t = \alpha / \sqrt{t}</math>. It's easy to see that no transformations are ultimately applied to any of the parameters based on any gradient history other than the most recent from <math> \phi_t </math> and that <math> \psi_t </math> in no way transforms any of the parameters by any specific amount as <math> V_t = I </math> has no impact later on.<br />
<br />
=== ADAGRAD As Another Example ===<br />
<br />
To recover ADAGRAD, we select <math> \phi_t (g_1, \dotsc, g_t) = g_t</math>, <math> \psi_t (g_1, \dotsc, g_t) = \frac{\sum_{i=1}^{t} g_i^2}{t} </math>, and <math>\alpha_t = \alpha / \sqrt{t}</math>. Therefore, compared to SGD, ADAGRAD uses a different step size for each parameter, based on the past gradients for that parameter; the learning rate becomes <math> \alpha_t = \alpha / \sqrt{\sum_i g_{i,j}^2} </math> for each parameter <math> j </math>. The authors note that this scheme is quite efficient when the gradients are sparse.<br />
<br />
=== ADAM As Another Example ===<br />
Once you can convince yourself that SGD is correct, you should understand the framework enough to see why the following setup for ADAM will allow us to recover the behaviour we want. ADAM has the ability to define a "learning rate" for every parameter based on how much that parameter moves over time (a.k.a it's momentum) supposedly to help with the learning process.<br />
<br />
In order to do this we will choose <math> \phi_t (g_1, \dotsc, g_t) = (1 - \beta_1) \sum_{i=0}^{t} {\beta_1}^{t - i} g_t </math>, psi to be <math> \psi_t (g_1, \dotsc, g_t) = (1 - \beta_2)</math>diag<math>( \sum_{i=0}^{t} {\beta_2}^{t - i} {g_t}^2) </math>, and keep <math>\alpha_t = \alpha / \sqrt{t}</math>. This set up is equivalent to choosing a learning rate decay of <math>\alpha / \sqrt{\sum_i g_{i,j}}</math> for <math>j \in [d]</math>.<br />
<br />
From this, we can now see that <math>m_t </math> gets filled up with the exponentially weighted average of the history of our gradients that we've come across so far in the algorithm. And that as we proceed to update we scale each one of our parameters by dividing out <math> V_t </math> (in the case of diagonal it's just 1/the diagonal entry) which contains the exponentially weighted average of each parameters momentum (<math> {g_t}^2 </math>) across our training so far in the algorithm. Thus giving each parameter it's own unique scaling by its second moment or momentum. Intuitively from a physical perspective if each parameter is a ball rolling around in the optimization landscape what we are now doing is instead of having the ball changed positions on the landscape at a fixed velocity (i.e. momentum of 0) the ball now has the ability to accelerate and speed up or slow down if it's on a steep hill or flat trough in the landscape (i.e. a momentum that can change with time).<br />
<br />
= <math> \Gamma_t </math>, an Interesting Quantity =<br />
Now that we have an idea of what ADAM looks like in this framework, let us now investigate the following:<br />
<br />
<center><math> \Gamma_{t + 1} = \frac{\sqrt{V_{t+1}}}{\alpha_{t+1}} - \frac{\sqrt{V_t}}{\alpha_t} </math></center><br />
<br />
Which essentially measure the change of the "Inverse of the learning rate" across time (since we are using alpha's as step sizes). Looking back to our example of SGD it's not hard to see that this quantity is strictly positive, which leads to "non-increasing" learning rates which, a desired property. However, that is not the case with ADAM, and can pose a problem in a theoretical and applied setting. The problem ADAM can face is that <math> \Gamma_t </math> can be indefinite, which the original proof assumed it could not be. The math for this proof is VERY long so instead we will opt for an example to showcase why this could be an issue.<br />
<br />
Consider the loss function <math> f_t(x) = \begin{cases} <br />
Cx & \text{for } t \text{ mod 3} = 1 \\<br />
-x & \text{otherwise}<br />
\end{cases} </math><br />
<br />
Where we have <math> C > 2 </math> and <math> \mathcal{F} </math> is <math> [-1,1] </math><br />
Additionally we choose <math> \beta_1 = 0 </math> and <math> \beta_2 = 1/(1+C^2) </math>. We then proceed to plug this into our framework from before. <br />
This function is periodic and it's easy to see that it has the gradient of C once and then a gradient of -1 twice every period. It has an optimal solution of <math> x = -1 </math> (from a regret standpoint), but using ADAM we would eventually converge at <math> x = 1 </math> since <math> C </math> would "overpower" the -1's.<br />
<br />
We formalize this intuition in the results below.<br />
<br />
'''Theorem 1.''' There is an online convex optimization problem where ADAM has non-zero average regret. i.e. <math>R_T/T\nrightarrow 0 </math> as <math>T\rightarrow \infty</math>.<br />
<br />
'''Theorem 2.''' For any constant <math>\beta_1,\beta_2 \in [0,1)</math> such that <math>\beta_2 < \sqrt{\beta_2}</math>, there is an online convex optimization problem where ADAM has non-zero average regret i.e. <math>R_T/T\nrightarrow 0 </math> as <math>T\rightarrow \infty</math>.<br />
<br />
'''Theorem 3.''' For any constant <math>\beta_1,\beta_2 \in [0,1)</math> such that <math>\beta_2 < \sqrt{\beta_2}</math>, there is a stochastic convex optimization problem for which ADAM does not converge to the optimal solution.<br />
<br />
= AMSGrad as an improvement to ADAM =<br />
There is a very simple intuitive fix to ADAM to handle this problem. We simply scale our historical weighted average by the maximum we have seen so far to avoid the negative sign problem. There is a very simple one-liner adaptation of ADAM to get to AMSGRAD:<br />
[[File:AMSGrad_algo.png|700px|center]]<br />
<br />
Below are some simple plots comparing ADAM and AMSGrad, the first are from the paper and the second are from another individual who attempted to recreate the experiments. The two plots somewhat disagree with one another so take this heuristic improvement with a grain of salt.<br />
<br />
[[File:AMSGrad_vs_adam.png|900px|center]]<br />
<br />
Here is another example of a one-dimensional convex optimization problem where ADAM fails to converge<br />
<br />
[[File:AMSGrad_vs_adam3.png|900px|center]]<br />
<br />
[[File:AMSGrad_vs_adam2.png|700px|center]]<br />
<br />
= Conclusion =<br />
We have introduced a framework for which we could view several different training algorithms. From there we used it to recover SGD as well as ADAM. In our recovery of ADAM we investigated the change of the inverse of the learning rate over time to discover in certain cases there were convergence issues. We proposed a new heuristic AMSGrad to help deal with this problem and presented some empirical results that show it may have helped ADAM slightly. Thanks for your time.<br />
<br />
= Source =<br />
1. Sashank J. Reddi and Satyen Kale and Sanjiv Kumar. "On the Convergence of Adam and Beyond." International Conference on Learning Representations. 2018</div>Bapskikohttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Robust_Imitation_of_Diverse_Behaviors&diff=35024Robust Imitation of Diverse Behaviors2018-03-21T17:46:42Z<p>Bapskiko: /* Policy Optimization Strategy: TRPO */</p>
<hr />
<div>=Introduction=<br />
One of the longest standing challenges in AI is building versatile embodied agents, both in the form of real robots and animated avatars, capable of a wide and diverse set of behaviors. State-of-the-art robots cannot compete with the effortless variety and adaptive flexibility of motor behaviors produced by toddlers. Towards addressing this challenge, the authors combine several deep generative approaches to imitation learning in a way that accentuates their individual strengths and addresses their limitations. The end product is a robust neural network policy that can imitate a large and diverse set of behaviors using few training demonstrations.<br />
<br />
=Motivation=<br />
Some of the models that have recently shown great promise in imitation learning for motor control are the deep generative models. The authors primarily talk about two approaches viz. supervised approaches that condition on demonstrations and Generative Adversarial Imitation Learning (GAIL) and their limitations and try to combine those two approaches in order to address these limitations. Some of these limitations are as follows:<br />
<br />
* Supervised approaches that condition on demonstrations (VAE):<br />
** They require large training datasets in order to work for non-trivial tasks<br />
** They tend to be brittle and fail when the agent diverges too much from the demonstration trajectories (As proof of this brittleness, the authors cite Ross et al. (2010), who provide a theorem showing that the cost incurred by this kind of model when it deviates from a demonstration trajectory with a small probability can be amplified in a manner quadratic in the number of time steps. )<br />
* Generative Adversarial Imitation Learning (GAIL)<br />
** Adversarial training leads to mode-collapse (the tendency of adversarial generative models to cover only a subset of modes of a probability distribution, resulting in a failure to produce adequately diverse samples)<br />
** More difficult and slow to train as they do not immediately provide a latent representation of the data<br />
<br />
Thus, the former approach can model diverse behaviors without dropping modes but does not learn robust policies, while the latter approach gives robust policies but insufficiently diverse behaviors. Thus, the authors combine the favorable aspects of these two approaches. The base of their model is a new type of variational autoencoder on demonstration trajectories that learns semantic policy embeddings. Leveraging these policy representations, they develop a new version of GAIL that <br />
# is much more robust than the purely-supervised controller, especially with few demonstrations, and <br />
# avoids mode collapse, capturing many diverse behaviors when GAIL on its own does not.<br />
<br />
=Model=<br />
The authors first introduce a variational autoencoder (VAE) for supervised imitation, consisting of a bi-directional LSTM encoder mapping demonstration sequences to embedding vectors, and two decoders. The first decoder is a multi-layer perceptron (MLP) policy mapping a trajectory embedding and the current state to a continuous action vector. The second is a dynamics model mapping the embedding and previous state to the present state while modeling correlations among states with a WaveNet.<br />
<br />
[[File: Model_Architecture.png|700px|center|]]<br />
<br />
==Behavioral cloning with VAE suited for control==<br />
<br />
In this section, the authors follow a similar approach to Duan et al. (2017), but opt for stochastic VAEs as having a distribution <math display="inline">q_\phi(z|x_{1:T})</math> to better regularize the latent space. In their VAE, an encoder stochastically maps a demonstration sequence to an embedding vector <math display="inline">z</math>. Given <math display="inline">z</math>, they decode both the state and action trajectories as shown in the figure above. To train the model, the following loss is minimized:<br />
<br />
\begin{align}<br />
L\left( \alpha, w, \phi; \tau_i \right) = - \pmb{\mathbb{E}}_{q_{\phi}(z|x_{1:T_i}^i)} \left[ \sum_{t=1}^{T_i} log \pi_\alpha \left( a_t^i|x_t^i, z \right) + log p_w \left( x_{t+1}^i|x_t^i, z\right) \right] +D_{KL}\left( q_\phi(z|x_{1:T_i}^i)||p(z) \right)<br />
\end{align}<br />
<br />
Where <math> \alpha </math> parameterizes the action decoder, <math> w </math> parameterizes the state decoder, <math> \phi </math> parameterizes the state encoder, and <math> T_i \in \tau_i </math> is the set of demonstration trajectories.<br />
<br />
The encoder <math display="inline">q</math> uses a bi-directional LSTM. To produce the final embedding, it calculates the average of all the outputs of the second layer of this LSTM before applying a final linear transformation to<br />
generate the mean and standard deviation of a Gaussian. Then, one sample from this Gaussian is taken as the demonstration encoding.<br />
<br />
The action decoder is an MLP that maps the concatenation of the state and the embedding of the parameters of a Gaussian policy. The state decoder is similar to a conditional WaveNet model. In particular, it conditions on the embedding <math display="inline">z</math> and previous state <math display="inline">x_{t-1}</math> to generate the vector <math display="inline">x_t</math> autoregressively. That is, the autoregression is over the components of the vector <math display="inline">x_t</math>. Finally, instead of a Softmax, the model uses a mixture of Gaussians as the output of the WaveNet.<br />
<br />
==Diverse generative adversarial imitiation learning==<br />
To enable GAIL to produce diverse solutions, the authors condition the discriminator on the embeddings generated by the VAE encoder and integrate out the GAIL objective with respect to the variational posterior <math display="inline">q_\phi(z|x_{1:T})</math>. Specifically, the authors train the discriminator by optimizing the following objective:<br />
<br />
\begin{align}<br />
{max}_{\psi} \pmb{\mathbb{E}}_{\tau_i \sim \pi_E} \left( \pmb{\mathbb{E}}_{q(z|x_{1:T_i}^i)} \left[\frac{1}{T_i} \sum_{t=1}^{T_i} logD_{\psi} \left( x_t^i, a_t^i | z \right) + \pmb{\mathbb{E}}_{\pi_\theta} \left[ log(1 - D_\psi(x, a | z)) \right] \right] \right)<br />
\end{align}<br />
<br />
The authors condition on unlabeled trajectories, which have been passed through a powerful encoder, and hence this approach is capable of one-shot imitation learning. Moreover, the VAE encoder enables to obtain a continuous latent embedding space where interpolation is possible.<br />
<br />
To better motivate the objective, the authors propose on temporarily leaving the context of imitation learning and considering an alternative objective for training GANs<br />
<br />
\begin{align}<br />
{min}_{G}{max}_{D} V (G, D) = \int_{y} p(y) \int_{z} q(z|y) \left[ log D(y | z) + \int_{\hat{y}} G(\hat{y} | z) log (1 - D(\hat{y} | z)) d\hat{y} \right] dy dz<br />
\end{align}<br />
<br />
This function is a simplification of the previous objective function. Furthermore, it satisfies the following property.<br />
<br />
===Lemma 1===<br />
Assuming that <math display="inline">q</math> computes the true posterior distribution that is <math display="inline">q(z|y) = \frac{p(y|z)p(z)}{p(y)}</math> then<br />
<br />
\begin{align}<br />
V (G, D) = \int_{z} p(z) \left[ \int_{y} p(y|z) log D(y|z) dy + \int_{\hat{y}} G(\hat{y} | z) log (1 - D(\hat{y} | z)) d\hat{y} \right] dz<br />
\end{align}<br />
<br />
If an optimal discriminator is further assumed, the cost optimized by the generator then becomes<br />
<br />
\begin{align}<br />
C(G) = 2 \int_ p p(z) JSD[p(\cdot|z) || G(\cdot|z)] dz - log4<br />
\end{align}<br />
<br />
where <math display="inline">JSD</math> stands for the Jensen-Shannon divergence. In the context of the WaveNet described in the earlier section, <math>p(x)</math> is the distribution of a mixture of Gaussians, and <math>p(z)</math> is the distribution over the mixture components, so the conditional distribution over the latent <math>z</math>, <math>p(x | z)</math> is uni-modal, and optimizing the divergence will not lead to mode collapse.<br />
<br />
==Policy Optimization Strategy: TRPO==<br />
<br />
[[file:robust_behaviour_alg.png | 800px]]<br />
<br />
In Algorithm 1, it states that TRPO is used for policy parameter updates. TRPO is short for Trust Region Policy Optimization, which an iterative procedure for policy optimization, developed by John Schulman, Sergey Levine, Philip Moritz, Micheal Jordan and Pieter Abbeel. This optimization methods achieves monotonic improve in fields related to robotic motions, such as walking and swimming. For more details on TRPO, please refer to the [https://arxiv.org/pdf/1502.05477.pdf original paper].<br />
<br />
=Experiments=<br />
<br />
The primary focus of the paper's experimental evaluation is to demonstrate that the architecture allows learning of robust controllers capable of producing the full spectrum of demonstration behaviors for a diverse range of challenging control problems. The authors consider three bodies: a 9 DoF robotic arm, a 9 DoF planar walker, and a 62 DoF complex humanoid (56-actuated joint angles, and a freely translating and rotating 3d root joint). While for the reaching task BC is sufficient to obtain a working controller, for the other two problems the full learning procedure is critical.<br />
<br />
The authors analyze the resulting embedding spaces and demonstrate that they exhibit a rich and sensible structure that can be exploited for control. Finally, the authors show that the encoder can be used to capture the gist of novel demonstration trajectories which can then be reproduced by the controller.<br />
<br />
==Robotic arm reaching==<br />
In this experiment, the authors demonstrate the effectiveness of their VAE architecture and investigate the nature of the learned embedding space on a reaching task with a simulated Jaco arm.<br />
<br />
To obtain demonstrations, the authors trained 60 independent policies to reach to random target locations in the workspace starting from the same initial configuration. 30 trajectories from each of<br />
the first 50 policies were generated. These served as training data for the VAE model (1500 training trajectories in total). The remaining 10 policies were used to generate test data.<br />
<br />
Here are the trajectories produced by the VAE model.<br />
<br />
[[File: Robotic_arm_reaching_VAE.png|300px|center|]]<br />
<br />
The reaching task is relatively simple, so with this amount of data the VAE policy is fairly robust. After training, the VAE encodes and reproduces the demonstrations as shown in the figure below.<br />
<br />
[[File: Robotic_arm_reaching.png|650px|center|]]<br />
<br />
==2D Walker==<br />
<br />
As a more challenging test compared to the reaching task, the authors consider bipedal locomotion. Here, the authors train 60 neural network policies for a 2d walker to serve as demonstrations. These policies are each trained to move at different speeds both forward and backward depending on a label provided as additional input to the policy. Target speeds for training were chosen from a set of four different speeds (m/s): -1, 0, 1, 3.<br />
<br />
[[File: 2D_Walker.png|650px|center|]]<br />
<br />
The authors trained their model with 20 episodes per policy (1200 demonstration trajectories in total, each with a length of 400 steps or 10s of simulated time). In this experiment a full approach is required: training the VAE with BC alone can imitate some of the trajectories, but it performs poorly in general, presumably because the relatively small training set does not cover the space of trajectories sufficiently densely. On this generated dataset, they also train policies with GAIL using the same architecture and hyper-parameters. Due to the lack of conditioning, GAIL does not reproduce coherently trajectories. Instead, it simply meshes different behaviors together. In addition, the policies trained with GAIL also, exhibit dramatically less diversity; see [https://www.youtube.com/watch?v=kIguLQ4OwuM video].<br />
<br />
[[File: 2D_Walker_Optimized.gif|frame|center|In the left panel, the planar walker demonstrates a particular walking style. In the right panel, the model's agent imitates this walking style using a single policy network.]]<br />
<br />
==Complex humanoid==<br />
For this experiment, the authors consider a humanoid body of high dimensionality that poses a hard control problem. They generate training trajectories with the existing controllers, which can produce instances of one of six different movement styles. Examples of such trajectories are shown in Fig. 5.<br />
<br />
[[File: Complex_humanoid.png|650px|center|]]<br />
<br />
The training set consists of 250 random trajectories from 6 different neural network controllers that were trained to match 6 different movement styles from the CMU motion capture database. Each trajectory is 334 steps or 10s long. The authors use a second set of 5 controllers from which they generate trajectories for evaluation (3 of these policies were trained on the same movement styles as the policies used for generating training data).<br />
<br />
Surprisingly, despite the complexity of the body, supervised learning is quite effective at producing sensible controllers: The VAE policy is reasonably good at imitating the demonstration trajectories, although it lacks the robustness to be practically useful. Adversarial training dramatically improves the stability of the controller. The authors analyze the improvement quantitatively by computing the percentage of the humanoid falling down before the end of an episode while imitating either training or test policies. The results are summarized in Figure 5 right. The figure further shows sequences of frames of representative demonstration and associated imitation trajectories. Videos of demonstration and imitation behaviors can be found in the [https://www.youtube.com/watch?v=NaohsyUxpxw video].<br />
<br />
[[File: Complex_humanoid_optimized.gif|frame|center|In the left and middle panels we show two demonstrated behaviors. In the right panel, the model's agent produces an unseen transition between those behaviors.]]<br />
<br />
Also, for practical purposes, it is desirable to allow the controller to transition from one behavior to another. The authors test this possibility in an experiment similar to the one for the Jaco arm: They determine the<br />
embedding vectors of pairs of demonstration trajectories, start the trajectory by conditioning on the first embedding vector, and then transition from one behavior to the other half-way through the episode by linearly interpolating the embeddings of the two demonstration trajectories over a window of 20 control steps. Although not always successful the learned controller often transitions robustly, despite not having been trained to do so. An example of this can be seen in the gif above. More examples can be seen in this [https://www.youtube.com/watch?v=VBrIll0B24o video].<br />
<br />
=Conclusions=<br />
The authors have proposed an approach for imitation learning that combines the favorable properties of techniques for density modeling with latent variables (VAEs) with those of GAIL. The result is a model that from a moderate number of demonstration tragectories, can learn:<br />
# a semantically well structured embedding of behaviors, <br />
# a corresponding multi-task controller that allows to robustly execute diverse behaviors from this embedding space, as well as <br />
# an encoder that can map new trajectories into the embedding space and hence allows for one-shot imitation. <br />
The experimental results demonstrate that this approach can work on a variety of control problems and that it scales even to very challenging ones such as the control of a simulated humanoid with a large number of degrees of freedoms.<br />
<br />
=Critique=<br />
The paper proposes a deep-learning-based approach to imitation learning which is sample-efficient and is able to imitate many diverse behaviors. The architecture can be seen as conditional generative adversarial imitation learning (GAIL). The conditioning vector is an embedding of a demonstrated trajectory, provided by a variational autoencoder. This results in one-shot imitation learning: at test time, a new demonstration can be embedded and provided as a conditioning vector to the imitation policy. The authors evaluate the method on several simulated motor control tasks.<br />
<br />
Pros:<br />
* Addresses a challenging problem of learning complex dynamics controllers / control policies<br />
* Well-written introduction / motivation<br />
* The proposed approach is able to learn complex and diverse behaviors and outperforms both the VAE alone (quantitatively) and GAIL alone (qualitatively).<br />
* Appealing qualitative results on the three evaluation problems. Interesting experiments with motion transitioning. <br />
<br />
Cons:<br />
* Comparisons to baselines could be more detailed.<br />
* Many key details are omitted (either on purpose, placed in the appendix, or simply absent, like the lack of definitions of terms in the modeling section, details of the planner model, simulation process, or the details of experimental settings)<br />
* Experimental evaluation is largely subjective (videos of robotic arm/biped/3D human motion)<br />
* A discussion of sample efficiency compared to GAIL and VAE would be interesting.<br />
* The presentation is not always clear, in particular, I had a hard time figuring out the notation in Section 3.<br />
* There has been some work on hybrids of VAEs and GANs, which seem worth mentioning when generative models are discussed, like:<br />
# Autoencoding beyond pixels using a learned similarity metric, Larsen et al., ICML 2016<br />
# Generating Images with Perceptual Similarity Metrics based on Deep Networks, Dosovitskiy&Brox. NIPS 2016<br />
These works share the intuition that good coverage of VAEs can be combined with sharp results generated by GANs.<br />
* Some more extensive analysis of the approach would be interesting. How sensitive is it to hyperparameters? How important is it to use VAE, not usual AE or supervised learning? How difficult will it be for others to apply it to new tasks?<br />
<br />
=References=<br />
# Duan, Y., Andrychowicz, M., Stadie, B., Ho, J., Schneider, J., Sutskever, I., Abbeel, P., & Zaremba, W. (2017). One-shot imitation learning. Preprint arXiv:1703.07326.<br />
# Ross, Stéphane, and Drew Bagnell. "Efficient reductions for imitation learning." Proceedings of the thirteenth international conference on artificial intelligence and statistics. 2010.<br />
# Wang, Z., Merel, J. S., Reed, S. E., de Freitas, N., Wayne, G., & Heess, N. (2017). Robust imitation of diverse behaviors. In Advances in Neural Information Processing Systems (pp. 5326-5335).<br />
# https://deepmind.com/blog/producing-flexible-behaviours-simulated-environments/<br />
# https://www.youtube.com/watch?v=NaohsyUxpxw<br />
# https://www.youtube.com/watch?v=VBrIll0B24o</div>Bapskikohttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:robust_behaviour_alg.png&diff=35023File:robust behaviour alg.png2018-03-21T17:45:48Z<p>Bapskiko: </p>
<hr />
<div></div>Bapskikohttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Robust_Imitation_of_Diverse_Behaviors&diff=35020Robust Imitation of Diverse Behaviors2018-03-21T17:07:18Z<p>Bapskiko: /* Behavioral cloning with VAE suited for control */</p>
<hr />
<div>=Introduction=<br />
One of the longest standing challenges in AI is building versatile embodied agents, both in the form of real robots and animated avatars, capable of a wide and diverse set of behaviors. State-of-the-art robots cannot compete with the effortless variety and adaptive flexibility of motor behaviors produced by toddlers. Towards addressing this challenge, the authors combine several deep generative approaches to imitation learning in a way that accentuates their individual strengths and addresses their limitations. The end product is a robust neural network policy that can imitate a large and diverse set of behaviors using few training demonstrations.<br />
<br />
=Motivation=<br />
Some of the models that have recently shown great promise in imitation learning for motor control are the deep generative models. The authors primarily talk about two approaches viz. supervised approaches that condition on demonstrations and Generative Adversarial Imitation Learning (GAIL) and their limitations and try to combine those two approaches in order to address these limitations. Some of these limitations are as follows:<br />
<br />
* Supervised approaches that condition on demonstrations (VAE):<br />
** They require large training datasets in order to work for non-trivial tasks<br />
** They tend to be brittle and fail when the agent diverges too much from the demonstration trajectories (As proof of this brittleness, the authors cite Ross et al. (2010), who provide a theorem showing that the cost incurred by this kind of model when it deviates from a demonstration trajectory with a small probability can be amplified in a manner quadratic in the number of time steps. )<br />
* Generative Adversarial Imitation Learning (GAIL)<br />
** Adversarial training leads to mode-collapse (the tendency of adversarial generative models to cover only a subset of modes of a probability distribution, resulting in a failure to produce adequately diverse samples)<br />
** More difficult and slow to train as they do not immediately provide a latent representation of the data<br />
<br />
Thus, the former approach can model diverse behaviors without dropping modes but does not learn robust policies, while the latter approach gives robust policies but insufficiently diverse behaviors. Thus, the authors combine the favorable aspects of these two approaches. The base of their model is a new type of variational autoencoder on demonstration trajectories that learns semantic policy embeddings. Leveraging these policy representations, they develop a new version of GAIL that <br />
# is much more robust than the purely-supervised controller, especially with few demonstrations, and <br />
# avoids mode collapse, capturing many diverse behaviors when GAIL on its own does not.<br />
<br />
=Model=<br />
The authors first introduce a variational autoencoder (VAE) for supervised imitation, consisting of a bi-directional LSTM encoder mapping demonstration sequences to embedding vectors, and two decoders. The first decoder is a multi-layer perceptron (MLP) policy mapping a trajectory embedding and the current state to a continuous action vector. The second is a dynamics model mapping the embedding and previous state to the present state while modeling correlations among states with a WaveNet.<br />
<br />
[[File: Model_Architecture.png|700px|center|]]<br />
<br />
==Behavioral cloning with VAE suited for control==<br />
<br />
In this section, the authors follow a similar approach to Duan et al. (2017), but opt for stochastic VAEs as having a distribution <math display="inline">q_\phi(z|x_{1:T})</math> to better regularize the latent space. In their VAE, an encoder stochastically maps a demonstration sequence to an embedding vector <math display="inline">z</math>. Given <math display="inline">z</math>, they decode both the state and action trajectories as shown in the figure above. To train the model, the following loss is minimized:<br />
<br />
\begin{align}<br />
L\left( \alpha, w, \phi; \tau_i \right) = - \pmb{\mathbb{E}}_{q_{\phi}(z|x_{1:T_i}^i)} \left[ \sum_{t=1}^{T_i} log \pi_\alpha \left( a_t^i|x_t^i, z \right) + log p_w \left( x_{t+1}^i|x_t^i, z\right) \right] +D_{KL}\left( q_\phi(z|x_{1:T_i}^i)||p(z) \right)<br />
\end{align}<br />
<br />
Where <math> \alpha </math> parameterizes the action decoder, <math> w </math> parameterizes the state decoder, <math> \phi </math> parameterizes the state encoder, and <math> T_i \in \tau_i </math> is the set of demonstration trajectories.<br />
<br />
The encoder <math display="inline">q</math> uses a bi-directional LSTM. To produce the final embedding, it calculates the average of all the outputs of the second layer of this LSTM before applying a final linear transformation to<br />
generate the mean and standard deviation of a Gaussian. Then, one sample from this Gaussian is taken as the demonstration encoding.<br />
<br />
The action decoder is an MLP that maps the concatenation of the state and the embedding of the parameters of a Gaussian policy. The state decoder is similar to a conditional WaveNet model. In particular, it conditions on the embedding <math display="inline">z</math> and previous state <math display="inline">x_{t-1}</math> to generate the vector <math display="inline">x_t</math> autoregressively. That is, the autoregression is over the components of the vector <math display="inline">x_t</math>. Finally, instead of a Softmax, the model uses a mixture of Gaussians as the output of the WaveNet.<br />
<br />
==Diverse generative adversarial imitiation learning==<br />
To enable GAIL to produce diverse solutions, the authors condition the discriminator on the embeddings generated by the VAE encoder and integrate out the GAIL objective with respect to the variational posterior <math display="inline">q_\phi(z|x_{1:T})</math>. Specifically, the authors train the discriminator by optimizing the following objective:<br />
<br />
\begin{align}<br />
{max}_{\psi} \pmb{\mathbb{E}}_{\tau_i \sim \pi_E} \left( \pmb{\mathbb{E}}_{q(z|x_{1:T_i}^i)} \left[\frac{1}{T_i} \sum_{t=1}^{T_i} logD_{\psi} \left( x_t^i, a_t^i | z \right) + \pmb{\mathbb{E}}_{\pi_\theta} \left[ log(1 - D_\psi(x, a | z)) \right] \right] \right)<br />
\end{align}<br />
<br />
The authors condition on unlabeled trajectories, which have been passed through a powerful encoder, and hence this approach is capable of one-shot imitation learning. Moreover, the VAE encoder enables to obtain a continuous latent embedding space where interpolation is possible.<br />
<br />
To better motivate the objective, the authors propose on temporarily leaving the context of imitation learning and considering an alternative objective for training GANs<br />
<br />
\begin{align}<br />
{min}_{G}{max}_{D} V (G, D) = \int_{y} p(y) \int_{z} q(z|y) \left[ log D(y | z) + \int_{\hat{y}} G(\hat{y} | z) log (1 - D(\hat{y} | z)) d\hat{y} \right] dy dz<br />
\end{align}<br />
<br />
This function is a simplification of the previous objective function. Furthermore, it satisfies the following property.<br />
<br />
===Lemma 1===<br />
Assuming that <math display="inline">q</math> computes the true posterior distribution that is <math display="inline">q(z|y) = \frac{p(y|z)p(z)}{p(y)}</math> then<br />
<br />
\begin{align}<br />
V (G, D) = \int_{z} p(z) \left[ \int_{y} p(y|z) log D(y|z) dy + \int_{\hat{y}} G(\hat{y} | z) log (1 - D(\hat{y} | z)) d\hat{y} \right] dz<br />
\end{align}<br />
<br />
If an optimal discriminator is further assumed, the cost optimized by the generator then becomes<br />
<br />
\begin{align}<br />
C(G) = 2 \int_ p p(z) JSD[p(\cdot|z) || G(\cdot|z)] dz - log4<br />
\end{align}<br />
<br />
where <math display="inline">JSD</math> stands for the Jensen-Shannon divergence. In the context of the WaveNet described in the earlier section, <math>p(x)</math> is the distribution of a mixture of Gaussians, and <math>p(z)</math> is the distribution over the mixture components, so the conditional distribution over the latent <math>z</math>, <math>p(x | z)</math> is uni-modal, and optimizing the divergence will not lead to mode collapse.<br />
<br />
==Policy Optimization Strategy: TRPO==<br />
<br />
In Algorithm 1, it states that TRPO is used for policy parameter updates. TRPO is short for Trust Region Policy Optimization, which an iterative procedure for policy optimization, developed by John Schulman, Sergey Levine, Philip Moritz, Micheal Jordan and Pieter Abbeel. This optimization methods achieves monotonic improve in fields related to robotic motions, such as walking and swimming. For more details on TRPO, please refer to the [https://arxiv.org/pdf/1502.05477.pdf original paper].<br />
<br />
=Experiments=<br />
<br />
The primary focus of the paper's experimental evaluation is to demonstrate that the architecture allows learning of robust controllers capable of producing the full spectrum of demonstration behaviors for a diverse range of challenging control problems. The authors consider three bodies: a 9 DoF robotic arm, a 9 DoF planar walker, and a 62 DoF complex humanoid (56-actuated joint angles, and a freely translating and rotating 3d root joint). While for the reaching task BC is sufficient to obtain a working controller, for the other two problems the full learning procedure is critical.<br />
<br />
The authors analyze the resulting embedding spaces and demonstrate that they exhibit a rich and sensible structure that can be exploited for control. Finally, the authors show that the encoder can be used to capture the gist of novel demonstration trajectories which can then be reproduced by the controller.<br />
<br />
==Robotic arm reaching==<br />
In this experiment, the authors demonstrate the effectiveness of their VAE architecture and investigate the nature of the learned embedding space on a reaching task with a simulated Jaco arm.<br />
<br />
To obtain demonstrations, the authors trained 60 independent policies to reach to random target locations in the workspace starting from the same initial configuration. 30 trajectories from each of<br />
the first 50 policies were generated. These served as training data for the VAE model (1500 training trajectories in total). The remaining 10 policies were used to generate test data.<br />
<br />
Here are the trajectories produced by the VAE model.<br />
<br />
[[File: Robotic_arm_reaching_VAE.png|300px|center|]]<br />
<br />
The reaching task is relatively simple, so with this amount of data the VAE policy is fairly robust. After training, the VAE encodes and reproduces the demonstrations as shown in the figure below.<br />
<br />
[[File: Robotic_arm_reaching.png|650px|center|]]<br />
<br />
==2D Walker==<br />
<br />
As a more challenging test compared to the reaching task, the authors consider bipedal locomotion. Here, the authors train 60 neural network policies for a 2d walker to serve as demonstrations. These policies are each trained to move at different speeds both forward and backward depending on a label provided as additional input to the policy. Target speeds for training were chosen from a set of four different speeds (m/s): -1, 0, 1, 3.<br />
<br />
[[File: 2D_Walker.png|650px|center|]]<br />
<br />
The authors trained their model with 20 episodes per policy (1200 demonstration trajectories in total, each with a length of 400 steps or 10s of simulated time). In this experiment a full approach is required: training the VAE with BC alone can imitate some of the trajectories, but it performs poorly in general, presumably because the relatively small training set does not cover the space of trajectories sufficiently densely. On this generated dataset, they also train policies with GAIL using the same architecture and hyper-parameters. Due to the lack of conditioning, GAIL does not reproduce coherently trajectories. Instead, it simply meshes different behaviors together. In addition, the policies trained with GAIL also, exhibit dramatically less diversity; see [https://www.youtube.com/watch?v=kIguLQ4OwuM video].<br />
<br />
[[File: 2D_Walker_Optimized.gif|frame|center|In the left panel, the planar walker demonstrates a particular walking style. In the right panel, the model's agent imitates this walking style using a single policy network.]]<br />
<br />
==Complex humanoid==<br />
For this experiment, the authors consider a humanoid body of high dimensionality that poses a hard control problem. They generate training trajectories with the existing controllers, which can produce instances of one of six different movement styles. Examples of such trajectories are shown in Fig. 5.<br />
<br />
[[File: Complex_humanoid.png|650px|center|]]<br />
<br />
The training set consists of 250 random trajectories from 6 different neural network controllers that were trained to match 6 different movement styles from the CMU motion capture database. Each trajectory is 334 steps or 10s long. The authors use a second set of 5 controllers from which they generate trajectories for evaluation (3 of these policies were trained on the same movement styles as the policies used for generating training data).<br />
<br />
Surprisingly, despite the complexity of the body, supervised learning is quite effective at producing sensible controllers: The VAE policy is reasonably good at imitating the demonstration trajectories, although it lacks the robustness to be practically useful. Adversarial training dramatically improves the stability of the controller. The authors analyze the improvement quantitatively by computing the percentage of the humanoid falling down before the end of an episode while imitating either training or test policies. The results are summarized in Figure 5 right. The figure further shows sequences of frames of representative demonstration and associated imitation trajectories. Videos of demonstration and imitation behaviors can be found in the [https://www.youtube.com/watch?v=NaohsyUxpxw video].<br />
<br />
[[File: Complex_humanoid_optimized.gif|frame|center|In the left and middle panels we show two demonstrated behaviors. In the right panel, the model's agent produces an unseen transition between those behaviors.]]<br />
<br />
Also, for practical purposes, it is desirable to allow the controller to transition from one behavior to another. The authors test this possibility in an experiment similar to the one for the Jaco arm: They determine the<br />
embedding vectors of pairs of demonstration trajectories, start the trajectory by conditioning on the first embedding vector, and then transition from one behavior to the other half-way through the episode by linearly interpolating the embeddings of the two demonstration trajectories over a window of 20 control steps. Although not always successful the learned controller often transitions robustly, despite not having been trained to do so. An example of this can be seen in the gif above. More examples can be seen in this [https://www.youtube.com/watch?v=VBrIll0B24o video].<br />
<br />
=Conclusions=<br />
The authors have proposed an approach for imitation learning that combines the favorable properties of techniques for density modeling with latent variables (VAEs) with those of GAIL. The result is a model that from a moderate number of demonstration tragectories, can learn:<br />
# a semantically well structured embedding of behaviors, <br />
# a corresponding multi-task controller that allows to robustly execute diverse behaviors from this embedding space, as well as <br />
# an encoder that can map new trajectories into the embedding space and hence allows for one-shot imitation. <br />
The experimental results demonstrate that this approach can work on a variety of control problems and that it scales even to very challenging ones such as the control of a simulated humanoid with a large number of degrees of freedoms.<br />
<br />
=Critique=<br />
The paper proposes a deep-learning-based approach to imitation learning which is sample-efficient and is able to imitate many diverse behaviors. The architecture can be seen as conditional generative adversarial imitation learning (GAIL). The conditioning vector is an embedding of a demonstrated trajectory, provided by a variational autoencoder. This results in one-shot imitation learning: at test time, a new demonstration can be embedded and provided as a conditioning vector to the imitation policy. The authors evaluate the method on several simulated motor control tasks.<br />
<br />
Pros:<br />
* Addresses a challenging problem of learning complex dynamics controllers / control policies<br />
* Well-written introduction / motivation<br />
* The proposed approach is able to learn complex and diverse behaviors and outperforms both the VAE alone (quantitatively) and GAIL alone (qualitatively).<br />
* Appealing qualitative results on the three evaluation problems. Interesting experiments with motion transitioning. <br />
<br />
Cons:<br />
* Comparisons to baselines could be more detailed.<br />
* Many key details are omitted (either on purpose, placed in the appendix, or simply absent, like the lack of definitions of terms in the modeling section, details of the planner model, simulation process, or the details of experimental settings)<br />
* Experimental evaluation is largely subjective (videos of robotic arm/biped/3D human motion)<br />
* A discussion of sample efficiency compared to GAIL and VAE would be interesting.<br />
* The presentation is not always clear, in particular, I had a hard time figuring out the notation in Section 3.<br />
* There has been some work on hybrids of VAEs and GANs, which seem worth mentioning when generative models are discussed, like:<br />
# Autoencoding beyond pixels using a learned similarity metric, Larsen et al., ICML 2016<br />
# Generating Images with Perceptual Similarity Metrics based on Deep Networks, Dosovitskiy&Brox. NIPS 2016<br />
These works share the intuition that good coverage of VAEs can be combined with sharp results generated by GANs.<br />
* Some more extensive analysis of the approach would be interesting. How sensitive is it to hyperparameters? How important is it to use VAE, not usual AE or supervised learning? How difficult will it be for others to apply it to new tasks?<br />
<br />
=References=<br />
# Duan, Y., Andrychowicz, M., Stadie, B., Ho, J., Schneider, J., Sutskever, I., Abbeel, P., & Zaremba, W. (2017). One-shot imitation learning. Preprint arXiv:1703.07326.<br />
# Ross, Stéphane, and Drew Bagnell. "Efficient reductions for imitation learning." Proceedings of the thirteenth international conference on artificial intelligence and statistics. 2010.<br />
# Wang, Z., Merel, J. S., Reed, S. E., de Freitas, N., Wayne, G., & Heess, N. (2017). Robust imitation of diverse behaviors. In Advances in Neural Information Processing Systems (pp. 5326-5335).<br />
# https://deepmind.com/blog/producing-flexible-behaviours-simulated-environments/<br />
# https://www.youtube.com/watch?v=NaohsyUxpxw<br />
# https://www.youtube.com/watch?v=VBrIll0B24o</div>Bapskikohttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Continuous_Adaptation_via_Meta-Learning_in_Nonstationary_and_Competitive_Environments&diff=34951Continuous Adaptation via Meta-Learning in Nonstationary and Competitive Environments2018-03-21T02:08:42Z<p>Bapskiko: /* Evaluating Adaptation Strategies */</p>
<hr />
<div>= Introduction =<br />
<br />
Typically, the basic goal of machine learning is to train a model to perform a task. In meta-learning, the goal is to train a model to perform the task of training a model to perform a task. Hence, in this case, the term "meta-Learning" has the exact meaning one would expect; the word "meta" has the precise function of introducing a layer of abstraction.<br />
<br />
The meta-learning task can be made more concrete by a simple example. Consider the CIFAR-100 classification task that we used for our data competition. We can alter this task from being a 100-class classification problem to a collection of 100 binary classification problems. The goal of meta-learning here is to design and train a single binary classifier for each class that will perform well on a randomly sampled task given a limited amount of training data for that specific task. In other words, we would like to train a model to perform the following procedure:<br />
<br />
# A task is sampled. The task is "Is X a dog?".<br />
# A small set of labeled training data is provided to the model. Each label is a boolean representing whether or not the corresponding image is a picture of a dog.<br />
# The model uses the training data to adjust itself to the specific task of checking whether or not an image is a picture of a dog.<br />
<br />
This example also highlights the intuition that the skill of sight is distinct and separable from the skill of knowing what a dog looks like.<br />
<br />
In this paper, a probabilistic framework for meta-learning is derived, then applied to tasks involving simulated robotic spiders. This framework generalizes the typical machine learning setup using Markov Decision Processes. This paper focuses on a multi-agent non-stationary environment which requires reinforcement learning (RL) agents to do continuous adaptation in such an environment. Non-stationarity breaks the standard assumptions and requires agents to continuously adapt, both at training and execution time, in order to earn more rewards, hence the approach is to break this into a sequence of stationary tasks and present it as a multi-task learning problem.<br />
<br />
[[File:paper19_fig1.png|600px|frame|none|alt=Alt text| '''Figure 1'''. a) Illustrates a probabilistic model for Model Agnostic Meta-Learning (MAML) in a multi-task RL setting, where the tasks <math>T</math>, policies <math>\pi</math>, and trajectories <math>\tau</math> are all random variables with dependencies encoded in the edges of a given graph. b) The proposed extension to MAML suitable for continuous adaptation to a task changing dynamically due to non-stationarity of the environment. The distribution of tasks is represented by a Markov chain, whereby policies from a previous step are used to construct a new policy for the current step. c) The computation graph for the meta-update from <math>\phi_i</math> to <math>\phi_{i+1}</math>. Boxes represent replicas of the policy graphs with the specified parameters. The model is optimized using truncated backpropagation through time starting from <math>L_{T_{i+1}}</math>.]]<br />
<br />
= Background =<br />
== Markov Decision Process (MDP) ==<br />
A MDP is defined by the tuple <math>(S,A,P,r,\gamma)</math>, where S is a set of states, A is a set of actions, P is the transition probability distribution, r is the reward function, and <math>\gamma</math> is the discount factor. More information ([https://www.cs.cmu.edu/~katef/DeepRLControlCourse/lectures/lecture2_mdps.pdf here]).<br />
<br />
<br />
= Model Agnostic Meta-Learning =<br />
<br />
An initial framework for meta-learning is given in "Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks" (Finn et al, 2017):<br />
<br />
"In our approach, the parameters of<br />
the model are explicitly trained such that a small<br />
number of gradient steps with a small amount<br />
of training data from a new task will produce<br />
good generalization performance on that task" (Finn et al, 2017).<br />
<br />
[[File:MAML.png | 500px]]<br />
<br />
In this training algorithm, the parameter vector <math>\theta</math> belonging to the model <math>f_{\theta}</math> is trained such that the meta-objective function <math>\mathcal{L} (\theta) = \sum_{\tau_i \sim P(\tau)} \mathcal{L}_{\tau_i} (f_{\theta_i' }) </math> is minimized. The sum in the objective function is over a sampled batch of training tasks. <math>\mathcal{L}_{\tau_i} (f_{\theta_i'})</math> is the training loss function corresponding to the <math>i^{th}</math> task in the batch evaluated at the model <math>f_{\theta_i'}</math>. The parameter vector <math>\theta_i'</math> is obtained by updating the general parameter <math>\theta</math> using the loss function <math>\mathcal{L}_{\tau_i}</math> and set of K training examples specific to the <math>i^{th}</math> task. Note that in alternate versions of this algorithm, additional testing sets are sampled from <math>\tau_i</math> and used to update <math>\theta</math> using testing loss functions instead of training loss functions.<br />
<br />
One of the important difference between this algorithm and more typical fine-tuning methods is that <math>\theta</math> is explicitly trained to be easily adjusted to perform well on different tasks rather than perform well on any specific tasks then fine tuned as the environment changes. (Sutton et al., 2007). In essence, the model is trained so that gradient steps are highly productive at adapting the model parameters to a new enviroment.<br />
<br />
= Probabilistic Framework for Meta-Learning =<br />
<br />
This paper puts the meta-learning problem into a Markov Decision Process (MDP) framework common to RL, see Figure 1a. Instead of training examples <math>\{(x, y)\}</math>, we have trajectories <math>\tau = (x_0, a_1, x_1, R_1, a_2,x_2,R_2, ... a_H, x_H, R_H)</math>. A trajectory is sequence of states/observations <math>x_t</math>, actions <math>a_t</math> and rewards <math>R_t</math> that is sampled from a task <math> T </math> according to a policy <math>\pi_{\theta}</math>. Included with said task is a method for assigning loss values to trajectories <math>L_T(\tau)</math> which is typically the negative cumulative reward. A policy is a deterministic function that takes in a state and returns an action. Our goal here is to train a policy <math>\pi_{\theta}</math> with parameter vector <math>\theta</math>. This is analougous to training a function <math>f_{\theta}</math> that assigns labels <math>y</math> to feature vectors <math>x</math>. More precisely we have the following definitions:<br />
<br />
* <math>\tau = (x_0, a_1, x_1, R_1, x_2, ... a_H, x_H, R_H)</math> trajectories.<br />
* <math>T :=(L_T, P_T(x), P_T(x_t | x_{t-1}, a_{t-1}), H )</math> (A Task)<br />
* <math>D(T)</math> : A distribution over tasks.<br />
* <math>L_T</math>: A loss function for the task T that assigns numeric loss values to trajectories.<br />
* <math>P_T(x), P_T(x_t | x_{t-1}, a_{t-1})</math>: Probability measures specifying the markovian dynamics of the observations <math>x_t</math><br />
* <math>H</math>: The horizon of the MDP. This is a fixed natural number specifying the lengths of the tasks trajectories.<br />
<br />
The papers goes further to define a Markov dynamic for sequences of tasks as shown in Figure 1b. Thus the policy that we would like to meta learn <math>\pi_{\theta}</math>, after being exposed to a sample of K trajectories <math>\tau_\theta^{1:K}</math> from the task <math>T_i</math>, should produce a new policy <math>\pi_{\phi}</math> that will perform well on the next task <math>T_{i+1}</math>. Thus we seek to minimize the following expectation:<br />
<br />
<math>\mathrm{E}_{P(T_0), P(T_{i+1} | T_i)}\bigg(\sum_{i=1}^{l} \mathcal{L}_{T_i, T_{i+1}}(\theta)\bigg)</math>, <br />
<br />
where <math>\mathcal{L}_{T_i, T_{i + 1}}(\theta) := \mathrm{E}_{\tau_{i, \theta}^{1:K} } \bigg( \mathrm{E}_{\tau_{i+1, \phi}}\Big( L_{T_{i+1}}(\tau_{i+1, \phi} | \tau_{i, \theta}^{1:K}, \theta) \Big) \bigg) </math> and <math>l</math> is the number of tasks.<br />
<br />
The meta-policy <math>\pi_{\theta}</math> is trained and then adapted at test time using the following procedures. The computational graph is given in Figure 1c.<br />
<br />
[[File:MAML2.png | 800px]]<br />
<br />
The mathematics of calculating loss gradients is omitted.<br />
<br />
= Training Spiders to Run with Dynamic Handicaps (Robotic Locomotion in Non-Stationary Environments) =<br />
<br />
The authors used the MuJoCo physics simulator to create a simulated environment where robotic spiders with 6 legs are faced with the task of running due east as quickly as possible. The robotic spider observes the location and velocity of its body, and the angles and velocities of its legs. It interacts with the environment by exerting torque on the joints of its legs. Each leg has two joints, the joint closer to the body rotates horizontally while the joint farther from the body rotates vertically. The environment is made non-stationary by gradually paralyzing two legs of the spider across training and testing episodes.<br />
Putting this example into the above probabilistic framework yields:<br />
<br />
* <math>T_i</math>: The task of walking east with the torques of two legs scaled by <math> (i-1)/6 </math><br />
* <math>\{T_i\}_{i=1}^{7}</math>: A sequence of tasks with the same two legs handicapped in each task. Note there are 15 different ways to choose such legs resulting in 15 sequences of tasks. 12 are used for training and 3 for testing.<br />
* A Markov Descision process composed of<br />
** Observations <math> x_t </math> containing information about the state of the spider.<br />
** Actions <math> a_t </math> containing information about the torques to apply to the spiders legs.<br />
** Rewards <math> R_t </math> corresponding to the speed at which the spider is moving east.<br />
<br />
Three differently structured policy neural networks are trained in this set up using both meta-learning and three different previously developed adaption methods.<br />
<br />
At testing time, the spiders following meta learned policies initially perform worse than the spiders using non-adaptive policies. However, by the third episode (<math> i=3 </math>), the meta-learners perform on par. And by the sixth episode, when the selected legs are mostly immobile, the meta-learners significantly out perform. These results can be seen in the graphs below.<br />
<br />
[[File:locomotion_results.png | 800px]]<br />
<br />
= Training Spiders to Fight Each Other (Adversarial Meta-Learning) =<br />
<br />
The authors created an adversarial environment called RoboSumo where pairs of agents with 4 (named Ants), 6 (named Bugs),or 8 legs (named spiders) sumo wrestle. The agents observe the location and velocity of their bodies and the bodies of their opponent, the angles and velocities of their legs, and the forces being exerted on them by their opponent (equivalent of tactile sense). The game is organized into episodes and rounds. Episodes are single wrestling matches with 500 time steps and win/lose/draw outcomes. Agents win by pushing their opponent out of the ring or making their opponent's body touch the ground. Rounds are batches of episodes. An episode results in a draw when neither of these things happen after 500 time steps. Rounds have possible outcomes win, lose, and draw that are decided based on majority of episodes won. K rounds will be fought. Both agents may update their policies between rounds. The agent that wins the majority of rounds is deemed the winner of the game.<br />
<br />
== Setup ==<br />
Similar to the Robotic locomotion example, this game can be phrased in terms of the RL MDP framework.<br />
<br />
* <math>T_i</math>: The task of fighting a round.<br />
* <math>\{T_i\}_{i=1}^{K}</math>: A sequence of rounds against the same opponent. Note that the opponent may update their policy between rounds but the anatomy of both wrestlers will be constant across rounds.<br />
* A Markov Descision process composed of<br />
** A horizon <math>H = 500*n</math> where <math>n</math> is the number of episodes per round.<br />
** Observations <math> x_t </math> containing information about the state of the agent and its opponent.<br />
** Actions <math> a_t </math> containing information about the torques to apply to the agents legs.<br />
** Rewards <math> R_t </math> rewards given to the agent based on its wrestling performance. <math>R_{500*n} = </math> +2000 if win episode, -2000 if lose, and -1000 if draw.<br />
<br />
Note that the above reward set up is quite sparse, therefore in order to encourage fast training, rewards are introduced at every time step for the following.<br />
* For staying close to the center of the ring.<br />
* For exerting force on the opponents body.<br />
* For moving towards the opponent.<br />
* For the distance of the opponent to the center of the ring.<br />
<br />
In addition to the sparse win/lose rewards, the following dense rewards are also introduced in the early training stages to encourage faster learning:<br />
* Quickly push the opponent outside - penalty proportional to the distance of there opponent from the center of the ring.<br />
* Moving towards the opponent - reward proportional to the velocity component towards the opponent.<br />
* Hit the opponent - reward proportional to square root of the total forces exerted on the opponent.<br />
* Control penalty - penalty denoted by <math> l_2 </math> on actions which lead to jittery/unnatural movements.<br />
<br />
<br />
This makes sense intuitively as these are reasonable goals for agents to explore when they are learning to wrestle.<br />
<br />
== Training ==<br />
The same combinations of policy networks and adaptation methods that were used in the locomotion example are trained and tested here. A family of non-adaptive policies are first trained via self-play and saved at all stages. Self-play simply means the two agents in the training environment use the same policy. All policy versions are saved so that agents of various skill levels can be sampled when training meta-learners. The weights of the different insects were calibrated such that the test win rate between two insects of differing anatomy, who have been trained for the same number of epochs via self-play, is close to 50%.<br />
<br />
[[File:weight_cal.png | 800px]]<br />
<br />
We can see in the above figure that the weight of the spider had to be increased by almost four times in order for the agents to be evenly matched.<br />
<br />
[[File:robosumo_results.png | 800px]]<br />
<br />
The above figure shows testing results for various adaptation strategies. The agent and opponent both start with the self-trained policies. The opponent uses all of its testing experience to continue training. The agent uses only the last 75 episodes to adapt its policy network. This shows that metal learners need only a limited amount of experience in order to hold their own against a constantly improving opponent.<br />
<br />
== Evaluating Adaptation Strategies ==<br />
To compare the proposed strategy against existing strategies, the authors leveraged the Robosumo task to have the different adaptation strategies compete against each other. The scoring metric used is TrueSkill.<br />
<br />
[[File:trueskill_results.PNG | 800px]]<br />
<br />
The figure above displays the results of the experiment. Recurrent policies are dominant, and the proposed meta-update performs better than or at par with the other strategies.<br />
<br />
The authors also demonstrated the same result by starting with 1000 agents, uniformly distributed in terms of adaptation strategies, and randomly matching them against each other for several generations. The loser of each match is removed and replaced with a duplicate of the winner. After several generations, the total population consists mostly of the proposed strategy. This is shown in the figure below.<br />
<br />
[[File:natural_selection.png | 800px]]<br />
<br />
= Future Work =<br />
The authors noted that the meta-learning adaptation rule they proposed is similar to backpropagation through time with a unit time lag, so a potential area for future research would be to introduce fully-recurrent meta-updates based on the full interaction history with the environment. Secondly, the algorithm proposed involves computing second-order derivatives at training time (see Figure 1c), which resulted in much slower training processes compared to baseline models during experiments, so they suggested finding a method to utilize information from the loss function without explicit backpropagation to speed up computations. The authors also mention that their approach likely will not work well with sparse rewards. This is because the meta-updates, which use policy gradients, are very dependent on the reward signal. They mention that this is an issue they would like to address in the future. A potential solution they have outlined for this is to introduce auxiliary dense rewards which could enable meta-learning.<br />
<br />
= Sources =<br />
# Chelsea Finn, Pieter Abbeel, Sergey Levine. "Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks." arXiv preprint arXiv:1703.03400v3 (2017).<br />
# Richard S Sutton, Anna Koop, and David Silver. On the role of tracking in stationary environments. In Proceedings of the 24th international conference on Machine learning, pp. 871–878. ACM, 2007.</div>Bapskikohttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Continuous_Adaptation_via_Meta-Learning_in_Nonstationary_and_Competitive_Environments&diff=34949Continuous Adaptation via Meta-Learning in Nonstationary and Competitive Environments2018-03-21T02:06:26Z<p>Bapskiko: /* Evaluating Adaptation Strategies */</p>
<hr />
<div>= Introduction =<br />
<br />
Typically, the basic goal of machine learning is to train a model to perform a task. In meta-learning, the goal is to train a model to perform the task of training a model to perform a task. Hence, in this case, the term "meta-Learning" has the exact meaning one would expect; the word "meta" has the precise function of introducing a layer of abstraction.<br />
<br />
The meta-learning task can be made more concrete by a simple example. Consider the CIFAR-100 classification task that we used for our data competition. We can alter this task from being a 100-class classification problem to a collection of 100 binary classification problems. The goal of meta-learning here is to design and train a single binary classifier for each class that will perform well on a randomly sampled task given a limited amount of training data for that specific task. In other words, we would like to train a model to perform the following procedure:<br />
<br />
# A task is sampled. The task is "Is X a dog?".<br />
# A small set of labeled training data is provided to the model. Each label is a boolean representing whether or not the corresponding image is a picture of a dog.<br />
# The model uses the training data to adjust itself to the specific task of checking whether or not an image is a picture of a dog.<br />
<br />
This example also highlights the intuition that the skill of sight is distinct and separable from the skill of knowing what a dog looks like.<br />
<br />
In this paper, a probabilistic framework for meta-learning is derived, then applied to tasks involving simulated robotic spiders. This framework generalizes the typical machine learning setup using Markov Decision Processes. This paper focuses on a multi-agent non-stationary environment which requires reinforcement learning (RL) agents to do continuous adaptation in such an environment. Non-stationarity breaks the standard assumptions and requires agents to continuously adapt, both at training and execution time, in order to earn more rewards, hence the approach is to break this into a sequence of stationary tasks and present it as a multi-task learning problem.<br />
<br />
[[File:paper19_fig1.png|600px|frame|none|alt=Alt text| '''Figure 1'''. a) Illustrates a probabilistic model for Model Agnostic Meta-Learning (MAML) in a multi-task RL setting, where the tasks <math>T</math>, policies <math>\pi</math>, and trajectories <math>\tau</math> are all random variables with dependencies encoded in the edges of a given graph. b) The proposed extension to MAML suitable for continuous adaptation to a task changing dynamically due to non-stationarity of the environment. The distribution of tasks is represented by a Markov chain, whereby policies from a previous step are used to construct a new policy for the current step. c) The computation graph for the meta-update from <math>\phi_i</math> to <math>\phi_{i+1}</math>. Boxes represent replicas of the policy graphs with the specified parameters. The model is optimized using truncated backpropagation through time starting from <math>L_{T_{i+1}}</math>.]]<br />
<br />
= Background =<br />
== Markov Decision Process (MDP) ==<br />
A MDP is defined by the tuple <math>(S,A,P,r,\gamma)</math>, where S is a set of states, A is a set of actions, P is the transition probability distribution, r is the reward function, and <math>\gamma</math> is the discount factor. More information ([https://www.cs.cmu.edu/~katef/DeepRLControlCourse/lectures/lecture2_mdps.pdf here]).<br />
<br />
<br />
= Model Agnostic Meta-Learning =<br />
<br />
An initial framework for meta-learning is given in "Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks" (Finn et al, 2017):<br />
<br />
"In our approach, the parameters of<br />
the model are explicitly trained such that a small<br />
number of gradient steps with a small amount<br />
of training data from a new task will produce<br />
good generalization performance on that task" (Finn et al, 2017).<br />
<br />
[[File:MAML.png | 500px]]<br />
<br />
In this training algorithm, the parameter vector <math>\theta</math> belonging to the model <math>f_{\theta}</math> is trained such that the meta-objective function <math>\mathcal{L} (\theta) = \sum_{\tau_i \sim P(\tau)} \mathcal{L}_{\tau_i} (f_{\theta_i' }) </math> is minimized. The sum in the objective function is over a sampled batch of training tasks. <math>\mathcal{L}_{\tau_i} (f_{\theta_i'})</math> is the training loss function corresponding to the <math>i^{th}</math> task in the batch evaluated at the model <math>f_{\theta_i'}</math>. The parameter vector <math>\theta_i'</math> is obtained by updating the general parameter <math>\theta</math> using the loss function <math>\mathcal{L}_{\tau_i}</math> and set of K training examples specific to the <math>i^{th}</math> task. Note that in alternate versions of this algorithm, additional testing sets are sampled from <math>\tau_i</math> and used to update <math>\theta</math> using testing loss functions instead of training loss functions.<br />
<br />
One of the important difference between this algorithm and more typical fine-tuning methods is that <math>\theta</math> is explicitly trained to be easily adjusted to perform well on different tasks rather than perform well on any specific tasks then fine tuned as the environment changes. (Sutton et al., 2007). In essence, the model is trained so that gradient steps are highly productive at adapting the model parameters to a new enviroment.<br />
<br />
= Probabilistic Framework for Meta-Learning =<br />
<br />
This paper puts the meta-learning problem into a Markov Decision Process (MDP) framework common to RL, see Figure 1a. Instead of training examples <math>\{(x, y)\}</math>, we have trajectories <math>\tau = (x_0, a_1, x_1, R_1, a_2,x_2,R_2, ... a_H, x_H, R_H)</math>. A trajectory is sequence of states/observations <math>x_t</math>, actions <math>a_t</math> and rewards <math>R_t</math> that is sampled from a task <math> T </math> according to a policy <math>\pi_{\theta}</math>. Included with said task is a method for assigning loss values to trajectories <math>L_T(\tau)</math> which is typically the negative cumulative reward. A policy is a deterministic function that takes in a state and returns an action. Our goal here is to train a policy <math>\pi_{\theta}</math> with parameter vector <math>\theta</math>. This is analougous to training a function <math>f_{\theta}</math> that assigns labels <math>y</math> to feature vectors <math>x</math>. More precisely we have the following definitions:<br />
<br />
* <math>\tau = (x_0, a_1, x_1, R_1, x_2, ... a_H, x_H, R_H)</math> trajectories.<br />
* <math>T :=(L_T, P_T(x), P_T(x_t | x_{t-1}, a_{t-1}), H )</math> (A Task)<br />
* <math>D(T)</math> : A distribution over tasks.<br />
* <math>L_T</math>: A loss function for the task T that assigns numeric loss values to trajectories.<br />
* <math>P_T(x), P_T(x_t | x_{t-1}, a_{t-1})</math>: Probability measures specifying the markovian dynamics of the observations <math>x_t</math><br />
* <math>H</math>: The horizon of the MDP. This is a fixed natural number specifying the lengths of the tasks trajectories.<br />
<br />
The papers goes further to define a Markov dynamic for sequences of tasks as shown in Figure 1b. Thus the policy that we would like to meta learn <math>\pi_{\theta}</math>, after being exposed to a sample of K trajectories <math>\tau_\theta^{1:K}</math> from the task <math>T_i</math>, should produce a new policy <math>\pi_{\phi}</math> that will perform well on the next task <math>T_{i+1}</math>. Thus we seek to minimize the following expectation:<br />
<br />
<math>\mathrm{E}_{P(T_0), P(T_{i+1} | T_i)}\bigg(\sum_{i=1}^{l} \mathcal{L}_{T_i, T_{i+1}}(\theta)\bigg)</math>, <br />
<br />
where <math>\mathcal{L}_{T_i, T_{i + 1}}(\theta) := \mathrm{E}_{\tau_{i, \theta}^{1:K} } \bigg( \mathrm{E}_{\tau_{i+1, \phi}}\Big( L_{T_{i+1}}(\tau_{i+1, \phi} | \tau_{i, \theta}^{1:K}, \theta) \Big) \bigg) </math> and <math>l</math> is the number of tasks.<br />
<br />
The meta-policy <math>\pi_{\theta}</math> is trained and then adapted at test time using the following procedures. The computational graph is given in Figure 1c.<br />
<br />
[[File:MAML2.png | 800px]]<br />
<br />
The mathematics of calculating loss gradients is omitted.<br />
<br />
= Training Spiders to Run with Dynamic Handicaps (Robotic Locomotion in Non-Stationary Environments) =<br />
<br />
The authors used the MuJoCo physics simulator to create a simulated environment where robotic spiders with 6 legs are faced with the task of running due east as quickly as possible. The robotic spider observes the location and velocity of its body, and the angles and velocities of its legs. It interacts with the environment by exerting torque on the joints of its legs. Each leg has two joints, the joint closer to the body rotates horizontally while the joint farther from the body rotates vertically. The environment is made non-stationary by gradually paralyzing two legs of the spider across training and testing episodes.<br />
Putting this example into the above probabilistic framework yields:<br />
<br />
* <math>T_i</math>: The task of walking east with the torques of two legs scaled by <math> (i-1)/6 </math><br />
* <math>\{T_i\}_{i=1}^{7}</math>: A sequence of tasks with the same two legs handicapped in each task. Note there are 15 different ways to choose such legs resulting in 15 sequences of tasks. 12 are used for training and 3 for testing.<br />
* A Markov Descision process composed of<br />
** Observations <math> x_t </math> containing information about the state of the spider.<br />
** Actions <math> a_t </math> containing information about the torques to apply to the spiders legs.<br />
** Rewards <math> R_t </math> corresponding to the speed at which the spider is moving east.<br />
<br />
Three differently structured policy neural networks are trained in this set up using both meta-learning and three different previously developed adaption methods.<br />
<br />
At testing time, the spiders following meta learned policies initially perform worse than the spiders using non-adaptive policies. However, by the third episode (<math> i=3 </math>), the meta-learners perform on par. And by the sixth episode, when the selected legs are mostly immobile, the meta-learners significantly out perform. These results can be seen in the graphs below.<br />
<br />
[[File:locomotion_results.png | 800px]]<br />
<br />
= Training Spiders to Fight Each Other (Adversarial Meta-Learning) =<br />
<br />
The authors created an adversarial environment called RoboSumo where pairs of agents with 4 (named Ants), 6 (named Bugs),or 8 legs (named spiders) sumo wrestle. The agents observe the location and velocity of their bodies and the bodies of their opponent, the angles and velocities of their legs, and the forces being exerted on them by their opponent (equivalent of tactile sense). The game is organized into episodes and rounds. Episodes are single wrestling matches with 500 time steps and win/lose/draw outcomes. Agents win by pushing their opponent out of the ring or making their opponent's body touch the ground. Rounds are batches of episodes. An episode results in a draw when neither of these things happen after 500 time steps. Rounds have possible outcomes win, lose, and draw that are decided based on majority of episodes won. K rounds will be fought. Both agents may update their policies between rounds. The agent that wins the majority of rounds is deemed the winner of the game.<br />
<br />
== Setup ==<br />
Similar to the Robotic locomotion example, this game can be phrased in terms of the RL MDP framework.<br />
<br />
* <math>T_i</math>: The task of fighting a round.<br />
* <math>\{T_i\}_{i=1}^{K}</math>: A sequence of rounds against the same opponent. Note that the opponent may update their policy between rounds but the anatomy of both wrestlers will be constant across rounds.<br />
* A Markov Descision process composed of<br />
** A horizon <math>H = 500*n</math> where <math>n</math> is the number of episodes per round.<br />
** Observations <math> x_t </math> containing information about the state of the agent and its opponent.<br />
** Actions <math> a_t </math> containing information about the torques to apply to the agents legs.<br />
** Rewards <math> R_t </math> rewards given to the agent based on its wrestling performance. <math>R_{500*n} = </math> +2000 if win episode, -2000 if lose, and -1000 if draw.<br />
<br />
Note that the above reward set up is quite sparse, therefore in order to encourage fast training, rewards are introduced at every time step for the following.<br />
* For staying close to the center of the ring.<br />
* For exerting force on the opponents body.<br />
* For moving towards the opponent.<br />
* For the distance of the opponent to the center of the ring.<br />
<br />
In addition to the sparse win/lose rewards, the following dense rewards are also introduced in the early training stages to encourage faster learning:<br />
* Quickly push the opponent outside - penalty proportional to the distance of there opponent from the center of the ring.<br />
* Moving towards the opponent - reward proportional to the velocity component towards the opponent.<br />
* Hit the opponent - reward proportional to square root of the total forces exerted on the opponent.<br />
* Control penalty - penalty denoted by <math> l_2 </math> on actions which lead to jittery/unnatural movements.<br />
<br />
<br />
This makes sense intuitively as these are reasonable goals for agents to explore when they are learning to wrestle.<br />
<br />
== Training ==<br />
The same combinations of policy networks and adaptation methods that were used in the locomotion example are trained and tested here. A family of non-adaptive policies are first trained via self-play and saved at all stages. Self-play simply means the two agents in the training environment use the same policy. All policy versions are saved so that agents of various skill levels can be sampled when training meta-learners. The weights of the different insects were calibrated such that the test win rate between two insects of differing anatomy, who have been trained for the same number of epochs via self-play, is close to 50%.<br />
<br />
[[File:weight_cal.png | 800px]]<br />
<br />
We can see in the above figure that the weight of the spider had to be increased by almost four times in order for the agents to be evenly matched.<br />
<br />
[[File:robosumo_results.png | 800px]]<br />
<br />
The above figure shows testing results for various adaptation strategies. The agent and opponent both start with the self-trained policies. The opponent uses all of its testing experience to continue training. The agent uses only the last 75 episodes to adapt its policy network. This shows that metal learners need only a limited amount of experience in order to hold their own against a constantly improving opponent.<br />
<br />
== Evaluating Adaptation Strategies ==<br />
In order to compare the proposed strategy against existing strategy, the authors leveraged the Robosumo task to have the different adaptation strategies compete against each other. The scoring metric used is TrueSkill.<br />
<br />
[[File:trueskill_results.PNG | 800px]]<br />
<br />
The figure above displays the results of the experiment. Recurrent policies are dominant, and the proposed meta-update perform better or at par.<br />
<br />
The authors also demonstrated the same result by starting with 1000 agents, uniformly distributed in terms of adaptation strategies, and randomly matching them against each other for several generations. The loser of each match is removed and replaced with a duplicate of the winner. After several generations, the total population consists mostly of the proposed strategy. This is shown in the figure below.<br />
<br />
[[File:natural_selection.png | 800px]]<br />
<br />
= Future Work =<br />
The authors noted that the meta-learning adaptation rule they proposed is similar to backpropagation through time with a unit time lag, so a potential area for future research would be to introduce fully-recurrent meta-updates based on the full interaction history with the environment. Secondly, the algorithm proposed involves computing second-order derivatives at training time (see Figure 1c), which resulted in much slower training processes compared to baseline models during experiments, so they suggested finding a method to utilize information from the loss function without explicit backpropagation to speed up computations. The authors also mention that their approach likely will not work well with sparse rewards. This is because the meta-updates, which use policy gradients, are very dependent on the reward signal. They mention that this is an issue they would like to address in the future. A potential solution they have outlined for this is to introduce auxiliary dense rewards which could enable meta-learning.<br />
<br />
= Sources =<br />
# Chelsea Finn, Pieter Abbeel, Sergey Levine. "Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks." arXiv preprint arXiv:1703.03400v3 (2017).<br />
# Richard S Sutton, Anna Koop, and David Silver. On the role of tracking in stationary environments. In Proceedings of the 24th international conference on Machine learning, pp. 871–878. ACM, 2007.</div>Bapskikohttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:natural_selection.png&diff=34948File:natural selection.png2018-03-21T02:06:08Z<p>Bapskiko: </p>
<hr />
<div></div>Bapskikohttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:trueskill_results.PNG&diff=34943File:trueskill results.PNG2018-03-21T02:00:49Z<p>Bapskiko: </p>
<hr />
<div></div>Bapskikohttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Continuous_Adaptation_via_Meta-Learning_in_Nonstationary_and_Competitive_Environments&diff=34942Continuous Adaptation via Meta-Learning in Nonstationary and Competitive Environments2018-03-21T02:00:24Z<p>Bapskiko: </p>
<hr />
<div>= Introduction =<br />
<br />
Typically, the basic goal of machine learning is to train a model to perform a task. In meta-learning, the goal is to train a model to perform the task of training a model to perform a task. Hence, in this case, the term "meta-Learning" has the exact meaning one would expect; the word "meta" has the precise function of introducing a layer of abstraction.<br />
<br />
The meta-learning task can be made more concrete by a simple example. Consider the CIFAR-100 classification task that we used for our data competition. We can alter this task from being a 100-class classification problem to a collection of 100 binary classification problems. The goal of meta-learning here is to design and train a single binary classifier for each class that will perform well on a randomly sampled task given a limited amount of training data for that specific task. In other words, we would like to train a model to perform the following procedure:<br />
<br />
# A task is sampled. The task is "Is X a dog?".<br />
# A small set of labeled training data is provided to the model. Each label is a boolean representing whether or not the corresponding image is a picture of a dog.<br />
# The model uses the training data to adjust itself to the specific task of checking whether or not an image is a picture of a dog.<br />
<br />
This example also highlights the intuition that the skill of sight is distinct and separable from the skill of knowing what a dog looks like.<br />
<br />
In this paper, a probabilistic framework for meta-learning is derived, then applied to tasks involving simulated robotic spiders. This framework generalizes the typical machine learning setup using Markov Decision Processes. This paper focuses on a multi-agent non-stationary environment which requires reinforcement learning (RL) agents to do continuous adaptation in such an environment. Non-stationarity breaks the standard assumptions and requires agents to continuously adapt, both at training and execution time, in order to earn more rewards, hence the approach is to break this into a sequence of stationary tasks and present it as a multi-task learning problem.<br />
<br />
[[File:paper19_fig1.png|600px|frame|none|alt=Alt text| '''Figure 1'''. a) Illustrates a probabilistic model for Model Agnostic Meta-Learning (MAML) in a multi-task RL setting, where the tasks <math>T</math>, policies <math>\pi</math>, and trajectories <math>\tau</math> are all random variables with dependencies encoded in the edges of a given graph. b) The proposed extension to MAML suitable for continuous adaptation to a task changing dynamically due to non-stationarity of the environment. The distribution of tasks is represented by a Markov chain, whereby policies from a previous step are used to construct a new policy for the current step. c) The computation graph for the meta-update from <math>\phi_i</math> to <math>\phi_{i+1}</math>. Boxes represent replicas of the policy graphs with the specified parameters. The model is optimized using truncated backpropagation through time starting from <math>L_{T_{i+1}}</math>.]]<br />
<br />
= Background =<br />
== Markov Decision Process (MDP) ==<br />
A MDP is defined by the tuple <math>(S,A,P,r,\gamma)</math>, where S is a set of states, A is a set of actions, P is the transition probability distribution, r is the reward function, and <math>\gamma</math> is the discount factor. More information ([https://www.cs.cmu.edu/~katef/DeepRLControlCourse/lectures/lecture2_mdps.pdf here]).<br />
<br />
<br />
= Model Agnostic Meta-Learning =<br />
<br />
An initial framework for meta-learning is given in "Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks" (Finn et al, 2017):<br />
<br />
"In our approach, the parameters of<br />
the model are explicitly trained such that a small<br />
number of gradient steps with a small amount<br />
of training data from a new task will produce<br />
good generalization performance on that task" (Finn et al, 2017).<br />
<br />
[[File:MAML.png | 500px]]<br />
<br />
In this training algorithm, the parameter vector <math>\theta</math> belonging to the model <math>f_{\theta}</math> is trained such that the meta-objective function <math>\mathcal{L} (\theta) = \sum_{\tau_i \sim P(\tau)} \mathcal{L}_{\tau_i} (f_{\theta_i' }) </math> is minimized. The sum in the objective function is over a sampled batch of training tasks. <math>\mathcal{L}_{\tau_i} (f_{\theta_i'})</math> is the training loss function corresponding to the <math>i^{th}</math> task in the batch evaluated at the model <math>f_{\theta_i'}</math>. The parameter vector <math>\theta_i'</math> is obtained by updating the general parameter <math>\theta</math> using the loss function <math>\mathcal{L}_{\tau_i}</math> and set of K training examples specific to the <math>i^{th}</math> task. Note that in alternate versions of this algorithm, additional testing sets are sampled from <math>\tau_i</math> and used to update <math>\theta</math> using testing loss functions instead of training loss functions.<br />
<br />
One of the important difference between this algorithm and more typical fine-tuning methods is that <math>\theta</math> is explicitly trained to be easily adjusted to perform well on different tasks rather than perform well on any specific tasks then fine tuned as the environment changes. (Sutton et al., 2007). In essence, the model is trained so that gradient steps are highly productive at adapting the model parameters to a new enviroment.<br />
<br />
= Probabilistic Framework for Meta-Learning =<br />
<br />
This paper puts the meta-learning problem into a Markov Decision Process (MDP) framework common to RL, see Figure 1a. Instead of training examples <math>\{(x, y)\}</math>, we have trajectories <math>\tau = (x_0, a_1, x_1, R_1, a_2,x_2,R_2, ... a_H, x_H, R_H)</math>. A trajectory is sequence of states/observations <math>x_t</math>, actions <math>a_t</math> and rewards <math>R_t</math> that is sampled from a task <math> T </math> according to a policy <math>\pi_{\theta}</math>. Included with said task is a method for assigning loss values to trajectories <math>L_T(\tau)</math> which is typically the negative cumulative reward. A policy is a deterministic function that takes in a state and returns an action. Our goal here is to train a policy <math>\pi_{\theta}</math> with parameter vector <math>\theta</math>. This is analougous to training a function <math>f_{\theta}</math> that assigns labels <math>y</math> to feature vectors <math>x</math>. More precisely we have the following definitions:<br />
<br />
* <math>\tau = (x_0, a_1, x_1, R_1, x_2, ... a_H, x_H, R_H)</math> trajectories.<br />
* <math>T :=(L_T, P_T(x), P_T(x_t | x_{t-1}, a_{t-1}), H )</math> (A Task)<br />
* <math>D(T)</math> : A distribution over tasks.<br />
* <math>L_T</math>: A loss function for the task T that assigns numeric loss values to trajectories.<br />
* <math>P_T(x), P_T(x_t | x_{t-1}, a_{t-1})</math>: Probability measures specifying the markovian dynamics of the observations <math>x_t</math><br />
* <math>H</math>: The horizon of the MDP. This is a fixed natural number specifying the lengths of the tasks trajectories.<br />
<br />
The papers goes further to define a Markov dynamic for sequences of tasks as shown in Figure 1b. Thus the policy that we would like to meta learn <math>\pi_{\theta}</math>, after being exposed to a sample of K trajectories <math>\tau_\theta^{1:K}</math> from the task <math>T_i</math>, should produce a new policy <math>\pi_{\phi}</math> that will perform well on the next task <math>T_{i+1}</math>. Thus we seek to minimize the following expectation:<br />
<br />
<math>\mathrm{E}_{P(T_0), P(T_{i+1} | T_i)}\bigg(\sum_{i=1}^{l} \mathcal{L}_{T_i, T_{i+1}}(\theta)\bigg)</math>, <br />
<br />
where <math>\mathcal{L}_{T_i, T_{i + 1}}(\theta) := \mathrm{E}_{\tau_{i, \theta}^{1:K} } \bigg( \mathrm{E}_{\tau_{i+1, \phi}}\Big( L_{T_{i+1}}(\tau_{i+1, \phi} | \tau_{i, \theta}^{1:K}, \theta) \Big) \bigg) </math> and <math>l</math> is the number of tasks.<br />
<br />
The meta-policy <math>\pi_{\theta}</math> is trained and then adapted at test time using the following procedures. The computational graph is given in Figure 1c.<br />
<br />
[[File:MAML2.png | 800px]]<br />
<br />
The mathematics of calculating loss gradients is omitted.<br />
<br />
= Training Spiders to Run with Dynamic Handicaps (Robotic Locomotion in Non-Stationary Environments) =<br />
<br />
The authors used the MuJoCo physics simulator to create a simulated environment where robotic spiders with 6 legs are faced with the task of running due east as quickly as possible. The robotic spider observes the location and velocity of its body, and the angles and velocities of its legs. It interacts with the environment by exerting torque on the joints of its legs. Each leg has two joints, the joint closer to the body rotates horizontally while the joint farther from the body rotates vertically. The environment is made non-stationary by gradually paralyzing two legs of the spider across training and testing episodes.<br />
Putting this example into the above probabilistic framework yields:<br />
<br />
* <math>T_i</math>: The task of walking east with the torques of two legs scaled by <math> (i-1)/6 </math><br />
* <math>\{T_i\}_{i=1}^{7}</math>: A sequence of tasks with the same two legs handicapped in each task. Note there are 15 different ways to choose such legs resulting in 15 sequences of tasks. 12 are used for training and 3 for testing.<br />
* A Markov Descision process composed of<br />
** Observations <math> x_t </math> containing information about the state of the spider.<br />
** Actions <math> a_t </math> containing information about the torques to apply to the spiders legs.<br />
** Rewards <math> R_t </math> corresponding to the speed at which the spider is moving east.<br />
<br />
Three differently structured policy neural networks are trained in this set up using both meta-learning and three different previously developed adaption methods.<br />
<br />
At testing time, the spiders following meta learned policies initially perform worse than the spiders using non-adaptive policies. However, by the third episode (<math> i=3 </math>), the meta-learners perform on par. And by the sixth episode, when the selected legs are mostly immobile, the meta-learners significantly out perform. These results can be seen in the graphs below.<br />
<br />
[[File:locomotion_results.png | 800px]]<br />
<br />
= Training Spiders to Fight Each Other (Adversarial Meta-Learning) =<br />
<br />
The authors created an adversarial environment called RoboSumo where pairs of agents with 4 (named Ants), 6 (named Bugs),or 8 legs (named spiders) sumo wrestle. The agents observe the location and velocity of their bodies and the bodies of their opponent, the angles and velocities of their legs, and the forces being exerted on them by their opponent (equivalent of tactile sense). The game is organized into episodes and rounds. Episodes are single wrestling matches with 500 time steps and win/lose/draw outcomes. Agents win by pushing their opponent out of the ring or making their opponent's body touch the ground. Rounds are batches of episodes. An episode results in a draw when neither of these things happen after 500 time steps. Rounds have possible outcomes win, lose, and draw that are decided based on majority of episodes won. K rounds will be fought. Both agents may update their policies between rounds. The agent that wins the majority of rounds is deemed the winner of the game.<br />
<br />
== Setup ==<br />
Similar to the Robotic locomotion example, this game can be phrased in terms of the RL MDP framework.<br />
<br />
* <math>T_i</math>: The task of fighting a round.<br />
* <math>\{T_i\}_{i=1}^{K}</math>: A sequence of rounds against the same opponent. Note that the opponent may update their policy between rounds but the anatomy of both wrestlers will be constant across rounds.<br />
* A Markov Descision process composed of<br />
** A horizon <math>H = 500*n</math> where <math>n</math> is the number of episodes per round.<br />
** Observations <math> x_t </math> containing information about the state of the agent and its opponent.<br />
** Actions <math> a_t </math> containing information about the torques to apply to the agents legs.<br />
** Rewards <math> R_t </math> rewards given to the agent based on its wrestling performance. <math>R_{500*n} = </math> +2000 if win episode, -2000 if lose, and -1000 if draw.<br />
<br />
Note that the above reward set up is quite sparse, therefore in order to encourage fast training, rewards are introduced at every time step for the following.<br />
* For staying close to the center of the ring.<br />
* For exerting force on the opponents body.<br />
* For moving towards the opponent.<br />
* For the distance of the opponent to the center of the ring.<br />
<br />
In addition to the sparse win/lose rewards, the following dense rewards are also introduced in the early training stages to encourage faster learning:<br />
* Quickly push the opponent outside - penalty proportional to the distance of there opponent from the center of the ring.<br />
* Moving towards the opponent - reward proportional to the velocity component towards the opponent.<br />
* Hit the opponent - reward proportional to square root of the total forces exerted on the opponent.<br />
* Control penalty - penalty denoted by <math> l_2 </math> on actions which lead to jittery/unnatural movements.<br />
<br />
<br />
This makes sense intuitively as these are reasonable goals for agents to explore when they are learning to wrestle.<br />
<br />
== Training ==<br />
The same combinations of policy networks and adaptation methods that were used in the locomotion example are trained and tested here. A family of non-adaptive policies are first trained via self-play and saved at all stages. Self-play simply means the two agents in the training environment use the same policy. All policy versions are saved so that agents of various skill levels can be sampled when training meta-learners. The weights of the different insects were calibrated such that the test win rate between two insects of differing anatomy, who have been trained for the same number of epochs via self-play, is close to 50%.<br />
<br />
[[File:weight_cal.png | 800px]]<br />
<br />
We can see in the above figure that the weight of the spider had to be increased by almost four times in order for the agents to be evenly matched.<br />
<br />
[[File:robosumo_results.png | 800px]]<br />
<br />
The above figure shows testing results for various adaptation strategies. The agent and opponent both start with the self-trained policies. The opponent uses all of its testing experience to continue training. The agent uses only the last 75 episodes to adapt its policy network. This shows that metal learners need only a limited amount of experience in order to hold their own against a constantly improving opponent.<br />
<br />
== Evaluating Adaptation Strategies ==<br />
In order to compare the proposed strategy against existing strategy, the authors leveraged the Robosumo task to have the different adaptation strategies compete against each other. The scoring metric used is TrueSkill.<br />
<br />
[[File:trueskill_results.png | 800px]]<br />
<br />
The figure above displays the results of the experiment. Recurrent policies are dominant, and the proposed meta-update perform better or at par.<br />
<br />
= Future Work =<br />
The authors noted that the meta-learning adaptation rule they proposed is similar to backpropagation through time with a unit time lag, so a potential area for future research would be to introduce fully-recurrent meta-updates based on the full interaction history with the environment. Secondly, the algorithm proposed involves computing second-order derivatives at training time (see Figure 1c), which resulted in much slower training processes compared to baseline models during experiments, so they suggested finding a method to utilize information from the loss function without explicit backpropagation to speed up computations. The authors also mention that their approach likely will not work well with sparse rewards. This is because the meta-updates, which use policy gradients, are very dependent on the reward signal. They mention that this is an issue they would like to address in the future. A potential solution they have outlined for this is to introduce auxiliary dense rewards which could enable meta-learning.<br />
<br />
= Sources =<br />
# Chelsea Finn, Pieter Abbeel, Sergey Levine. "Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks." arXiv preprint arXiv:1703.03400v3 (2017).<br />
# Richard S Sutton, Anna Koop, and David Silver. On the role of tracking in stationary environments. In Proceedings of the 24th international conference on Machine learning, pp. 871–878. ACM, 2007.</div>Bapskikohttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Self_Normalizing_Neural_Networks&diff=34671stat946w18/Self Normalizing Neural Networks2018-03-19T22:43:05Z<p>Bapskiko: /* Critique */</p>
<hr />
<div>==Introduction and Motivation==<br />
<br />
While neural networks have been making a lot of headway in improving benchmark results and narrowing the gap with human-level performance, success has been fairly limited to visual and sequential processing tasks through advancements in convolutional network and recurrent network structures. Most data science competitions outside of these domains are still outperformed by algorithms such as gradient boosting and random forests. The traditional (densely connected) feed-forward neural networks (FNNs) are rarely used competitively, and when they do win on rare occasions, they have very shallow network architectures with just up to four layers [10].<br />
<br />
The authors, Klambauer et al., believe that what prevents FNNs from becoming more competitively useful is the inability to train a deeper FNN structure, which would allow the network to learn more levels of abstract representations. To have a deeper network, oscillations in the distribution of activations need to be kept under control so that stable gradients can be obtained during training. Several techniques are available to normalize activations, including batch normalization [6], layer normalization [1] and weight normalization [8]. These methods work well with CNNs and RNNs, but not so much with FNNs because backpropagating through normalization parameters introduces additional variance to the gradients, and regularization techniques like dropout further perturb the normalization effect. CNNs and RNNs are less sensitive to such perturbations, presumably due to their weight sharing architecture, but FNNs do not have such property, and thus suffer from high variance in training errors, which hinders learning. Furthermore, the aforementioned normalization techniques involve adding external layers to the model and can slow down computations. <br />
<br />
Therefore, the authors were motivated to develop a new FNN implementation that can achieve the intended effect of normalization techniques that works well with stochastic gradient descent and dropout. Self-normalizing neural networks (SNNs) are based on the idea of scaled exponential linear units (SELU), a new activation function introduced in this paper, whose output distribution is proved to converge to a fixed point, thus making it possible to train deeper networks.<br />
<br />
==Notations==<br />
<br />
As the paper (primarily in the supplementary materials) comes with lengthy proofs, important notations are listed first.<br />
<br />
Consider two fully-connected layers, let <math display="inline">x</math> denote the inputs to the second layer, then <math display="inline">z = Wx</math> represents the network inputs of the second layer, and <math display="inline">y = f(z)</math> represents the activations in the second layer.<br />
<br />
Assume that all <math display="inline">x_i</math>'s, <math display="inline">1 \leqslant i \leqslant n</math>, have mean <math display="inline">\mu := \mathrm{E}(x_i)</math> and variance <math display="inline">\nu := \mathrm{Var}(x_i)</math> and that each <math display="inline">y</math> has mean <math display="inline">\widetilde{\mu} := \mathrm{E}(y)</math> and variance <math display="inline">\widetilde{\nu} := \mathrm{Var}(y)</math>, then let <math display="inline">g</math> be the set of functions that maps <math display="inline">(\mu, \nu)</math> to <math display="inline">(\widetilde{\mu}, \widetilde{\nu})</math>. <br />
<br />
For the weight vector <math display="inline">w</math>, <math display="inline">n</math> times the mean of the weight vector is <math display="inline">\omega := \sum_{i = 1}^n \omega_i</math> and <math display="inline">n</math> times the second moment is <math display="inline">\tau := \sum_{i = 1}^{n} w_i^2</math>.<br />
<br />
==Key Concepts==<br />
<br />
===Self-Normalizing Neural-Net (SNN)===<br />
<br />
''A neural network is self-normalizing if it possesses a mapping <math display="inline">g: \Omega \rightarrow \Omega</math> for each activation <math display="inline">y</math> that maps mean and variance from one layer to the next and has a stable and attracting fixed point depending on <math display="inline">(\omega, \tau)</math> in <math display="inline">\Omega</math>. Furthermore, the mean and variance remain in the domain <math display="inline">\Omega</math>, that is <math display="inline">g(\Omega) \subseteq \Omega</math>, where <math display="inline">\Omega = \{ (\mu, \nu) | \mu \in [\mu_{min}, \mu_{max}], \nu \in [\nu_{min}, \nu_{max}] \}</math>. When iteratively applying the mapping <math display="inline">g</math>, each point within <math display="inline">\Omega</math> converges to this fixed point.''<br />
<br />
In other words, in SNNs, if the inputs from an earlier layer (<math display="inline">x</math>) already have their mean and variance within a predefined interval <math display="inline">\Omega</math>, then the activations to the next layer (<math display="inline">y = f(z = Wx)</math>) should remain within those intervals. This is true across all pairs of connecting layers as the normalizing effect gets propagated through the network, hence why the term self-normalizing. When the mapping is applied iteratively, it should draw the mean and variance values closer to a fixed point within <math display="inline">\Omega</math>, the value of which depends on <math display="inline">\omega</math> and <math display="inline">\tau</math> (recall that they are from the weight vector).<br />
<br />
We will design a FNN then construct a g that takes the mean and variance of each layer to those of the next and is a contraction mapping i.e. <math>g(\mu_i, \nu_i) = (\mu_{i+1}, \nu_{i+1}) \forall i </math>. It should be noted that although the g required in the SNN definition depends on <math display="inline">(\omega, \tau)</math> of an individual layer, the FNN that we construct will have the same values of <math display="inline">(\omega, \tau)</math> for each layer. Intuitively this definition can be interpreted as saying that the mean and variance of the final layer of a sufficiently deep SNN will not change when the mean and variance of the input data change. This is because the mean and variance are passing through a contraction mapping at each layer, converging to the mapping's fixed point.<br />
<br />
The activation function that makes an SNN possible should meet the following four conditions:<br />
<br />
# It can take on both negative and positive values, so it can normalize the mean;<br />
# It has a saturation region, so it can dampen variances that are too large;<br />
# It has a slope larger than one, so it can increase variances that are too small; and<br />
# It is a continuous curve, which is necessary for the fixed point to exist (see the definition of Banach fixed point theorem to follow).<br />
<br />
Commonly used activation functions such as rectified linear units (ReLU), sigmoid, tanh, leaky ReLUs and exponential linear units (ELUs) do not meet all four criteria, therefore, a new activation function is needed.<br />
<br />
===Scaled Exponential Linear Units (SELUs)===<br />
<br />
One of the main ideas introduced in this paper is the SELU function. As the name suggests, it is closely related to ELU [3],<br />
<br />
\[ \mathrm{elu}(x) = \begin{cases} x & x > 0 \\<br />
\alpha e^x - \alpha & x \leqslant 0<br />
\end{cases} \]<br />
<br />
but further builds upon it by introducing a new scale parameter $\lambda$ and proving the exact values that $\alpha$ and $\lambda$ should take on to achieve self-normalization. SELU is defined as:<br />
<br />
\[ \mathrm{selu}(x) = \lambda \begin{cases} x & x > 0 \\<br />
\alpha e^x - \alpha & x \leqslant 0<br />
\end{cases} \]<br />
<br />
SELUs meet all four criteria listed above - it takes on positive values when <math display="inline">x > 0</math> and negative values when <math display="inline">x < 0</math>, it has a saturation region when <math display="inline">x</math> is a larger negative value, the value of <math display="inline">\lambda</math> can be set to greater than one to ensure a slope greater than one, and it is continuous at <math display="inline">x = 0</math>. <br />
<br />
Figure 1 below gives an intuition for how SELUs normalize activations across layers. As shown, a variance dampening effect occurs when inputs are negative and far away from zero, and a variance increasing effect occurs when inputs are close to zero.<br />
<br />
[[File:snnf1.png|500px]]<br />
<br />
Figure 2 below plots the progression of training error on the MNIST and CIFAR10 datasets when training with SNNs versus FNNs with batch normalization at varying model depths. As shown, FNNs that adopted the SELU activation function exhibited lower and less variable training loss compared to using batch normalization, even as the depth increased to 16 and 32 layers.<br />
<br />
[[File:snnf2.png|600px]]<br />
<br />
=== Banach Fixed Point Theorem and Contraction Mappings ===<br />
<br />
The underlying theory behind SNNs is the Banach fixed point theorem, which states the following: ''Let <math display="inline">(X, d)</math> be a non-empty complete metric space with a contraction mapping <math display="inline">f: X \rightarrow X</math>. Then <math display="inline">f</math> has a unique fixed point <math display="inline">x_f \subseteq X</math> with <math display="inline">f(x_f) = x_f</math>. Every sequence <math display="inline">x_n = f(x_{n-1})</math> with starting element <math display="inline">x_0 \subseteq X</math> converges to the fixed point: <math display="inline">x_n \underset{n \rightarrow \infty}\rightarrow x_f</math>.''<br />
<br />
A contraction mapping is a function <math display="inline">f: X \rightarrow X</math> on a metric space <math display="inline">X</math> with distance <math display="inline">d</math>, such that for all points <math display="inline">\mathbf{u}</math> and <math display="inline">\mathbf{v}</math> in <math display="inline">X</math>: <math display="inline">d(f(\mathbf{u}), f(\mathbf{v})) \leqslant \delta d(\mathbf{u}, \mathbf{v})</math>, for a <math display="inline">0 \leqslant \delta \leqslant 1</math>.<br />
<br />
The easiest way to prove a contraction mapping is usually to show that the spectral norm [12] of its Jacobian is less than 1 [13], as was done for this paper.<br />
<br />
==Proving the Self-Normalizing Property==<br />
<br />
===Mean and Variance Mapping Function===<br />
<br />
<math display="inline">g</math> is derived under the assumption that <math display="inline">x_i</math>'s are independent but not necessarily having the same mean and variance [[#Footnotes |(2)]]. Under this assumption (and recalling earlier notation of <math display="inline">\omega</math> and <math display="inline">\tau</math>),<br />
<br />
\begin{align}<br />
\mathrm{E}(z = \mathbf{w}^T \mathbf{x}) = \sum_{i = 1}^n w_i \mathrm{E}(x_i) = \mu \omega<br />
\end{align}<br />
<br />
\begin{align}<br />
\mathrm{Var}(z) = \mathrm{Var}(\sum_{i = 1}^n w_i x_i) = \sum_{i = 1}^n w_i^2 \mathrm{Var}(x_i) = \nu \sum_{i = 1}^n w_i^2 = \nu\tau \textrm{ .}<br />
\end{align}<br />
<br />
When the weight terms are normalized, <math display="inline">z</math> can be viewed as a weighted sum of <math display="inline">x_i</math>'s. Wide neural net layers with a large number of nodes is common, so <math display="inline">n</math> is usually large, and by the Central Limit Theorem, <math display="inline">z</math> approaches a normal distribution <math display="inline">\mathcal{N}(\mu\omega, \sqrt{\nu\tau})</math>. <br />
<br />
Using the above property, the exact form for <math display="inline">g</math> can be obtained using the definitions for mean and variance of continuous random variables: <br />
<br />
[[File:gmapping.png|600px|center]]<br />
<br />
Analytical solutions for the integrals can be obtained as follows: <br />
<br />
[[File:gintegral.png|600px|center]]<br />
<br />
The authors are interested in the fixed point <math display="inline">(\mu, \nu) = (0, 1)</math> as these are the parameters associated with the common standard normal distribution. The authors also proposed using normalized weights such that <math display="inline">\omega = \sum_{i = 1}^n = 0</math> and <math display="inline">\tau = \sum_{i = 1}^n w_i^2= 1</math> as it gives a simpler, cleaner expression for <math display="inline">\widetilde{\mu}</math> and <math display="inline">\widetilde{\nu}</math> in the calculations in the next steps. This weight scheme can be achieved in several ways, for example, by drawing from a normal distribution <math display="inline">\mathcal{N}(0, \frac{1}{n})</math> or from a uniform distribution <math display="inline">U(-\sqrt{3}, \sqrt{3})</math>.<br />
<br />
At <math display="inline">\widetilde{\mu} = \mu = 0</math>, <math display="inline">\widetilde{\nu} = \nu = 1</math>, <math display="inline">\omega = 0</math> and <math display="inline">\tau = 1</math>, the constants <math display="inline">\lambda</math> and <math display="inline">\alpha</math> from the SELU function can be solved for - <math display="inline">\lambda_{01} \approx 1.0507</math> and <math display="inline">\alpha_{01} \approx 1.6733</math>. These values are used throughout the rest of the paper whenever an expression calls for <math display="inline">\lambda</math> and <math display="inline">\alpha</math>.<br />
<br />
===Details of Moment-Mapping Integrals ===<br />
Consider the moment-mapping integrals:<br />
\begin{align}<br />
\widetilde{\mu} & = \int_{-\infty}^\infty \mathrm{selu} (z) p_N(z; \mu \omega, \sqrt{\nu \tau})dz\\<br />
\widetilde{\nu} & = \int_{-\infty}^\infty \mathrm{selu} (z)^2 p_N(z; \mu \omega, \sqrt{\nu \tau}) dz-\widetilde{\mu}^2.<br />
\end{align}<br />
<br />
The equation for <math display="inline">\widetilde{\mu}</math> can be expanded as <br />
\begin{align}<br />
\widetilde{\mu} & = \frac{\lambda}{2}\left( 2\alpha\int_{-\infty}^0 (\exp(z)-1) p_N(z; \mu \omega, \sqrt{\nu \tau})dz +2\int_{0}^\infty z p_N(z; \mu \omega, \sqrt{\nu \tau})dz \right)\\<br />
&= \frac{\lambda}{2}\left( 2 \alpha \frac{1}{\sqrt{2\pi\tau\nu}} \int_{-\infty}^0 (\exp(z)-1) \exp(\frac{-1}{2\tau \nu} (z-\mu \omega)^2 ) dz +2\frac{1}{\sqrt{2\pi\tau\nu}}\int_{0}^\infty z \exp(\frac{-1}{2\tau \nu} (z-\mu \omega)^2 dz \right)\\<br />
&= \frac{\lambda}{2}\left( 2 \alpha\frac{1}{\sqrt{2\pi\tau\nu}}\int_{-\infty}^0 \exp(z) \exp(\frac{-1}{2\tau \nu} (z-\mu \omega)^2 ) dz - 2 \alpha\frac{1}{\sqrt{2\pi\tau\nu}}\int_{-\infty}^0 \exp(\frac{-1}{2\tau \nu} (z-\mu \omega)^2 ) dz +2\frac{1}{\sqrt{2\pi\tau\nu}}\int_{0}^\infty z \exp(\frac{-1}{2\tau \nu} (z-\mu \omega)^2 dz \right)\\<br />
\end{align}<br />
<br />
The first integral can be simplified via the substituiton<br />
\begin{align}<br />
q:= \frac{1}{\sqrt{2\tau \nu}}(z-\mu \omega -\tau \nu).<br />
\end{align}<br />
While the second and third can be simplified via the substitution<br />
\begin{align}<br />
q:= \frac{1}{\sqrt{2\tau \nu}}(z-\mu \omega ).<br />
\end{align}<br />
Using the definitions of <math display="inline">\mathrm{erf}</math> and <math display="inline">\mathrm{erfc}</math> then yields the result of the previous section.<br />
<br />
===Self-Normalizing Property Under Normalized Weights===<br />
<br />
Assuming the the weights normalized with <math display="inline">\omega=0</math> and <math display="inline">\tau=1</math>, it is possible to calculate the exact value for the spectral norm [12] of <math display="inline">g</math>'s Jacobian around the fixed point <math display="inline">(\mu, \nu) = (0, 1)</math>, which turns out to be <math display="inline">0.7877</math>. Thus, at initialization, SNNs have a stable and attracting fixed point at <math display="inline">(0, 1)</math>, which means that when <math display="inline">g</math> is applied iteratively to a pair <math display="inline">(\mu_{new}, \nu_{new})</math>, it should draw the points closer to <math display="inline">(0, 1)</math>. The rate of convergence is determined by the spectral norm [12], whose value depends on <math display="inline">\mu</math>, <math display="inline">\nu</math>, <math display="inline">\omega</math> and <math display="inline">\tau</math>.<br />
<br />
[[File:paper10_fig2.png|600px|frame|none|alt=Alt text|The figure illustrates, in the scenario described above, the mapping <math display="inline">g</math> of mean and variance <math display="inline">(\mu, \nu)</math> to <math display="inline">(\mu_{new}, \nu_{new})</math>. The arrows show the direction <math display="inline">(\mu, \nu)</math> is mapped by <math display="inline">g: (\mu, \nu)\mapsto(\mu_{new}, \nu_{new})</math>. One can clearly see the fixed point mapping <math display="inline">g</math> is at <math display="inline">(0, 1)</math>.]]<br />
<br />
===Self-Normalizing Property Under Unnormalized Weights===<br />
<br />
As weights are updated during training, there is no guarantee that they would remain normalized. The authors addressed this issue through the first key theorem presented in the paper, which states that a fixed point close to (0, 1) can still be obtained if <math display="inline">\mu</math>, <math display="inline">\nu</math>, <math display="inline">\omega</math> and <math display="inline">\tau</math> are restricted to a specified range. <br />
<br />
Additionally, there is no guarantee that the mean and variance of the inputs would stay within the range given by the first theorem, which led to the development of theorems #2 and #3. These two theorems established an upper and lower bound on the variance of inputs if the variance of activations from the previous layer are above or below the range specified, respectively. This ensures that the variance would not explode or vanish after being propagated through the network.<br />
<br />
The theorems come with lengthy proofs in the supplementary materials for the paper. High-level proof sketches are presented here.<br />
<br />
====Theorem 1: Stable and Attracting Fixed Points Close to (0, 1)====<br />
<br />
'''Definition:''' We assume <math display="inline">\alpha = \alpha_{01}</math> and <math display="inline">\lambda = \lambda_{01}</math>. We restrict the range of the variables to the domain <math display="inline">\mu \in [-0.1, 0.1]</math>, <math display="inline">\omega \in [-0.1, 0.1]</math>, <math display="inline">\nu \in [0.8, 1.5]</math>, and <math display="inline">\tau \in [0.9, 1.1]</math>. For <math display="inline">\omega = 0</math> and <math display="inline">\tau = 1</math>, the mapping has the stable fixed point <math display="inline">(\mu, \nu) = (0, 1</math>. For other <math display="inline">\omega</math> and <math display="inline">\tau</math>, g has a stable and attracting fixed point depending on <math display="inline">(\omega, \tau)</math> in the <math display="inline">(\mu, \nu)</math>-domain: <math display="inline">\mu \in [-0.03106, 0.06773]</math> and <math display="inline">\nu \in [0.80009, 1.48617]</math>. All points within the <math display="inline">(\mu, \nu)</math>-domain converge when iteratively applying the mapping to this fixed point.<br />
<br />
'''Proof:''' In order to show the the mapping <math display="inline">g</math> has a stable and attracting fixed point close to <math display="inline">(0, 1)</math>, the authors again applied Banach's fixed point theorem, which states that a contraction mapping on a nonempty complete metric space that does not map outside its domain has a unique fixed point, and that all points in the <math display="inline">(\mu, \nu)</math>-domain converge to the fixed point when <math display="inline">g</math> is iteratively applied. <br />
<br />
The two requirements are proven as follows:<br />
<br />
'''1. g is a contraction mapping.'''<br />
<br />
For <math display="inline">g</math> to be a contraction mapping in <math display="inline">\Omega</math> with distance <math display="inline">||\cdot||_2</math>, there must exist a Lipschitz constant <math display="inline">M < 1</math> such that: <br />
<br />
\begin{align} <br />
\forall \mu, \nu \in \Omega: ||g(\mu) - g(\nu)||_2 \leqslant M||\mu - \nu||_2 <br />
\end{align}<br />
<br />
As stated earlier, <math display="inline">g</math> is a contraction mapping if the spectral norm [12] of the Jacobian <math display="inline">\mathcal{H}</math> [[#Footnotes | (3)]] is below one, or equivalently, if the the largest singular value of <math display="inline">\mathcal{H}</math> is less than 1.<br />
<br />
To find the singular values of <math display="inline">\mathcal{H}</math>, the authors used an explicit formula derived by Blinn [2] for <math display="inline">2\times2</math> matrices, which states that the largest singular value of the matrix is <math display="inline">\frac{1}{2}(\sqrt{(a_{11} + a_{22}) ^ 2 + (a_{21} - a{12})^2} + \sqrt{(a_{11} - a_{22}) ^ 2 + (a_{21} + a{12})^2})</math>.<br />
<br />
For <math display="inline">\mathcal{H}</math>, an expression for the largest singular value of <math display="inline">\mathcal{H}</math>, made up of the first-order partial derivatives of the mapping <math display="inline">g</math> with respect to <math display="inline">\mu</math> and <math display="inline">\nu</math>, can be derived given the analytical solutions for <math display="inline">\widetilde{\mu}</math> and <math display="inline">\widetilde{\nu}</math> (and denoted <math display="inline">S(\mu, \omega, \nu, \tau, \lambda, \alpha)</math>).<br />
<br />
From the mean value theorem, we know that for a <math display="inline">t \in [0, 1]</math>, <br />
<br />
[[File:seq.png|600px|center]]<br />
<br />
Therefore, the distance of the singular value at <math display="inline">S(\mu, \omega, \nu, \tau, \lambda_{\mathrm{01}}, \alpha_{\mathrm{01}})</math> and at <math display="inline">S(\mu + \Delta\mu, \omega + \Delta\omega, \nu + \Delta\nu, \tau \Delta\tau, \lambda_{\mathrm{01}}, \alpha_{\mathrm{01}})</math> can be bounded above by <br />
<br />
[[File:seq2.png|600px|center]]<br />
<br />
An upper bound was obtained for each partial derivative term above, mainly through algebraic reformulations and by making use of the fact that many of the functions are monotonically increasing or decreasing on the variables they depend on in <math display="inline">\Omega</math> (see pages 17 - 25 in the supplementary materials).<br />
<br />
The <math display="inline">\Delta</math> terms were then set (rather arbitrarily) to be: <math display="inline">\Delta \mu=0.0068097371</math>,<br />
<math display="inline">\Delta \omega=0.0008292885</math>, <math display="inline">\Delta \nu=0.0009580840</math>, and <math display="inline">\Delta \tau=0.0007323095</math>. Plugging in the upper bounds on the absolute values of the derivative terms for <math display="inline">S</math> and the <math display="inline">\Delta</math> terms yields<br />
<br />
\[ S(\mu + \Delta \mu,\omega + \Delta \omega,\nu + \Delta \nu,\tau + \Delta \tau,\lambda_{\rm 01},\alpha_{\rm 01}) - S(\mu,\omega,\nu,\tau,\lambda_{\rm 01},\alpha_{\rm 01}) < 0.008747 \]<br />
<br />
Next, the largest singular value is found from a computer-assisted fine grid-search [[#Footnotes | (1)]] over the domain <math display="inline">\Omega</math>, with grid lengths <math display="inline">\Delta \mu=0.0068097371</math>, <math display="inline">\Delta \omega=0.0008292885</math>, <math display="inline">\Delta \nu=0.0009580840</math>, and <math display="inline">\Delta \tau=0.0007323095</math>, which turned out to be <math display="inline">0.9912524171058772</math>. Therefore, <br />
<br />
\[ S(\mu + \Delta \mu,\omega + \Delta \omega,\nu + \Delta \nu,\tau + \Delta \tau,\lambda_{\rm 01},\alpha_{\rm 01}) \leq 0.9912524171058772 + 0.008747 < 1 \]<br />
<br />
Since the largest singular value is smaller than 1, <math display="inline>g</math> is a contraction mapping.<br />
<br />
'''2. g does not map outside its domain.'''<br />
<br />
To prove that <math display="inline">g</math> does not map outside of the domain <math display="inline">\mu \in [-0.1, 0.1]</math> and <math display="inline">\nu \in [0.8, 1.5]</math>, lower and upper bounds on <math display="inline">\widetilde{\mu}</math> and <math display="inline">\widetilde{\nu}</math> were obtained to show that they stay within <math display="inline">\Omega</math>. <br />
<br />
First, it was shown that the derivatives of <math display="inline">\widetilde{\mu}</math> and <math display="inline">\widetilde{\xi}</math> with respect to <math display="inline">\mu</math> and <math display="inline">\nu</math> are either positive or have the sign of <math display="inline">\omega</math> in <math display="inline">\Omega</math>, so the minimum and maximum points are found at the borders. In <math display="inline">\Omega</math>, it then follows that<br />
<br />
\begin{align}<br />
-0.03106 <\widetilde{\mu}(-0.1,0.1, 0.8, 0.95, \lambda_{\rm 01}, \alpha_{\rm 01}) \leq & \widetilde{\mu} \leq \widetilde{\mu}(0.1,0.1,1.5, 1.1, \lambda_{\rm 01}, \alpha_{\rm 01}) < 0.06773<br />
\end{align}<br />
<br />
and <br />
<br />
\begin{align}<br />
0.80467 <\widetilde{\xi}(-0.1,0.1, 0.8, 0.95, \lambda_{\rm 01}, \alpha_{\rm 01}) \leq & \widetilde{\xi} \leq \widetilde{\xi}(0.1,0.1,1.5, 1.1, \lambda_{\rm 01}, \alpha_{\rm 01}) < 1.48617.<br />
\end{align}<br />
<br />
Since <math display="inline">\widetilde{\nu} = \widetilde{\xi} - \widetilde{\mu}^2</math>, <br />
<br />
\begin{align}<br />
0.80009 & \leqslant \widetilde{\nu} \leqslant 1.48617<br />
\end{align}<br />
<br />
The bounds on <math display="inline">\widetilde{\mu}</math> and <math display="inline">\widetilde{\nu}</math> are narrower than those for <math display="inline">\mu</math> and <math display="inline">\nu</math> set out in <math display="inline">\Omega</math>, therefore <math display="inline">g(\Omega) \subseteq \Omega</math>.<br />
<br />
==== Theorem 2: Decreasing Variance from Above ====<br />
<br />
'''Definition:''' For <math display="inline">\lambda = \lambda_{01}</math>, <math display="inline">\alpha = \alpha_{01}</math>, and the domain <math display="inline">\Omega^+: -1 \leqslant \mu \leqslant 1, -0.1 \leqslant \omega \leqslant 0.1, 3 \leqslant \nu \leqslant 16</math>, and <math display="inline">0.8 \leqslant \tau \leqslant 1.25</math>, we have for the mapping of the variance <math display="inline">\widetilde{\nu}(\mu, \omega, \nu, \tau, \lambda, \alpha)</math> under <math display="inline">g</math>: <math display="inline">\widetilde{\nu}(\mu, \omega, \nu, \tau, \lambda, \alpha) < \nu</math>.<br />
<br />
Theorem 2 states that when <math display="inline">\nu \in [3, 16]</math>, the mapping <math display="inline">g</math> draws it to below 3 when applied across layers, thereby establishing an upper bound of <math display="inline">\nu < 3</math> on variance.<br />
<br />
'''Proof:''' The authors proved the inequality by showing that <math display="inline">g(\mu, \omega, \xi, \tau, \lambda_{01}, \alpha_{01}) = \widetilde{\xi}(\mu, \omega, \xi, \tau, \lambda_{01}, \alpha_{01}) - \nu < 0</math>, since the second moment should be greater than or equal to variance <math display="inline">\widetilde{\nu}</math>. The behavior of <math display="inline">\frac{\partial }{\partial \mu } \widetilde{\xi}(\mu, \omega, \nu, \tau, \lambda, \alpha)</math>, <math display="inline">\frac{\partial }{\partial \omega } \widetilde{\xi}(\mu, \omega, \nu, \tau, \lambda, \alpha)</math>, <math display="inline">\frac{\partial }{\partial \nu } \widetilde{\xi}(\mu, \omega, \nu, \tau, \lambda, \alpha)</math>, and <math display="inline">\frac{\partial }{\partial \tau } \widetilde{\xi}(\mu, \omega, \nu, \tau, \lambda, \alpha)</math> are used to find the bounds on <math display="inline">g(\mu, \omega, \xi, \tau, \lambda_{01}, \alpha_{01})</math> (see pages 9 - 13 in the supplementary materials). Again, the partial derivative terms were monotonic, which made it possible to find the upper bound at the board values. It was shown that the maximum value of <math display="inline">g</math> does not exceed <math display="inline">-0.0180173</math>.<br />
<br />
==== Theorem 3: Increasing Variance from Below ====<br />
<br />
'''Definition''': We consider <math display="inline">\lambda = \lambda_{01}</math>, <math display="inline">\alpha = \alpha_{01}</math>, and the domain <math display="inline">\Omega^-: -0.1 \leqslant \mu \leqslant 0.1</math> and <math display="inline">-0.1 \leqslant \omega \leqslant 0.1</math>. For the domain <math display="inline">0.02 \leqslant \nu \leqslant 0.16</math> and <math display="inline">0.8 \leqslant \tau \leqslant 1.25</math> as well as for the domain <math display="inline">0.02 \leqslant \nu \leqslant 0.24</math> and <math display="inline">0.9 \leqslant \tau \leqslant 1.25</math>, the mapping of the variance <math display="inline">\widetilde{\nu}(\mu, \omega, \nu, \tau, \lambda, \alpha)</math> increases: <math display="inline">\widetilde{\nu}(\mu, \omega, \nu, \tau, \lambda, \alpha) > \nu</math>.<br />
<br />
Theorem 3 states that the variance <math display="inline">\widetilde{\nu}</math> increases when variance is smaller than in <math display="inline">\Omega</math>. The lower bound on variance is <math display="inline">\widetilde{\nu} > 0.16</math> when <math display="inline">0.8 \leqslant \tau</math> and <math display="inline">\widetilde{\nu} > 0.24</math> when <math display="inline">0.9 \leqslant \tau</math> under the proposed mapping.<br />
<br />
'''Proof:''' According to the mean value theorem, for a <math display="inline">t \in [0, 1]</math>,<br />
<br />
[[File:th3.png|700px|center]]<br />
<br />
Similar to the proof for Theorem 2 (except we are interested in the smallest <math display="inline">\widetilde{\nu}</math> instead of the biggest), the lower bound for <math display="inline">\frac{\partial }{\partial \nu} \widetilde{\xi}(\mu,\omega,\nu+t(\nu_{\mathrm{min}}-\nu),\tau,\lambda_{\rm 01},\alpha_{\rm 01})</math> can be derived, and substituted into the relationship <math display="inline">\widetilde{\nu} = \widetilde{\xi}(\mu,\omega,\nu,\tau,\lambda_{\rm 01},\alpha_{\rm 01}) - (\widetilde{\mu}(\mu,\omega,\nu,\tau,\lambda_{\rm 01},\alpha_{\rm 01}))^2</math>. The lower bound depends on <math display="inline">\tau</math> and <math display="inline">\nu</math>, and in the <math display="inline">\Omega^{-1}</math> listed, it is slightly above <math display="inline">\nu</math>.<br />
<br />
== Implementation Details ==<br />
<br />
=== Initialization ===<br />
<br />
As previously explained, SNNs work best when inputs to the network are standardized, and the weights are initialized with mean of 0 and variance of <math display="inline">\frac{1}{n}</math> to help converge to the fixed point <math display="inline">(\mu, \nu) = (0, 1)</math>.<br />
<br />
=== Dropout Technique ===<br />
<br />
The authors reason that regular dropout, randomly setting activations to 0 with probability <math display="inline">1 - q</math>, is not compatible with SELUs. This is because the low variance region in SELUs is at <math display="inline">\lim_{x \rightarrow -\infty} = -\lambda \alpha</math>, not 0. Contrast this with ReLUs, which work well with dropout since they have <math display="inline">\lim_{x \rightarrow -\infty} = 0</math> as the saturation region. Therefore, a new dropout technique for SELUs was needed, termed ''alpha dropout''.<br />
<br />
With alpha dropout, activations are randomly set to <math display="inline">-\lambda\alpha = \alpha'</math>, which for this paper corresponds to the constant <math display="inline">1.7581</math>, with probability <math display="inline">1 - q</math>.<br />
<br />
The updated mean and variance of the activations are now:<br />
\[ \mathrm{E}(xd + \alpha'(1 - d)) = \mu q + \alpha'(1 - q) \] <br />
<br />
and<br />
<br />
\[ \mathrm{Var}(xd + \alpha'(1 - d)) = q((1-q)(\alpha' - \mu)^2 + \nu) \]<br />
<br />
Activations need to be transformed (e.g. scaled) after dropout to maintain the same mean and variance. In regular dropout, conserving the mean and variance correlates to scaling activations by a factor of 1/q while training. To ensure that mean and variance are unchanged after alpha dropout, the authors used an affine transformation <math display="inline">a(xd + \alpha'(1 - d)) + b</math>, and solved for the values of <math display="inline">a</math> and <math display="inline">b</math> to give <math display="inline">a = (\frac{\nu}{q((1-q)(\alpha' - \mu)^2 + \nu)})^{\frac{1}{2}}</math> and <math display="inline">b = \mu - a(q\mu + (1-q)\alpha'))</math>. As the values for <math display="inline">\mu</math> and <math display="inline">\nu</math> are set to <math display="inline">0</math> and <math display="inline">1</math> throughout the paper, these expressions can be simplified into <math display="inline">a = (q + \alpha'^2 q(1-q))^{-\frac{1}{2}}</math> and <math display="inline">b = -(q + \alpha^2 q (1-q))^{-\frac{1}{2}}((1 - q)\alpha')</math>, where <math display="inline">\alpha' \approx 1.7581</math>.<br />
<br />
Empirically, the authors found that dropout rates (1-q) of <math display="inline">0.05</math> or <math display="inline">0.10</math> worked well with SNNs.<br />
<br />
=== Optimizers ===<br />
<br />
Through experiments, the authors found that stochastic gradient descent, momentum, Adadelta and Adamax work well on SNNs. For Adam, configuration parameters <math display="inline">\beta_2 = 0.99</math> and <math display="inline">\epsilon = 0.01</math> were found to be more effective.<br />
<br />
==Experimental Results==<br />
<br />
Three sets of experiments were conducted to compare the performance of SNNs to six other FNN structures and to other machine learning algorithms, such as support vector machines and random forests. The experiments were carried out on (1) 121 UCI Machine Learning Repository datasets, (2) the Tox21 chemical compounds toxicity effects dataset (with 12,000 compounds and 270,000 features), and (3) the HTRU2 dataset of statistics on radio wave signals from pulsar candidates (with 18,000 observations and eight features). In each set of experiment, hyperparameter search was conducted on a validation set to select parameters such as the number of hidden units, number of hidden layers, learning rate, regularization parameter, and dropout rate (see pages 95 - 107 of the supplementary material for exact hyperparameters considered). Whenever models of different setups gave identical results on the validation data, preference was given to the structure with more layers, lower learning rate and higher dropout rate.<br />
<br />
The six FNN structures considered were: (1) FNNs with ReLU activations, no normalization and “Microsoft weight initialization” (MSRA) [5] to control the variance of input signals [5]; (2) FNNs with batch normalization [6], in which normalization is applied to activations of the same mini-batch; (3) FNNs with layer normalization [1], in which normalization is applied on a per layer basis for each training example; (4) FNNs with weight normalization [8], whereby each layer’s weights are normalized by learning the weight’s magnitude and direction instead of the weight vector itself; (5) highway networks, in which layers are not restricted to being sequentially connected [9]; and (6) an FNN-version of residual networks [4], with residual blocks made up of two or three densely connected layers.<br />
<br />
On the Tox21 dataset, the authors demonstrated the self-normalizing effect by comparing the distribution of neural inputs <math display="inline">z</math> at initialization and after 40 epochs of training to that of the standard normal. As Figure 3 show, the distribution of <math display="inline">z</math> remained similar to a normal distribution.<br />
<br />
[[File:snnf3.png|600px]]<br />
<br />
On all three sets of classification tasks, the authors demonstrated that SNN outperformed the other FNN counterparts on accuracy and AUC measures, came close to the state-of-the-art results on the Tox21 dataset with an 8-layer network, and produced a new state-of-the-art AUC on predicting pulsars for the HTRU2 dataset by a small margin (achieving an AUC 0.98, averaged over 10 cross-validation folds, versus the previous record of 0.976).<br />
<br />
On UCI datasets with fewer than 1,000 observations, SNNs did not outperform SVMs or random forests in terms of average rank in accuracy, but on datasets with at least 1,000 observations, SNNs showed the best overall performance (average rank of 5.8, compared to 6.1 for support vector machines and 6.6 for random forests). Through hyperparameter tuning, it was also discovered that the average depth of FNNs is 10.8 layers, more than the other FNN architectures tried.<br />
<br />
Here are the results on the Tox21 challenge. The challenge requires prediction of toxic effects of 12000 chemicals based on their chemical structures. SNN with 8 layers had the best performance.<br />
<br />
[[File:tox21.png|600px]]<br />
<br />
==Future Work==<br />
<br />
Although not the focus of this paper, the authors also briefly noted that their initial experiments with applying SELUs on relatively simple CNN structures showed promising results, which is not surprising given that ELUs, which do not have the self-normalizing property, has already been shown to work well with CNNs, demonstrating faster convergence than ReLU networks and even pushing the state-of-the-art error rates on CIFAR-100 at the time of publishing in 2015 [3].<br />
<br />
Since the paper was published, SELUs have been adopted by several researchers, not just with FNNs [https://github.com/bioinf-jku/SNNs see link], but also with CNNs, GANs, autoencoders, reinforcement learning and RNNs. In a few cases, researchers for those papers concluded that networks trained with SELUs converged faster than those trained with ReLUs, and that SELUs have the same convergence quality as batch normalization. There is potential for SELUs to be incorporated into more architectures in the future.<br />
<br />
==Critique==<br />
<br />
Overall, the authors presented a convincing case for using SELUs (along with proper initialization and alpha dropout) on FNNs. FNNs trained with SELU have more layers than those with other normalization techniques, so the work here provides a promising direction for making traditional FNNs more powerful. There are not as many well-established benchmark datasets to evaluate FNNs, but the experiments carried out, particularly on the larger Tox21 dataset, showed that SNNs can be very effective at classification tasks.<br />
<br />
The proofs provide a satisfactory explanation for why SELUs have a self-normalising property within the specified domain, but during their introduction the authors give 4 criteria that an activation function must satisfy in order to be self-normalising. Those criteria make intuitive sense, but there is a lack of firm justification which creates some confusion. For example, they state that SNNs cannot be derived from tanh units, even though <math> tanh(\lambda x) </math> satisfies all 4 criteria if <math> \lambda </math> is larger than 1. Assuming the authors did not overlook such a simple modification, there must be some additional criteria for an activation function to have a self-normalising property.<br />
<br />
The only question I have with the proofs is the lack of explanation for how the domains, <math display="inline">\Omega</math>, <math display="inline">\Omega^-</math> and <math display="inline">\Omega^+</math> are determined, which is an important consideration because they are used for deriving the upper and lower bounds on expressions needed for proving the three theorems. The ranges appear somewhat set through trial-and-error and heuristics to ensure the numbers work out (e.g. make the spectral norm [12] of <math display="inline">\mathcal{J}</math> as large as can be below 1 so as to ensure <math display="inline">g</math> is a contraction mapping), so it is not clear whether they are unique conditions, or that the parameters will remain within those prespecified ranges throughout training; and if the parameters can stray away from the ranges provided, then the issue of what will happen to the self-normalizing property was not addressed. Perhaps that is why the authors gave preference to models with a deeper structure and smaller learning rate during experiments to help the parameters stay within their domains. Further, in addition to the hyperparameters considered, it would be helpful to know the final values that went into the best-performing models, for a better understanding of what range of values work better for SNNs empirically.<br />
<br />
==Conclusion==<br />
<br />
The SNN structure proposed in this paper is built on the traditional FNN structure with a few modifications, including the use of SELUs as the activation function (with <math display="inline">\lambda \approx 1.0507</math> and <math display="inline">\alpha \approx 1.6733</math>), alpha dropout, network weights initialized with mean of zero and variance <math display="inline">\frac{1}{n}</math>, and inputs normalized to mean of zero and variance of one. It is simple to implement while being backed up by detailed theory. <br />
<br />
When properly initialized, SELUs will draw neural inputs towards a fixed point of zero mean and unit variance as the activations are propagated through the layers. The self-normalizing property is maintained even when weights deviate from their initial values during training (under mild conditions). When the variance of inputs goes beyond the prespecified range imposed, they are still bounded above and below so SNNs do not suffer from exploding and vanishing gradients. This self-normalizing property allows SNNs to be more robust to perturbations in stochastic gradient descent, so deeper structures with better prediction performance can be built. <br />
<br />
In the experiments conducted, the authors demonstrated that SNNs outperformed FNNs trained with other normalization techniques, such as batch, layer and weight normalization, and specialized architectures, such as highway or residual networks, on several classification tasks, including on the UCI Machine Learning Repository datasets. The adoption of SELUs by other researchers also lends credence to the potential for SELUs to be implemented in more neural network architectures.<br />
<br />
==References==<br />
<br />
# Ba, Kiros and Hinton. "Layer Normalization". arXiv:1607.06450. (2016).<br />
# Blinn. "Consider the Lowly 2X2 Matrix." IEEE Computer Graphics and Applications. (1996).<br />
# Clevert, Unterthiner, Hochreiter. "Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)." arXiv: 1511.07289. (2015).<br />
# He, Zhang, Ren and Sun. "Deep Residual Learning for Image Recognition." arXiv:1512.03385. (2015).<br />
# He, Zhang, Ren and Sun. "Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification." arXiv:1502.01852. (2015). <br />
# Ioffe and Szegedy. "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariance Shift." arXiv:1502.03167. (2015).<br />
# Klambauer, Unterthiner, Mayr and Hochreiter. "Self-Normalizing Neural Networks." arXiv: 1706.02515. (2017).<br />
# Salimans and Kingma. "Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks." arXiv:1602.07868. (2016).<br />
# Srivastava, Greff and Schmidhuber. "Highway Networks." arXiv:1505.00387 (2015).<br />
# Unterthiner, Mayr, Klambauer and Hochreiter. "Toxicity Prediction Using Deep Learning." arXiv:1503.01445. (2015). <br />
# https://en.wikipedia.org/wiki/Central_limit_theorem <br />
# http://mathworld.wolfram.com/SpectralNorm.html <br />
# https://www.math.umd.edu/~petersd/466/fixedpoint.pdf<br />
<br />
==Online Resources==<br />
https://github.com/bioinf-jku/SNNs (GitHub repository maintained by some of the paper's authors)<br />
<br />
==Footnotes==<br />
<br />
1. Error propagation analysis: The authors performed an error analysis to quantify the potential numerical imprecisions propagated through the numerous operations performed. The potential imprecision <math display="inline">\epsilon</math> was quantified by applying the mean value theorem<br />
<br />
\[ |f(x + \Delta x - f(x)| \leqslant ||\triangledown f(x + t\Delta x|| ||\Delta x|| \textrm{ for } t \in [0, 1]\textrm{.} \] <br />
<br />
The error propagation rules, or <math display="inline">|f(x + \Delta x - f(x)|</math>, was first obtained for simple operations such as addition, subtraction, multiplication, division, square root, exponential function, error function and complementary error function. Them, the error bounds on the compound terms making up <math display="inline">\Delta (S(\mu, \omega, \nu, \tau, \lambda, \alpha)</math> were found by decomposing them into the simpler expressions. If each of the variables have a precision of <math display="inline">\epsilon</math>, then it turns out <math display="inline">S</math> has a precision better than <math display="inline">292\epsilon</math>. For a machine with a precision of <math display="inline">2^{-56}</math>, the rounding error is <math display="inline">\epsilon \approx 10^{-16}</math>, and <math display="inline">292\epsilon < 10^{-13}</math>. In addition, all computations are correct up to 3 ulps (“unit in last place”) for the hardware architectures and GNU C library used, with 1 ulp being the highest precision that can be achieved.<br />
<br />
2. Independence Assumption: The classic definition of central limit theorem requires <math display="inline">x_i</math>’s to be independent and identically distributed, which is not guaranteed to hold true in a neural network layer. However, according to the Lyapunov CLT, the <math display="inline">x_i</math>’s do not need to be identically distributed as long as the <math display="inline">(2 + \delta)</math>th moment exists for the variables and meet the Lyapunov condition for the rate of growth of the sum of the moments [11]. In addition, CLT has also shown to be valid under weak dependence under mixing conditions [11]. Therefore, the authors argue that the central limit theorem can be applied with network inputs.<br />
<br />
3. <math display="inline">\mathcal{H}</math> versus <math display="inline">\mathcal{J}</math> Jacobians: In solving for the largest singular value of the Jacobian <math display="inline">\mathcal{H}</math> for the mapping <math display="inline">g: (\mu, \nu)</math>, the authors first worked with the terms in the Jacobian <math display="inline">\mathcal{J}</math> for the mapping <math display="inline">h: (\mu, \nu) \rightarrow (\widetilde{\mu}, \widetilde{\xi})</math> instead, because the influence of <math display="inline">\widetilde{\mu}</math> on <math display="inline">\widetilde{\nu}</math> is small when <math display="inline">\widetilde{\mu}</math> is small in <math display="inline">\Omega</math> and <math display="inline">\mathcal{H}</math> can be easily expressed as terms in <math display="inline">\mathcal{J}</math>. <math display="inline">\mathcal{J}</math> was referenced in the paper, but I used <math display="inline">\mathcal{H}</math> in the summary here to avoid confusion.</div>Bapskikohttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Wasserstein_Auto-Encoders&diff=34667Wasserstein Auto-Encoders2018-03-19T21:13:38Z<p>Bapskiko: /* Wasserstein Auto-Encoders */</p>
<hr />
<div><br />
= Introduction =<br />
Recent years have seen a convergence of two previously distinct approaches: representation learning from high dimensional data, and unsupervised generative modeling. In the field that formed at their intersection, Variational Auto-Encoders (VAEs) and Generative Adversarial Networks (GANs) have emerged to become well-established. VAEs are theoretically elegant but with the drawback that they tend to generate blurry samples when applied to natural images. GANs on the other hand produce better visual quality of sampled images, but come without an encoder, are harder to train and suffer from the mode-collapse problem when the trained model is unable to capture all the variability in the true data distribution. Thus there has been a push to come up with the best way to combine them together, but a principled unifying framework is yet to be discovered.<br />
<br />
This work proposes a new family of regularized auto-encoders called the Wasserstein Auto-Encoder (WAE). The proposed method provides a novel theoretical insight into setting up an objective function for auto-encoders from the point of view of of optimal transport (OT). This theoretical formulation leads the authors to examine adversarial and maximum mean discrepancy based regularizers for matching a prior and the distribution of encoded data points in the latent space. An empirical evaluation is performed on MNIST and CelebA datasets, where WAE is found to generate samples of better quality than VAE while preserving training stability, encoder-decoder structure and nice latent manifold structure.<br />
<br />
The main contribution of the proposed algorithm is to provide theoretical foundations for using optimal transport cost as the auto-encoder objective function, while blending auto-encoders and GANs in a principled way. It also theoretically and experimentally explores the interesting relationships between WAEs, VAEs and adversarial auto-encoders.<br />
<br />
= Proposed Approach =<br />
==Theory of Optimal Transport and Wasserstein Distance==<br />
Wasserstein Distance is a measure of the distance between two probability distributions. It is also called Earth Mover’s distance, short for EM distance, because informally it can be interpreted as moving piles of dirt that follow one probability distribution at a minimum cost to follow the other distribution. The cost is quantified by the amount of dirt moved times the moving distance. <br />
A simple case where the probability domain is discrete is presented below.<br />
<br />
<br />
[[File:em_distance.PNG|thumb|upright=1.4|center|Step-by-step plan of moving dirt between piles in ''P'' and ''Q'' to make them match (''W'' = 5).]]<br />
<br />
<br />
When dealing with the continuous probability domain, the EM distance or the minimum one among the costs of all dirt moving solutions becomes:<br />
\begin{align}<br />
\small W(p_r, p_g) = \underset{\gamma\sim\Pi(p_r, p_g)} {\inf}\pmb{\mathbb{E}}_{(x,y)\sim\gamma}[\parallel x-y\parallel]<br />
\end{align}<br />
<br />
Where <math>\Pi(p_r, p_g)</math> is the set of all joint probability distributions with marginals <math>p_r</math> and <math>p_g</math>. Here the distribution <math>\gamma</math> is called a transport plan because it's marginal structure gives some intuition that it represents the amount of probability mass to be moved from x to y. This intuition can be explained by looking at the following equation.<br />
<br />
\begin{align}<br />
\int\gamma(x, y)dx = p_g(y)<br />
\end{align}<br />
<br />
The Wasserstein distance or the cost of Optimal Transport (OT) provides a much weaker topology, which informally means that it makes it easier for a sequence of distribution to converge as compared to other ''f''-divergences. This is particularly important in applications where data is supported on low dimensional manifolds in the input space. As a result, stronger notions of distances such as KL-divergence, often max out, providing no useful gradients for training. In contrast, OT has a much nicer linear behaviour even upon saturation. It can be shown that the Wasserstein distance has guarantees of continuity and differentiability (Arjovsky et al., 2017). Moreover, Arjovsky et al. show there is a nice relationship between the magnitude of the Wasserstein distance and the distance between distributions; a smaller distance nicely corresponds to a smaller distance between the two distributions, and vice versa.<br />
<br />
==Problem Formulation and Notation==<br />
In this paper, calligraphic letters, i.e. <math>\small {\mathcal{X}}</math>, are used for sets, capital letters, i.e. <math>\small X</math>, are used for random variables and lower case letters, i.e. <math>\small x</math>, for their values. Probability distributions are denoted with capital letters, i.e. <math>\small P(X)</math>, and corresponding densities with lower case letters, i.e. <math>\small p(x)</math>.<br />
<br />
This work aims to minimize OT <math>\small W_c(P_X, P_G)</math> between the true (but unknown) data distribution <math>\small P_X</math> and a latent variable model <math>\small P_G</math> specified by the prior distribution <math>\small P_Z</math> of latent codes <math>\small Z \in \pmb{\mathbb{Z}}</math> and the generative model <math>\small P_G(X|Z)</math> of the data points <math>\small X \in \pmb{\mathbb{X}}</math> given <math>\small Z</math>. <br />
<br />
Kantorovich's formulation of the OT problem is given by:<br />
\begin{align}<br />
\small W_c(P_X, P_G) := \underset{\Gamma\sim {\mathcal{P}}(X \sim P_X, Y \sim P_G)}{\inf} {\pmb{\mathbb{E}}_{(X,Y)\sim\Gamma}[c(X,Y)]}<br />
\end{align}<br />
where <math>\small c(x,y)</math> is any measurable cost function and <math>\small {\mathcal{P}(X \sim P_X,Y \sim P_G)}</math> is a set of all joint distributions of <math>\small (X,Y)</math> with marginals <math>\small P_X</math> and <math>\small P_G</math>. When <math>\small c(x,y)=d(x,y)</math>, the following Kantorovich-Rubinstein duality holds for the <math>\small 1^{st}</math> root of <math>\small W_c</math>:<br />
\begin{align}<br />
\small W_1(P_X, P_G) := \underset{f \in {\mathcal{F_L}}} {\sup} {\pmb{\mathbb{E}}_{X \sim P_X}[f(X)]} -{\pmb{\mathbb{E}}_{Y \sim P_G}[f(Y)]}<br />
\end{align}<br />
where <math>\small {\mathcal{F_L}}</math> is the class of all bounded [https://en.wikipedia.org/wiki/Lipschitz_continuity Lipschitz continuous functions]. A reference that provides an intuitive explanation for how the Kantorovich-Rubinstein duality was applied in this case is [https://vincentherrmann.github.io/blog/wasserstein/ here].<br />
<br />
==Wasserstein Auto-Encoders==<br />
The proposed method focuses on latent variables <math>\small P_G </math> defined by a two step procedure, where first a code <math>\small Z</math> is sampled from a fixed prior distribution <math>\small P_Z</math> on a latent space <math>\small {\mathcal{Z}}</math> and then <math>\small Z</math> is mapped to the image <math>\small X \in {\mathcal{X}}</math> with a transformation. This results in a density of the form<br />
\begin{align}<br />
\small p_G(x) := \int_{{\mathcal{Z}}} p_G(x|z)p_z(z)dz, \forall x\in{\mathcal{X}}<br />
\end{align}<br />
assuming all the densities are properly defined. It turns out that if the focus is only on generative models deterministically mapping <math>\small Z </math> to <math>\small X = G(Z) </math>, then the OT cost takes a much simpler form as stated below by Theorem 1.<br />
<br />
'''Theorem 1''' For any function <math>\small G:{\mathcal{Z}} \rightarrow {\mathcal{X}}</math>, where <math>\small Q(Z) </math> is the marginal distribution of <math>\small Z </math> when <math>\small X \in P_X </math> and <math>\small Z \in Q(Z|X) </math>,<br />
\begin{align}<br />
\small \underset{\Gamma\sim {\mathcal{P}}(X \sim P_X, Y \sim P_G)}{\inf} {\pmb{\mathbb{E}}_{(X,Y)\sim\Gamma}[c(X,Y)]} = \underset{Q : Q_z=P_z}{\inf} {{\pmb{\mathbb{E}}_{P_X}}{\pmb{\mathbb{E}}_{Q(Z|X)}}[c(X,G(Z))]}<br />
\end{align}<br />
This essentially means that instead of finding a coupling <math>\small \Gamma </math> between two random variables living in the <math>\small {\mathcal{X}} </math> space, one distributed according to <math>\small P_X </math> and the other one according to <math>\small P_G </math>, it is sufficient to find a conditional distribution <math>\small Q(Z|X) </math> such that its <math>\small Z </math> marginal <math>\small Q_Z(Z) := {\pmb{\mathbb{E}}_{X \sim P_X}[Q(Z|X)]} </math> is identical to the prior distribution <math>\small P_Z </math>. In order to implement a numerical solution to Theorem 1, the constraints on <math>\small Q(Z|X) </math> and <math>\small P_Z </math> are relaxed and a penalty function is added to the objective leading to the WAE objective function given by:<br />
<br />
\begin{align}<br />
\small D_{WAE}(P_X, P_G):= \underset{Q(Z|X) \in Q}{\inf} {{\pmb{\mathbb{E}}_{P_X}}{\pmb{\mathbb{E}}_{Q(Z|X)}}[c(X,G(Z))]} + {\lambda} {{\mathcal{D}}_Z(Q_Z,P_Z)}<br />
\end{align}<br />
where <math>\small Q </math> is any non-parametric set of probabilistic encoders, <math>\small {\mathcal{D}}_Z </math> is an arbitrary divergence between <br />
<math>\small Q_Z </math> and <math>\small P_Z </math>, and <math>\small \lambda > 0 </math> is a hyperparameter. The authors propose two different penalties <math>\small {\mathcal{D}}_Z(Q_Z,P_Z) </math> based on adversarial training (GANs) and maximum mean discrepancy (MMD). The authors note that a numerical solution to the dual formulation of the problem has been tried by clipping the weights of the network (to satisfy the Lipschitz condition) and by penalizing the objective with <math>\small \lambda \mathbb{E}(\parallel \nabla f(X) \parallel - 1)^2 </math><br />
<br />
===WAE-GAN: GAN-based===<br />
The first option is to choose <math>\small {\mathcal{D}}_Z(Q_Z,P_Z) = D_{JS}(Q_Z,P_Z)</math>, where <math>\small D_{JS} </math> is the Jensen-Shannon divergence metric, and use adversarial training to estimate it. Specifically a discriminator is introduced in the latent space <math>\small {\mathcal{Z}} </math> trying to separate true points sampled from <math>\small P_Z </math> from fake ones sampled from <math>\small Q_Z </math>. This results in Algorithm 1. It is interesting that the min-max problem is moved from the input pixel space to the latent space.<br />
<br />
<br />
[[File:wae-gan.PNG|270px|center]]<br />
<br />
===WAE-MMD: MMD-based===<br />
For a positive definite kernel <math>\small k: {\mathcal{Z}} \times {\mathcal{Z}} \rightarrow {\mathcal{R}}</math>, the following expression is called the maximum mean discrepancy:<br />
\begin{align}<br />
\small {MMD}_k(P_Z,Q_Z) = \parallel \int_{{\mathcal{Z}}} k(z,\cdot)dP_z(z) - \int_{{\mathcal{Z}}} k(z,\cdot)dQ_z(z) \parallel_{\mathcal{H}_k},<br />
\end{align}<br />
<br />
where <math>\mathcal{H}_k</math> is the reproducing kernel Hilbert space of real-valued functions mapping <math>\mathcal{Z}</math> to <math>\mathcal{R}</math>. This can be used as a divergence measure and the authors propose to use <math>\small {\mathcal{D}}_Z(Q_Z,P_Z) = MMD_k(P_Z,Q_Z) </math>, which leads to Algorithm 2.<br />
<br />
<br />
[[File:wae-mmd.PNG|270px|center]]<br />
<br />
= Comparison with Related Work =<br />
==Auto-Encoders, VAEs and WAEs==<br />
Classical unregularized encoders only minimized the reconstruction cost, and resulted in training points being chaotically scattered across the latent space with holes in between, where the decoder had never been trained. They were hard to sample from and did not provide a useful representation. VAEs circumvented this problem by maximizing a variational lower-bound term comprising of a reconstruction cost and a KL-divergence measure which captures how distinct each training example is from the prior <math>\small P_Z</math>. This however does not guarantee that the overall encoded distribution <math>\small {{\pmb{\mathbb{E}}_{P_X}}}[Q(Z|X)]</math> matches <math>\small P_Z</math>. This is ensured by WAE however, is a direct consequence of our objective function derived from Theorem 1, and is visually represented in the figure below. It is also interesting to note that this also allows WAE to have deterministic encoder-decoder pairs.<br />
<br />
<br />
[[File:vae-wae.PNG|500px|thumb|center|WAE and VAE regularization]]<br />
<br />
<br />
It is also shown that if <math>\small c(x,y)={\parallel x-y \parallel}_2^2</math>, WAE-GAN is equivalent to adversarial autoencoders (AAE). Thus the theory suggests that AAE minimize the 2-Wasserstein distance between <math>\small P_X</math> and <math>\small P_G</math>.<br />
<br />
==OT, W-GAN and WAE==<br />
The Wasserstein GAN (W-GAN) minimizes the 1-Wasserstein distance <math>\small W_1(P_X,P_G)</math> for generative modeling. The W-GAN formulation is approached from the dual form and thus it cannot be applied to another other cost <math>\small W_c</math> as the neat form of the Kantorovich-Rubinstein duality holds only for <math>\small W_1</math>. WAE approaches the same problem from the primal form, can be applied to any cost function <math>\small c</math> and comes naturally with an encoder. The constraint on OT in Theorem 1, is relaxed in line with theory on unbalanced optimal transport by adding a penalty or additional divergences to the objective.<br />
<br />
==GANs and WAEs==<br />
Many of the GAN variations including f-GAN and W-GAN come without an encoder. Often it may be desirable to reconstruct the latent codes and use the learned manifold in which case they won't be applicable. For works which try to blend adversarial auto-encoder structures, encoders and decoders do not have incentive to be reciprocal. WAE does not necessarily lead to a min-max game and has a clear theoretical foundation for using penalties for regularization.<br />
<br />
=Experimental Results=<br />
The authors empirically evaluate the proposed WAE generative model by specifically testing if data points are accurately reconstructed, if the latent manifold has reasonable geometry, and if random samples of good visual quality are generated. <br />
<br />
'''Experimental setup:'''<br />
Gaussian prior distribution <math> \small P_Z</math> and squared cost function <math> \small c(x,y)</math> are used for data-points. The encoder-decoder pairs were deterministic. The convolutional deep neural network for encoder mapping and decoder mapping are similar to DC-GAN with batch normalization. Real world datasets, MNIST with 70k images and CelebA with 203k images were used for training and testing.<br />
<br />
'''WAE-GAN and WAE-MMD:'''<br />
In WAE-GAN, the discriminator <math> \small D </math> composed of several fully connected layers with ReLu activations. For WAE-MMD, the RBF kernel failed to penalize outliers and thus the authors resorted to using inverse multiquadratics kernel <math> \small k(x,y)=C/(C+\parallel{x-y}_2^2\parallel) </math>. Trained models are presented in the figure below.<br />
As far as random sampled results are concerned, WAE-GAN seems to be highly unstable but do lead to better matching scores among WAE-GAN, WAE-MMD and VAE. WAE-MMD on the other hand has much more stable training and fairly good quality of sampled results.<br />
<br />
'''Qualitative assessment:'''<br />
In order to quantitatively assess the quality of the generated images, they use the Fréchet Inception Distance and report the results on CelebA (The Fréchet Inception Distance measures the similarity between two sets of images, by comparing the Fréchet distance of multivariate Gaussian distributions fitted to their feature representations. In more detail, let <math> (m,C) </math> denote the mean vector and covariance matrix of the features of the inception network (Szegedy et al. 2017) applied to model samples. Let <math>(m_w,C_w) </math> denote the mean vector and covariance matrix of the features of the inception network applied to real data. Then the Fréchet Inception Distance between the model samples and the real data is <math> ||m-m_w||^2 +\mathrm{tr}(C+C_w-2(CC_w)^{\frac{1}{2}} )\,</math> (Heusel et al. 2017). ) These results confirm that the sampled images from WAE are of better quality than from VAE (score: 82), and WAE-GAN gets a slightly better score (score:42) than WAE-MMD (score:55), which correlates with visual inspection of the images.<br />
<br />
[[File:results.png|800px|thumb|center|Results on MNIST and Celeb-A dataset. In "test reconstructions" (middle row of images), odd rows correspond to the real test points.]]<br />
<br />
<br />
<br />
The authors also heuristically evaluate the sharpness of generated samples using the Laplace filter. The numbers, summarized in Table1, show that WAE-MMD has samples of slightly better quality than VAE, while WAE-GAN achieves the best results overall.<br />
[[File: paper17_Table.png|300px|thumb|right|Qualitative Assessment of Images]]<br />
<br />
= Commentary and Conclusion =<br />
This paper presents an interesting theoretical justification for a new family of auto-encoders called Wasserstein Auto-Encoders (WAE). The objective function minimizes the optimal transport cost in the form of the Wasserstein distance, but relaxes theoretical constraints to separate it into a reconstruction cost and a regularization penalty. The regularization penalizes divergences between a prior and the distribution of encoded latent space training data, and is estimated by means of adversarial training (WAE-GAN), or kernel-based techniques (WAE-MMD). They show that they achieve samples of better visual quality than VAEs, while achieving stable training at the same time. They also theoretically show that WAEs are a generalization of adversarial auto-encoders (AAEs).<br />
<br />
Although the paper mentions that encoder-decoder pairs can be deterministic, they do not show the geometry of the latent space that is obtained. It is necessary to study the effect of randomness of encoders on the quality of obtained samples. While this method is evaluated on MNIST and CelebA datasets, it is also important to see their performance on other real world data distributions. The authors do not provide a comprehensive evaluation of WAE-GAN regularization, thus making it hard to comment on whether moving an adversarial problem to the latent space results in less instability. Reasons for better sample quality of WAE-GAN over WAE-MMD also need to be inspected. In the future it would be interesting to investigate different ways to compute the divergences between the encoded distribution and the prior distribution.<br />
<br />
=Open Source Code=<br />
1. https://github.com/tolstikhin/wae <br />
<br />
2. https://github.com/maitek/waae-pytorch<br />
<br />
=Sources=<br />
1. M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein GAN, 2017<br />
<br />
2. Martin Heusel et al. "Gans trained by a two time-scale update rule converge to a local nash equilibrium." Advances in Neural Information Processing Systems. 2017.<br />
<br />
3. Christian Szegedy et al. "Inception-v4, inception-resnet and the impact of residual connections on learning." AAAI. Vol. 4. 2017.<br />
<br />
4. Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, Bernhard Scholkopf. Wasserstein Auto-Encoders, 2017<br />
<br />
5. https://lilianweng.github.io/lil-log/2017/08/20/from-GAN-to-WGAN.html</div>Bapskikohttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Wasserstein_Auto-Encoders&diff=34666Wasserstein Auto-Encoders2018-03-19T20:53:25Z<p>Bapskiko: /* Problem Formulation and Notation */</p>
<hr />
<div><br />
= Introduction =<br />
Recent years have seen a convergence of two previously distinct approaches: representation learning from high dimensional data, and unsupervised generative modeling. In the field that formed at their intersection, Variational Auto-Encoders (VAEs) and Generative Adversarial Networks (GANs) have emerged to become well-established. VAEs are theoretically elegant but with the drawback that they tend to generate blurry samples when applied to natural images. GANs on the other hand produce better visual quality of sampled images, but come without an encoder, are harder to train and suffer from the mode-collapse problem when the trained model is unable to capture all the variability in the true data distribution. Thus there has been a push to come up with the best way to combine them together, but a principled unifying framework is yet to be discovered.<br />
<br />
This work proposes a new family of regularized auto-encoders called the Wasserstein Auto-Encoder (WAE). The proposed method provides a novel theoretical insight into setting up an objective function for auto-encoders from the point of view of of optimal transport (OT). This theoretical formulation leads the authors to examine adversarial and maximum mean discrepancy based regularizers for matching a prior and the distribution of encoded data points in the latent space. An empirical evaluation is performed on MNIST and CelebA datasets, where WAE is found to generate samples of better quality than VAE while preserving training stability, encoder-decoder structure and nice latent manifold structure.<br />
<br />
The main contribution of the proposed algorithm is to provide theoretical foundations for using optimal transport cost as the auto-encoder objective function, while blending auto-encoders and GANs in a principled way. It also theoretically and experimentally explores the interesting relationships between WAEs, VAEs and adversarial auto-encoders.<br />
<br />
= Proposed Approach =<br />
==Theory of Optimal Transport and Wasserstein Distance==<br />
Wasserstein Distance is a measure of the distance between two probability distributions. It is also called Earth Mover’s distance, short for EM distance, because informally it can be interpreted as moving piles of dirt that follow one probability distribution at a minimum cost to follow the other distribution. The cost is quantified by the amount of dirt moved times the moving distance. <br />
A simple case where the probability domain is discrete is presented below.<br />
<br />
<br />
[[File:em_distance.PNG|thumb|upright=1.4|center|Step-by-step plan of moving dirt between piles in ''P'' and ''Q'' to make them match (''W'' = 5).]]<br />
<br />
<br />
When dealing with the continuous probability domain, the EM distance or the minimum one among the costs of all dirt moving solutions becomes:<br />
\begin{align}<br />
\small W(p_r, p_g) = \underset{\gamma\sim\Pi(p_r, p_g)} {\inf}\pmb{\mathbb{E}}_{(x,y)\sim\gamma}[\parallel x-y\parallel]<br />
\end{align}<br />
<br />
Where <math>\Pi(p_r, p_g)</math> is the set of all joint probability distributions with marginals <math>p_r</math> and <math>p_g</math>. Here the distribution <math>\gamma</math> is called a transport plan because it's marginal structure gives some intuition that it represents the amount of probability mass to be moved from x to y. This intuition can be explained by looking at the following equation.<br />
<br />
\begin{align}<br />
\int\gamma(x, y)dx = p_g(y)<br />
\end{align}<br />
<br />
The Wasserstein distance or the cost of Optimal Transport (OT) provides a much weaker topology, which informally means that it makes it easier for a sequence of distribution to converge as compared to other ''f''-divergences. This is particularly important in applications where data is supported on low dimensional manifolds in the input space. As a result, stronger notions of distances such as KL-divergence, often max out, providing no useful gradients for training. In contrast, OT has a much nicer linear behaviour even upon saturation. It can be shown that the Wasserstein distance has guarantees of continuity and differentiability (Arjovsky et al., 2017). Moreover, Arjovsky et al. show there is a nice relationship between the magnitude of the Wasserstein distance and the distance between distributions; a smaller distance nicely corresponds to a smaller distance between the two distributions, and vice versa.<br />
<br />
==Problem Formulation and Notation==<br />
In this paper, calligraphic letters, i.e. <math>\small {\mathcal{X}}</math>, are used for sets, capital letters, i.e. <math>\small X</math>, are used for random variables and lower case letters, i.e. <math>\small x</math>, for their values. Probability distributions are denoted with capital letters, i.e. <math>\small P(X)</math>, and corresponding densities with lower case letters, i.e. <math>\small p(x)</math>.<br />
<br />
This work aims to minimize OT <math>\small W_c(P_X, P_G)</math> between the true (but unknown) data distribution <math>\small P_X</math> and a latent variable model <math>\small P_G</math> specified by the prior distribution <math>\small P_Z</math> of latent codes <math>\small Z \in \pmb{\mathbb{Z}}</math> and the generative model <math>\small P_G(X|Z)</math> of the data points <math>\small X \in \pmb{\mathbb{X}}</math> given <math>\small Z</math>. <br />
<br />
Kantorovich's formulation of the OT problem is given by:<br />
\begin{align}<br />
\small W_c(P_X, P_G) := \underset{\Gamma\sim {\mathcal{P}}(X \sim P_X, Y \sim P_G)}{\inf} {\pmb{\mathbb{E}}_{(X,Y)\sim\Gamma}[c(X,Y)]}<br />
\end{align}<br />
where <math>\small c(x,y)</math> is any measurable cost function and <math>\small {\mathcal{P}(X \sim P_X,Y \sim P_G)}</math> is a set of all joint distributions of <math>\small (X,Y)</math> with marginals <math>\small P_X</math> and <math>\small P_G</math>. When <math>\small c(x,y)=d(x,y)</math>, the following Kantorovich-Rubinstein duality holds for the <math>\small 1^{st}</math> root of <math>\small W_c</math>:<br />
\begin{align}<br />
\small W_1(P_X, P_G) := \underset{f \in {\mathcal{F_L}}} {\sup} {\pmb{\mathbb{E}}_{X \sim P_X}[f(X)]} -{\pmb{\mathbb{E}}_{Y \sim P_G}[f(Y)]}<br />
\end{align}<br />
where <math>\small {\mathcal{F_L}}</math> is the class of all bounded [https://en.wikipedia.org/wiki/Lipschitz_continuity Lipschitz continuous functions]. A reference that provides an intuitive explanation for how the Kantorovich-Rubinstein duality was applied in this case is [https://vincentherrmann.github.io/blog/wasserstein/ here].<br />
<br />
==Wasserstein Auto-Encoders==<br />
The proposed method focuses on latent variables <math>\small P_G </math> defined by a two step procedure, where first a code <math>\small Z</math> is sampled from a fixed prior distribution <math>\small P_Z</math> on a latent space <math>\small {\mathcal{Z}}</math> and then <math>\small Z</math> is mapped to the image <math>\small X \in {\mathcal{X}}</math> with a transformation. This results in a density of the form<br />
\begin{align}<br />
\small p_G(x) := \int_{{\mathcal{Z}}} p_G(x|z)p_z(z)dz, \forall x\in{\mathcal{X}}<br />
\end{align}<br />
assuming all the densities are properly defined. It turns out that if the focus is only on generative models deterministically mapping <math>\small Z </math> to <math>\small X = G(Z) </math>, then the OT cost takes a much simpler form as stated below by Theorem 1.<br />
<br />
'''Theorem 1''' For any function <math>\small G:{\mathcal{Z}} \rightarrow {\mathcal{X}}</math>, where <math>\small Q(Z) </math> is the marginal distribution of <math>\small Z </math> when <math>\small X \in P_X </math> and <math>\small Z \in Q(Z|X) </math>,<br />
\begin{align}<br />
\small \underset{\Gamma\sim {\mathcal{P}}(X \sim P_X, Y \sim P_G)}{\inf} {\pmb{\mathbb{E}}_{(X,Y)\sim\Gamma}[c(X,Y)]} = \underset{Q : Q_z=P_z}{\inf} {{\pmb{\mathbb{E}}_{P_X}}{\pmb{\mathbb{E}}_{Q(Z|X)}}[c(X,G(Z))]}<br />
\end{align}<br />
This essentially means that instead of finding a coupling <math>\small \Gamma </math> between two random variables living in the <math>\small {\mathcal{X}} </math> space, one distributed according to <math>\small P_X </math> and the other one according to <math>\small P_G </math>, it is sufficient to find a conditional distribution <math>\small Q(Z|X) </math> such that its <math>\small Z </math> marginal <math>\small Q_Z(Z) := {\pmb{\mathbb{E}}_{X \sim P_X}[Q(Z|X)]} </math> is identical to the prior distribution <math>\small P_Z </math>. In order to implement a numerical solution to Theorem 1, the constraints on <math>\small Q(Z|X) </math> and <math>\small P_Z </math> are relaxed and a penalty function is added to the objective leading to the WAE objective function given by:<br />
<br />
\begin{align}<br />
\small D_{WAE}(P_X, P_G):= \underset{Q(Z|X) \in Q}{\inf} {{\pmb{\mathbb{E}}_{P_X}}{\pmb{\mathbb{E}}_{Q(Z|X)}}[c(X,G(Z))]} + {\lambda} {{\mathcal{D}}_Z(Q_Z,P_Z)}<br />
\end{align}<br />
where <math>\small Q </math> is any non-parametric set of probabilistic encoders, <math>\small {\mathcal{D}}_Z </math> is an arbitrary divergence between <br />
<math>\small Q_Z </math> and <math>\small P_Z </math>, and <math>\small \lambda > 0 </math> is a hyperparameter. The authors propose two different penalties <math>\small {\mathcal{D}}_Z(Q_Z,P_Z) </math> based on adversarial training (GANs) and maximum mean discrepancy (MMD).<br />
<br />
===WAE-GAN: GAN-based===<br />
The first option is to choose <math>\small {\mathcal{D}}_Z(Q_Z,P_Z) = D_{JS}(Q_Z,P_Z)</math>, where <math>\small D_{JS} </math> is the Jensen-Shannon divergence metric, and use adversarial training to estimate it. Specifically a discriminator is introduced in the latent space <math>\small {\mathcal{Z}} </math> trying to separate true points sampled from <math>\small P_Z </math> from fake ones sampled from <math>\small Q_Z </math>. This results in Algorithm 1. It is interesting that the min-max problem is moved from the input pixel space to the latent space.<br />
<br />
<br />
[[File:wae-gan.PNG|270px|center]]<br />
<br />
===WAE-MMD: MMD-based===<br />
For a positive definite kernel <math>\small k: {\mathcal{Z}} \times {\mathcal{Z}} \rightarrow {\mathcal{R}}</math>, the following expression is called the maximum mean discrepancy:<br />
\begin{align}<br />
\small {MMD}_k(P_Z,Q_Z) = \parallel \int_{{\mathcal{Z}}} k(z,\cdot)dP_z(z) - \int_{{\mathcal{Z}}} k(z,\cdot)dQ_z(z) \parallel_{\mathcal{H}_k},<br />
\end{align}<br />
<br />
where <math>\mathcal{H}_k</math> is the reproducing kernel Hilbert space of real-valued functions mapping <math>\mathcal{Z}</math> to <math>\mathcal{R}</math>. This can be used as a divergence measure and the authors propose to use <math>\small {\mathcal{D}}_Z(Q_Z,P_Z) = MMD_k(P_Z,Q_Z) </math>, which leads to Algorithm 2.<br />
<br />
<br />
[[File:wae-mmd.PNG|270px|center]]<br />
<br />
= Comparison with Related Work =<br />
==Auto-Encoders, VAEs and WAEs==<br />
Classical unregularized encoders only minimized the reconstruction cost, and resulted in training points being chaotically scattered across the latent space with holes in between, where the decoder had never been trained. They were hard to sample from and did not provide a useful representation. VAEs circumvented this problem by maximizing a variational lower-bound term comprising of a reconstruction cost and a KL-divergence measure which captures how distinct each training example is from the prior <math>\small P_Z</math>. This however does not guarantee that the overall encoded distribution <math>\small {{\pmb{\mathbb{E}}_{P_X}}}[Q(Z|X)]</math> matches <math>\small P_Z</math>. This is ensured by WAE however, is a direct consequence of our objective function derived from Theorem 1, and is visually represented in the figure below. It is also interesting to note that this also allows WAE to have deterministic encoder-decoder pairs.<br />
<br />
<br />
[[File:vae-wae.PNG|500px|thumb|center|WAE and VAE regularization]]<br />
<br />
<br />
It is also shown that if <math>\small c(x,y)={\parallel x-y \parallel}_2^2</math>, WAE-GAN is equivalent to adversarial autoencoders (AAE). Thus the theory suggests that AAE minimize the 2-Wasserstein distance between <math>\small P_X</math> and <math>\small P_G</math>.<br />
<br />
==OT, W-GAN and WAE==<br />
The Wasserstein GAN (W-GAN) minimizes the 1-Wasserstein distance <math>\small W_1(P_X,P_G)</math> for generative modeling. The W-GAN formulation is approached from the dual form and thus it cannot be applied to another other cost <math>\small W_c</math> as the neat form of the Kantorovich-Rubinstein duality holds only for <math>\small W_1</math>. WAE approaches the same problem from the primal form, can be applied to any cost function <math>\small c</math> and comes naturally with an encoder. The constraint on OT in Theorem 1, is relaxed in line with theory on unbalanced optimal transport by adding a penalty or additional divergences to the objective.<br />
<br />
==GANs and WAEs==<br />
Many of the GAN variations including f-GAN and W-GAN come without an encoder. Often it may be desirable to reconstruct the latent codes and use the learned manifold in which case they won't be applicable. For works which try to blend adversarial auto-encoder structures, encoders and decoders do not have incentive to be reciprocal. WAE does not necessarily lead to a min-max game and has a clear theoretical foundation for using penalties for regularization.<br />
<br />
=Experimental Results=<br />
The authors empirically evaluate the proposed WAE generative model by specifically testing if data points are accurately reconstructed, if the latent manifold has reasonable geometry, and if random samples of good visual quality are generated. <br />
<br />
'''Experimental setup:'''<br />
Gaussian prior distribution <math> \small P_Z</math> and squared cost function <math> \small c(x,y)</math> are used for data-points. The encoder-decoder pairs were deterministic. The convolutional deep neural network for encoder mapping and decoder mapping are similar to DC-GAN with batch normalization. Real world datasets, MNIST with 70k images and CelebA with 203k images were used for training and testing.<br />
<br />
'''WAE-GAN and WAE-MMD:'''<br />
In WAE-GAN, the discriminator <math> \small D </math> composed of several fully connected layers with ReLu activations. For WAE-MMD, the RBF kernel failed to penalize outliers and thus the authors resorted to using inverse multiquadratics kernel <math> \small k(x,y)=C/(C+\parallel{x-y}_2^2\parallel) </math>. Trained models are presented in the figure below.<br />
As far as random sampled results are concerned, WAE-GAN seems to be highly unstable but do lead to better matching scores among WAE-GAN, WAE-MMD and VAE. WAE-MMD on the other hand has much more stable training and fairly good quality of sampled results.<br />
<br />
'''Qualitative assessment:'''<br />
In order to quantitatively assess the quality of the generated images, they use the Fréchet Inception Distance and report the results on CelebA (The Fréchet Inception Distance measures the similarity between two sets of images, by comparing the Fréchet distance of multivariate Gaussian distributions fitted to their feature representations. In more detail, let <math> (m,C) </math> denote the mean vector and covariance matrix of the features of the inception network (Szegedy et al. 2017) applied to model samples. Let <math>(m_w,C_w) </math> denote the mean vector and covariance matrix of the features of the inception network applied to real data. Then the Fréchet Inception Distance between the model samples and the real data is <math> ||m-m_w||^2 +\mathrm{tr}(C+C_w-2(CC_w)^{\frac{1}{2}} )\,</math> (Heusel et al. 2017). ) These results confirm that the sampled images from WAE are of better quality than from VAE (score: 82), and WAE-GAN gets a slightly better score (score:42) than WAE-MMD (score:55), which correlates with visual inspection of the images.<br />
<br />
[[File:results.png|800px|thumb|center|Results on MNIST and Celeb-A dataset. In "test reconstructions" (middle row of images), odd rows correspond to the real test points.]]<br />
<br />
<br />
<br />
The authors also heuristically evaluate the sharpness of generated samples using the Laplace filter. The numbers, summarized in Table1, show that WAE-MMD has samples of slightly better quality than VAE, while WAE-GAN achieves the best results overall.<br />
[[File: paper17_Table.png|300px|thumb|right|Qualitative Assessment of Images]]<br />
<br />
= Commentary and Conclusion =<br />
This paper presents an interesting theoretical justification for a new family of auto-encoders called Wasserstein Auto-Encoders (WAE). The objective function minimizes the optimal transport cost in the form of the Wasserstein distance, but relaxes theoretical constraints to separate it into a reconstruction cost and a regularization penalty. The regularization penalizes divergences between a prior and the distribution of encoded latent space training data, and is estimated by means of adversarial training (WAE-GAN), or kernel-based techniques (WAE-MMD). They show that they achieve samples of better visual quality than VAEs, while achieving stable training at the same time. They also theoretically show that WAEs are a generalization of adversarial auto-encoders (AAEs).<br />
<br />
Although the paper mentions that encoder-decoder pairs can be deterministic, they do not show the geometry of the latent space that is obtained. It is necessary to study the effect of randomness of encoders on the quality of obtained samples. While this method is evaluated on MNIST and CelebA datasets, it is also important to see their performance on other real world data distributions. The authors do not provide a comprehensive evaluation of WAE-GAN regularization, thus making it hard to comment on whether moving an adversarial problem to the latent space results in less instability. Reasons for better sample quality of WAE-GAN over WAE-MMD also need to be inspected. In the future it would be interesting to investigate different ways to compute the divergences between the encoded distribution and the prior distribution.<br />
<br />
=Open Source Code=<br />
1. https://github.com/tolstikhin/wae <br />
<br />
2. https://github.com/maitek/waae-pytorch<br />
<br />
=Sources=<br />
1. M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein GAN, 2017<br />
<br />
2. Martin Heusel et al. "Gans trained by a two time-scale update rule converge to a local nash equilibrium." Advances in Neural Information Processing Systems. 2017.<br />
<br />
3. Christian Szegedy et al. "Inception-v4, inception-resnet and the impact of residual connections on learning." AAAI. Vol. 4. 2017.<br />
<br />
4. Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, Bernhard Scholkopf. Wasserstein Auto-Encoders, 2017<br />
<br />
5. https://lilianweng.github.io/lil-log/2017/08/20/from-GAN-to-WGAN.html</div>Bapskikohttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Label-Free_Supervision_of_Neural_Networks_with_Physics_and_Domain_Knowledge&diff=34402Label-Free Supervision of Neural Networks with Physics and Domain Knowledge2018-03-16T01:57:12Z<p>Bapskiko: /* Conclusion and Critique */</p>
<hr />
<div>== Introduction ==<br />
Applications of machine learning are often encumbered by the need for large amounts of labeled training data. Neural networks have made large amounts of labeled data even more crucial to success (LeCun, Bengio, and Hinton 2015[1]). Nonetheless, humans are often able to learn without direct examples, opting instead for high level instructions for how a task should be performed, or what it will look like when completed. This work explores whether a similar principle can be applied to teaching machines: can we supervise networks without individual examples by instead describing only the structure of desired outputs.<br />
<br />
[[File:c433li-1.png]]<br />
<br />
Unsupervised learning methods such as autoencoders, also aim to uncover hidden structure in the data without having access to any label. Such systems succeed in producing highly compressed, yet informative representations of the inputs (Kingma and Welling 2013; Le 2013). However, these representations differ from ours as they are not explicitly constrained to have a particular meaning or semantics. This paper attempts to explicitly provide the semantics of the hidden variables we hope to discover, but still train without labels by learning from constraints that are known to hold according to prior domain knowledge. By training without direct examples of the values our hidden (output) variables take, several advantages are gained over traditional supervised learning, including:<br />
* a reduction in the amount of work spent labeling, <br />
* an increase in generality, as a single set of constraints can be applied to multiple data sets without relabeling.<br />
<br />
== Problem Setup ==<br />
In a traditional supervised learning setting, we are given a training set <math>D=\{(x_1, y_1), \cdots, (x_n, y_n)\}</math> of <math>n</math> training examples. Each example is a pair <math>(x_i,y_i)</math> formed by an instance <math>x_i \in X</math> and the corresponding output (label) <math>y_i \in Y</math>. The goal is to learn a function <math>f: X \rightarrow Y</math> mapping inputs to outputs. To quantify performance, a loss function <math>\ell:Y \times Y \rightarrow \mathbb{R}</math> is provided, and a mapping is found via <br />
<br />
::<math> f^* = \text{argmin}_{f \in \mathcal{F}} \sum_{i=1}^n \ell(f(x_i),y_i) </math><br />
<br />
where the optimization is over a pre-defined class of functions <math>\mathcal{F}</math> (hypothesis class). In our case, <math>\mathcal{F}</math> will be (convolutional) neural networks parameterized by their weights. The loss could be for example <math>\ell(f(x_i),y_i) = 1[f(x_i) \neq y_i]</math>. By restricting the space of possible functions specifying the hypothesis class <math>\mathcal{F}</math>, we are leveraging prior knowledge about the specific problem we are trying to solve. Informally, the so-called No Free Lunch Theorems state that every machine learning algorithm must make such assumptions in order to work. Another common way in which a modeler incorporates prior knowledge is by specifying an a-priori preference for certain functions in <math>\mathcal{F}</math>, incorporating a regularization term <math>R:\mathcal{F} \rightarrow \mathbb{R}</math>, and solving for <math> f^* = argmin_{f \in \mathcal{F}} \sum_{i=1}^n \ell(f(x_i),y_i) + R(f)</math>. Typically, the regularization term <math>R:\mathcal{F} \rightarrow \mathbb{R}</math> specifies a preference for "simpler" functions (Occam's razor) to prevent overfitting the model on the training data.<br />
<br />
The focus is on the set of problems/domains where the problem is a complex environment having a complex representation of the output space, for example mapping an input image to the height of an object(since this leads to a complex output space) rather than simple binary classification problem.<br />
<br />
In this paper, prior knowledge on the structure of the outputs is modelled by providing a weighted constraint function <math>g:X \times Y \rightarrow \mathbb{R}</math>, used to penalize “structures” that are not consistent with our prior knowledge. And whether this weak form of supervision is sufficient to learn interesting functions is explored. While one clearly needs labels <math>y</math> to evaluate <math>f^*</math>, labels may not be necessary to discover <math>f^*</math>. If prior knowledge informs us that outputs of <math>f^*</math> have other unique properties among functions in <math>\mathcal{F}</math>, we may use these properties for training rather than direct examples <math>y</math>. <br />
<br />
Specifically, an unsupervised approach where the labels <math>y_i</math> are not provided to us is considered, where a necessary property of the output <math>g</math> is optimized instead.<br />
::<math>\hat{f}^* = \text{argmin}_{f \in \mathcal{F}} \sum_{i=1}^n g(x_i,f(x_i))+ R(f) </math><br />
<br />
If the optimizing the above equation is sufficient to find <math>\hat{f}^*</math>, we can use it in replace of labels. If it's not sufficient, additional regularization terms are added. The idea is illustrated with three examples, as described in the next section.<br />
<br />
== Experiments ==<br />
=== Tracking an object in free fall ===<br />
In the first experiment, they record videos of an object being thrown across the field of view, and aim to learn the object's height in each frame. The goal is to obtain a regression network mapping from <math>{R^{\text{height} \times \text{width} \times 3}} \rightarrow \mathbb{R}</math>, where <math>\text{height}</math> and <math>\text{width}</math> are the number of vertical and horizontal pixels per frame, and each pixel has 3 color channels. This network is trained as a structured prediction problem operating on a sequence of <math>N</math> images to produce a sequence of <math>N</math> heights, <math>\left(R^{\text{height} \times \text{width} \times 3} \right)^N \rightarrow \mathbb{R}^N</math>, and each piece of data <math>x_i</math> will be a vector of images, <math>\mathbf{x}</math>.<br />
Rather than supervising the network with direct labels, <math>\mathbf{y} \in \mathbb{R}^N</math>, the network is instead supervised to find an object obeying the elementary physics of free falling objects. An object acting under gravity will have a fixed acceleration of <math>a = -9.8 m / s^2</math>, and the plot of the object's height over time will form a parabola:<br />
::<math>\mathbf{y}_i = y_0 + v_0(i\Delta t) + \frac{1}{2} a(i\Delta t)^2</math><br />
<br />
The idea is, given any trajectory of <math>N</math> height predictions, <math>f(\mathbf{x})</math>, we fit a parabola with fixed curvature to those predictions, and minimize the resulting residual. Formally, if we specify <math>\mathbf{a} = [\frac{1}{2} a\Delta t^2, \frac{1}{2} a(2 \Delta t)^2, \ldots, \frac{1}{2} a(N \Delta t)^2]</math>, the prediction produced by the fitted parabola is:<br />
::<math> \mathbf{\hat{y}} = \mathbf{a} + \mathbf{A} (\mathbf{A}^T\mathbf{A})^{-1} \mathbf{A}^T (f(\mathbf{x}) - \mathbf{a}) </math><br />
<br />
where<br />
::<math><br />
\mathbf{A} = <br />
\left[ {\begin{array}{*{20}c}<br />
\Delta t & 1 \\<br />
2\Delta t & 1 \\<br />
3\Delta t & 1 \\<br />
\vdots & \vdots \\<br />
N\Delta t & 1 \\<br />
\end{array} } \right]<br />
</math><br />
<br />
The constraint loss is then defined as<br />
::<math>g(\mathbf{x},f(\mathbf{x})) = g(f(\mathbf{x})) = \sum_{i=1}^{N} |\mathbf{\hat{y}}_i - f(\mathbf{x})_i|</math><br />
<br />
Note that <math>\hat{y}</math> is not the ground truth labels. Because <math>g</math> is differentiable almost everywhere, it can be optimized with SGD. They find that when combined with existing regularization methods for neural networks, this optimization is sufficient to recover <math>f^*</math> up to an additive constant <math>C</math> (specifying what object height corresponds to 0).<br />
<br />
[[File:c433li-2.png]]<br />
<br />
The data set is collected on a laptop webcam running at 10 frames per second (<math>\Delta t = 0.1s</math>). The camera position is fixed and 65 diverse trajectories of the object in flight, totalling 602 images are recorded. For each trajectory, the network is trained on randomly selected intervals of <math>N=5</math> contiguous frames. Images are resized to <math>56 \times 56</math> pixels before going into a small, randomly initialized neural network with no pretraining. The network consists of 3 Conv/ReLU/MaxPool blocks followed by 2 Fully Connected/ReLU layers with probability 0.5 dropout and a single regression output.<br />
<br />
Since scaling the <math>y_0</math> and <math>v_0</math> results in the same constraint loss <math>g</math>, the authors evaluate the result by the correlation of predicted heights with ground truth pixel measurements. This method was used since the distance from the object to the camera could not be accurately recorded, and this distance is required to calculate the height in meters. This is not a bullet proof evaluation, and is discussed in further detail in the critique section. The results are compared to a supervised network trained with the labels to directly predict the height of the object in pixels. The supervised learning task is viewed as a substantially easier task. From this knowledge we can see from the table below that, under their evaluation criteria, the result is pretty satisfying.<br />
{| class="wikitable"<br />
|+ style="text-align: left;" | Evaluation <br />
|-<br />
! scope="col" | Method !! scope="col" | Random Uniform Output !! scope="col" | Supervised with Labels !! scope="col" | Approach in this Paper<br />
|-<br />
! scope="row" | Correlation <br />
| 12.1% || 94.5% || 90.1%<br />
|}<br />
<br />
=== Tracking the position of a walking man ===<br />
In the second experiment, they aim to detect the horizontal position of a person walking across a frame without providing direct labels <math>y \in \mathbb{R}</math> by exploiting the assumption that the person will be walking at a constant velocity over short periods of time. This is formulated as a structured prediction problem <math>f: \left(R^{\text{height} \times \text{width} \times 3} \right)^N \rightarrow \mathbb{R}^N</math>, and each training instances <math>x_i</math> are a vector of images, <math>\mathbf{x}</math>, being mapped to a sequence of predictions, <math>\mathbf{y}</math>. Given the similarities to the first experiment with free falling objects, we might hope to simply remove the gravity term from equation and retrain. However, in this case, that is not possible, as the constraint provides a necessary, but not sufficient, condition for convergence.<br />
<br />
Given any sequence of correct outputs, <math>(\mathbf{y}_1, \ldots, \mathbf{y}_N)</math>, the modified sequence, <math>(\lambda * \mathbf{y}_1 + C, \ldots, \lambda * \mathbf{y}_N + C)</math> (<math>\lambda, C \in \mathbb{R}</math>) will also satisfy the constant velocity constraint. In the worst case, when <math>\lambda = 0</math>, <math>f \equiv C</math>, and the network can satisfy the constraint while having no dependence on the image. The trivial output is avoided by adding two two additional loss terms.<br />
<br />
::<math>h_1(\mathbf{x}) = -\text{std}(f(\mathbf{x}))</math><br />
which seeks to maximize the standard deviation of the output, and<br />
<br />
::<math>\begin{split}<br />
h_2(\mathbf{x}) = \hphantom{'} & \text{max}(\text{ReLU}(f(\mathbf{x}) - 10)) \hphantom{\text{ }}+ \\<br />
& \text{max}(\text{ReLU}(0 - f(\mathbf{x})))<br />
\end{split}<br />
</math><br />
which limit the output to a fixed ranged <math>[0, 10]</math>, the final loss is thus:<br />
<br />
::<math><br />
\begin{split}<br />
g(\mathbf{x}) = \hphantom{'} & ||(\mathbf{A} (\mathbf{A}^T\mathbf{A})^{-1} \mathbf{A}^T - \mathbf{I}) * f(\mathbf{x})||_1 \hphantom{\text{ }}+ \\<br />
& \gamma_1 * h_1(\mathbf{x}) <br />
\hphantom{\text{ }}+ \\<br />
& \gamma_2 * h_2(\mathbf{x})<br />
% h_2(y) & = \text{max}(\text{ReLU}(y - 10)) + \\<br />
% & \hphantom{=}\hphantom{a} \text{max}(\text{ReLU}(0 - y))<br />
\end{split}<br />
</math><br />
<br />
[[File:c433li-3.png]]<br />
<br />
The data set contains 11 trajectories across 6 distinct scenes, totalling 507 images resized to <math>56 \times 56</math>. The network is trained to output linearly consistent positions on 5 strided frames from the first half of each trajectory, and is evaluated on the second half. The boundary violation penalty is set to <math>\gamma_2 = 0.8</math> and the standard deviation bonus is set to <math>\gamma_1 = 0.6</math>.<br />
<br />
As in the previous experiment, the result is evaluated by the correlation with the ground truth. The result is as follows:<br />
{| class="wikitable"<br />
|+ style="text-align: left;" | Evaluation <br />
|-<br />
! scope="col" | Method !! scope="col" | Random Uniform Output !! scope="col" | Supervised with Labels !! scope="col" | Approach in this Paper<br />
|-<br />
! scope="row" | Correlation <br />
| 45.9% || 80.5% || 95.4%<br />
|}<br />
Surprisingly, the approach in this paper beats the same network trained with direct labeled supervision on the test set, which can be attributed to overfitting on the small amount of training data available (as correlation on training data reached 99.8%).<br />
<br />
=== Detecting objects with causal relationships ===<br />
In the previous experiments, the authors explored options for incorporating constraints pertaining to dynamics equations in real-world phenomena, i.e., prior knowledge derived from elementary physics. In this experiment, the authors explore the possibilities of learning from logical constraints imposed on single images. More specifically, they ask whether it is possible to learn from causal phenomena.<br />
<br />
[[File:paper18_Experiment_3.png|400px]]<br />
<br />
Here, the authors provide images containing a stochastic collection of up to four characters: Peach, Mario, Yoshi, and Bowser, with each character having small appearance changes across frames due to rotation and reflection. Example images can be seen in Fig. (4). While the existence of objects in each frame is non-deterministic, the generating distribution encodes the underlying phenomenon that Mario will always appear whenever Peach appears. The aim is to create a pair of neural networks <math>f_1, f_2</math> for identifying Peach and Mario, respectively. The networks, <math>f_k : R^{height×width×3} → \{0, 1\}</math>, map the image to the discrete boolean variables, <math>y_1</math> and <math>y_2</math>. Rather than supervising with direct labels, the authors train the networks by constraining their outputs to have the logical relationship <math>y_1 ⇒ y_2</math>. This problem is challenging because the networks must simultaneously learn to recognize the characters and select them according to logical relationships. To avoid the trivial solution <math>y_1 \equiv 1, y_2 \equiv 1</math> on every image, three additional loss terms need to be added:<br />
<br />
::<math> h_1(\mathbf{x}, k) = \frac{1}{M}\sum_i^M |Pr[f_k(\mathbf{x}) = 1] - Pr[f_k(\rho(\mathbf{x})) = 1]|, </math><br />
<br />
which forces rotational independence of the outputs in order to encourage the network to learn the existence, rather than location of objects, <br />
<br />
::<math> h_2(\mathbf{x}, k) = -\text{std}_{i \in [1 \dots M]}(Pr[f_k(\mathbf{x}_i) = 1]), </math><br />
<br />
which seeks high variance outputs, and<br />
<br />
::<math> h_3(\mathbf{x}, v) = \frac{1}{M}\sum_i^{M} (Pr[f(\mathbf{x}_i) = v] - \frac{1}{3} + (\frac{1}{3} - \mu_v))^2 \\<br />
\mu_{v} = \frac{1}{M}\sum_i^{M} \mathbb{1}\{v = \text{argmax}_{v' \in \{0, 1\}^2} Pr[f(\mathbf{x}) = v']\}. </math><br />
<br />
which seeks high entropy outputs. The final loss function then becomes: <br />
<br />
::<math> \begin{split}<br />
g(\mathbf{x}) & = \mathbb{1}\{f_1(\mathbf{x}) \nRightarrow f_2(\mathbf{x})\} \hphantom{\text{ }} + \\<br />
& \sum_{k \in \{1, 2\}} \gamma_1 h_1(\mathbf{x}, k) + \gamma_2 h_2(\mathbf{x}, k) + <br />
\hspace{-0.7em} \sum_{v \neq \{1,0\}} \hspace{-0.7em} \gamma_3 * h_3(\mathbf{x}, v)<br />
\end{split}<br />
</math><br />
<br />
'''Evaluation'''<br />
<br />
The input images, shown in Fig. (4), are 56 × 56 pixels. The authors used <math>\gamma_1 = 0.65, \gamma_2 = 0.65, \gamma_3 = 0.95</math>, and trained for 4,000 iterations. This experiment demonstrates that networks can learn from constraints that operate over discrete sets with potentially complex logical rules. Removing constraints will cause learning to fail. Thus, the experiment also shows that sophisticated sufficiency conditions can be key to success when learning from constraints.<br />
<br />
== Conclusion and Critique ==<br />
This paper has introduced a method for using physics and other domain constraints to supervise neural networks. However, the approach described in this paper is not entirely new. Similar ideas are already widely used in Q learning, where the Q value are not available, and the network is supervised by the constraint, as in Deep Q learning (Mnih, Riedmiller et al. 2013[2]).<br />
::<math>Q(s,a) = R(r,s) + \gamma \sum_{s' ~ P_{sa}}{\text{max}_{a'}Q(s',a')}</math><br />
<br />
<br />
Also, the paper has a mistake where they quote the free fall equation as<br />
::<math>\mathbf{y}_i = y_0 + v_0(i\Delta t) + a(i\Delta t)^2</math><br />
which should be<br />
::<math>\mathbf{y}_i = y_0 + v_0(i\Delta t) + \frac{1}{2} a(i\Delta t)^2</math><br />
Although in this case it doesn't affect the result.<br />
<br />
<br />
For the evaluation of the experiments, they used correlation with ground truth as the metric to avoid the fact that the output can be scaled without affecting the constraint loss. This is fine if the network gives output of the same scale. However, there's no such guarantee, and the network may give output of varying scale, in which case, we can't say that the network has learnt the correct thing, although it may have a high correlation with ground truth. In fact, to solve the scaling issue, an obvious way is to combine the constraints introduced in this paper with some labeled training data. It's not clear why the author didn't experiment with a combination of these two losses.<br />
<br />
In regards to the free fall experiment in particular, the authors apply a fixed acceleration model to create the constraint loss, with the goal of having the network predict height. However, since they did not measure the true height of the object to create test labels, they evaluate using height in pixel space. They do not mention the accuracy of their camera calibration, nor what camera model was used to remove lens distortion. Since lens distortion tends to be worse at the extreme edges of the image, and that they tossed the pillow throughout the entire frame, it is likely that the ground truth labels were corrupted by distortion. If that is the case, it is possible the supervised network is actually performing worse, because it learning how to predict distorted (beyond a constant scaling factor) heights instead of the true height.<br />
<br />
These methods essentially boil down to generating approximate labels for training data using some knowledge of the dynamic that the labels should follow.<br />
<br />
Finally, this paper only picks examples where the constraints are easy to design, while in some more common tasks such as image classification, what kind of constraints are needed is not straightforward at all.<br />
<br />
== References ==<br />
[1] LeCun, Y.; Bengio, Y.; and Hinton, G. 2015. Deep learning. Nature 521(7553):436–444.<br />
<br />
[2] Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; and Riedmiller, M. 2013. Playing Atari with Deep Reinforcement Learning. arxiv 1312.5602.</div>Bapskikohttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Spectral_normalization_for_generative_adversial_network&diff=33129stat946w18/Spectral normalization for generative adversial network2018-03-12T01:09:28Z<p>Bapskiko: /* Model */</p>
<hr />
<div>= Presented by =<br />
<br />
1. liu, wenqing<br />
<br />
= Introduction =<br />
Generative adversarial networks (GANs) (Goodfellow et al., 2014) have been enjoying considerable success as a framework of generative models in recent years. The concept is to consecutively train the model distribution and the discriminator in turn, with the goal of reducing the difference between the model distribution and the target distribution measured by the best discriminator possible at each step of the training.<br />
<br />
A persisting challenge in the training of GANs is the performance control of the discriminator. When the support of the model distribution and the support of target distribution are disjoint, there exists a discriminator that can perfectly distinguish the model distribution from the target (Arjovsky & Bottou, 2017). One such discriminator is produced in this situation, the training of the generator comes to complete stop, because the derivative of the so-produced discriminator with respect to the input turns out to be 0. This motivates us to introduce some form of restriction to the choice of discriminator.<br />
<br />
In this paper, we propose a novel weight normalization method called ''spectral normalization'' that can stabilize the training of discriminator networks. Our normalization enjoys following favorable properties:<br />
<br />
* The only hyper-parameter that needs to be tuned is the Lipschitz constant, and the algorithm is not too sensitive to this constant's value<br />
* The additional computational needed to implement spectral normalization is small <br />
<br />
In this study, we provide explanations of the effectiveness of spectral normalization against other regularization or normalization techniques.<br />
<br />
= Model =<br />
<br />
Let us consider a simple discriminator made of a neural network of the following form, with the input x:<br />
<br />
\[f(x,\theta) = W^{L+1}a_L(W^L(a_{L-1}(W^{L-1}(\cdots a_1(W^1x)\cdots))))\]<br />
<br />
where <math> \theta:=W^1,\cdots,W^L, W^{L+1} </math> is the learning parameters set, <math>W^l\in R^{d_l*d_{l-1}}, W^{L+1}\in R^{1*d_L} </math>, and <math>a_l </math> is an element-wise non-linear activation function.The final output of the discriminator function is given by <math>D(x,\theta) = A(f(x,\theta)) </math>. The standard formulation of GANs is given by <math>\min_{G}\max_{D}V(G,D)</math> where min and max of G and D are taken over the set of generator and discriminator functions, respectively. <br />
<br />
The conventional form of <math>V(G,D) </math> is given by:<br />
<br />
\[E_{x\sim q_{data}}[\log D(x)] + E_{x'\sim p_G}[\log(1-D(x'))]\]<br />
<br />
where <math>q_{data}</math> is the data distribution and <math>p_G(x)</math> is the model generator distribution to be learned through the adversarial min-max optimization. It is known that, for a fixed generator G, the optimal discriminator for this form of <math>V(G,D) </math> is given by <br />
<br />
\[ D_G^{*}(x):=q_{data}(x)/(q_{data}(x)+p_G(x)) \]<br />
<br />
We search for the discriminator D from the set of K-lipshitz continuous functions, that is, <br />
<br />
\[ \arg\max_{||f||_{Lip}\le k}V(G,D)\]<br />
<br />
where we mean by <math> ||f||_{lip}</math> the smallest value M such that <math> ||f(x)-f(x')||/||x-x'||\le M </math> for any x,x', with the norm being the <math> l_2 </math> norm.<br />
<br />
Our spectral normalization controls the Lipschitz constant of the discriminator function <math> f </math> by literally constraining the spectral norm of each layer <math> g: h_{in}\rightarrow h_{out}</math>. By definition, Lipschitz norm <math> ||g||_{Lip} </math> is equal to <math> \sup_h\sigma(\nabla g(h)) </math>, where <math> \sigma(A) </math> is the spectral norm of the matrix A, which is equivalent to the largest singular value of A. Therefore, for a linear layer <math> g(h)=Wh </math>, the norm is given by <math> ||g||_{Lip}=\sigma(W) </math>. Observing the following bound:<br />
<br />
\[ ||f||_{Lip}\le ||(h_L\rightarrow W^{L+1}h_{L})||_{Lip}*||a_{L}||_{Lip}*||(h_{L-1}\rightarrow W^{L}h_{L-1})||_{Lip}\cdots ||a_1||_{Lip}*||(h_0\rightarrow W^1h_0)||_{Lip}=\prod_{l=1}^{L+1}\sigma(W^l) *\prod_{l=1}^{L} ||a_l||_{Lip} \]<br />
<br />
Our spectral normalization normalizes the spectral norm of the weight matrix W so that it satisfies the Lipschitz constraint <math> \sigma(W)=1 </math>:<br />
<br />
\[ \bar{W_{SN}}:= W/\sigma(W) \]<br />
<br />
In summary, just like what weight normalization does, we reparameterize weight matrix <math> \bar{W_{SN}} </math> as <math> W/\sigma(W) </math> to fix the singular value of weight matrix. Now we can calculate the gradient of new parameter W by chain rule:<br />
<br />
\[ \frac{\partial V(G,D)}{\partial W} = \frac{\partial V(G,D)}{\partial \bar{W_{SN}}}*\frac{\partial \bar{W_{SN}}}{\partial W} \]<br />
<br />
\[ \frac{\partial \bar{W_{SN}}}{\partial W_{ij}} = \frac{1}{\sigma(W)}E_{ij}-\frac{1}{\sigma(W)^2}*\frac{\partial \sigma(W)}{\partial(W_{ij})}W=\frac{1}{\sigma(W)}E_{ij}-\frac{[u_1v_1^T]_{ij}}{\sigma(W)^2}W=\frac{1}{\sigma(W)}(E_{ij}-[u_1v_1^T]_{ij}\bar{W_{SN}}) \]<br />
<br />
where <math> E_{ij} </math> is the matrix whose (i,j)-th entry is 1 and zero everywhere else, and <math> u_1, v_1</math> are respectively the first left and right singular vectors of W.<br />
<br />
To understand the above computation in more detail, note that <br />
\begin{align}<br />
\sigma(W)= \sup_{||u||=1, ||v||=1} \langle Wv, u \rangle = \sup_{||u||=1, ||v||=1} \text{trace} ( (uv^T)^T W).<br />
\end{align}<br />
By Theorem 4.4.2 in Lemaréchal and Hiriart-Urruty (1996), the sub-differential of a convex function defined as the the maximum of a set of differentiable convex functions over a compact index set is the convex hull of the gradients of the maximizing functions. Thus we have the sub-differential:<br />
<br />
\begin{align}<br />
\partial \sigma = \text{convex hull} \{ u v^T: u,v \text{ are left/right singular vectors associated with } \sigma(W) \}.<br />
\end{align}<br />
<br />
However, the authors assume that the maximum singular value of W has only one left and one right normalized singular vector. Thus <math> \sigma </math> is differentiable and <br />
\begin{align}<br />
\nabla_W \sigma(W) =u_1v_1^T,<br />
\end{align}<br />
which explains the above computation.<br />
<br />
= Spectral Normalization VS Other Regularization Techniques =<br />
<br />
The weight normalization introduced by Salimans & Kingma (2016) is a method that normalizes the <math> l_2 </math> norm of each row vector in the weight matrix. Mathematically it is equivalent to require the weight by the weight normalization <math> \bar{W_{WN}} </math>:<br />
<br />
<math> \sigma_1(\bar{W_{WN}})^2+\cdots+\sigma_T(\bar{W_{WN}})^2=d_0, \text{where } T=\min(d_i,d_0) </math> where <math> \sigma_t(A) </math> is a t-th singular value of matrix A. <br />
<br />
Note, if <math> \bar{W_{WN}} </math> is the weight normalized matrix of dimension <math> d_i*d_0 </math>, the norm <math> ||\bar{W_{WN}}h||_2 </math> for a fixed unit vector <math> h </math> is maximized at <math> ||\bar{W_{WN}}h||_2 \text{ when } \sigma_1(\bar{W_{WN}})=\sqrt{d_0} \text{ and } \sigma_t(\bar{W_{WN}})=0, t=2, \cdots, T </math> which means that <math> \bar{W_{WN}} </math> is of rank one. In order to retain as much norm of the input as possible and hence to make the discriminator more sensitive, one would hope to make the norm of <math> \bar{W_{WN}}h </math> large. For weight normalization, however, this comes at hte cost of reducing the rank and hence the number of features to be used for the discriminator. Thus, there is a conflict of interests between weight normalization and our desire to use as many features as possible to distinguish the generator distribution from the target distribution. The former interest often reigns over the other in many cases, inadvertently diminishing the number of features to be used by the discriminators. Consequently, the algorithm would produce a rather arbitrary model distribution that matches the target distribution only at select few features. <br />
<br />
Brock et al. (2016) introduced orthonormal regularization on each weight to stabilize the training of GANs. In their work, Brock et al. (2016) augmented the adversarial objective function by adding the following term:<br />
<br />
<math> ||W^TW-I||^2_F </math><br />
<br />
While this seems to serve the same purpose as spectral normalization, orthonormal regularization are mathematically quite different from our spectral normalization because the orthonormal regularization destroys the information about the spectrum by setting all the singular values to one. On the other hand, spectral normalization only scales the spectrum so that its maximum will be one. <br />
<br />
Gulrajani et al. (2017) used gradient penalty method in combination with WGAN. In their work, they placed K-Lipschitz constant on the discriminator by augmenting the objective function with the regularizer that rewards the function for having local 1-Lipschitz constant(i.e <math> ||\nabla_{\hat{x}} f ||_2 = 1 </math>) at discrete sets of points of the form <math> \hat{x}:=\epsilon \tilde{x} + (1-\epsilon)x </math> generated by interpolating a sample <math> \tilde{x} </math> from generative distribution and a sample <math> x </math> from the data distribution. This approach has an obvious weakness of being heavily dependent on the support of the current generative distribution. Moreover, WGAN-GP requires more computational cost than our spectral normalization with single-step power iteration, because the computation of <math> ||\nabla_{\hat{x}} f ||_2 </math> requires one whole round of forward and backward propagation.<br />
<br />
= Experimental settings and results = <br />
== Objective function ==<br />
For all methods other than WGAN-GP, we use <br />
<math> V(G,D) := E_{x\sim q_{data}(x)}[\log D(x)] + E_{z\sim p(z)}[\log (1-D(G(z)))]</math><br />
to update D, for the updates of G, use <math> -E_{z\sim p(z)}[\log(D(G(z)))] </math>. Alternatively, test performance of the algorithm with so-called hinge loss, which is given by <br />
<math> V_D(\hat{G},D)= E_{x\sim q_{data}(x)}[\min(0,-1+D(x))] + E_{z\sim p(z)}[\min(0,-1-D(\hat{G}(z)))] </math>, <math> V_G(G,\hat{D})=-E_{z\sim p(z)}[\hat{D}(G(z))] </math><br />
<br />
For WGAN-GP, we choose <br />
<math> V(G,D):=E_{x\sim q_{data}}[D(x)]-E_{z\sim p(z)}[D(G(z))]- \lambda E_{\hat{x}\sim p(\hat{x})}[(||\nabla_{\hat{x}}D(\hat{x}||-1)^2)]</math><br />
<br />
== Optimization ==<br />
Adam optimizer: 6 settings in total, related to <br />
* <math> n_{dis} </math>, the number of updates of the discriminator per one update of Adam. <br />
* learning rate <math> \alpha </math><br />
* the first and second momentum parameters <math> \beta_1, \beta_2 </math> of Adam<br />
<br />
[[File:inception score.png]]<br />
<br />
[[File:FID score.png]]<br />
<br />
The above image show the inception core and FID score of with settings A-F, and table show the inception scores of the different methods with optimal settings on CIFAR-10 and STL-10 dataset.<br />
<br />
== Singular values analysis on the weights of the discriminator D ==<br />
[[File:singular value.png]]<br />
<br />
In above figure, we show the squared singular values of the weight matrices in the final discriminator D produced by each method using the parameter that yielded the best inception score. As we predicted before, the singular values of the first fifth layers trained with weight clipping and weight normalization concentrate on a few components. On the other hand, the singular values of the weight matrices in those layers trained with spectral normalization is more broadly distributed.<br />
<br />
== Training time ==<br />
On CIFAR-10, SN-GANs is slightly slower than weight normalization, but significantly faster than WGAN-GP. As we mentioned in Section3, WGAN-GP is slower than other methods because WGAN-GP needs to calculate the gradient of gradient norm.<br />
<br />
== comparison between GN-GANs and orthonormal regularization ==<br />
[[File:comparison.png]]<br />
Above we explained in Section 3, orthonormal regularization is different from our method in that it destroys the spectral information and puts equal emphasis on all feature dimensions, including the ones that shall be weeded out in the training process. To see the extent of its possibly detrimental effect, we experimented by increasing the dimension of the feature space, especially at the final layer for which the training with our spectral normalization prefers relatively small feature space. Above figure shows the result of our experiments. As we predicted, the performance of the orthonormal regularization deteriorates as we increase the dimension of the feature maps at the final layer. SN-GANs, on the other hand, does not falter with this modification of the architecture.<br />
<br />
We also applied our method to the training of class conditional GANs on ILSVRC2012 dataset with 1000 classes, each consisting of approximately 1300 images, which we compressed to 128*128 pixels. GAN without normalization and GAN with layer normalization collapsed in the beginning of training and failed to produce any meaningful images. Above picture shows that the inception score of the orthonormal normalization plateaued around 20k iterations, while SN kept improving even afterward.<br />
<br />
= Algorithm of spectral normalization =<br />
To calculate the largest singular value of matrix <math> W </math> to implement spectral normalization, we appeal to power iterations. Algorithm is executed as follows:<br />
<br />
* Initialize <math>\tilde{u}_{l}\in R^{d_l} \text{for} l=1,\cdots,L </math> with a random vector (sampled from isotropic distribution) <br />
* For each update and each layer l:<br />
* Apply power iteration method to a unnormalized weight <math> W^l </math>:<br />
<br />
<math> \tilde{v_l}\leftarrow (W^l)^T\tilde{u_l}/||(W^l)^T\tilde{u_l}||_2 </math><br />
<br />
<math>\tilde{u_l}\leftarrow (W^l)^T\tilde{v_l}/||(W^l)^T\tilde{v_l}|| </math><br />
<br />
* Calculate <math> \bar{W_{SN}} </math> with the spectral norm :<br />
<br />
<math> \bar{W_{SN}}(W^l)=W^l/\sigma(W^l), \text{where} \sigma(W^l)=\tilde{u_l}^TW^l\tilde{v_l} </math><br />
<br />
* Update <math> W^l </math> with SGD on mini-batch dataset <math> D_M </math> with a learning rate <math> \alpha </math><br />
<br />
<math> W^l\leftarrow W^l-\alpha\nabla_{W^l}l(\bar{W_{SN}^l}(W^l),D_M) </math><br />
<br />
== Conclusions ==<br />
This paper proposes spectral normalization as a stabilizer of training of GANs. When we apply spectral normalization to the GANs on image generation tasks, the generated examples are more diverse than the conventional weight normalization and achieve better or comparative inception scores relative to previous studies. The method imposes global regularization on the discriminator as opposed to local regularization introduced by WGAN-GP, and can possibly used in combinations. In the future work, we would like to further investigate where our methods stand amongest other methods on more theoretical basis, and experiment our algorithm on larger and more complex datasets.<br />
<br />
== Critique(to be edited) ==<br />
<br />
== References ==<br />
# Lemaréchal, Claude, and J. B. Hiriart-Urruty. "Convex analysis and minimization algorithms I." Grundlehren der mathematischen Wissenschaften 305 (1996).</div>Bapskikohttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Unsupervised_Machine_Translation_Using_Monolingual_Corpora_Only&diff=33124stat946w18/Unsupervised Machine Translation Using Monolingual Corpora Only2018-03-11T19:36:17Z<p>Bapskiko: /* Validation */</p>
<hr />
<div><br />
[[File:MC_Translation_Example.png]]<br />
== Introduction ==<br />
Neural machine translation systems are usually trained on large corpora consisting of pairs of pre-translated sentences. The paper ''Unsupervised Machine Translation Using Monolingual Corpora Only'' by Guillaume Lample, Ludovic Denoyer, and Marc'Aurelio Ranzato proposes an unsupervised neural machine translation system, which can be trained without such parallel data.<br />
<br />
==Motivation==<br />
The authors offer two motivations for their work:<br />
# To translate between languages for which large parallel corpora does not exist<br />
# To provide a strong lower bound that any semi-supervised machine translation system is supposed to yield<br />
<br />
== Overview of unsupervised translation system ==<br />
The unsupervised translation scheme has the following outline:<br />
* The word-vector embeddings of the source and target languages are aligned in an unsupervised manner.<br />
* Sentences from the source and target language are mapped to a common latent vector space by an encoder, and then mapped to probability distributions over sentences in the target or source language by a decoder.<br />
* A de-noising auto-encoder loss encourages the latent-space representations to be insensitive to noise.<br />
* An adversarial loss encourages the latent-space representations of source and target sentences to be indistinguishable from each other. It is intended that the latent-space representation of a sentence should reflect its meaning, and not the particular language in which it is expressed.<br />
* A reconstruction loss encourages the model to improve on the translation model of the previous epoch.<br />
<br />
[[File:paper4_fig1.png|frame|none|alt=Alt text|A toy example of illustrating the training process which guides the design of the objective function. The key idea here is to build a common latent space between languages. On the left, the model is trained to reconstruct a sentence from a noisy version of it in the same language. On the right, the model is trained to reconstruct a sentence given the same sentence but in another language.]]<br />
<br />
==Notation==<br />
Let <math display="inline">S</math> denote the set of words in the source language, and let <math display="inline">T</math> denote the set of words in the target language. Let <math display="inline">H \subset \mathbb{R}^{n_H}</math> denote the latent vector space. Moreover, let <math display="inline">S'</math> and <math display="inline">T'</math> denote the sets of finite sequences of words in the source and target language, and let <math display="inline">H'</math> denote the set of finite sequences of vectors in the latent space. For any set X, elide measure-theoretic details and let <math display="inline">\mathcal{P}(X)</math> denote the set of probability distributions over X.<br />
<br />
==Word vector alignment ==<br />
<br />
Conneau et al. (2017) describe an unsupervised method for aligning word vectors across languages. By "alignment", I mean that their method maps words with related meanings to nearby vectors, regardless of the language of the words. Moreover, if two words are one another's literal translations, their word vectors tend to be mutual nearest neighbors. <br />
<br />
The underlying idea of the alignment scheme can be summarized as follows: methods like word2vec or GLoVe generate vectors for which there is a correspondence between semantics and geometry. If <math display="inline">f</math> maps English words to their corresponding vectors, we have the approximate equation<br />
\begin{align}<br />
f(\text{king}) -f(\text{man}) +f(\text{woman})\approx f(\text{queen}).<br />
\end{align}<br />
Furthermore, if <math display="inline">g</math> maps French words to their corresponding vectors, then <br />
\begin{align}<br />
g(\text{roi}) -g(\text{homme}) +g(\text{femme})\approx g(\text{reine}).<br />
\end{align}<br />
<br />
Thus if <math display="inline">W</math> maps the word vectors of English words to the word vectors of their French translations, we should expect <math display="inline">W</math> to be linear. As was observed by Mikolov et al. (2013), the problem of word-vector alignment then becomes a problem of learning the linear transformation that best aligns two point clouds, one from the source language and one from the target language. For more on the history of the word-vector alignment problem, see my CS698 project ([https://uwaterloo.ca/scholar/sites/ca.scholar/files/pa2forsy/files/project_dec_3_0.pdf link]).<br />
<br />
Conneau et al. (2017)'s word vector alignment scheme is unique in that it requires no parallel data, and uses only the shapes of the two word-vector point clouds to be aligned. I will not go into detail, but the heart of the method is a special GAN, in which only the discriminator is a neural network, and the generator is the map corresponding to an orthogonal matrix.<br />
<br />
This unsupervised alignment method is crucial to the translation scheme of the current paper. From now on we denote by <br />
<math display="inline">A: S' \cup T' \to \mathcal{Z}'</math> the function that maps a source- or target- language word sequence to the corresponding aligned word vector sequence.<br />
<br />
==Encoder ==<br />
The encoder <math display="inline">E </math> reads a sequence of word vectors <math display="inline">(z_1,\ldots, z_m) \in \mathcal{Z}'</math> and outputs a sequence of hidden states <math display="inline">(h_1,\ldots, h_m) \in H'</math> in the latent space. Crucially, because the word vectors of the two languages have been aligned, the same encoder can be applied to both. That is, to map a source sentence <math display="inline">x=(x_1,\ldots, x_M)\in S'</math> to the latent space, we compute <math display="inline">E(A(x))</math>, and to map a target sentence <math display="inline">y=(y_1,\ldots, y_K)\in T'</math> to the latent space, we compute <math display="inline">E(A(y))</math>.<br />
<br />
The encoder consists of two LSTMs, one of which reads the word-vector sequence in the forward direction, and one of which reads it in the backward direction. The hidden state sequence is generated by concatenating the hidden states produced by the forward and backward LSTMs at each word vector.<br />
<br />
==Decoder==<br />
<br />
The decoder is a mono-directional LSTM that accepts a sequence of hidden states <math display="inline">h=(h_1,\ldots, h_m) \in H'</math> from the latent space and a language <math display="inline">L \in \{S,T \}</math> and outputs a probability distribution over sentences in that language. We have<br />
<br />
\begin{align}<br />
D: H' \times \{S,T \} \to \mathcal{P}(S') \cup \mathcal{P}(T').<br />
\end{align}<br />
<br />
The decoder makes use of the attention mechanism of Bahdanau et al. (2014). To compute the probability of a given sentence <math display="inline">y=(y_1,\ldots,y_K)</math> , the LSTM processes the sentence one word at a time, accepting at step <math display="inline">k</math> the aligned word vector of the previous word in the sentence <math display="inline">A(y_{k-1})</math> and a context vector <math display="inline">c_k\in H</math> computed from the hidden sequence <math display="inline">h\in H'</math>, and outputting a probability distribution over possible next words. The LSTM is initiated with a special, language-specific start-of-sequence token. Otherwise, the decoder is does not depend on the language of the sentence it is producing. The context vector is computed as described by Bahdanau et al. (2014), where we let <math display="inline">l_{k}</math> denote the hidden state of the LSTM at step <math display="inline">k</math>, and where <math display="inline">U,W</math> are learnable weight matrices, and <math display="inline">v</math> is a learnable weight vector:<br />
\begin{align}<br />
c_k&= \sum_{m=1}^M \alpha_{k,m} h_m\\<br />
\alpha_{k,m}&= \frac{\exp(e_{k,m})}{\sum_{m'=1}^M\exp(e_{k,m'}) },\\<br />
e_{k,m} &= v^T \tanh (Wl_{k-1} + U h_m ).<br />
\end{align}<br />
<br />
<br />
By learning <math display="inline">U,W</math> and <math display="inline">v</math>, the decoder can learn to decide which vectors in the sequence <math display="inline">h</math> are relevant to computing which words in the output sentence.<br />
<br />
At step <math display="inline">k</math>, after receiving the context vector <math display="inline">c_k\in H</math> and the aligned word vector of the previous word in the sequence,<math display="inline">A(y_{k-1})</math>, the LSTM outputs a probability distribution over words, which should be interpreted as the distribution of the next word according to the decoder. The probability the decoder assigns to a sentence is then the product of the probabilities computed for each word in this manner.<br />
<br />
[[File:paper4_fig2.png|700px|]]<br />
<br />
==Overview of objective ==<br />
The objective function is the sum of:<br />
# The de-noising auto-encoder loss,<br />
# The translation loss,<br />
# The adversarial loss.<br />
I shall describe these in the following sections.<br />
<br />
==De-noising Auto-encoder Loss == <br />
A de-noising auto-encoder is a function optimized to map a corrupted sample from some dataset to the original un-corrupted sample. De-noising auto-encoders were introduced by Vincent et al. (2008), who provided numerous justifications, one of which is particularly illuminating. If we think of the dataset of interest as a thin manifold in a high-dimensional space, the corruption process is likely perturbed a datapoint off the manifold. To learn to restore the corrupted datapoint, the de-noising auto-encoder must learn the shape of the manifold.<br />
<br />
Hill et al. (2016), used a de-noising auto-encoder to learn vectors representing sentences. They corrupted input sentences by randomly dropping and swapping words, and then trained a neural network to map the corrupted sentence to a vector, and then map the vector to the un-corrupted sentence. Interestingly, they found that sentence vectors learned this way were particularly effective when applied to tasks that involved generating paraphrases. This makes some sense: for a vector to be useful in restoring a corrupted sentence, it must capture something of the sentence's underlying meaning.<br />
<br />
The present paper uses the principal of de-noising auto-encoders to compute one of the terms in its loss function. In each iteration, a sentence is sampled from the source or target language, and a corruption process <math display="inline"> C</math> is applied to it. <math display="inline"> C</math> works by deleting each word in the sentence with probability <math display="inline">p_C</math> and applying to the sentence a permutation randomly selected from those that do not move words more than <math display="inline">k_C</math> spots from their original positions. The authors select <math display="inline">p_C=0.1</math> and <math display="inline">k_C=3</math>. The corrupted sentence is then mapped to the latent space using <math display="inline">E\circ A</math>. The loss is then the negative log probability of the original un-corrupted sentence according to the decoder <math display="inline">D</math> applied to the latent-space sequence.<br />
<br />
The explanation of Vincent et al. (2008) can help us understand this loss-function term: the de-noising auto-encoder loss forces the translation system to learn the shapes of the manifolds of the source and target languages.<br />
<br />
==Translation Loss==<br />
To compute the translation loss, we sample a sentence from one of the languages, translate it with the encoder and decoder of the previous epoch, and then corrupt its output with <math display="inline">C</math>. We then use the current encoder <math display="inline">E</math> to map the corrupted translation to a sequence <math display="inline">h \in H'</math> and the decoder <math display="inline">D</math> to map <math display="inline">h</math> to a probability distribution over sentences. The translation loss is the negative log probability the decoder assigns to the original uncorrupted sentence. <br />
<br />
It is interesting and useful to consider why this translation loss, which depends on the translation model of the previous iteration, should promote an improved translation model in the current iteration. One loose way to understand this is to think of the translator as a de-noising translator. We are given a sentence perturbed from the manifold of possible sentences from a given language both by the corruption process and by the poor quality of the translation. The model must learn to both project and translate. The technique employed here resembles that used by Sennrich et al. (2014), who trained a neural machine translation system using both parallel and monolingual data. To make use of the monolingual target-language data, they used an auxiliary model to translate it to the source language, then trained their model to reconstruct the original target-language data from the source-language translation. Sennrich et al. argued that training the model to reconstruct true data from synthetic data was more robust than the opposite approach. The authors of the present paper use similar reasoning.<br />
<br />
==Adversarial Loss ==<br />
The intuition underlying the latent space is that it should encode the meaning of a sentence in a language-independent way. Accordingly, the authors introduce an adversarial loss, to encourage latent-space vectors mapped from the source and target languages to be indistinguishable. Central to this adversarial loss is the discriminator <math display="inline">R:H' \to [0,1]</math>, which makes use of <math display="inline">r: H\to [0,1]</math> a three-layer fully-connected neural network with 1024 hidden units per layer. Given a sequence of latent-space vectors <math display="inline">h=(h_1,\ldots,h_m)\in H'</math> the discriminator assigns probability <math display="inline">R(h)=\prod_{i=1}^m r(h_i)</math> that they originated in the target space. Each iteration, the discriminator is trained to maximize the objective function<br />
<br />
\begin{align}<br />
I_T(q) \log (R(E(q))) +(1-I_T(q) )\log(1-R(E(q)))<br />
\end{align}<br />
<br />
where <math display="inline">q</math> is a randomly selected sentence, and <math display="inline">I_T(q)</math> is 1 when <math display="inline">q\in I_T</math> is from the source language and 0 if <math display="inline">q\in I_S</math><br />
<br />
The same term is added to the primary objective function, which the encoder and decoder are trained to minimize. The result is that the encoder and decoder learn to fool the discriminator by mapping sentences from the source and target language to similar sequences of latent-space vectors.<br />
<br />
<br />
The authors note that they make use of label smoothing, a technique recommended by Goodfellow (2016) for regularizing GANs, in which the objective described above is replaced by <br />
<br />
\begin{align}<br />
I_T(q)( (1-\alpha)\log (R(E(q))) +\alpha\log(1-R(E(q))) )+(1-I_T(q) ) ( (1-\beta) \log(1-R(E(q))) +\beta\log (R(E(q)) ))<br />
\end{align}<br />
for some small nonnegative values of <math display="inline">\alpha, \beta</math>, the idea being to prevent the discriminator from making extreme predictions. While one-sided label smoothing (<math display="inline">\beta = 0</math>) is generally recommended, the present model differs from a standard GAN in that it is symmetric, and hence two-sided label smoothing would appear more reasonable.<br />
<br />
<br />
It is interesting to observe that while the intuition justifying the use of the latent space suggests that the latent space representation of a sentence should be language-independent, this is not actually true: if two sentences are translations of one another, but have different lengths, their latent-space representations will necessarily be different, since a a sentence's latent space representation has the same length as the sentence itself.<br />
<br />
==Objective Function==<br />
<br />
Combining the above-described terms, we can write the overall objective function. Let <math display="inline">Q_S</math> denote the monolingual dataset for the source language, and let <math display="inline">Q_T</math> denote the monolingual dataset for the target language. Let <math display="inline">D_S:= D(\cdot, S)</math> and<math display="inline">D_T= D(\cdot, T)</math> (i.e. <math display="inline">D_S, D_T</math>) be the decoder restricted to the source or target language, respectively. Let <math display="inline"> M_S </math> and <math display="inline"> M_T </math> denote the target-to-source and source-to-target translation models of the previous epoch. Then our objective function is<br />
<br />
\begin{align}<br />
\mathcal{L}(D,E,R)=\text{T Translation Loss}+\text{T De-noising Loss} +\text{T Adversarial Loss} +\text{S Translation Loss} +\text{S De-noising Loss} +\text{S Adversarial Loss}\\<br />
\end{align}<br />
\begin{align}<br />
=\sum_{q\in Q_T}\left( -\log D_T \circ E \circ C \circ M _S(q) (q) -\log D_T \circ E \circ C (q) (q)+(1-\alpha)\log (R\circ E(q)) +\alpha\log(1-R\circ E(q)) \right)+\sum_{q\in Q_S}\left( -\log D_S \circ E \circ C \circ M_T (q) (q) -\log D_S \circ E \circ C (q) (q)+(1-\beta) \log(1-R \circ E(q)) +\beta\log (R\circ E(q) \right).<br />
\end{align}<br />
<br />
They alternate between iterations minimizing <math display="inline">\mathcal{L} </math> with respect to <math display="inline">E, D</math> and iterations maximizing with respect to <math display="inline">R</math>. ADAM is used for minimization, while RMSprop is used for maximization. After each epoch, M is updated so that <math display="inline">M_S=D_S \circ E</math> and <math display="inline">M_T=D_T \circ E</math>, after which <math display="inline"> M </math> is frozen until the next epoch.<br />
<br />
==Validation==<br />
The authors' aim is for their method to be completely unsupervised, so they do not use parallel corpora even for the selection of hyper-parameters. Instead, they validate by translating sentences to the other language and back, and comparing the resulting sentence with the original according to BLEU, a similarity metric frequently used in translation (Papineni et al. 2002).<br />
<br />
As justification, they show empirically that the score generated by applying BLEU on back-and-forth translation is correlated with applying BLEU using parallel corpora.<br />
[[File:paper4fig3.png]]<br />
<br />
==Experimental Procedure and Results==<br />
<br />
The authors test their method on four data sets. The first is from the English-French translation task of the Workshop on Machine Translation 2014 (WMT14). This data set consists of parallel data. The authors generate a monolingual English corpus by randomly sampling 15 million sentence pairs, and choosing only the English sentences. They then generate a French corpus by selecting the French sentences from those pairs that were not previous chosen. Importantly, this means that the monolingual data sets have no parallel sentences. The second data set is generated from the English-German translation task from WMT14 using the same procedure.<br />
<br />
The third and fourth data sets are generated from Multi30k data set, which consists of multilingual captions of various images. The images are discarded and the English, French, and German captions are used to generate monolingual data sets in the manner described above. These monolingual corpora are much smaller, consisting of 14500 sentences each.<br />
<br />
The unsupervised translation scheme performs well, though not as well as a supervised translation scheme. It converges after a small number of epochs. Besides supervised translation, the authors compare their method with three other baselines: "Word-by-Word" uses only the previously-discussed word-alignment scheme; "Word-Reordering" uses a simple LSTM based language model and a greedy algorithm to select a reordering of the words produced by "Word-by-Word". "Oracle Word Reordering" means the optimal reordering of the words produced by "Word-by-Word".<br />
<br />
==Result Figures==<br />
[[File:MC_Translation Results.png]]<br />
[[File:MC_Translation_Convergence.png]]<br />
<br />
==Commentary==<br />
This paper's results are impressive: that it is even possible to translate between languages without parallel data suggests that languages are more similar than we might initially suspect, and that the method the authors present has, at least in part, discovered some common deep structure. As the authors point out, using no parallel data at all, their method is able to produce results comparable to those produced by neural machine translation methods trained on hundreds of thousands of a parallel sentences on the WMT dataset. On the other hand, the results they offer come with a few significant caveats.<br />
<br />
The first caveat is that the workhorse of the method is the unsupervised word-vector alignment scheme presented in Conneau et al. (2017) (that paper shares 3 authors with this one). As the ablation study reveals, without word-vector alignment, this method preforms extremely poorly. Moreover, word-by-word translation using word-vector alignment alone performs well, albeit not as well as this method. This suggests that the method of this paper mainly learns to perform (sometimes significant) corrections to word-by-word translations by reordering and occasional word substitution. Presumably, it does this by learning something of the natural structure of sentences in each of the two languages, so that it can correct the errors made by word-by-word translation.<br />
<br />
The second caveat is that the best results are attained translating between English and French, two very closely related languages, and the quality of translation between English and German, a slightly-less related pair, is significantly worse ( according to the ''Shorter Oxford English Dictionary'', 28.3 percent of the English vocabulary is French-derived, 28.2 percent is Latin-derived, and 25 percent is derived from Germanic languages. This probably understates the degree of correspondence between the French and English vocabularies, since French likely derives from Latin many of the same words English does. ). The authors do not report results with more distantly-related pairs, but it is reasonable to expect that performance would degrade significantly, for two reasons. Firstly, Conneau et al. (2017) shows that the word-alignment scheme performs much worse on more distant language pairs. This may be because there are more one-to-one correspondences between the words of closely related languages than there are between more distant languages. Secondly, because the same encoder is used to read sentences of both languages, the encoder cannot adapt to the unique word-order properties of either language. This would become a problem for language pairs with very different grammar. The authors suggest that their scheme could be a useful tool for translating between language pairs for which their are few parallel corpora. However, language pairs lacking parallel corpora are often (though not always) distantly related, and it is for such pairs that the performance of the present method likely suffers.<br />
<br />
<br />
<br />
<br />
The proposed method always beats Oracle Word Reordering on the Multi30k data set, but sometimes does not on the WMT data set. This may be because the WMT sentences are much more syntactically complex than the simple image captions of the Multi30k data set.<br />
<br />
The ablation study also reveals the importance of the corruption process <math display="inline">C</math>: the absence of <math display="inline">C</math> significantly degrades translation quality, though not as much as the absence of word-vector alignment. We can understand this in two related ways. First of all, if we view the model as learning to correct structural errors in word-by-word translations, then the corruption process introduces more errors of this kind, and so provides additional data upon which the model can train. Second, as Vincent et al. (2008) point out, de-noising auto-encoder training encourages a model to learn the structure of the manifold from which the data is drawn. By learning the structure of the source and target languages, the model can better correct the errors of word-by-word translation.<br />
<br />
[[File:MC_Alignment_Results.png|frame|none|alt=Alt text|From Conneau et al. (2017). The final row shows the performance of alignment method used in the present paper. Note the degradation in performance for more distant languages.]]<br />
<br />
[[File:MC_Translation_Ablation.png|frame|none|alt=Alt text|From the present paper. Results of an ablation study. Of note are the first, third, and forth rows, which demonstrate that while the translation component of the loss is relatively unimportant, the word vector alignment scheme and de-noising auto-encoder matter a great deal.]]<br />
<br />
==Future Work==<br />
The principal of performing unsupervised translation by starting with a rough but reasonable guess, and then improving it using knowledge of the structure of target language seems promising. Word by word translation using word-vector alignment works well for closely related languages like English and French, but is unlikely to work as well for more distant languages. For those languages, a better method for getting an initial guess is required.<br />
<br />
==References==<br />
#Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. "Neural machine translation by jointly learning to align and translate." arXiv preprint arXiv:1409.0473 (2014).<br />
#Conneau, Alexis, Guillaume Lample, Marc’Aurelio Ranzato, Ludovic Denoyer, Hervé Jégou. "Word Translation without Parallel Data". arXiv:1710.04087, (2017)<br />
# Dictionary, Shorter Oxford English. "Shorter Oxford english dictionary." (2007).<br />
#Goodfellow, Ian. "NIPS 2016 tutorial: Generative adversarial networks." arXiv preprint arXiv:1701.00160 (2016).<br />
# Hill, Felix, Kyunghyun Cho, and Anna Korhonen. "Learning distributed representations of sentences from unlabelled data." arXiv preprint arXiv:1602.03483 (2016).<br />
# Lample, Guillaume, Ludovic Denoyer, and Marc'Aurelio Ranzato. "Unsupervised Machine Translation Using Monolingual Corpora Only." arXiv preprint arXiv:1711.00043 (2017).<br />
#Papineni, Kishore, et al. "BLEU: a method for automatic evaluation of machine translation." Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics, 2002.<br />
# Mikolov, Tomas, Quoc V Le, and Ilya Sutskever. "Exploiting similarities among languages for machine translation." arXiv preprint arXiv:1309.4168. (2013).<br />
#Sennrich, Rico, Barry Haddow, and Alexandra Birch. "Improving neural machine translation models with monolingual data." arXiv preprint arXiv:1511.06709 (2015).<br />
# Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. "Sequence to sequence learning with neural networks." Advances in neural information processing systems. 2014.<br />
# Vincent, Pascal, et al. "Extracting and composing robust features with denoising autoencoders." Proceedings of the 25th international conference on Machine learning. ACM, 2008.</div>Bapskikohttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:paper4fig3.png&diff=33123File:paper4fig3.png2018-03-11T19:31:54Z<p>Bapskiko: Comparison of validation metric compared to parallel metric</p>
<hr />
<div>Comparison of validation metric compared to parallel metric</div>Bapskikohttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18&diff=31849stat946w182018-02-13T18:46:23Z<p>Bapskiko: </p>
<hr />
<div>=[https://piazza.com/uwaterloo.ca/fall2017/stat946/resources List of Papers]=<br />
<br />
= Record your contributions here [https://docs.google.com/spreadsheets/d/1fU746Cld_mSqQBCD5qadvkXZW1g-j-kHvmHQ6AMeuqU/edit?usp=sharing]=<br />
<br />
Use the following notations:<br />
<br />
P: You have written a summary/critique on the paper.<br />
<br />
T: You had a technical contribution on a paper (excluding the paper that you present).<br />
<br />
E: You had an editorial contribution on a paper (excluding the paper that you present).<br />
<br />
[https://docs.google.com/forms/d/1HrpW_lnn4jpFmoYKJBRAkm-GYa8djv9iZXcESeVB7Ts/prefill Your feedback on presentations]<br />
<br />
=Paper presentation=<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="60pt"|Date<br />
|width="100pt"|Name <br />
|width="30pt"|Paper number <br />
|width="700pt"|Title<br />
|width="30pt"|Link to the paper<br />
|width="30pt"|Link to the summary<br />
|-<br />
|Feb 15 (example)||Ri Wang || ||Sequence to sequence learning with neural networks.||[http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Paper] || [http://wikicoursenote.com/wiki/Stat946f15/Sequence_to_sequence_learning_with_neural_networks#Long_Short-Term_Memory_Recurrent_Neural_Network Summary]<br />
|-<br />
|Feb 27 || Peter Forsyth || 1|| Unsupervised Machine Translation Using Monolingual Corpora Only || [https://arxiv.org/pdf/1711.00043.pdf Paper] || <br />
|-<br />
|Feb 27 || || 2|| || || <br />
|-<br />
|Feb 27 || || 3|| || || <br />
|-<br />
|Mar 1 || || 4|| || || <br />
|-<br />
|Mar 1 || || 5|| || || <br />
|-<br />
|Mar 1 || || 6|| || || <br />
|-<br />
|Mar 6 || || 3|| || || <br />
|-<br />
|Mar 6 || || 3|| || || <br />
|-<br />
|Mar 6 || || 3|| || || <br />
|-<br />
|Mar 8 || Alex (Xian) Wang || 4 || Self-Normalizing Neural Networks || [http://papers.nips.cc/paper/6698-self-normalizing-neural-networks.pdf Paper] || <br />
|-<br />
|Mar 8 || || 4|| || || <br />
|-<br />
|Mar 8 || || 4|| || || <br />
|-<br />
|Mar 13 || Chunshang Li || 5|| UNDERSTANDING IMAGE MOTION WITH GROUP REPRESENTATIONS || || <br />
|-<br />
|Mar 13 || || 5|| || || <br />
|-<br />
|Mar 13 || || 5|| || || <br />
|-<br />
|Mar 15 || || 6|| || || <br />
|-<br />
|Mar 15 || || 6|| || || <br />
|-<br />
|Mar 15 || || 6|| || || <br />
|-<br />
|Mar 20 || Travis Dunn || 7|| Continuous Adaptation via Meta-Learning in Nonstationary and Competitive Environments || [https://openreview.net/pdf?id=Sk2u1g-0- Paper] || [Summary]<br />
|-<br />
|Mar 20 || Sushrut Bhalla || 20|| MaskRNN: Instance Level Video Object Segmentation || [https://papers.nips.cc/paper/6636-maskrnn-instance-level-video-object-segmentation.pdf Paper] || [Summary]<br />
|-<br />
|Mar 20 || Hamid Tahir || 21|| Wavelet Pooling for Convolution Neural Networks || [https://openreview.net/pdf?id=rkhlb8lCZ Paper] || Summary<br />
|-<br />
|Mar 22 || Dongyang Yang|| 8|| DON’T DECAY THE LEARNING RATE, INCREASE THE BATCH SIZE || [https://arxiv.org/pdf/1711.00489.pdf Paper] || <br />
|-<br />
|Mar 22 || Yao Li || 8||Toward Multimodal Image-to-Image Translation || [http://papers.nips.cc/paper/6650-toward-multimodal-image-to-image-translation.pdf Paper] || <br />
|-<br />
|Mar 22 || Sahil Pereira || 24||End-to-End Differentiable Adversarial Imitation Learning|| [http://proceedings.mlr.press/v70/baram17a/baram17a.pdf Paper] || [http://proceedings.mlr.press/v70/baram17a/baram17a.pdf Summary]<br />
|-<br />
|Mar 27 || Jaspreet Singh Sambee || 25|| Gated Recurrent Convolution Neural Network for OCR || [http://papers.nips.cc/paper/6637-gated-recurrent-convolution-neural-network-for-ocr.pdf Paper] || <br />
|-<br />
|Mar 27 || Braden Hurl || 26|| Spherical CNNs || [https://openreview.net/pdf?id=Hkbd5xZRb Paper] || <br />
|-<br />
|Mar 27 || Marko Ilievski || 27|| Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders || [http://proceedings.mlr.press/v70/engel17a/engel17a.pdf Paper] || <br />
|-<br />
|Mar 29 || Alex Pon || 28||Wasserstein GAN || [https://arxiv.org/pdf/1701.07875.pdf Paper] ||<br />
|-<br />
|Mar 29 || Sean Walsh || 29||Improved Training of Wasserstein GANs || [https://arxiv.org/pdf/1704.00028.pdf Paper] ||<br />
|-<br />
|Mar 29 || Jason Ku || 30||MarrNet: 3D Shape Reconstruction via 2.5D Sketches ||[https://arxiv.org/pdf/1711.03129.pdf Paper] ||<br />
|-<br />
|Apr 3 || Tong Yang || 31|| Dynamic Routing Between Capsules. || [http://papers.nips.cc/paper/6975-dynamic-routing-between-capsules.pdf Paper] || <br />
|-<br />
|Apr 3 || Benjamin Skikos || 32|| Training and Inference with Integers in Deep Neural Networks || [https://openreview.net/pdf?id=HJGXzmspb Paper] || <br />
|-<br />
|Apr 3 || || 11|| || || <br />
|-</div>Bapskiko