statwiki - User contributions [US]

Neural Audio Synthesis of Musical Notes with WaveNet autoencoders

2018-03-27T19:03:48Z

Apon:

= Introduction =
The authors of this paper have pointed out that the method in which most notes are created are hand-designed instruments modifying pitch, velocity and filter parameters to produce the required tone, timbre and dynamics of a sound. The authors suggest that this may be a problem and thus suggest a data-driven approach to audio synthesis. They demonstrate how to generate new types of expressive and realistic instrument sounds using a neural network model instead of using specific arrangements of oscillators or algorithms for sample playback. The model is capable of learning semantically meaningful hidden representations which can be used as control signals for manipulating tone, timbre, and dynamics during playback. To train such a data expensive model the authors highlight the need for a large dataset much like imagenet for music. The motivation for this work stems from recent advances in autoregressive models like WaveNet [5] and SampleRNN [6]. These models are effective at modeling short and medium scale (~500ms) signals, but rely on external conditioning for large-term dependencies; the proposed model removes the need for external conditioning.

= Contributions =
To solve the problem highlighted above the authors propose two main contributions of their paper:
* Wavenet-style autoencoder that learn to encode temural data over a long term audio structures without requiring external conditioning
* NSynth: a large dataset of musical notes inspired by the emerging of large image datasets

= Models =

[[File:paper26-figure1-models.png|center]]

== WaveNet Autoencoder ==

While the proposed autoencoder structure is very similar to that of WaveNet the authors argue that the algorithm is novel in two ways:
* It is able to attain consistent long-term structure without any external conditioning
* Creating meaningful embedding which can be interpolated between
The authors accomplish this by passing the raw audio throw the encoder to produce an embedding <math>Z = f(x) </math>, next the input is shifted and feed into the decoder which reproduces the input. The resulting probability distribution:

\begin{align}
p(x) = \prod_{i=1}^N\{x_i | x_1, … , x_N-1, f(x) \}
\end{align}

A detailed block diagram of the modified WaveNet structure can be seen in figure 1b. This diagram demonstrates the encoder as a 30 layer network in each each node is a ReLU nonlinearity followed by a non-causal dilated convolution. Dilated convolution (aka convolutions with holes) is a type of convolution in which the filter skips input values with a certain step (step size of 1 is equivalent to the standard convolution), effectively allowing the network to operate at a coarser scale compared to traditional convolutional layers and have very large receptive fields. The resulting convolution is 128 channels all feed into another ReLU nonlinearity which is feed into another 1x1 convolution before getting down sampled with average pooling to produce a 16 dimension <math>Z </math> distribution. Each <math>Z </math> encoding is for a specific temporal resolution which the authors of the paper tuned to 32ms. This means that there are 125, 16 dimension <math>Z </math> encodings for each 4 second note present in the NSynth database (1984 embeddings).
Before the <math>Z </math> embedding enters the decoder it is first upsampled to the original audio rate using nearest neighbor interpolation. The embedding then passes through the decoder to recreate the original audio note. The input audio data is first quantized using 8-bit mu-law encoding into 256 possible values, and the output prediction is the softmax over the possible values.

== Baseline: Spectral Autoencoder ==
Being unable to find an alternative fully deep model which the authors could use to compare to there proposed WaveNet autoencoder to, the authors just made a strong baseline. The baseline algorithm that the authors developed is a spectral autoencoder. The block diagram of its architecture can be seen in figure 1a. The baseline network is 10 layer deep. Each layer has a 4x4 kernels with 2x2 strides followed by a leaky-ReLU (0.1) and batch normalization. The final hidden vector(Z) was set to 1984 to exactly match the hidden vector of the WaveNet autoencoder.

The authors attempted to train the baseline on multiple input: raw waveforms, FFT, and log magnitude of spectrum finding the latter to be best correlated with perceptual distortion. The authors also explored several representations of phase, finding that estimating magnitude and using established iterative techniques to reconstruct phase to be most effective. (The technique to reconstruct the phase from the magnitude comes from (Griffin and Lim 1984). It can be summarized as follows. In each iteration, generate a Fourier signal z by taking the Short Time Fourier transform of the current estimate of the complete time-domain signal, and replacing its magnitude component with the known true magnitude. Then find the time-domain signal whose Short Time Fourier transform is closest to z in the least-squares sense. This is the estimate of the complete signal for the next iteration. ) A final heuristic that was used by the authors to increase the accuracy of the baseline was weighting the mean square error (MSE) loss starting at 10 for 0 HZ and decreasing linearly to 1 at 4000 Hz and above. This is valid as the fundamental frequency of most instrument are found at lower frequencies.

== Training ==
Both the modified WaveNet and the baseline autoencoder used stochastic gradient descent with an Adam optimizer. The authors trained the baseline autoencoder model asynchronously for 1800000 epocs with a batch size of 8 with a learning rate of 1e-4. Where as the WaveNet modules were trained synchronously for 250000 epocs with a batch size of 32 with a decaying learning rate ranging from 2e-4 to 6e-6.

= The NSynth Dataset =
To evaluate the WaveNet autoencoder model, the authors' wanted an audio dataset that let them explore the learned embeddings. Musical notes are an ideal setting for this study. Prior to this paper, the existing music datasets included the RWC music database (Goto et al., 2003) and the dataset from Romani Picas et al. However, the authors wanted to develop a larger dataset.

The NSynth dataset has 306 043 unique musical notes (each have a unique pitch, timbre, envelope) all 4 seconds in length sampled at 16,000 Hz. The data set consists of 1006 different instruments playing on average of 65.4 different pitches across on average 4.75 different velocities. Average pitches and velocities are used as not all instruments, can reach all 88 MIDI frequencies, or the 5 velocities desired by the authors. The dataset has the following split: training set with 289,205 notes, validation set with 12,678 notes, and test set with 4,096 notes.

Along with each note the authors also included the following annotations:
* Source - The way each sound was produced. There were 3 classes ‘acoustic’, ‘electronic’ and ‘synthetic’
* Family - The family class of instruments that produced each note. There is 11 classes which include: {‘bass’, ‘brass’, ‘vocal’ ext.}
* Qualities - Sonic qualities about each note

The full dataset is publicly available here: https://magenta.tensorflow.org/datasets/nsynth.

[[File:nsynth_table.png | 400px|thumb|center|Full details of the NSynth dataset.]]

= Evaluation =

To fully analyze all aspects of WaveNet the authors proposed three evaluations:
* Reconstruction - Both Quantitative and Qualitative analysis were considered
* Interpolation in Timbre and Dynamics
* Entanglement of Pitch and Timbre

Sound is historically very difficult to quantify from a picture representation as it requires training and expertise to analyze. Even with expertise it can be difficult to complete a full analyses as two very different sound can look quite similar in the respective pictorial representation. This is why the authors recommend all readers to listen to the created notes which can be sound here: https://magenta.tensorflow.org/nsynth.

However, even when taking this under consideration the authors do pictorially demonstrate differences in the two proposed algorithms along with the original note, as it is hard to publish a paper with sound included. To demonstrate the pictorial difference the authors demonstrate each note using constant-q transform (CQT) which is able to capture the dynamics of timbre along with representing the frequencies of the sound.

== Reconstruction ==

[[File:paper27-figure2-reconstruction.png|center]]

=== Qualitative Comparison ===
In the Glockenspiel the WaveNet autoencoder is able to reproduce the magnitude, phase of the fundamental frequency (A and C in figure 2), and the attack (B in figure 2) of the instrument; Whereas the Baseline autoencoder introduces non existing harmonics (D in figure 2). The flugelhorn on the other hand, presents the starkest difference between the WaveNet and baseline autoencoders. The WaveNet while not perfect is able to reproduce the verbarto (I and J in figure 2) across multiple frequencies, which results in a natural sounding note. The baseline not only fails to do this but also adds extra noise (K in figure 2). The authors do add that the WaveNet produces some strikes (L in figure 2) however they argue that they are inaudible.

[[File:paper27-table1.png|center]]

Mu-law encoding was used in the original WaveNet [https://arxiv.org/pdf/1609.03499.pdf paper] to make the problem "more tractable" compared to raw 16-bit integer values. In that paper, they note that "especially for speech, this non-linear quantization produces a significantly better reconstruction" compared to a linear scheme. This might be expected considering that the mu-law companding transformation was designed to [https://www.cisco.com/c/en/us/support/docs/voice/h323/8123-waveform-coding.html#t4 encode speech]. In this application though, using this encoding creates perceptible distortion that sounds similar to clipping.

=== Quantitative Comparison ===
For a quantitative comparison the authors trained a separate multi-task classifier to classify a note using given pitch or quality of a note. The results of both the Baseline and the WaveNet where then inputted and attempted to be classified. As seen in table 1 WaveNet significantly outperformed the Baseline in both metrics posting a ~70% increase when only considering pitch.

== Interpolation in Timbre and Dynamics ==

[[File:paper27-figure3-interpolation.png|center]]

For this evaluation the authors reconstructed from linear interpolations in Z space among different instruments and compared these to superimposed position of the original two instruments. Not surprisingly the model fuse aspects of both instruments during the recreation. The authors claim however, that WaveNet produces much more realistic sounding results.
To support their claim the authors the authors point to WaveNet ability to create dynamic mixing of overtone in time, even jumping to higher harmonics (A in figure 3), capturing the timbre and dynamics of both the bass and flute. This can be once again seen in (B in figure 3) where Wavenet adds additional harmonics as well as a sub-harmonics to the original flute note.

== Entanglement of Pitch and Timbre ==

[[File:paper27-table2.png|center]]

[[File:paper27-figure4-entanglement.png|center]]

To study the entanglement between pitch and Z space the authors constructed a classifier which was expected to drop in accuracy if the representation of pitch and timbre is disentangled as it relies heavily on the pitch information. This is clearly demonstrated by the first two rows of table 2 where WaveNet relies more strongly on pitch then the baseline algorithm. The authors provide a more qualitative demonstrating in figure 4. They demonstrate a situation in which a classifier may be confused; a note with pitch of +12 is almost exactly the same as the original apart from an emergence of sub-harmonics.

Further insight can be gained on the relationship between pitch and timbre by studying the trend amongst the network embeddings among the pitches for specific instruments. This is depicted in figure 5 for several instruments across their entire 88 note range at 127 velocity. It can be noted from the figure that the instruments have unique separation of two or more registers over which the embeddings of notes with different pitches are similar. This is expected since instrumental dynamics and timbre varies dramatically over the range of the instrument.

= Future Directions =

One significant area which the authors claim great improvement is needed is the large memory constraints required by there algorithm. Due to the large memory requirement the current WaveNet must rely on down sampling thus being unable to fully capture the global context. They claim that research using different input representations (instead of mu-law) to minimize distortion is ongoing.

= Open Source Code =

Google has released all code related to this paper at the following open source repository: https://github.com/tensorflow/magenta/tree/master/magenta/models/nsynth

= References =

# Engel, J., Resnick, C., Roberts, A., Dieleman, S., Norouzi, M., Eck, D. & Simonyan, K.. (2017). Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders. Proceedings of the 34th International Conference on Machine Learning, in PMLR 70:1068-1077
# Griffin, Daniel, and Jae Lim. "Signal estimation from modified short-time Fourier transform." IEEE Transactions on Acoustics, Speech, and Signal Processing 32.2 (1984): 236-243.
# NSynth: Neural Audio Synthesis. (2017, April 06). Retrieved March 19, 2018, from https://magenta.tensorflow.org/nsynth
# The NSynth Dataset. (2017, April 05). Retrieved March 19, 2018, from https://magenta.tensorflow.org/datasets/nsynth
# Oord, Aaron van den, Nal Kalchbrenner, and Koray Kavukcuoglu. "Pixel recurrent neural networks." arXiv preprint arXiv:1601.06759 (2016).
# Mehri, Soroush, et al. "SampleRNN: An unconditional end-to-end neural audio generation model." arXiv preprint arXiv:1612.07837 (2016).

File:nsynth table.png

2018-03-27T19:01:21Z

Apon:

Neural Audio Synthesis of Musical Notes with WaveNet autoencoders

2018-03-27T18:58:11Z

Apon: /* The NSynth Dataset */

= Introduction =
The authors of this paper have pointed out that the method in which most notes are created are hand-designed instruments modifying pitch, velocity and filter parameters to produce the required tone, timbre and dynamics of a sound. The authors suggest that this may be a problem and thus suggest a data-driven approach to audio synthesis. They demonstrate how to generate new types of expressive and realistic instrument sounds using a neural network model instead of using specific arrangements of oscillators or algorithms for sample playback. The model is capable of learning semantically meaningful hidden representations which can be used as control signals for manipulating tone, timbre, and dynamics during playback. To train such a data expensive model the authors highlight the need for a large dataset much like imagenet for music. The motivation for this work stems from recent advances in autoregressive models like WaveNet [5] and SampleRNN [6]. These models are effective at modeling short and medium scale (~500ms) signals, but rely on external conditioning for large-term dependencies; the proposed model removes the need for external conditioning.

= Contributions =
To solve the problem highlighted above the authors propose two main contributions of their paper:
* Wavenet-style autoencoder that learn to encode temural data over a long term audio structures without requiring external conditioning
* NSynth: a large dataset of musical notes inspired by the emerging of large image datasets

= Models =

[[File:paper26-figure1-models.png|center]]

== WaveNet Autoencoder ==

While the proposed autoencoder structure is very similar to that of WaveNet the authors argue that the algorithm is novel in two ways:
* It is able to attain consistent long-term structure without any external conditioning
* Creating meaningful embedding which can be interpolated between
The authors accomplish this by passing the raw audio throw the encoder to produce an embedding <math>Z = f(x) </math>, next the input is shifted and feed into the decoder which reproduces the input. The resulting probability distribution:

\begin{align}
p(x) = \prod_{i=1}^N\{x_i | x_1, … , x_N-1, f(x) \}
\end{align}

A detailed block diagram of the modified WaveNet structure can be seen in figure 1b. This diagram demonstrates the encoder as a 30 layer network in each each node is a ReLU nonlinearity followed by a non-causal dilated convolution. Dilated convolution (aka convolutions with holes) is a type of convolution in which the filter skips input values with a certain step (step size of 1 is equivalent to the standard convolution), effectively allowing the network to operate at a coarser scale compared to traditional convolutional layers and have very large receptive fields. The resulting convolution is 128 channels all feed into another ReLU nonlinearity which is feed into another 1x1 convolution before getting down sampled with average pooling to produce a 16 dimension <math>Z </math> distribution. Each <math>Z </math> encoding is for a specific temporal resolution which the authors of the paper tuned to 32ms. This means that there are 125, 16 dimension <math>Z </math> encodings for each 4 second note present in the NSynth database (1984 embeddings).
Before the <math>Z </math> embedding enters the decoder it is first upsampled to the original audio rate using nearest neighbor interpolation. The embedding then passes through the decoder to recreate the original audio note. The input audio data is first quantized using 8-bit mu-law encoding into 256 possible values, and the output prediction is the softmax over the possible values.

== Baseline: Spectral Autoencoder ==
Being unable to find an alternative fully deep model which the authors could use to compare to there proposed WaveNet autoencoder to, the authors just made a strong baseline. The baseline algorithm that the authors developed is a spectral autoencoder. The block diagram of its architecture can be seen in figure 1a. The baseline network is 10 layer deep. Each layer has a 4x4 kernels with 2x2 strides followed by a leaky-ReLU (0.1) and batch normalization. The final hidden vector(Z) was set to 1984 to exactly match the hidden vector of the WaveNet autoencoder.

The authors attempted to train the baseline on multiple input: raw waveforms, FFT, and log magnitude of spectrum finding the latter to be best correlated with perceptual distortion. The authors also explored several representations of phase, finding that estimating magnitude and using established iterative techniques to reconstruct phase to be most effective. (The technique to reconstruct the phase from the magnitude comes from (Griffin and Lim 1984). It can be summarized as follows. In each iteration, generate a Fourier signal z by taking the Short Time Fourier transform of the current estimate of the complete time-domain signal, and replacing its magnitude component with the known true magnitude. Then find the time-domain signal whose Short Time Fourier transform is closest to z in the least-squares sense. This is the estimate of the complete signal for the next iteration. ) A final heuristic that was used by the authors to increase the accuracy of the baseline was weighting the mean square error (MSE) loss starting at 10 for 0 HZ and decreasing linearly to 1 at 4000 Hz and above. This is valid as the fundamental frequency of most instrument are found at lower frequencies.

== Training ==
Both the modified WaveNet and the baseline autoencoder used stochastic gradient descent with an Adam optimizer. The authors trained the baseline autoencoder model asynchronously for 1800000 epocs with a batch size of 8 with a learning rate of 1e-4. Where as the WaveNet modules were trained synchronously for 250000 epocs with a batch size of 32 with a decaying learning rate ranging from 2e-4 to 6e-6.

= The NSynth Dataset =
To evaluate the WaveNet autoencoder model, the authors' wanted an audio dataset that let them explore the learned embeddings. Musical notes are an ideal setting for this study. Prior to this paper, the existing music datasets included the RWC music database (Goto et al., 2003) and the dataset from Romani Picas et al. However, the authors wanted to develop a larger dataset.

The NSynth dataset has 306 043 unique musical notes (each have a unique pitch, timbre, envelope) all 4 seconds in length sampled at 16,000 Hz. The data set consists of 1006 different instruments playing on average of 65.4 different pitches across on average 4.75 different velocities. Average pitches and velocities are used as not all instruments, can reach all 88 MIDI frequencies, or the 5 velocities desired by the authors. The dataset has the following split: training set with 289,205 notes, validation set with 12,678 notes, and test set with 4,096 notes.

Along with each note the authors also included the following annotations:
* Source - The way each sound was produced. There were 3 classes ‘acoustic’, ‘electronic’ and ‘synthetic’
* Family - The family class of instruments that produced each note. There is 11 classes which include: {‘bass’, ‘brass’, ‘vocal’ ext.}
* Qualities - Sonic qualities about each note

The full dataset is publicly available here: https://magenta.tensorflow.org/datasets/nsynth.

= Evaluation =

To fully analyze all aspects of WaveNet the authors proposed three evaluations:
* Reconstruction - Both Quantitative and Qualitative analysis were considered
* Interpolation in Timbre and Dynamics
* Entanglement of Pitch and Timbre

Sound is historically very difficult to quantify from a picture representation as it requires training and expertise to analyze. Even with expertise it can be difficult to complete a full analyses as two very different sound can look quite similar in the respective pictorial representation. This is why the authors recommend all readers to listen to the created notes which can be sound here: https://magenta.tensorflow.org/nsynth.

However, even when taking this under consideration the authors do pictorially demonstrate differences in the two proposed algorithms along with the original note, as it is hard to publish a paper with sound included. To demonstrate the pictorial difference the authors demonstrate each note using constant-q transform (CQT) which is able to capture the dynamics of timbre along with representing the frequencies of the sound.

== Reconstruction ==

[[File:paper27-figure2-reconstruction.png|center]]

=== Qualitative Comparison ===
In the Glockenspiel the WaveNet autoencoder is able to reproduce the magnitude, phase of the fundamental frequency (A and C in figure 2), and the attack (B in figure 2) of the instrument; Whereas the Baseline autoencoder introduces non existing harmonics (D in figure 2). The flugelhorn on the other hand, presents the starkest difference between the WaveNet and baseline autoencoders. The WaveNet while not perfect is able to reproduce the verbarto (I and J in figure 2) across multiple frequencies, which results in a natural sounding note. The baseline not only fails to do this but also adds extra noise (K in figure 2). The authors do add that the WaveNet produces some strikes (L in figure 2) however they argue that they are inaudible.

[[File:paper27-table1.png|center]]

Mu-law encoding was used in the original WaveNet [https://arxiv.org/pdf/1609.03499.pdf paper] to make the problem "more tractable" compared to raw 16-bit integer values. In that paper, they note that "especially for speech, this non-linear quantization produces a significantly better reconstruction" compared to a linear scheme. This might be expected considering that the mu-law companding transformation was designed to [https://www.cisco.com/c/en/us/support/docs/voice/h323/8123-waveform-coding.html#t4 encode speech]. In this application though, using this encoding creates perceptible distortion that sounds similar to clipping.

=== Quantitative Comparison ===
For a quantitative comparison the authors trained a separate multi-task classifier to classify a note using given pitch or quality of a note. The results of both the Baseline and the WaveNet where then inputted and attempted to be classified. As seen in table 1 WaveNet significantly outperformed the Baseline in both metrics posting a ~70% increase when only considering pitch.

== Interpolation in Timbre and Dynamics ==

[[File:paper27-figure3-interpolation.png|center]]

For this evaluation the authors reconstructed from linear interpolations in Z space among different instruments and compared these to superimposed position of the original two instruments. Not surprisingly the model fuse aspects of both instruments during the recreation. The authors claim however, that WaveNet produces much more realistic sounding results.
To support their claim the authors the authors point to WaveNet ability to create dynamic mixing of overtone in time, even jumping to higher harmonics (A in figure 3), capturing the timbre and dynamics of both the bass and flute. This can be once again seen in (B in figure 3) where Wavenet adds additional harmonics as well as a sub-harmonics to the original flute note.

== Entanglement of Pitch and Timbre ==

[[File:paper27-table2.png|center]]

[[File:paper27-figure4-entanglement.png|center]]

To study the entanglement between pitch and Z space the authors constructed a classifier which was expected to drop in accuracy if the representation of pitch and timbre is disentangled as it relies heavily on the pitch information. This is clearly demonstrated by the first two rows of table 2 where WaveNet relies more strongly on pitch then the baseline algorithm. The authors provide a more qualitative demonstrating in figure 4. They demonstrate a situation in which a classifier may be confused; a note with pitch of +12 is almost exactly the same as the original apart from an emergence of sub-harmonics.

Further insight can be gained on the relationship between pitch and timbre by studying the trend amongst the network embeddings among the pitches for specific instruments. This is depicted in figure 5 for several instruments across their entire 88 note range at 127 velocity. It can be noted from the figure that the instruments have unique separation of two or more registers over which the embeddings of notes with different pitches are similar. This is expected since instrumental dynamics and timbre varies dramatically over the range of the instrument.

= Future Directions =

One significant area which the authors claim great improvement is needed is the large memory constraints required by there algorithm. Due to the large memory requirement the current WaveNet must rely on down sampling thus being unable to fully capture the global context. They claim that research using different input representations (instead of mu-law) to minimize distortion is ongoing.

= Open Source Code =

Google has released all code related to this paper at the following open source repository: https://github.com/tensorflow/magenta/tree/master/magenta/models/nsynth

= References =

# Engel, J., Resnick, C., Roberts, A., Dieleman, S., Norouzi, M., Eck, D. & Simonyan, K.. (2017). Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders. Proceedings of the 34th International Conference on Machine Learning, in PMLR 70:1068-1077
# Griffin, Daniel, and Jae Lim. "Signal estimation from modified short-time Fourier transform." IEEE Transactions on Acoustics, Speech, and Signal Processing 32.2 (1984): 236-243.
# NSynth: Neural Audio Synthesis. (2017, April 06). Retrieved March 19, 2018, from https://magenta.tensorflow.org/nsynth
# The NSynth Dataset. (2017, April 05). Retrieved March 19, 2018, from https://magenta.tensorflow.org/datasets/nsynth
# Oord, Aaron van den, Nal Kalchbrenner, and Koray Kavukcuoglu. "Pixel recurrent neural networks." arXiv preprint arXiv:1601.06759 (2016).
# Mehri, Soroush, et al. "SampleRNN: An unconditional end-to-end neural audio generation model." arXiv preprint arXiv:1612.07837 (2016).

Multi-scale Dense Networks for Resource Efficient Image Classification

2018-03-26T19:29:51Z

Apon: /* Architecture */

= Introduction =

Multi-Scale Dense Networks, MSDNets, are designed to address the growing demand for efficient object recognition. The issue with existing recognition networks is that they are either:
efficient networks, but don't do well on hard examples, or large networks that do well on all examples but require a large amount of resources.

In order to be efficient on all difficulties MSDNets propose a structure that can accurately output classifications for varying levels of computational requirements. The two cases that are used to evaluate the network are:
Anytime Prediction: What is the best prediction the network can provide when suddenly prompted.
Budget Batch Predictions: Given a maximum amount of computational resources how well does the network do on the batch.

= Related Networks =

== Computationally Efficient Networks ==

Existing methods for refining an accurate network to be more efficient include weight pruning, quantization of weights (during or after training), and knowledge distillation, which trains smaller network to match teacher network.

== Resource Efficient Networks ==

Unlike the above, resource efficient concepts consider limited resources as a part of the structure/loss.
Examples of work in this area include:
* Efficient variants to existing state of the art networks
* Gradient boosted decision trees, which incorporate computational limitations into the training
* Fractal nets
* Adaptive computation time method

== Related architectures ==

MSDNets pull on concepts from a number of existing networks:
* Neural fabrics and others, are used to quickly establish a low resolution feature map, which is integral for classification.
* Deeply supervised nets, introduced the incorporation of multiple classifiers throughout the network
* The feature concatenation method from DenseNets allows the later classifiers to not be disrupted by the weight updates from earlier classifiers.

= Problem Setup =
The authors consider two settings that impose computational constraints at prediction time.

== Anytime Prediction ==
In the anytime prediction setting (Grubb & Bagnell, 2012), there is a finite computational budget <math>B > 0</math> available for each test example <math>x</math>. The budget is nondeterministic and varies per test instance.

== Budgeted Batch Classification ==
In the budgeted batch classification setting, the model needs to classify a set of examples <math>D_test = {x_1, . . . , x_M}</math> within a finite computational budget <math>B > 0</math> that is known in advance.

= Multi-Scale Dense Networks =

== Integral Contributions ==

The way MSDNets aims to provide efficient classification with varying computational costs is to create one network that outputs results at depths. While this may seem trivial, as intermediate classifiers can be inserted into any existing network, two major problems arise.

=== Coarse Level Features Needed For Classification ===

[[File:paper29 fig3.png | 700px|thumb|center]]

Coarse level features are needed to gain context of scene. In typical CNN based networks, the features propagate from fine to coarse. Classifiers added to the early, fine featured, layers do not output accurate predictions due to the lack of context.

Figure 3 depicts relative accuracies of the intermediate classifiers and shows that the accuracy of a classifier is highly correlated with its position in the network. It is easy to see, specifically with the case of ResNet, that the classifiers improve in a staircase pattern. All of the experiments were performed on Cifar-100 dataset and it can be seen that the intermediate classifiers perform worst than the final classifiers, thus highlighting the problem with the lack of coarse level features early on.

To address this issue, MSDNets proposes an architecture in which uses multi scaled feature maps. The network is quickly formed to contain a set number of scales ranging from fine to coarse. These scales are propagated throughout, so that for the length of the network there are always coarse level features for classification and fine features for learning more difficult representations.

=== Training of Early Classifiers Interferes with Later Classifiers ===

When training a network containing intermediate classifiers, the training of early classifiers will cause the early layers to focus on features for that classifier. These learned features may not be as useful to the later classifiers and degrade their accuracy.

MSDNets use dense connectivity to avoid this issue. By concatenating all prior layers to learn future layers, the gradient propagation is spread throughout the available features. This allows later layers to not be reliant on any single prior, providing opportunities to learn new features that priors have ignored.

== Architecture ==

[[File:MSDNet_arch.png | 700px|thumb|center|Left: the MSDNet architecture. Right: example calculations for each output given 3 scales and 4 layers.]]

The architecture of MSDNet is a structure of convolutions with a set number of layers and a set number of scales. Layers allow the network to build on the previous information to generate more accurate predictions, while the scales allow the network to maintain coarse level features throughout.

The first layer is a special, mini-CNN-network, that quickly fills all required scales with features. The following layers are generated through the convolutions of the previous layers and scales.

Each output at a given s scale is given by the convolution of all prior outputs of the same scale, and the strided-convolution of all prior outputs from the previous scale.

The classifiers are run on the concatenation of all of the coarsest outputs from the preceding layers.

=== Loss Function ===

The loss is calculated as a weighted sum of each classifier's logistic loss. The weighted loss is taken as an average over a set of training samples. The weights can be determined from a budget of computational power, but results also show that setting all to 1 is also acceptable.

=== Computational Limit Inclusion ===

When running in a budgeted batch scenario, the network attempts to provide the best overall accuracy. To do this with a set limit on computational resources, it works to use less of the budget on easy detections in order to allow more time to be spent on hard ones.
In order to facilitate this, the classifiers are designed to exit when the confidence of the classification exceeds a preset threshold. To determine the threshold for each classifier, <math>|D_{test}|\sum_{k}(q_k C_k) \leq B </math> must be true. Where <math>|D_{test}|</math> is the total number of test samples, <math>C_k</math> is the computational requirement to get an output from the <math>k</math>th classifier, and <math>q_k </math> is the probability that a sample exits at the <math>k</math>th classifier. Assuming that all classifiers have the same base probability, <math>q</math>, then <math>q_k</math> can be used to find the threshold.

=== Network Reduction and Lazy Evaluation ===
There are two ways to reduce the computational needs of MSDNets:

# Reduce the size of the network by splitting it into <math>S</math> blocks along the depth dimension and keeping the <math>(S-i+1)</math> scales in the <math>i^{\text{th}}</math> block.
# Remove unnecessary computations: Group the computation in "diagonal blocks"; this propagates the example along paths that are required for the evaluation of the next classifier.

= Experiments =

When evaluating on CIFAR-10 and CIFAR-100 ensembles and multi-classifier versions of ResNets and DenseNets, as well as FractalNet are used to compare with MSDNet.

When evaluating on ImageNet ensembles and individual versions of ResNets and DenseNets are compared with MSDNets.

== Anytime Prediction ==

In anytime prediction MSDNets are shown to have highly accurate with very little budget, and continue to remain above the alternate methods as the budget increases.

[[File:MSDNet_anytime.png | 700px|thumb|center|Accuracy of the anytime classification models.]]

== Budget Batch ==

For budget batch 3 MSDNets are designed with classifiers set-up for varying ranges of budget constraints. On both dataset options the MSDNets exceed all alternate methods with a fraction of the budget required.

[[File:MSDNet_budgetbatch.png | 700px|thumb|center|Accuracy of the budget batch classification models.]]

= Critique =

The problem formulation and scenario evaluation were very well formulated, and according to independent reviews, the results were reproducible. Where the paper could improve is on explaining how to implement the threshold; it isn't very well explained how the use of the validation set can be used to set the threshold value.

= Implementation =
The following repository provides the source code for the paper, written by the authors: https://github.com/gaohuang/MSDNet

= Sources =
# Huang, G., Chen, D., Li, T., Wu, F., Maaten, L., & Weinberger, K. Q. (n.d.). Multi-Scale Dense Networks for Resource Efficient Image Classification. ICLR 2018. doi:1703.09844
# Huang, G. (n.d.). Gaohuang/MSDNet. Retrieved March 25, 2018, from https://github.com/gaohuang/MSDNet

Multi-scale Dense Networks for Resource Efficient Image Classification

2018-03-24T19:55:36Z

Apon: /* Architecture */

= Introduction =

Multi-Scale Dense Networks, MSDNets, are designed to address the growing demand for efficient object recognition. The issue with existing recognition networks is that they are either:
efficient networks, but don't do well on hard examples, or large networks that do well on all examples but require a large amount of resources.

In order to be efficient on all difficulties MSDNets propose a structure that can accurately output classifications for varying levels of computational requirements. The two cases that are used to evaluate the network are:
Anytime Prediction: What is the best prediction the network can provide when suddenly prompted.
Budget Batch Predictions: Given a maximum amount of computational resources how well does the network do on the batch.

= Related Networks =

== Computationally Efficient Networks ==

Existing methods for refining an accurate network to be more efficient include weight pruning, quantization of weights (during or after training), and knowledge distillation, which trains smaller network to match teacher network.

== Resource Efficient Networks ==

Unlike the above, resource efficient concepts consider limited resources as a part of the structure/loss.
Examples of work in this area include:
* Efficient variants to existing state of the art networks
* Gradient boosted decision trees, which incorporate computational limitations into the training
* Fractal nets
* Adaptive computation time method

== Related architectures ==

MSDNets pull on concepts from a number of existing networks:
* Neural fabrics and others, are used to quickly establish a low resolution feature map, which is integral for classification.
* Deeply supervised nets, introduced the incorporation of multiple classifiers throughout the network
* The feature concatenation method from DenseNets allows the later classifiers to not be disrupted by the weight updates from earlier classifiers.

= Multi-Scale Dense Networks =

== Integral Contributions ==

The way MSDNets aims to provide efficient classification with varying computational costs is to create one network that outputs results at depths. While this may seem trivial, as intermediate classifiers can be inserted into any existing network, two major problems arise.

=== Coarse Level Features Needed For Classification ===

[[File:paper29 fig3.png | 700px|thumb|center]]

Coarse level features are needed to gain context of scene. In typical CNN based networks, the features propagate from fine to coarse. Classifiers added to the early, fine featured, layers do not output accurate predictions due to the lack of context.

Figure 3 depicts relative accuracies of the intermediate classifiers and shows that the accuracy of a classifier is highly correlated with its position in the network. It is easy to see, specifically with the case of ResNet, that the classifiers improve in a staircase pattern. All of the experiments were performed on Cifar-100 dataset and it can be seen that the intermediate classifiers perform worst than the final classifiers, thus highlighting the problem with the lack of coarse level features early on.

To address this issue, MSDNets proposes an architecture in which uses multi scaled feature maps. The network is quickly formed to contain a set number of scales ranging from fine to coarse. These scales are propagated throughout, so that for the length of the network there are always coarse level features for classification and fine features for learning more difficult representations.

=== Training of Early Classifiers Interferes with Later Classifiers ===

When training a network containing intermediate classifiers, the training of early classifiers will cause the early layers to focus on features for that classifier. These learned features may not be as useful to the later classifiers and degrade their accuracy.

MSDNets use dense connectivity to avoid this issue. By concatenating all prior layers to learn future layers, the gradient propagation is spread throughout the available features. This allows later layers to not be reliant on any single prior, providing opportunities to learn new features that priors have ignored.

== Architecture ==

[[File:MSDNet_arch.png | 700px|thumb|center|Left: the MSDNet architecture. Right: example calculations for each output given 3 scales and 4 layers.]]

The architecture of MSDNet is a structure of convolutions with a set number of layers and a set number of scales. Layers allow the network to build on the previous information to generate more accurate predictions, while the scales allow the network to maintain coarse level features throughout.

The first layer is a special, mini-CNN-network, that quickly fills all required scales with features. The following layers are generated through the convolutions of the previous layers and scales.

Each output at a given s scale is given by the convolution of all prior outputs of the same scale, and the strided-convolution of all prior outputs from the previous scale.

The classifiers are run on the concatenation of all of the coarsest outputs from the preceding layers.

=== Loss Function ===

The loss is calculated as a weighted sum of each classifier's logistic loss. The weighted loss is taken as an average over a set of training samples. The weights can be determined from a budget of computational power, but results also show that setting all to 1 is also acceptable.

=== Computational Limit Inclusion ===

When running in a budgeted batch scenario, the network attempts to provide the best overall accuracy. To do this with a set limit on computational resources, it works to use less of the budget on easy detections in order to allow more time to be spent on hard ones.
In order to facilitate this, the classifiers are designed to exit when the confidence of the classification exceeds a preset threshold. To determine the threshold for each classifier, <math>|D_{test}|\sum_{k}(q_k C_k) \leq B </math> must be true. Where <math>|D_{test}|</math> is the total number of test samples, <math>C_k</math> is the computational requirement to get an output from the <math>k</math>th classifier, and <math>q_k </math> is the probability that a sample exits at the <math>k</math>th classifier. Assuming that all classifiers have the same base probability, <math>q</math>, then <math>q_k</math> can be used to find the threshold.

= Experiments =

When evaluating on CIFAR-10 and CIFAR-100 ensembles and multi-classifier versions of ResNets and DenseNets, as well as FractalNet are used to compare with MSDNet.

When evaluating on ImageNet ensembles and individual versions of ResNets and DenseNets are compared with MSDNets.

== Anytime Prediction ==

In anytime prediction MSDNets are shown to have highly accurate with very little budget, and continue to remain above the alternate methods as the budget increases.

[[File:MSDNet_anytime.png | 700px|thumb|center|Accuracy of the anytime classification models.]]

== Budget Batch ==

For budget batch 3 MSDNets are designed with classifiers set-up for varying ranges of budget constraints. On both dataset options the MSDNets exceed all alternate methods with a fraction of the budget required.

[[File:MSDNet_budgetbatch.png | 700px|thumb|center|Accuracy of the budget batch classification models.]]

= Critique =

The problem formulation and scenario evaluation were very well formulated, and according to independent reviews, the results were reproducible. Where the paper could improve is on explaining how to implement the threshold; it isn't very well explained how the use of the validation set can be used to set the threshold value.

Multi-scale Dense Networks for Resource Efficient Image Classification

2018-03-24T19:55:09Z

Apon: /* Architecture */

= Introduction =

Multi-Scale Dense Networks, MSDNets, are designed to address the growing demand for efficient object recognition. The issue with existing recognition networks is that they are either:
efficient networks, but don't do well on hard examples, or large networks that do well on all examples but require a large amount of resources.

In order to be efficient on all difficulties MSDNets propose a structure that can accurately output classifications for varying levels of computational requirements. The two cases that are used to evaluate the network are:
Anytime Prediction: What is the best prediction the network can provide when suddenly prompted.
Budget Batch Predictions: Given a maximum amount of computational resources how well does the network do on the batch.

= Related Networks =

== Computationally Efficient Networks ==

Existing methods for refining an accurate network to be more efficient include weight pruning, quantization of weights (during or after training), and knowledge distillation, which trains smaller network to match teacher network.

== Resource Efficient Networks ==

Unlike the above, resource efficient concepts consider limited resources as a part of the structure/loss.
Examples of work in this area include:
* Efficient variants to existing state of the art networks
* Gradient boosted decision trees, which incorporate computational limitations into the training
* Fractal nets
* Adaptive computation time method

== Related architectures ==

MSDNets pull on concepts from a number of existing networks:
* Neural fabrics and others, are used to quickly establish a low resolution feature map, which is integral for classification.
* Deeply supervised nets, introduced the incorporation of multiple classifiers throughout the network
* The feature concatenation method from DenseNets allows the later classifiers to not be disrupted by the weight updates from earlier classifiers.

= Multi-Scale Dense Networks =

== Integral Contributions ==

The way MSDNets aims to provide efficient classification with varying computational costs is to create one network that outputs results at depths. While this may seem trivial, as intermediate classifiers can be inserted into any existing network, two major problems arise.

=== Coarse Level Features Needed For Classification ===

[[File:paper29 fig3.png | 700px|thumb|center]]

Coarse level features are needed to gain context of scene. In typical CNN based networks, the features propagate from fine to coarse. Classifiers added to the early, fine featured, layers do not output accurate predictions due to the lack of context.

Figure 3 depicts relative accuracies of the intermediate classifiers and shows that the accuracy of a classifier is highly correlated with its position in the network. It is easy to see, specifically with the case of ResNet, that the classifiers improve in a staircase pattern. All of the experiments were performed on Cifar-100 dataset and it can be seen that the intermediate classifiers perform worst than the final classifiers, thus highlighting the problem with the lack of coarse level features early on.

To address this issue, MSDNets proposes an architecture in which uses multi scaled feature maps. The network is quickly formed to contain a set number of scales ranging from fine to coarse. These scales are propagated throughout, so that for the length of the network there are always coarse level features for classification and fine features for learning more difficult representations.

=== Training of Early Classifiers Interferes with Later Classifiers ===

When training a network containing intermediate classifiers, the training of early classifiers will cause the early layers to focus on features for that classifier. These learned features may not be as useful to the later classifiers and degrade their accuracy.

MSDNets use dense connectivity to avoid this issue. By concatenating all prior layers to learn future layers, the gradient propagation is spread throughout the available features. This allows later layers to not be reliant on any single prior, providing opportunities to learn new features that priors have ignored.

== Architecture ==

[[File:MSDNet_arch.png | 700px|thumb|center|Left: the MSDNet architecture. Right: example calculations for each output given 3 scales and 4 layers.]]

The architecture of MSDNet is a structure of convolutions with a set number of layers and a set number of scales. Layers allow the network to build on the previous information to generate more accurate predictions, while the scales allow the network to maintain coarse level features throughout.

The first layer is a special, mini-cnn-network, that quickly fills all required scales with features. The following layers are generated through the convolutions of the previous layers and scales.

Each output at a given s scale is given by the convolution of all prior outputs of the same scale, and the strided-convolution of all prior outputs from the previous scale.

The classifiers are run on the concatenation of all of the coarsest outputs from the preceding layers.

=== Loss Function ===

The loss is calculated as a weighted sum of each classifier's logistic loss. The weighted loss is taken as an average over a set of training samples. The weights can be determined from a budget of computational power, but results also show that setting all to 1 is also acceptable.

=== Computational Limit Inclusion ===

When running in a budgeted batch scenario, the network attempts to provide the best overall accuracy. To do this with a set limit on computational resources, it works to use less of the budget on easy detections in order to allow more time to be spent on hard ones.
In order to facilitate this, the classifiers are designed to exit when the confidence of the classification exceeds a preset threshold. To determine the threshold for each classifier, <math>|D_{test}|\sum_{k}(q_k C_k) \leq B </math> must be true. Where <math>|D_{test}|</math> is the total number of test samples, <math>C_k</math> is the computational requirement to get an output from the <math>k</math>th classifier, and <math>q_k </math> is the probability that a sample exits at the <math>k</math>th classifier. Assuming that all classifiers have the same base probability, <math>q</math>, then <math>q_k</math> can be used to find the threshold.

= Experiments =

When evaluating on CIFAR-10 and CIFAR-100 ensembles and multi-classifier versions of ResNets and DenseNets, as well as FractalNet are used to compare with MSDNet.

When evaluating on ImageNet ensembles and individual versions of ResNets and DenseNets are compared with MSDNets.

== Anytime Prediction ==

In anytime prediction MSDNets are shown to have highly accurate with very little budget, and continue to remain above the alternate methods as the budget increases.

[[File:MSDNet_anytime.png | 700px|thumb|center|Accuracy of the anytime classification models.]]

== Budget Batch ==

For budget batch 3 MSDNets are designed with classifiers set-up for varying ranges of budget constraints. On both dataset options the MSDNets exceed all alternate methods with a fraction of the budget required.

[[File:MSDNet_budgetbatch.png | 700px|thumb|center|Accuracy of the budget batch classification models.]]

= Critique =

The problem formulation and scenario evaluation were very well formulated, and according to independent reviews, the results were reproducible. Where the paper could improve is on explaining how to implement the threshold; it isn't very well explained how the use of the validation set can be used to set the threshold value.

Multi-scale Dense Networks for Resource Efficient Image Classification

2018-03-24T19:54:33Z

Apon: /* Introduction */

= Introduction =

Multi-Scale Dense Networks, MSDNets, are designed to address the growing demand for efficient object recognition. The issue with existing recognition networks is that they are either:
efficient networks, but don't do well on hard examples, or large networks that do well on all examples but require a large amount of resources.

In order to be efficient on all difficulties MSDNets propose a structure that can accurately output classifications for varying levels of computational requirements. The two cases that are used to evaluate the network are:
Anytime Prediction: What is the best prediction the network can provide when suddenly prompted.
Budget Batch Predictions: Given a maximum amount of computational resources how well does the network do on the batch.

= Related Networks =

== Computationally Efficient Networks ==

Existing methods for refining an accurate network to be more efficient include weight pruning, quantization of weights (during or after training), and knowledge distillation, which trains smaller network to match teacher network.

== Resource Efficient Networks ==

Unlike the above, resource efficient concepts consider limited resources as a part of the structure/loss.
Examples of work in this area include:
* Efficient variants to existing state of the art networks
* Gradient boosted decision trees, which incorporate computational limitations into the training
* Fractal nets
* Adaptive computation time method

== Related architectures ==

MSDNets pull on concepts from a number of existing networks:
* Neural fabrics and others, are used to quickly establish a low resolution feature map, which is integral for classification.
* Deeply supervised nets, introduced the incorporation of multiple classifiers throughout the network
* The feature concatenation method from DenseNets allows the later classifiers to not be disrupted by the weight updates from earlier classifiers.

= Multi-Scale Dense Networks =

== Integral Contributions ==

The way MSDNets aims to provide efficient classification with varying computational costs is to create one network that outputs results at depths. While this may seem trivial, as intermediate classifiers can be inserted into any existing network, two major problems arise.

=== Coarse Level Features Needed For Classification ===

[[File:paper29 fig3.png | 700px|thumb|center]]

Coarse level features are needed to gain context of scene. In typical CNN based networks, the features propagate from fine to coarse. Classifiers added to the early, fine featured, layers do not output accurate predictions due to the lack of context.

Figure 3 depicts relative accuracies of the intermediate classifiers and shows that the accuracy of a classifier is highly correlated with its position in the network. It is easy to see, specifically with the case of ResNet, that the classifiers improve in a staircase pattern. All of the experiments were performed on Cifar-100 dataset and it can be seen that the intermediate classifiers perform worst than the final classifiers, thus highlighting the problem with the lack of coarse level features early on.

To address this issue, MSDNets proposes an architecture in which uses multi scaled feature maps. The network is quickly formed to contain a set number of scales ranging from fine to coarse. These scales are propagated throughout, so that for the length of the network there are always coarse level features for classification and fine features for learning more difficult representations.

=== Training of Early Classifiers Interferes with Later Classifiers ===

When training a network containing intermediate classifiers, the training of early classifiers will cause the early layers to focus on features for that classifier. These learned features may not be as useful to the later classifiers and degrade their accuracy.

MSDNets use dense connectivity to avoid this issue. By concatenating all prior layers to learn future layers, the gradient propagation is spread throughout the available features. This allows later layers to not be reliant on any single prior, providing opportunities to learn new features that priors have ignored.

== Architecture ==

[[File:MSDNet_arch.png | 700px|thumb|center|Left: the MSDNet architecture. Right: example calculations for each output given 3 scales and 4 layers.]]

The architecture of an MSDNet is a structure of convolutions with a set number of layers and a set number of scales. Layers allow the network to build on the previous information to generate more accurate predictions, while the scales allow the network to maintain coarse level features throughout.

The first layer is a special, mini-cnn-network, that quickly fills all required scales with features. The following layers are generated through the convolutions of the previous layers and scales.

Each output at a given s scale is given by the convolution of all prior outputs of the same scale, and the strided-convolution of all prior outputs from the previous scale.

The classifiers are run on the concatenation of all of the coarsest outputs from the preceding layers.

=== Loss Function ===

The loss is calculated as a weighted sum of each classifier's logistic loss. The weighted loss is taken as an average over a set of training samples. The weights can be determined from a budget of computational power, but results also show that setting all to 1 is also acceptable.

=== Computational Limit Inclusion ===

When running in a budgeted batch scenario, the network attempts to provide the best overall accuracy. To do this with a set limit on computational resources, it works to use less of the budget on easy detections in order to allow more time to be spent on hard ones.
In order to facilitate this, the classifiers are designed to exit when the confidence of the classification exceeds a preset threshold. To determine the threshold for each classifier, <math>|D_{test}|\sum_{k}(q_k C_k) \leq B </math> must be true. Where <math>|D_{test}|</math> is the total number of test samples, <math>C_k</math> is the computational requirement to get an output from the <math>k</math>th classifier, and <math>q_k </math> is the probability that a sample exits at the <math>k</math>th classifier. Assuming that all classifiers have the same base probability, <math>q</math>, then <math>q_k</math> can be used to find the threshold.

= Experiments =

When evaluating on CIFAR-10 and CIFAR-100 ensembles and multi-classifier versions of ResNets and DenseNets, as well as FractalNet are used to compare with MSDNet.

When evaluating on ImageNet ensembles and individual versions of ResNets and DenseNets are compared with MSDNets.

== Anytime Prediction ==

In anytime prediction MSDNets are shown to have highly accurate with very little budget, and continue to remain above the alternate methods as the budget increases.

[[File:MSDNet_anytime.png | 700px|thumb|center|Accuracy of the anytime classification models.]]

== Budget Batch ==

For budget batch 3 MSDNets are designed with classifiers set-up for varying ranges of budget constraints. On both dataset options the MSDNets exceed all alternate methods with a fraction of the budget required.

[[File:MSDNet_budgetbatch.png | 700px|thumb|center|Accuracy of the budget batch classification models.]]

= Critique =

The problem formulation and scenario evaluation were very well formulated, and according to independent reviews, the results were reproducible. Where the paper could improve is on explaining how to implement the threshold; it isn't very well explained how the use of the validation set can be used to set the threshold value.

MarrNet: 3D Shape Reconstruction via 2.5D Sketches

2018-03-22T19:36:04Z

Apon: /* Results */

= Introduction =
Humans are able to quickly recognize 3D shapes from images, even in spite of drastic differences in object texture, material, lighting, and background.

[[File:marrnet_intro_image.png|700px|thumb|center|Objects in real images. The appearance of the same shaped object varies based on colour, texture, lighting, background, etc. However, the 2.5D sketches (e.g. depth or normal maps) of the object remain constant, and can be seen as an abstraction of the object which is used to reconstruct the 3D shape.]]

In this work, the authors propose a novel end-to-end trainable model that sequentially estimates 2.5D sketches and 3D object shape from images and also enforce the re projection consistency between the 3D shape and the estimated sketch. The two step approach makes the network more robust to differences in object texture, material, lighting and background. Based on the idea from [Marr, 1982] that human 3D perception relies on recovering 2.5D sketches, which include depth and surface normal maps, the author’s design an end-to-end trainable pipeline which they call MarrNet. MarrNet first estimates depth, normal maps, and silhouette, followed by a 3D shape. MarrNet uses an encoder-decoder structure for the sub-components of the framework.

The authors claim several unique advantages to their method. Single image 3D reconstruction is a highly under-constrained problem, requiring strong prior knowledge of object shapes. As well, accurate 3D object annotations using real images are not common, and many previous approaches rely on purely synthetic data. However, most of these methods suffer from domain adaptation due to imperfect rendering.

Using 2.5D sketches can alleviate the challenges of domain transfer. It is straightforward to generate perfect object surface normals and depths using a graphics engine. Since 2.5D sketches contain only depth, surface normal, and silhouette information, the second step of recovering 3D shape can be trained purely from synthetic data. As well, the introduction of differentiable constraints between 2.5D sketches and 3D shape makes it possible to fine-tune the system, even without any annotations.

The framework is evaluated on both synthetic objects from ShapeNet, and real images from PASCAL 3D+, showing good qualitative and quantitative performance in 3D shape reconstruction.

= Related Work =

== 2.5D Sketch Recovery ==
Researchers have explored recovering 2.5D information from shading, texture, and colour images in the past. More recently, the development of depth sensors has led to the creation of large RGB-D datasets, and papers on estimating depth, surface normals, and other intrinsic images using deep networks. While this method employs 2.5D estimation, the final output is a full 3D shape of an object.

[[File:2-5d_example.PNG|700px|thumb|center|Results from the paper: Learning Non-Lambertian Object Intrinsics across ShapeNet Categories. The results show that neural networks can be trained to recover 2.5D information from an image. The top row predicts the albedo and the bottom row predicts the shading. It can be observed that the results are still blurry and the fine details are not fully recovered.]]

== Single Image 3D Reconstruction ==
The development of large-scale shape repositories like ShapeNet has allowed for the development of models encoding shape priors for single image 3D reconstruction. These methods normally regress voxelized 3D shapes, relying on synthetic data or 2D masks for training. The formulation in the paper tackles domain adaptation better, since the network can be fine-tuned on images without any annotations.

== 2D-3D Consistency ==
Intuitively, the 3D shape can be constrained to be consistent with 2D observations. This idea has been explored for decades, with the use of depth and silhouettes, as well as some papers enforcing differentiable 2D-3D constraints for joint training of deep networks. In this work, this idea is exploited to develop differentiable constraints for consistency between the 2.5D sketches and 3D shape.

= Approach =
The 3D structure is recovered from a single RGB view using three steps, shown in Figure 1. The first step estimates 2.5D sketches, including depth, surface normal, and silhouette of the object. The second step, shown in Figure 2, estimates a 3D voxel representation of the object. The third step uses a reprojection consistency function to enforce the 2.5D sketch and 3D structure alignment.

[[File:marrnet_model_components.png|700px|thumb|center|MarrNet architecture. 2.5D sketches of normals, depths, and silhouette are first estimated. The sketches are then used to estimate the 3D shape. Finally, re-projection consistency is used to ensure consistency between the sketch and 3D output.]]

== 2.5D Sketch Estimation ==
The first step takes a 2D RGB image and predicts the surface normal, depth, and silhouette of the object. The goal is to estimate intrinsic object properties from the image, while discarding non-essential information. A ResNet-18 encoder-decoder network is used, with the encoder taking a 256 x 256 RGB image, producing 8 x 8 x 512 feature maps. The decoder is four sets of 5 x 5 convolutional and ReLU layers, followed by four sets of 1 x 1 convolutional and ReLU layers. The output is 256 x 256 resolution depth, surface normal, and silhouette images.

== 3D Shape Estimation ==
The second step estimates a voxelized 3D shape using the 2.5D sketches from the first step. The focus here is for the network to learn the shape prior that can explain the input well, and can be trained on synthetic data without suffering from the domain adaptation problem. The network architecture is inspired by the TL network, and 3D-VAE-GAN, with an encoder-decoder structure. The normal and depth image, masked by the estimated silhouette, are passed into 5 sets of convolutional, ReLU, and pooling layers, followed by two fully connected layers, with a final output width of 200. The 200-dimensional vector is passed into a decoder of 5 convolutional and ReLU layers, outputting a 128 x 128 x 128 voxelized estimate of the input.

== Re-projection Consistency ==
The third step consists of a depth re-projection loss and surface normal re-projection loss. Here, <math>v_{x, y, z}</math> represents the value at position <math>(x, y, z)</math> in a 3D voxel grid, with <math>v_{x, y, z} \in [0, 1] ∀ x, y, z</math>. <math>d_{x, y}</math> denotes the estimated depth at position <math>(x, y)</math>, <math>n_{x, y} = (n_a, n_b, n_c)</math> denotes the estimated surface normal. Orthographic projection is used.

[[File:marrnet_reprojection_consistency.png|700px|thumb|center|Reprojection consistency for voxels. Left and middle: criteria for depth and silhouettes. Right: criterion for surface normals]]

=== Depths ===
The voxel with depth <math>v_{x, y}, d_{x, y}</math> should be 1, while all voxels in front of it should be 0. The projected depth loss is defined as follows:

<math>
L_{depth}(x, y, z)=
\left\{
\begin{array}{ll}
v^2_{x, y, z}, & z < d_{x, y} \\
(1 - v_{x, y, z})^2, & z = d_{x, y} \\
0, & z > d_{x, y} \\
\end{array}
\right.
</math>

<math>
\frac{∂L_{depth}(x, y, z)}{∂v_{x, y, z}} =
\left\{
\begin{array}{ll}
2v{x, y, z}, & z < d_{x, y} \\
2(v_{x, y, z} - 1), & z = d_{x, y} \\
0, & z > d_{x, y} \\
\end{array}
\right.
</math>

When <math>d_{x, y} = \infty</math>, all voxels in front of it should be 0.

=== Surface Normals ===
Since vectors <math>n_{x} = (0, −n_{c}, n_{b})</math> and <math>n_{y} = (−n_{c}, 0, n_{a})</math> are orthogonal to the normal vector <math>n_{x, y} = (n_{a}, n_{b}, n_{c})</math>, they can be normalized to obtain <math>n’_{x} = (0, −1, n_{b}/n_{c})</math> and <math>n’_{y} = (−1, 0, n_{a}/n_{c})</math> on the estimated surface plane at <math>(x, y, z)</math>. The projected surface normal tried to guarantee voxels at <math>(x, y, z) ± n’_{x}</math> and <math>(x, y, z) ± n’_{y}</math> should be 1 to match the estimated normal. The constraints are only applied when the target voxels are inside the estimated silhouette.

The projected surface normal loss is defined as follows, with <math>z = d_{x, y}</math>:

<math>
L_{normal}(x, y, z) =
(1 - v_{x, y-1, z+\frac{n_b}{n_c}})^2 + (1 - v_{x, y+1, z-\frac{n_b}{n_c}})^2 +
(1 - v_{x-1, y, z+\frac{n_a}{n_c}})^2 + (1 - v_{x+1, y, z-\frac{n_a}{n_c}})^2
</math>

Gradients along x are:

<math>
\frac{dL_{normal}(x, y, z)}{dv_{x-1, y, z+\frac{n_a}{n_c}}} = 2(v_{x-1, y, z+\frac{n_a}{n_c}}-1)
</math>
and
<math>
\frac{dL_{normal}(x, y, z)}{dv_{x+1, y, z-\frac{n_a}{n_c}}} = 2(v_{x+1, y, z-\frac{n_a}{n_c}}-1)
</math>

Gradients along y are similar to x.

= Training =
The 2.5D and 3D estimation components are first pre-trained separately on synthetic data from ShapeNet, and then fine-tuned on real images.

For pre-training, the 2.5D sketch estimator is trained on synthetic ShapeNet depth, surface normal, and silhouette ground truth, using an L2 loss. The 3D estimator is trained with ground truth voxels using a cross-entropy loss.

Reprojection consistency loss is used to fine-tune the 3D estimation using real images, using the predicted depth, normals, and silhouette. A straightforward implementation leads to shapes that explain the 2.5D sketches well, but lead to unrealistic 3D appearance due to overfitting.

Instead, the decoder of the 3D estimator is fixed, and only the encoder is fine-tuned. The model is fine-tuned separately on each image for 40 iterations, which takes up to 10 seconds on the GPU. Without fine-tuning, testing time takes around 100 milliseconds. SGD is used for optimization with batch size of 4, learning rate of 0.001, and momentum of 0.9.

= Evaluation =
Qualitative and quantitative results are provided using different variants of the framework. The framework is evaluated on both synthetic and real images on three datasets.

== ShapeNet ==
Synthesized images of 6,778 chairs from ShapeNet are rendered from 20 random viewpoints. The chairs are placed in front of random background from the SUN dataset, and the RGB, depth, normal, and silhouette images are rendered using the physics-based renderer Mitsuba for more realistic images.

=== Method ===
MarrNet is trained without the final fine-tuning stage, since 3D shapes are available. A baseline is created that directly predicts the 3D shape using the same 3D shape estimator architecture with no 2.5D sketch estimation.

=== Results ===
The baseline output is compared to the full framework, and the figure below shows that MarrNet provides model outputs with more details and smoother surfaces than the baseline. Quantitatively, the full model also achieves 0.57 IoU, higher than the direct prediction baseline.

[[File:marrnet_shapenet_results.png|700px|thumb|center|ShapeNet results.]]

== PASCAL 3D+ ==
Rough 3D models are provided from real-life images.

=== Method ===
Each module is pre-trained on the ShapeNet dataset, and then fine-tuned on the PASCAL 3D+ dataset. Three variants of the model are tested. The first is trained using ShapeNet data only with no fine-tuning. The second is fine-tuned without fixing the decoder. The third is fine-tuned with a fixed decoder.

=== Results ===
The figure below shows the results of the ablation study. The model trained only on synthetic data provides reasonable estimates. However, fine-tuning without fixing the decoder leads to impossible shapes from certain views. The third model keeps the shape prior, providing more details in the final shape.

[[File:marrnet_pascal_3d_ablation.png|600px|thumb|center|Ablation studies using the PASCAL 3D+ dataset.]]

Additional comparisons are made with the state-of-the-art (DRC) on the provided ground truth shapes. MarrNet achieves 0.39 IoU, while DRC achieves 0.34. However, the authors claim that the IoU metric is sub-optimal for three reasons. First, there is no emphasis on details since the metric prefers models that predict mean shapes consistently. Second, all possible scales are searched during the IoU computation, making it less efficient. Third, PASCAL 3D+ only has rough annotations, with only 10 CAD chair models for all images, and computing IoU with these shapes is not very informative. Instead, human studies are conducted and MarrNet reconstructions are preferred 74% of the time over DRC, and 42% of the time to ground truth. This shows how MarrNet produces nice shapes and also highlights the fact that ground truth shapes are not very good.

[[File:human_studies.png|400px|thumb|center|Human preferences on chairs in PASCAL 3D+ (Xiang et al. 2014). The numbers show the percentage of how often humans prefered the 3D shape from DRC (state-of-the-art), MarrNet, or GT.]]

[[File:marrnet_pascal_3d_drc_comparison.png|600px|thumb|center|Comparison between DRC and MarrNet results.]]

Several failure cases are shown in the figure below. Specifically, the framework does not seem to work well on thin structures.

[[File:marrnet_pascal_3d_failure_cases.png|500px|thumb|center|Failure cases on PASCAL 3D+. The algorithm cannot recover thin structures.]]

===

== IKEA ==
This dataset contains images of IKEA furniture, with accurate 3D shape and pose annotations. Objects are often heavily occluded or truncated.

=== Results ===
Qualitative results are shown in the figure below. The model is shown to deal with mild occlusions in real life scenarios. Human studes show that MarrNet reconstructions are preferred 61% of the time to 3D-VAE-GAN.

[[File:marrnet_ikea_results.png|700px|thumb|center|Results on chairs in the IKEA dataset, and comparison with 3D-VAE-GAN.]]

== Other Data ==
MarrNet is also applied on cars and airplanes. Shown below, smaller details such as the horizontal stabilizer and rear-view mirrors are recovered.

[[File:marrnet_airplanes_and_cars.png|700px|thumb|center|Results on airplanes and cars from the PASCAL 3D+ dataset, and comparison with DRC.]]

MarrNet is also jointly trained on three object categories, and successfully recovers the shapes of different categories. Results are shown in the figure below.

[[File:marrnet_multiple_categories.png|700px|thumb|center|Results when trained jointly on all three object categories (cars, airplanes, and chairs).]]

= Commentary =
Qualitatively, the results look quite impressive. The 2.5D sketch estimation seems to distill the useful information for more realistic looking 3D shape estimation. The disentanglement of 2.5D and 3D estimation steps also allows for easier training and domain adaptation from synthetic data.

As the authors mention, the IoU metric is not very descriptive, and most of the comparisons in this paper are only qualitative, mainly being human preference studies. A better quantitative evaluation metric would greatly help in making an unbiased comparison between different results.

As seen in several of the results, the network does not deal well with objects that have thin structures, which is particularly noticeable with many of the chair arm rests. As well, looking more carefully at some results, it seems that fine-tuning only the 3D encoder does not seem to transfer well to unseen objects, since shape priors have already been learned by the decoder.

= Conclusion =
The proposed MarrNet employs a novel model to estimate 2.5D sketches for 3D shape reconstruction. The sketches are shown to improve the model’s performance, and make it easy to adapt to images across different domains and categories. Differentiable loss functions are created such that the model can be fine-tuned end-to-end on images without ground truth. The experiments show that the model performs well, and human studies show that the results are preferred over other methods.

= References =
# David Marr. Vision: A computational investigation into the human representation and processing of visual information. W. H. Freeman and Company, 1982.
# Shubham Tulsiani, Tinghui Zhou, Alexei A Efros, and Jitendra Malik. Multi-view supervision for single-view reconstruction via differentiable ray consistency. In CVPR, 2017.
# JiajunWu, Chengkai Zhang, Tianfan Xue,William T Freeman, and Joshua B Tenenbaum. Learning a Proba- bilistic Latent Space of Object Shapes via 3D Generative-Adversarial Modeling. In NIPS, 2016b.

End-to-End Differentiable Adversarial Imitation Learning

2018-03-22T19:26:28Z

Apon: /* Discussion */

= Introduction =
The ability to imitate an expert policy is very beneficial in the case of automating human demonstrated tasks. Assuming that a sequence of state action pairs (trajectories) of an expert policy are available, a new policy can be trained that imitates the expert without having access to the original reward signal used by the expert. There are two main approaches to solve the problem of imitating a policy; they are Behavioural Cloning (BC) and Inverse Reinforcement Learning (IRL). BC directly learns the conditional distribution of actions over states in a supervised fashion by training on single time-step state-action pairs. The disadvantage of BC is that the training requires large amounts of expert data, which is hard to obtain. In addition, an agent trained using BC is unaware of how its action can affect future state distribution. The second method using IRL involves recovering a reward signal under which the expert is uniquely optimal; the main disadvantage is that it’s an ill-posed problem.

To address the problem of imitating an expert policy, techniques based on Generative Adversarial Networks (GANs) have been proposed in recent years. GANs use a discriminator to guide the generative model towards producing patterns like those of the expert. The generator is guided as it tries to produce samples on the correct side of the discriminators decision boundary hyper-plane, as seen in Figure 1. This idea was used by (Ho & Ermon, 2016) in their work titled Generative Adversarial Imitation Learning (GAIL) to imitate an expert policy in a model-free setup. A model free setup is the one where the agent cannot make predictions about what the next state and reward will be before it takes each action since the transition function to move from state A to state B is not learned.

The disadvantage of the model-free approach comes to light when training stochastic policies. The presence of stochastic elements breaks the flow of information (gradients) from one neural network to the other, thus prohibiting the use of backpropagation. In this situation, a standard solution is to use gradient estimation (Williams, 1992). This tends to suffer from high variance, resulting in a need for larger sample sizes as well as variance reduction methods. This paper proposes a model-based imitation learning algorithm (MGAIL), in which information propagates from the guiding neural network (D) to the generative model (G), which in this case represents the policy <math>\pi</math> that is to be trained. Training policy <math>\pi</math> assumes the existence of an expert policy <math>\pi_{E}</math> with given trajectories <math>\{s_{0},a_{0},s_{1},...\}^{N}_{i=0}</math> which it aims to imitate without access to the original reward signal <math>r_{e}</math>. This is achieved by two steps: (1) learning a forward model that approximates the environment’s dynamics (2) building an end-to-end differentiable computation graph that spans over multiple time-steps. The gradient in such a graph carries information from future states to earlier time-steps, helping the policy to account for compounding errors.

[[File:GeneratorFollowingDiscriminator.png|center]]

Figure 1: '''Illustration of GANs.''' The generative model follows the discriminating hyper-plane defined by the discriminator. Eventually, G will produce patterns similar to the expert patterns.

= Background =
== Markov Decision Process ==
Consider an infinite-horizon discounted Markov decision process (MDP), defined by the tuple <math>(S, A, P, r, \rho_0, \gamma)</math> where <math>S</math> is the set of states, <math>A</math> is a set of actions, <math>P :
S × A × S → [0, 1]</math> is the transition probability distribution, <math>r : (S × A) → R</math> is the reward function, <math>\rho_0 : S → [0, 1]</math> is the distribution over initial states, and <math>γ ∈ (0, 1)</math> is the discount factor. Let <math>π</math> denote a stochastic policy <math>π : S × A → [0, 1]</math>, <math>R(π)</math> denote its expected discounted reward: <math>E_πR = E_π [\sum_{t=0}^T \gamma^t r_t]</math> and <math>τ</math> denote a trajectory of states and actions <math>τ = {s_0, a_0, s_1, a_1, ...}</math>.

== Imitation Learning ==
A common technique for performing imitation learning is to train a policy <math> \pi </math> that minimizes some loss function <math> l(s, \pi(s)) </math> with respect to a discounted state distribution encountered by the expert: <math> d_\pi(s) = (1-\gamma)\sum_{t=0}^{\infty}\gamma^t p(s_t) </math>. This can be obtained using any supervised learning (SL) algorithm, but the policy's prediction affects future state distributions; this violates the independent and identically distributed (i.i.d) assumption made by most SL algorithms. This process is susceptible to compounding errors since a slight deviation in the learner's behavior can lead to different state distributions not encountered by the expert policy.

This issue was overcome through the use of the Forward Training (FT) algorithm which trains a non-stationary policy iteratively over time. At each time step a new policy is trained on the state distribution induced by the previously trained policies <math>\pi_0</math>, <math>\pi_1</math>, ...<math>\pi_{t-1}</math>. This is continued till the end of the time horizon to obtain a policy that can mimic the expert policy. This requirement to train a policy at each time step till the end makes the FT algorithm impractical for cases where the time horizon is very large or undefined. This shortcoming is resolved using the Stochastic Mixing Iterative Learning (SMILe) algorithm. SMILe trains a stochastic stationary policy over several iterations under the trajectory distribution induced by the previously trained policy: <math> \pi_t = \pi_{t-1} + \alpha (1 - \alpha)^{t-1}(\hat{\pi}_t - \pi_0)</math>, with <math>\pi_0</math> following expert's policy at the start of training.

== Generative Adversarial Networks ==
GANs learn a generative model that can fool the discriminator by using a two-player zero-sum game:

\begin{align}
\underset{G}{\operatorname{argmin}}\; \underset{D\in (0,1)}{\operatorname{argmax}} = \mathbb{E}_{x\sim p_E}[log(D(x)]\ +\ \mathbb{E}_{z\sim p_z}[log(1 - D(G(z)))]
\end{align}

In the above equation, <math> p_E </math> represents the expert distribution and <math> p_z </math> represents the input noise distribution from which the input to the generator is sampled. The generator produces patterns and the discriminator judges if the pattern was generated or from the expert data. When the discriminator cannot distinguish between the two distributions the game ends and the generator has learned to mimic the expert. GANs rely on basic ideas such as binary classification and algorithms such as backpropagation in order to learn the expert distribution.

GAIL applies GANs to the task of imitating an expert policy in a model-free approach. GAIL uses similar objective functions like GANs, but the expert distribution in GAIL represents the joint distribution over state action tuples:

\begin{align}
\underset{\pi}{\operatorname{argmin}}\; \underset{D\in (0,1)}{\operatorname{argmax}} = \mathbb{E}_{\pi}[log(D(s,a)]\ +\ \mathbb{E}_{\pi_E}[log(1 - D(s,a))] - \lambda H(\pi))
\end{align}

where <math> H(\pi) \triangleq \mathbb{E}_{\pi}[-log\: \pi(a|s)]</math> is the entropy.

This problem cannot be solved using the standard methods described for GANs because the generator in GAIL represents a stochastic policy. The exact form of the first term in the above equation is given by: <math> \mathbb{E}_{s\sim \rho_\pi(s)}\mathbb{E}_{a\sim \pi(\cdot |s)} [log(D(s,a)] </math>.

The two-player game now depends on the stochastic properties (<math> \theta </math>) of the policy, and it is unclear how to differentiate the above equation with respect to <math> \theta </math>. This problem can be overcome using score functions such as REINFORCE to obtain an unbiased gradient estimation:

\begin{align}
\nabla_\theta\mathbb{E}_{\pi} [log\; D(s,a)] \cong \hat{\mathbb{E}}_{\tau_i}[\nabla_\theta\; log\; \pi_\theta(a|s)Q(s,a)]
\end{align}

where <math> Q(\hat{s},\hat{a}) </math> is the score function of the gradient:

\begin{align}
Q(\hat{s},\hat{a}) = \hat{\mathbb{E}}_{\tau_i}[log\; D(s,a) | s_0 = \hat{s}, a_0 = \hat{a}]
\end{align}

REINFORCE gradients suffer from high variance which makes them difficult to work with even after applying variance reduction techniques. In order to better understand the changes required to fool the discriminator we need access to the gradients of the discriminator network, which can be obtained from the Jacobian of the discriminator. This paper demonstrates the use of a forward model along with the Jacobian of the discriminator to train a policy, without using high-variance gradient estimations.

= Algorithm =
This section first analyzes the characteristics of the discriminator network, then describes how a forward model can enable policy imitation through GANs. Lastly, the model based adversarial imitation learning algorithm is presented.

== The discriminator network ==
The discriminator network is trained to predict the conditional distribution: <math> D(s,a) = p(y|s,a) </math> where <math> y \in (\pi_E, \pi) </math>.

The discriminator is trained on an even distribution of expert and generated examples; hence <math> p(\pi) = p(\pi_E) = \frac{1}{2} </math>. Given this and applying Bayes' theorem, we can rearrange and factor <math> D(s,a) </math> to obtain:

\begin{aligned}
D(s,a) &= p(\pi|s,a) \\
& = \frac{p(s,a|\pi)p(\pi)}{p(s,a|\pi)p(\pi) + p(s,a|\pi_E)p(\pi_E)} \\
& = \frac{p(s,a|\pi)}{p(s,a|\pi) + p(s,a|\pi_E)} \\
& = \frac{1}{1 + \frac{p(s,a|\pi_E)}{p(s,a|\pi)}} \\
& = \frac{1}{1 + \frac{p(a|s,\pi_E)}{p(a|s,\pi)} \cdot \frac{p(s|\pi_E)}{p(s|\pi)}} \\
\end{aligned}

Define <math> \varphi(s,a) </math> and <math> \psi(s) </math> to be:

\begin{aligned}
\varphi(s,a) = \frac{p(a|s,\pi_E)}{p(a|s,\pi)}, \psi(s) = \frac{p(s|\pi_E)}{p(s|\pi)}
\end{aligned}

to get the final expression for <math> D(s,a) </math>:
\begin{aligned}
D(s,a) = \frac{1}{1 + \varphi(s,a)\cdot \psi(s)}
\end{aligned}

<math> \varphi(s,a) </math> represents a policy likelihood ratio, and <math> \psi(s) </math> represents a state distribution likelihood ratio. Based on these expressions, the paper states that the discriminator makes its decisions by answering two questions. The first question relates to state distribution: what is the likelihood of encountering state <math> s </math> under the distribution induces by <math> \pi_E </math> vs <math> \pi </math>? The second question is about behavior: given a state <math> s </math>, how likely is action a under <math> \pi_E </math> vs <math> \pi </math>? The desired change in state is given by <math> \psi_s \equiv \partial \psi / \partial s </math>; this information can by obtained from the partial derivatives of <math> D(s,a) </math>, which is why these derivatives are proposed to be used for training policies (see following sections):

\begin{aligned}
\nabla_aD &= - \frac{\varphi_a(s,a)\psi(s)}{(1 + \varphi(s,a)\psi(s))^2} \\
\nabla_sD &= - \frac{\varphi_s(s,a)\psi(s) + \varphi(s,a)\psi_s(s)}{(1 + \varphi(s,a)\psi(s))^2} \\
\end{aligned}

== Backpropagating through stochastic units ==
There is interest in training stochastic policies because stochasticity encourages exploration for Policy Gradient methods. This is a problem for algorithms that build differentiable computation graphs where the gradients flow from one component to another since it is unclear how to backpropagate through stochastic units. The following subsections show how to estimate the gradients of continuous and categorical stochastic elements for continuous and discrete action domains respectively.

=== Continuous Action Distributions ===
In the case of continuous action policies, re-parameterization was used to enable computing the derivatives of stochastic models. Assuming that the stochastic policy has a Gaussian distribution <math> \mathcal{N}(\mu_{\theta} (s), \sigma_{\theta}^2 (s))</math>, where the mean and variance are given by some deterministic functions <math>\mu_{\theta}</math> and <math>\sigma_{\theta}</math>, then the policy <math> \pi </math> can be written as <math> \pi_\theta(a|s) = \mu_\theta(s) + \xi \sigma_\theta(s) </math>, where <math> \xi \sim N(0,1) </math>. This way, the authors are able to get a Monte-Carlo estimator of the derivative of the expected value of <math> D(s, a) </math> with respect to <math> \theta </math>:

\begin{align}
\nabla_\theta\mathbb{E}_{\pi(a|s)}D(s,a) = \mathbb{E}_{\rho (\xi )}\nabla_a D(a,s) \nabla_\theta \pi_\theta(a|s) \cong \frac{1}{M}\sum_{i=1}^{M} \nabla_a D(s,a) \nabla_\theta \pi_\theta(a|s)\Bigr|_{\substack{\xi=\xi_i}}
\end{align}

=== Categorical Action Distributions ===
In the case of discrete action domains, the paper uses categorical re-parameterization with Gumbel-Softmax. This method relies on the Gumbel-Max trick which is a method for drawing samples from a categorical distribution with class probabilities <math> \pi(a_1|s),\pi(a_2|s),...,\pi(a_N|s) </math>:

\begin{align}
a_{argmax} = \underset{i}{argmax}[g_i + log\ \pi(a_i|s)]\textrm{, where } g_i \sim Gumbel(0, 1).
\end{align}

Gumbel-Softmax provides a differentiable approximation of the samples obtained using the Gumbel-Max trick (Gumbel-softmax allows us to generate a differentiable sample from a discrete distribution, which is needed in this trajectory imitation setting.):

\begin{align}
a_{softmax} = \frac{exp[\frac{1}{\tau}(g_i + log\ \pi(a_i|s))]}{\sum_{j=1}^{k}exp[\frac{1}{\tau}(g_j + log\ \pi(a_i|s))]}
\end{align}

In the above equation, the hyper-parameter <math> \tau </math> (temperature) trades bias for variance. When <math> \tau </math> gets closer to zero, the softmax operator acts like argmax resulting in a low bias, but high variance; vice versa when the <math> \tau </math> is large.

The authors use <math> a_{softmax} </math> to interact with the environment; argmax is applied over <math> a_{softmax} </math> to obtain a single “pure” action, but the continuous approximation is used in the backward pass using the estimation: <math> \nabla_\theta\; a_{argmax} \approx \nabla_\theta\; a_{softmax} </math>.

== Backpropagating through a Forward model ==
The above subsections presented the means for extracting the partial derivative <math> \nabla_aD </math>. The main contribution of this paper is incorporating the use of <math> \nabla_sD </math>. In a model-free approach the state <math> s </math> is treated as a fixed input, therefore <math> \nabla_sD </math> is discarded. This is illustrated in Figure 2. This work uses a model-based approach which makes incorporating <math> \nabla_sD </math> more involved. In the model-based approach, a state <math> s_t </math> can be written as a function of the previous state action pair: <math> s_t = f(s_{t-1}, a_{t-1}) </math>, where <math> f </math> represents the forward model. Using the forward model and the law of total derivatives we get:

\begin{align}
\nabla_\theta D(s_t,a_t)\Bigr|_{\substack{s=s_t, a=a_t}} &= \frac{\partial D}{\partial a}\frac{\partial a}{\partial \theta}\Bigr|_{\substack{a=a_t}} + \frac{\partial D}{\partial s}\frac{\partial s}{\partial \theta}\Bigr|_{\substack{s=s_t}} \\
&= \frac{\partial D}{\partial a}\frac{\partial a}{\partial \theta}\Bigr|_{\substack{a=a_t}} + \frac{\partial D}{\partial s}\left (\frac{\partial f}{\partial s}\frac{\partial s}{\partial \theta}\Bigr|_{\substack{s=s_{t-1}}} + \frac{\partial f}{\partial a}\frac{\partial a}{\partial \theta}\Bigr|_{\substack{a=a_{t-1}}} \right )
\end{align}

Using this formula, the error regarding deviations of future states <math> (\psi_s) </math> propagate back in time and influence the actions of policies in earlier times. This is summarized in Figure 3.

[[File:modelFree_blockDiagram.PNG|400px|center]]

Figure 2: Block-diagram of the model-free approach: given a state <math> s </math>, the policy outputs <math> \mu </math> which is fed to a stochastic sampling unit. An action <math> a </math> is sampled, and together with <math> s </math> are presented to the discriminator network. In the backward phase, the error message <math> \delta_a </math> is blocked at the stochastic sampling unit. From there, a high-variance gradient estimation is used (<math> \delta_{HV} </math>). Meanwhile, the error message <math> \delta_s </math> is flushed.

[[File:modelBased_blockDiagram.PNG|700px|center]]

Figure 3: Block diagram of model-based adversarial imitation learning.

Figure 3 describes the computation graph for training the policy (i.e. G). The discriminator network D is fixed at this stage and is trained separately. At time <math> t </math> of the forward pass, <math> \pi </math> outputs a distribution over actions: <math> \mu_t = \pi(s_t) </math>, from which an action at is sampled. For example, in the continuous case, this is done using the re-parametrization trick: <math> a_t = \mu_t + \xi \cdot \sigma </math>, where <math> \xi \sim N(0,1) </math>. The next state <math> s_{t+1} = f(s_t, a_t) </math> is computed using the forward model (which is also trained separately), and the entire process repeats for time <math> t+1 </math>. In the backward pass, the gradient of <math> \pi </math> is comprised of a.) the error message <math> \delta_a </math> (Green) that propagates fluently through the differentiable approximation of the sampling process. And b.) the error message <math> \delta_s </math> (Blue) of future time-steps, that propagate back through the differentiable forward model.

== MGAIL Algorithm ==
Shalev- Shwartz et al. (2016) and Heess et al. (2015) built a multi-step computation graph for describing the familiar policy gradient objective; in this case it is given by:

\begin{align}
J(\theta) = \mathbb{E}\left [ \sum_{t=0}^{T} \gamma ^t D(s_t,a_t)|\theta\right ]
\end{align}

Using the results from Heess et al. (2015) this paper demonstrates how to differentiate <math> J(\theta) </math> over a trajectory of <math>(s,a,s’) </math> transitions:

\begin{align}
J_s &= \mathbb{E}_{p(a|s)}\mathbb{E}_{p(s'|s,a)}\left [ D_s + D_a \pi_s + \gamma J'_{s'}(f_s + f_a \pi_s) \right] \\
J_\theta &= \mathbb{E}_{p(a|s)}\mathbb{E}_{p(s'|s,a)}\left [ D_a \pi_\theta + \gamma (J'_{s'} f_a \pi_\theta + J'_\theta) \right]
\end{align}

The policy gradient <math> \nabla_\theta J </math> is calculated by applying equations 12 and 13 recursively for <math> T </math> iterations. The MGAIL algorithm is presented below.

[[File:MGAIL_alg.PNG]]

== Forward Model Structure ==
The stability of the learning process depends on the prediction accuracy of the forward model, but learning an accurate forward model is challenging by itself. The authors propose methods for improving the performance of the forward model based on two aspects of its functionality. First, the forward model should learn to use the action as an operator over the state space. To accomplish this, the actions and states, which are sampled form different distributions need to be first represented in a shared space. This is done by encoding the state and action using two separate neural networks and combining their outputs to form a single vector. Additionally, multiple previous states are used to predict the next state by representing the environment as an <math> n^{th} </math> order MDP. A gated recurrent units (GRU, a simpler variant on the LSTM model) layer is incorporated into the state encoder to enable recurrent connections from previous states. Using these modifications, the model is able to achieve better, and more stable results compared to the standard forward model based on a feed forward neural network. The comparison is presented in Figure 4.

[[File:performance_comparison.PNG]]

Figure 4: Performance comparison between a basic forward model (Blue), and the advanced forward model (Green).

= Experiments =
The proposed algorithm is evaluated on three discrete control tasks (Cartpole, Mountain-Car, Acrobot) and five continuous control tasks (Hopper, Walker, Half-Cheetah, Ant, and Humanoid). These tasks are modelled by the MuJoCo physics simulator (Todorov et al., 2012), contain second order dynamics and utilize direct torque control. Expert policies are trained using the Trust Region Policy Optimization (TRPO) algorithm (Schulman et al., 2015). Different number of trajectories are used to train the expert for each task, but all trajectories are of length 1000.
The discriminator and generator (policy) networks contains two hidden layers with ReLU non-linearities and are trained using the ADAM optimizer. The total reward received over a period of <math> N </math> steps using BC, GAIL and MGAIL is presented in Table 1. The proposed algorithm achieved the highest reward for most environments while exhibiting performance comparable to the expert over all of them. A comparison between the basic forward model and the more advanced forward model is also made and described in the previous section of this summary. The two models compared are shown below.

[[File:baram17_forward.PNG]]

[[File:mgail_test_results_1.PNG]]

[[File:mgail_test_results.PNG]]

Table 1. Policy performance, boldface indicates better results, <math> \pm </math> represents one standard deviation.

= Discussion =
This paper presented a model-free algorithm for imitation learning. It demonstrated how a forward model can be used to train policies using the exact gradient of the discriminator network. A downside of this approach is the need to learn a forward model; this could be difficult in certain domains. Learning the system dynamics directly from raw images is considered as one line of future work. Another future work is to address the violation of the fundamental assumption made by all supervised learning algorithms, which requires the data to be i.i.d. This problem arises because the discriminator and forward models are trained in a supervised learning fashion using data sampled from a dynamic distribution. The authors tried a solution proposed by another paper (Loshchilov & Hutter, 2016), which is to reset the learning rate several times during training period, but it did not result in significant improvements.

= Source =
# Baram, Nir, et al. "End-to-end differentiable adversarial imitation learning." International Conference on Machine Learning. 2017.
# Ho, Jonathan, and Stefano Ermon. "Generative adversarial imitation learning." Advances in Neural Information Processing Systems. 2016.
# Shalev-Shwartz, Shai, et al. "Long-term planning by short-term prediction." arXiv preprint arXiv:1602.01580 (2016).
# Heess, Nicolas, et al. "Learning continuous control policies by stochastic value gradients." Advances in Neural Information Processing Systems. 2015.
# Schulman, John, et al. "Trust region policy optimization." International Conference on Machine Learning. 2015.

End-to-End Differentiable Adversarial Imitation Learning

2018-03-22T19:24:36Z

Apon: /* Experiments */

= Introduction =
The ability to imitate an expert policy is very beneficial in the case of automating human demonstrated tasks. Assuming that a sequence of state action pairs (trajectories) of an expert policy are available, a new policy can be trained that imitates the expert without having access to the original reward signal used by the expert. There are two main approaches to solve the problem of imitating a policy; they are Behavioural Cloning (BC) and Inverse Reinforcement Learning (IRL). BC directly learns the conditional distribution of actions over states in a supervised fashion by training on single time-step state-action pairs. The disadvantage of BC is that the training requires large amounts of expert data, which is hard to obtain. In addition, an agent trained using BC is unaware of how its action can affect future state distribution. The second method using IRL involves recovering a reward signal under which the expert is uniquely optimal; the main disadvantage is that it’s an ill-posed problem.

To address the problem of imitating an expert policy, techniques based on Generative Adversarial Networks (GANs) have been proposed in recent years. GANs use a discriminator to guide the generative model towards producing patterns like those of the expert. The generator is guided as it tries to produce samples on the correct side of the discriminators decision boundary hyper-plane, as seen in Figure 1. This idea was used by (Ho & Ermon, 2016) in their work titled Generative Adversarial Imitation Learning (GAIL) to imitate an expert policy in a model-free setup. A model free setup is the one where the agent cannot make predictions about what the next state and reward will be before it takes each action since the transition function to move from state A to state B is not learned.

The disadvantage of the model-free approach comes to light when training stochastic policies. The presence of stochastic elements breaks the flow of information (gradients) from one neural network to the other, thus prohibiting the use of backpropagation. In this situation, a standard solution is to use gradient estimation (Williams, 1992). This tends to suffer from high variance, resulting in a need for larger sample sizes as well as variance reduction methods. This paper proposes a model-based imitation learning algorithm (MGAIL), in which information propagates from the guiding neural network (D) to the generative model (G), which in this case represents the policy <math>\pi</math> that is to be trained. Training policy <math>\pi</math> assumes the existence of an expert policy <math>\pi_{E}</math> with given trajectories <math>\{s_{0},a_{0},s_{1},...\}^{N}_{i=0}</math> which it aims to imitate without access to the original reward signal <math>r_{e}</math>. This is achieved by two steps: (1) learning a forward model that approximates the environment’s dynamics (2) building an end-to-end differentiable computation graph that spans over multiple time-steps. The gradient in such a graph carries information from future states to earlier time-steps, helping the policy to account for compounding errors.

[[File:GeneratorFollowingDiscriminator.png|center]]

Figure 1: '''Illustration of GANs.''' The generative model follows the discriminating hyper-plane defined by the discriminator. Eventually, G will produce patterns similar to the expert patterns.

= Background =
== Markov Decision Process ==
Consider an infinite-horizon discounted Markov decision process (MDP), defined by the tuple <math>(S, A, P, r, \rho_0, \gamma)</math> where <math>S</math> is the set of states, <math>A</math> is a set of actions, <math>P :
S × A × S → [0, 1]</math> is the transition probability distribution, <math>r : (S × A) → R</math> is the reward function, <math>\rho_0 : S → [0, 1]</math> is the distribution over initial states, and <math>γ ∈ (0, 1)</math> is the discount factor. Let <math>π</math> denote a stochastic policy <math>π : S × A → [0, 1]</math>, <math>R(π)</math> denote its expected discounted reward: <math>E_πR = E_π [\sum_{t=0}^T \gamma^t r_t]</math> and <math>τ</math> denote a trajectory of states and actions <math>τ = {s_0, a_0, s_1, a_1, ...}</math>.

== Imitation Learning ==
A common technique for performing imitation learning is to train a policy <math> \pi </math> that minimizes some loss function <math> l(s, \pi(s)) </math> with respect to a discounted state distribution encountered by the expert: <math> d_\pi(s) = (1-\gamma)\sum_{t=0}^{\infty}\gamma^t p(s_t) </math>. This can be obtained using any supervised learning (SL) algorithm, but the policy's prediction affects future state distributions; this violates the independent and identically distributed (i.i.d) assumption made by most SL algorithms. This process is susceptible to compounding errors since a slight deviation in the learner's behavior can lead to different state distributions not encountered by the expert policy.

This issue was overcome through the use of the Forward Training (FT) algorithm which trains a non-stationary policy iteratively over time. At each time step a new policy is trained on the state distribution induced by the previously trained policies <math>\pi_0</math>, <math>\pi_1</math>, ...<math>\pi_{t-1}</math>. This is continued till the end of the time horizon to obtain a policy that can mimic the expert policy. This requirement to train a policy at each time step till the end makes the FT algorithm impractical for cases where the time horizon is very large or undefined. This shortcoming is resolved using the Stochastic Mixing Iterative Learning (SMILe) algorithm. SMILe trains a stochastic stationary policy over several iterations under the trajectory distribution induced by the previously trained policy: <math> \pi_t = \pi_{t-1} + \alpha (1 - \alpha)^{t-1}(\hat{\pi}_t - \pi_0)</math>, with <math>\pi_0</math> following expert's policy at the start of training.

== Generative Adversarial Networks ==
GANs learn a generative model that can fool the discriminator by using a two-player zero-sum game:

\begin{align}
\underset{G}{\operatorname{argmin}}\; \underset{D\in (0,1)}{\operatorname{argmax}} = \mathbb{E}_{x\sim p_E}[log(D(x)]\ +\ \mathbb{E}_{z\sim p_z}[log(1 - D(G(z)))]
\end{align}

In the above equation, <math> p_E </math> represents the expert distribution and <math> p_z </math> represents the input noise distribution from which the input to the generator is sampled. The generator produces patterns and the discriminator judges if the pattern was generated or from the expert data. When the discriminator cannot distinguish between the two distributions the game ends and the generator has learned to mimic the expert. GANs rely on basic ideas such as binary classification and algorithms such as backpropagation in order to learn the expert distribution.

GAIL applies GANs to the task of imitating an expert policy in a model-free approach. GAIL uses similar objective functions like GANs, but the expert distribution in GAIL represents the joint distribution over state action tuples:

\begin{align}
\underset{\pi}{\operatorname{argmin}}\; \underset{D\in (0,1)}{\operatorname{argmax}} = \mathbb{E}_{\pi}[log(D(s,a)]\ +\ \mathbb{E}_{\pi_E}[log(1 - D(s,a))] - \lambda H(\pi))
\end{align}

where <math> H(\pi) \triangleq \mathbb{E}_{\pi}[-log\: \pi(a|s)]</math> is the entropy.

This problem cannot be solved using the standard methods described for GANs because the generator in GAIL represents a stochastic policy. The exact form of the first term in the above equation is given by: <math> \mathbb{E}_{s\sim \rho_\pi(s)}\mathbb{E}_{a\sim \pi(\cdot |s)} [log(D(s,a)] </math>.

The two-player game now depends on the stochastic properties (<math> \theta </math>) of the policy, and it is unclear how to differentiate the above equation with respect to <math> \theta </math>. This problem can be overcome using score functions such as REINFORCE to obtain an unbiased gradient estimation:

\begin{align}
\nabla_\theta\mathbb{E}_{\pi} [log\; D(s,a)] \cong \hat{\mathbb{E}}_{\tau_i}[\nabla_\theta\; log\; \pi_\theta(a|s)Q(s,a)]
\end{align}

where <math> Q(\hat{s},\hat{a}) </math> is the score function of the gradient:

\begin{align}
Q(\hat{s},\hat{a}) = \hat{\mathbb{E}}_{\tau_i}[log\; D(s,a) | s_0 = \hat{s}, a_0 = \hat{a}]
\end{align}

REINFORCE gradients suffer from high variance which makes them difficult to work with even after applying variance reduction techniques. In order to better understand the changes required to fool the discriminator we need access to the gradients of the discriminator network, which can be obtained from the Jacobian of the discriminator. This paper demonstrates the use of a forward model along with the Jacobian of the discriminator to train a policy, without using high-variance gradient estimations.

= Algorithm =
This section first analyzes the characteristics of the discriminator network, then describes how a forward model can enable policy imitation through GANs. Lastly, the model based adversarial imitation learning algorithm is presented.

== The discriminator network ==
The discriminator network is trained to predict the conditional distribution: <math> D(s,a) = p(y|s,a) </math> where <math> y \in (\pi_E, \pi) </math>.

The discriminator is trained on an even distribution of expert and generated examples; hence <math> p(\pi) = p(\pi_E) = \frac{1}{2} </math>. Given this and applying Bayes' theorem, we can rearrange and factor <math> D(s,a) </math> to obtain:

\begin{aligned}
D(s,a) &= p(\pi|s,a) \\
& = \frac{p(s,a|\pi)p(\pi)}{p(s,a|\pi)p(\pi) + p(s,a|\pi_E)p(\pi_E)} \\
& = \frac{p(s,a|\pi)}{p(s,a|\pi) + p(s,a|\pi_E)} \\
& = \frac{1}{1 + \frac{p(s,a|\pi_E)}{p(s,a|\pi)}} \\
& = \frac{1}{1 + \frac{p(a|s,\pi_E)}{p(a|s,\pi)} \cdot \frac{p(s|\pi_E)}{p(s|\pi)}} \\
\end{aligned}

Define <math> \varphi(s,a) </math> and <math> \psi(s) </math> to be:

\begin{aligned}
\varphi(s,a) = \frac{p(a|s,\pi_E)}{p(a|s,\pi)}, \psi(s) = \frac{p(s|\pi_E)}{p(s|\pi)}
\end{aligned}

to get the final expression for <math> D(s,a) </math>:
\begin{aligned}
D(s,a) = \frac{1}{1 + \varphi(s,a)\cdot \psi(s)}
\end{aligned}

<math> \varphi(s,a) </math> represents a policy likelihood ratio, and <math> \psi(s) </math> represents a state distribution likelihood ratio. Based on these expressions, the paper states that the discriminator makes its decisions by answering two questions. The first question relates to state distribution: what is the likelihood of encountering state <math> s </math> under the distribution induces by <math> \pi_E </math> vs <math> \pi </math>? The second question is about behavior: given a state <math> s </math>, how likely is action a under <math> \pi_E </math> vs <math> \pi </math>? The desired change in state is given by <math> \psi_s \equiv \partial \psi / \partial s </math>; this information can by obtained from the partial derivatives of <math> D(s,a) </math>, which is why these derivatives are proposed to be used for training policies (see following sections):

\begin{aligned}
\nabla_aD &= - \frac{\varphi_a(s,a)\psi(s)}{(1 + \varphi(s,a)\psi(s))^2} \\
\nabla_sD &= - \frac{\varphi_s(s,a)\psi(s) + \varphi(s,a)\psi_s(s)}{(1 + \varphi(s,a)\psi(s))^2} \\
\end{aligned}

== Backpropagating through stochastic units ==
There is interest in training stochastic policies because stochasticity encourages exploration for Policy Gradient methods. This is a problem for algorithms that build differentiable computation graphs where the gradients flow from one component to another since it is unclear how to backpropagate through stochastic units. The following subsections show how to estimate the gradients of continuous and categorical stochastic elements for continuous and discrete action domains respectively.

=== Continuous Action Distributions ===
In the case of continuous action policies, re-parameterization was used to enable computing the derivatives of stochastic models. Assuming that the stochastic policy has a Gaussian distribution <math> \mathcal{N}(\mu_{\theta} (s), \sigma_{\theta}^2 (s))</math>, where the mean and variance are given by some deterministic functions <math>\mu_{\theta}</math> and <math>\sigma_{\theta}</math>, then the policy <math> \pi </math> can be written as <math> \pi_\theta(a|s) = \mu_\theta(s) + \xi \sigma_\theta(s) </math>, where <math> \xi \sim N(0,1) </math>. This way, the authors are able to get a Monte-Carlo estimator of the derivative of the expected value of <math> D(s, a) </math> with respect to <math> \theta </math>:

\begin{align}
\nabla_\theta\mathbb{E}_{\pi(a|s)}D(s,a) = \mathbb{E}_{\rho (\xi )}\nabla_a D(a,s) \nabla_\theta \pi_\theta(a|s) \cong \frac{1}{M}\sum_{i=1}^{M} \nabla_a D(s,a) \nabla_\theta \pi_\theta(a|s)\Bigr|_{\substack{\xi=\xi_i}}
\end{align}

=== Categorical Action Distributions ===
In the case of discrete action domains, the paper uses categorical re-parameterization with Gumbel-Softmax. This method relies on the Gumbel-Max trick which is a method for drawing samples from a categorical distribution with class probabilities <math> \pi(a_1|s),\pi(a_2|s),...,\pi(a_N|s) </math>:

\begin{align}
a_{argmax} = \underset{i}{argmax}[g_i + log\ \pi(a_i|s)]\textrm{, where } g_i \sim Gumbel(0, 1).
\end{align}

Gumbel-Softmax provides a differentiable approximation of the samples obtained using the Gumbel-Max trick (Gumbel-softmax allows us to generate a differentiable sample from a discrete distribution, which is needed in this trajectory imitation setting.):

\begin{align}
a_{softmax} = \frac{exp[\frac{1}{\tau}(g_i + log\ \pi(a_i|s))]}{\sum_{j=1}^{k}exp[\frac{1}{\tau}(g_j + log\ \pi(a_i|s))]}
\end{align}

In the above equation, the hyper-parameter <math> \tau </math> (temperature) trades bias for variance. When <math> \tau </math> gets closer to zero, the softmax operator acts like argmax resulting in a low bias, but high variance; vice versa when the <math> \tau </math> is large.

The authors use <math> a_{softmax} </math> to interact with the environment; argmax is applied over <math> a_{softmax} </math> to obtain a single “pure” action, but the continuous approximation is used in the backward pass using the estimation: <math> \nabla_\theta\; a_{argmax} \approx \nabla_\theta\; a_{softmax} </math>.

== Backpropagating through a Forward model ==
The above subsections presented the means for extracting the partial derivative <math> \nabla_aD </math>. The main contribution of this paper is incorporating the use of <math> \nabla_sD </math>. In a model-free approach the state <math> s </math> is treated as a fixed input, therefore <math> \nabla_sD </math> is discarded. This is illustrated in Figure 2. This work uses a model-based approach which makes incorporating <math> \nabla_sD </math> more involved. In the model-based approach, a state <math> s_t </math> can be written as a function of the previous state action pair: <math> s_t = f(s_{t-1}, a_{t-1}) </math>, where <math> f </math> represents the forward model. Using the forward model and the law of total derivatives we get:

\begin{align}
\nabla_\theta D(s_t,a_t)\Bigr|_{\substack{s=s_t, a=a_t}} &= \frac{\partial D}{\partial a}\frac{\partial a}{\partial \theta}\Bigr|_{\substack{a=a_t}} + \frac{\partial D}{\partial s}\frac{\partial s}{\partial \theta}\Bigr|_{\substack{s=s_t}} \\
&= \frac{\partial D}{\partial a}\frac{\partial a}{\partial \theta}\Bigr|_{\substack{a=a_t}} + \frac{\partial D}{\partial s}\left (\frac{\partial f}{\partial s}\frac{\partial s}{\partial \theta}\Bigr|_{\substack{s=s_{t-1}}} + \frac{\partial f}{\partial a}\frac{\partial a}{\partial \theta}\Bigr|_{\substack{a=a_{t-1}}} \right )
\end{align}

Using this formula, the error regarding deviations of future states <math> (\psi_s) </math> propagate back in time and influence the actions of policies in earlier times. This is summarized in Figure 3.

[[File:modelFree_blockDiagram.PNG|400px|center]]

Figure 2: Block-diagram of the model-free approach: given a state <math> s </math>, the policy outputs <math> \mu </math> which is fed to a stochastic sampling unit. An action <math> a </math> is sampled, and together with <math> s </math> are presented to the discriminator network. In the backward phase, the error message <math> \delta_a </math> is blocked at the stochastic sampling unit. From there, a high-variance gradient estimation is used (<math> \delta_{HV} </math>). Meanwhile, the error message <math> \delta_s </math> is flushed.

[[File:modelBased_blockDiagram.PNG|700px|center]]

Figure 3: Block diagram of model-based adversarial imitation learning.

Figure 3 describes the computation graph for training the policy (i.e. G). The discriminator network D is fixed at this stage and is trained separately. At time <math> t </math> of the forward pass, <math> \pi </math> outputs a distribution over actions: <math> \mu_t = \pi(s_t) </math>, from which an action at is sampled. For example, in the continuous case, this is done using the re-parametrization trick: <math> a_t = \mu_t + \xi \cdot \sigma </math>, where <math> \xi \sim N(0,1) </math>. The next state <math> s_{t+1} = f(s_t, a_t) </math> is computed using the forward model (which is also trained separately), and the entire process repeats for time <math> t+1 </math>. In the backward pass, the gradient of <math> \pi </math> is comprised of a.) the error message <math> \delta_a </math> (Green) that propagates fluently through the differentiable approximation of the sampling process. And b.) the error message <math> \delta_s </math> (Blue) of future time-steps, that propagate back through the differentiable forward model.

== MGAIL Algorithm ==
Shalev- Shwartz et al. (2016) and Heess et al. (2015) built a multi-step computation graph for describing the familiar policy gradient objective; in this case it is given by:

\begin{align}
J(\theta) = \mathbb{E}\left [ \sum_{t=0}^{T} \gamma ^t D(s_t,a_t)|\theta\right ]
\end{align}

Using the results from Heess et al. (2015) this paper demonstrates how to differentiate <math> J(\theta) </math> over a trajectory of <math>(s,a,s’) </math> transitions:

\begin{align}
J_s &= \mathbb{E}_{p(a|s)}\mathbb{E}_{p(s'|s,a)}\left [ D_s + D_a \pi_s + \gamma J'_{s'}(f_s + f_a \pi_s) \right] \\
J_\theta &= \mathbb{E}_{p(a|s)}\mathbb{E}_{p(s'|s,a)}\left [ D_a \pi_\theta + \gamma (J'_{s'} f_a \pi_\theta + J'_\theta) \right]
\end{align}

The policy gradient <math> \nabla_\theta J </math> is calculated by applying equations 12 and 13 recursively for <math> T </math> iterations. The MGAIL algorithm is presented below.

[[File:MGAIL_alg.PNG]]

== Forward Model Structure ==
The stability of the learning process depends on the prediction accuracy of the forward model, but learning an accurate forward model is challenging by itself. The authors propose methods for improving the performance of the forward model based on two aspects of its functionality. First, the forward model should learn to use the action as an operator over the state space. To accomplish this, the actions and states, which are sampled form different distributions need to be first represented in a shared space. This is done by encoding the state and action using two separate neural networks and combining their outputs to form a single vector. Additionally, multiple previous states are used to predict the next state by representing the environment as an <math> n^{th} </math> order MDP. A gated recurrent units (GRU, a simpler variant on the LSTM model) layer is incorporated into the state encoder to enable recurrent connections from previous states. Using these modifications, the model is able to achieve better, and more stable results compared to the standard forward model based on a feed forward neural network. The comparison is presented in Figure 4.

[[File:performance_comparison.PNG]]

Figure 4: Performance comparison between a basic forward model (Blue), and the advanced forward model (Green).

= Experiments =
The proposed algorithm is evaluated on three discrete control tasks (Cartpole, Mountain-Car, Acrobot) and five continuous control tasks (Hopper, Walker, Half-Cheetah, Ant, and Humanoid). These tasks are modelled by the MuJoCo physics simulator (Todorov et al., 2012), contain second order dynamics and utilize direct torque control. Expert policies are trained using the Trust Region Policy Optimization (TRPO) algorithm (Schulman et al., 2015). Different number of trajectories are used to train the expert for each task, but all trajectories are of length 1000.
The discriminator and generator (policy) networks contains two hidden layers with ReLU non-linearities and are trained using the ADAM optimizer. The total reward received over a period of <math> N </math> steps using BC, GAIL and MGAIL is presented in Table 1. The proposed algorithm achieved the highest reward for most environments while exhibiting performance comparable to the expert over all of them. A comparison between the basic forward model and the more advanced forward model is also made and described in the previous section of this summary. The two models compared are shown below.

[[File:baram17_forward.PNG]]

[[File:mgail_test_results_1.PNG]]

[[File:mgail_test_results.PNG]]

Table 1. Policy performance, boldface indicates better results, <math> \pm </math> represents one standard deviation.

= Discussion =
This paper presented a model-free algorithm for imitation learning. It demonstrated how a forward model can be used to train policies using the exact gradient of the discriminator network. A downside of this approach is the need to learn a forward model, since this could be difficult in certain domains. Learning the system dynamics directly from raw images is considered as one line of future work. Another future work is to address the violation of the fundamental assumption made by all supervised learning algorithms, which requires the data to be i.i.d. This problem arises because the discriminator and forward models are trained in a supervised learning fashion using data sampled from a dynamic distribution. The authors tried a solution proposed by another paper, which is to reset the learning rate several times during training period, but it did not result in significant improvements.

= Source =
# Baram, Nir, et al. "End-to-end differentiable adversarial imitation learning." International Conference on Machine Learning. 2017.
# Ho, Jonathan, and Stefano Ermon. "Generative adversarial imitation learning." Advances in Neural Information Processing Systems. 2016.
# Shalev-Shwartz, Shai, et al. "Long-term planning by short-term prediction." arXiv preprint arXiv:1602.01580 (2016).
# Heess, Nicolas, et al. "Learning continuous control policies by stochastic value gradients." Advances in Neural Information Processing Systems. 2015.
# Schulman, John, et al. "Trust region policy optimization." International Conference on Machine Learning. 2015.

Do Deep Neural Networks Suffer from Crowding

2018-03-22T19:20:20Z

Apon: /* Critique */

= Introduction =
Since the increase in popularity of Deep Neural Networks (DNNs), there has been lots of research in making machines capable of recognizing objects the same way humans do. Humans can recognize objects in a way that is invariant to scale, translation, and clutter. Crowding is another visual effect suffered by humans, in which an object that can be recognized in isolation can no longer be recognized when other objects, called flankers, are placed close to it. This paper focuses on studying the impact of crowding on DNNs by adding clutter to the images and then analyzing which models and settings suffer less from such effects.

[[File:paper25_fig_crowding_ex.png|center|600px]]
The figure shows a visual example of crowding [3]. Keep your eyes still and look at the dot in the center and try to identify the "A" in the two circles. You should see that it is much easier to make out the "A" in the right than in the left circle. The same "A" exists in both circles, however, the left circle contains flankers which are those line segments.

The paper investigates two types of DNNs for crowding: traditional deep convolutional neural networks (DCNN) and a multi-scale eccentricity-dependent model which is an extension of the DCNNs and inspired by the retina where the receptive field size of the convolutional filters in the model grows with increasing distance from the center of the image, called the eccentricity and will be explained below. The authors focus on the dependence of crowding on image factors, such as flanker configuration, target-flanker similarity, target eccentricity and premature pooling in particular.

= Models =
Two types of models are considered: deep convolutional neural networks and eccentricity-dependent models. Based on several hypothesis that pooling is the cause of crowding in human perception, the paper tries to investigate the effects of pooling on the detection of crowded images through these two network types.

== Deep Convolutional Neural Networks ==
The DCNN is a basic architecture with 3 convolutional layers, spatial 3x3 max-pooling with varying strides and a fully connected layer for classification as shown in the below figure.
[[File:DCNN.png|800px|center]]

The network is fed with images resized to 60x60, with mini-batches of 128 images, 32 feature channels for all convolutional layers, and convolutional filters of size 5x5 and stride 1.

As highlighted earlier, the effect of pooling is into main consideration and hence three different configurations have been investigated as below:

1. '''No total pooling''' Feature maps sizes decrease only due to boundary effects, as the 3x3 max pooling has stride 1. The square feature maps sizes after each pool layer are 60-54-48-42.
2. '''Progressive pooling''' 3x3 pooling with a stride of 2 halves the square size of the feature maps, until we pool over what remains in the final layer, getting rid of any spatial information before the fully connected layer. (60-27-11-1).
3. '''At end pooling''' Same as no total pooling, but before the fully connected layer, max-pool over the entire feature map. (60-54-48-1).

===What is the problem in CNNs?===
CNNs fall short in explaining human perceptual invariance. First, CNNs typically take input at a single uniform resolution. Biological measurements suggest that resolution is not uniform across the human visual field, but rather decays with eccentricity, i.e. distance from the center of focus Even more importantly, CNNs rely on data augmentation to achieve transformation-invariance and obviously a lot of processing is needed for CNNs.

==Eccentricity-dependent Model==
In order to take care of the scale invariance in the input image, the eccentricity dependent DNN is utilized. The main intuition behind this architecture is that as we increase eccentricity, the receptive fields also increase and hence the model will become invariant to changing input scales. It emphasizes scale invariance over translation invariance, in contrast to traditional DCNNs. In this model, the input image is cropped into varying scales (11 crops increasing by a factor of <math>\sqrt{2}</math> which are then resized to 60x60 pixels) and then fed to the network. The model computes an invariant representation of the input by sampling the inverted pyramid at a discrete set of scales with the same number of filters at each scale. Since the same number of filters are used for each scale, the smaller crops will be sampled at a high resolution while the larger crops will be sampled with a low resolution. These scales are fed into the network as an input channel to the convolutional layers and share the weights across scale and space.
[[File:EDM.png|2000x450px|center]]

The architecture of this model is the same as the previous DCNN model with the only change being the extra filters added for each of the scales, so the number of parameters remains the same as DCNN models. The authors perform spatial pooling, the aforementioned ''At end pooling'' is used here, and scale pooling which helps in reducing a number of scales by taking the maximum value of corresponding locations in the feature maps across multiple scales. It has three configurations: (1) at the beginning, in which all the different scales are pooled together after the first layer, 11-1-1-1-1 (2) progressively, 11-7-5-3-1 and (3) at the end, 11-11-11-11-1, in which all 11 scales are pooled together at the last layer.

===Contrast Normalization===
Since we have multiple scales of an input image, in some experiments, we perform normalization such that the sum of the pixel intensities in each scale is in the same range [0,1] (this is to prevent smaller crops, which have more non-black pixels, from disproportionately dominating max-pooling across scales). The normalized pixel intensities are then divided by a factor proportional to the crop area [[File:sqrtf.png|60px]] where i=1 is the smallest crop.

=Experiments=
Targets are the set of objects to be recognized and flankers are the set of objects the model has not been trained to recognize, which act as clutter with respect to these target objects. The target objects are the even MNIST numbers having translational variance (shifted at different locations of the image along the horizontal axis), while flankers are from odd MNIST numbers, notMNIST dataset (contains alphabet letters) and Omniglot dataset (contains characters). Examples of the target and flanker configurations are shown below:
[[File:eximages.png|800px|center]]

The target and the object are referred to as ''a'' and ''x'' respectively with the below four configurations:
# No flankers. Only the target object. (a in the plots)
# One central flanker closer to the center of the image than the target. (xa)
# One peripheral flanker closer to the boundary of the image that the target. (ax)
# Two flankers spaced equally around the target, being both the same object (xax).

Training is done using backpropogation with images of size <math>1920 px^2</math> with embedded targets objects and flankers of size of <math>120 px^2</math>. The training and test images are divided as per the usual MNIST configuration. To determine if there is a difference between the peripheral flankers and the central flankers, all the tests are performed in the right half image plane.

==DNNs trained with Target and Flankers==
This is a constant spacing training setup where identical flankers are placed at a distance of 120 pixels either side of the target(xax) with the target having translational variance. The tests are evaluated on (i) DCNN with at the end pooling, and (ii) eccentricity-dependent model with 11-11-11-11-1 scale pooling, at the end spatial pooling and contrast normalization. The test data has different flanker configurations as described above.
[[File:result1.png|x450px|center]]

===Observations===
* With the flanker configuration same as the training one, models are better at recognizing objects in clutter rather than isolated objects for all image locations
* If the target-flanker spacing is changed, then models perform worse
* the eccentricity model is much better at recognizing objects in isolation than the DCNN because the multi-scale crops divide the image into discrete regions, letting the model learn from image parts as well as the whole image
* Only the eccentricity-dependent model is robust to different flanker configurations not included in training when the target is centered.

==DNNs trained with Images with the Target in Isolation==
Here the target objects are in isolation and with translational variance while the test-set is the same set of flanker configurations as used before.
[[File:result2.png|750x400px|center]]
In addition to the evaluation of DCNNs in constant target eccentricity at 240 pixels, here they are tested with images in which the target is fixed at 720 pixels from the center of the image, as shown in Fig 3. Since the target is already at the edge of the visual field, a flanker cannot be more peripheral in the image than the target. Same results as for the 240 pixels target eccentricity can be extracted. The closer the flanker is to the target, the more accuracy decreases. Also, it can be seen that when the target is close to the image boundary, recognition is poor because of boundary effects eroding away information about the target
[[File:paper25_supplemental1.png|800px|center]]

===DCNN Observations===
* The recognition gets worse with the increase in the number of flankers.
* Convolutional networks are capable of being invariant to translations.
* In the constant target eccentricity setup, where the target is fixed at the center of the image with varying target-flanker spacing, we observe that as the distance between target and flankers increase, recognition gets better.
* Spatial pooling helps in learning invariance.
*Flankers similar to the target object helps in recognition since they don't activate the convolutional filter more.
* notMNIST data affects leads to more crowding since they have many more edges and white image pixels which activate the convolutional layers more.

===Eccentric Model===
The set-up is the same as explained earlier.
[[File:result3.png|750x400px|center]]

====Observations====
* If the target is placed at the center and no contrast normalization is done, then the recognition accuracy is high since this model concentrates the most on the central region of the image.
* If contrast normalization is done, then all the scales will contribute equal amount and hence the eccentricity dependence is removed.
* Early pooling is harmful since it might take away the useful information very early which might be useful to the network.

==Complex Clutter==
Here, the targets are randomly embedded into images of the Places dataset and shifted along horizontally in order to investigate model robustness when the target is not at the image center. Tests are performed on DCNN and the eccentricity model with and without contrast normalization using at end pooling. The results are shown in Figure 9 below.

[[File:result4.png|750x400px|center]]

====Observations====
- Only eccentricity model without contrast normalization can recognize the target and only when the target is close to the image center.
- The eccentricity model does not need to be trained on different types of clutter to become robust to those types of clutter, but it needs to fixate on the relevant part of the image to recognize the target.

=Conclusions=
We often think that just training the network with data similar to the test data would achieve good results in a general scenario too but that's not the case as we trained the model with flankers and it did not give us the ideal results for the target objects.
*'''Flanker Configuration''': When models are trained with images of objects in isolation, adding flankers harms recognition. Adding two flankers is the same or worse than adding just one and the smaller the spacing between flanker and target, the more crowding occurs. This is because the pooling operation merges nearby responses, such as the target and flankers if they are close.
*'''Similarity between target and flanker''': Flankers more similar to targets cause more crowding, because of the selectivity property of the learned DNN filters.
*'''Dependence on target location and contrast normalization''': In DCNNs and eccentricity-dependent models with contrast normalization, recognition accuracy is the same across all eccentricities. In eccentricity-dependent networks without contrast normalization, recognition does not decrease despite the presence of clutter when the target is at the center of the image.
*'''Effect of pooling''': adding pooling leads to better recognition accuracy of the models. Yet, in the eccentricity model, pooling across the scales too early in the hierarchy leads to lower accuracy.

=Critique=
This paper just tries to check the impact of flankers on targets as to how crowding can affect recognition but it does not propose anything novel in terms of architecture to take care of such a type of crowding. The eccentricity based model does well only when the target is placed at the center of the image but maybe windowing over the frames instead of taking crops starting from the middle might help.

This paper focuses on image classification. For a stronger argument, their model could be applied to the task of object detection. Perhaps crowding does not have as large of an impact when the objects of interest are localized by a region proposal network.

=References=
1) Volokitin A, Roig G, Poggio T:"Do Deep Neural Networks Suffer from Crowding?" Conference on Neural Information Processing Systems (NIPS). 2017
2) Francis X. Chen, Gemma Roig, Leyla Isik, Xavier Boix and Tomaso Poggio: "Eccentricity Dependent Deep Neural Networks for Modeling Human Vision" Journal of Vision. 17. 808. 10.1167/17.10.808.
3) J Harrison, W & W Remington, R & Mattingley, Jason. (2014). Visual crowding is anisotropic along the horizontal meridian during smooth pursuit. Journal of vision. 14. 10.1167/14.1.21. http://willjharrison.com/2014/01/new-paper-visual-crowding-is-anisotropic-along-the-horizontal-meridian-during-smooth-pursuit/

Do Deep Neural Networks Suffer from Crowding

2018-03-22T19:13:13Z

Apon: /* Experiments and its Set-Up */

= Introduction =
Since the increase in popularity of Deep Neural Networks (DNNs), there has been lots of research in making machines capable of recognizing objects the same way humans do. Humans can recognize objects in a way that is invariant to scale, translation, and clutter. Crowding is another visual effect suffered by humans, in which an object that can be recognized in isolation can no longer be recognized when other objects, called flankers, are placed close to it. This paper focuses on studying the impact of crowding on DNNs by adding clutter to the images and then analyzing which models and settings suffer less from such effects.

[[File:paper25_fig_crowding_ex.png|center|600px]]
The figure shows a visual example of crowding [3]. Keep your eyes still and look at the dot in the center and try to identify the "A" in the two circles. You should see that it is much easier to make out the "A" in the right than in the left circle. The same "A" exists in both circles, however, the left circle contains flankers which are those line segments.

The paper investigates two types of DNNs for crowding: traditional deep convolutional neural networks (DCNN) and a multi-scale eccentricity-dependent model which is an extension of the DCNNs and inspired by the retina where the receptive field size of the convolutional filters in the model grows with increasing distance from the center of the image, called the eccentricity and will be explained below. The authors focus on the dependence of crowding on image factors, such as flanker configuration, target-flanker similarity, target eccentricity and premature pooling in particular.

= Models =
Two types of models are considered: deep convolutional neural networks and eccentricity-dependent models. Based on several hypothesis that pooling is the cause of crowding in human perception, the paper tries to investigate the effects of pooling on the detection of crowded images through these two network types.

== Deep Convolutional Neural Networks ==
The DCNN is a basic architecture with 3 convolutional layers, spatial 3x3 max-pooling with varying strides and a fully connected layer for classification as shown in the below figure.
[[File:DCNN.png|800px|center]]

The network is fed with images resized to 60x60, with mini-batches of 128 images, 32 feature channels for all convolutional layers, and convolutional filters of size 5x5 and stride 1.

As highlighted earlier, the effect of pooling is into main consideration and hence three different configurations have been investigated as below:

1. '''No total pooling''' Feature maps sizes decrease only due to boundary effects, as the 3x3 max pooling has stride 1. The square feature maps sizes after each pool layer are 60-54-48-42.
2. '''Progressive pooling''' 3x3 pooling with a stride of 2 halves the square size of the feature maps, until we pool over what remains in the final layer, getting rid of any spatial information before the fully connected layer. (60-27-11-1).
3. '''At end pooling''' Same as no total pooling, but before the fully connected layer, max-pool over the entire feature map. (60-54-48-1).

===What is the problem in CNNs?===
CNNs fall short in explaining human perceptual invariance. First, CNNs typically take input at a single uniform resolution. Biological measurements suggest that resolution is not uniform across the human visual field, but rather decays with eccentricity, i.e. distance from the center of focus Even more importantly, CNNs rely on data augmentation to achieve transformation-invariance and obviously a lot of processing is needed for CNNs.

==Eccentricity-dependent Model==
In order to take care of the scale invariance in the input image, the eccentricity dependent DNN is utilized. The main intuition behind this architecture is that as we increase eccentricity, the receptive fields also increase and hence the model will become invariant to changing input scales. It emphasizes scale invariance over translation invariance, in contrast to traditional DCNNs. In this model, the input image is cropped into varying scales (11 crops increasing by a factor of <math>\sqrt{2}</math> which are then resized to 60x60 pixels) and then fed to the network. The model computes an invariant representation of the input by sampling the inverted pyramid at a discrete set of scales with the same number of filters at each scale. Since the same number of filters are used for each scale, the smaller crops will be sampled at a high resolution while the larger crops will be sampled with a low resolution. These scales are fed into the network as an input channel to the convolutional layers and share the weights across scale and space.
[[File:EDM.png|2000x450px|center]]

The architecture of this model is the same as the previous DCNN model with the only change being the extra filters added for each of the scales, so the number of parameters remains the same as DCNN models. The authors perform spatial pooling, the aforementioned ''At end pooling'' is used here, and scale pooling which helps in reducing a number of scales by taking the maximum value of corresponding locations in the feature maps across multiple scales. It has three configurations: (1) at the beginning, in which all the different scales are pooled together after the first layer, 11-1-1-1-1 (2) progressively, 11-7-5-3-1 and (3) at the end, 11-11-11-11-1, in which all 11 scales are pooled together at the last layer.

===Contrast Normalization===
Since we have multiple scales of an input image, in some experiments, we perform normalization such that the sum of the pixel intensities in each scale is in the same range [0,1] (this is to prevent smaller crops, which have more non-black pixels, from disproportionately dominating max-pooling across scales). The normalized pixel intensities are then divided by a factor proportional to the crop area [[File:sqrtf.png|60px]] where i=1 is the smallest crop.

=Experiments=
Targets are the set of objects to be recognized and flankers are the set of objects the model has not been trained to recognize, which act as clutter with respect to these target objects. The target objects are the even MNIST numbers having translational variance (shifted at different locations of the image along the horizontal axis), while flankers are from odd MNIST numbers, notMNIST dataset (contains alphabet letters) and Omniglot dataset (contains characters). Examples of the target and flanker configurations are shown below:
[[File:eximages.png|800px|center]]

The target and the object are referred to as ''a'' and ''x'' respectively with the below four configurations:
# No flankers. Only the target object. (a in the plots)
# One central flanker closer to the center of the image than the target. (xa)
# One peripheral flanker closer to the boundary of the image that the target. (ax)
# Two flankers spaced equally around the target, being both the same object (xax).

Training is done using backpropogation with images of size <math>1920 px^2</math> with embedded targets objects and flankers of size of <math>120 px^2</math>. The training and test images are divided as per the usual MNIST configuration. To determine if there is a difference between the peripheral flankers and the central flankers, all the tests are performed in the right half image plane.

==DNNs trained with Target and Flankers==
This is a constant spacing training setup where identical flankers are placed at a distance of 120 pixels either side of the target(xax) with the target having translational variance. The tests are evaluated on (i) DCNN with at the end pooling, and (ii) eccentricity-dependent model with 11-11-11-11-1 scale pooling, at the end spatial pooling and contrast normalization. The test data has different flanker configurations as described above.
[[File:result1.png|x450px|center]]

===Observations===
* With the flanker configuration same as the training one, models are better at recognizing objects in clutter rather than isolated objects for all image locations
* If the target-flanker spacing is changed, then models perform worse
* the eccentricity model is much better at recognizing objects in isolation than the DCNN because the multi-scale crops divide the image into discrete regions, letting the model learn from image parts as well as the whole image
* Only the eccentricity-dependent model is robust to different flanker configurations not included in training when the target is centered.

==DNNs trained with Images with the Target in Isolation==
Here the target objects are in isolation and with translational variance while the test-set is the same set of flanker configurations as used before.
[[File:result2.png|750x400px|center]]
In addition to the evaluation of DCNNs in constant target eccentricity at 240 pixels, here they are tested with images in which the target is fixed at 720 pixels from the center of the image, as shown in Fig 3. Since the target is already at the edge of the visual field, a flanker cannot be more peripheral in the image than the target. Same results as for the 240 pixels target eccentricity can be extracted. The closer the flanker is to the target, the more accuracy decreases. Also, it can be seen that when the target is close to the image boundary, recognition is poor because of boundary effects eroding away information about the target
[[File:paper25_supplemental1.png|800px|center]]

===DCNN Observations===
* The recognition gets worse with the increase in the number of flankers.
* Convolutional networks are capable of being invariant to translations.
* In the constant target eccentricity setup, where the target is fixed at the center of the image with varying target-flanker spacing, we observe that as the distance between target and flankers increase, recognition gets better.
* Spatial pooling helps in learning invariance.
*Flankers similar to the target object helps in recognition since they don't activate the convolutional filter more.
* notMNIST data affects leads to more crowding since they have many more edges and white image pixels which activate the convolutional layers more.

===Eccentric Model===
The set-up is the same as explained earlier.
[[File:result3.png|750x400px|center]]

====Observations====
* If the target is placed at the center and no contrast normalization is done, then the recognition accuracy is high since this model concentrates the most on the central region of the image.
* If contrast normalization is done, then all the scales will contribute equal amount and hence the eccentricity dependence is removed.
* Early pooling is harmful since it might take away the useful information very early which might be useful to the network.

==Complex Clutter==
Here, the targets are randomly embedded into images of the Places dataset and shifted along horizontally in order to investigate model robustness when the target is not at the image center. Tests are performed on DCNN and the eccentricity model with and without contrast normalization using at end pooling. The results are shown in Figure 9 below.

[[File:result4.png|750x400px|center]]

====Observations====
- Only eccentricity model without contrast normalization can recognize the target and only when the target is close to the image center.
- The eccentricity model does not need to be trained on different types of clutter to become robust to those types of clutter, but it needs to fixate on the relevant part of the image to recognize the target.

=Conclusions=
We often think that just training the network with data similar to the test data would achieve good results in a general scenario too but that's not the case as we trained the model with flankers and it did not give us the ideal results for the target objects.
*'''Flanker Configuration''': When models are trained with images of objects in isolation, adding flankers harms recognition. Adding two flankers is the same or worse than adding just one and the smaller the spacing between flanker and target, the more crowding occurs. This is because the pooling operation merges nearby responses, such as the target and flankers if they are close.
*'''Similarity between target and flanker''': Flankers more similar to targets cause more crowding, because of the selectivity property of the learned DNN filters.
*'''Dependence on target location and contrast normalization''': In DCNNs and eccentricity-dependent models with contrast normalization, recognition accuracy is the same across all eccentricities. In eccentricity-dependent networks without contrast normalization, recognition does not decrease despite the presence of clutter when the target is at the center of the image.
*'''Effect of pooling''': adding pooling leads to better recognition accuracy of the models. Yet, in the eccentricity model, pooling across the scales too early in the hierarchy leads to lower accuracy.

=Critique=
This paper just tries to check the impact of flankers on targets as to how crowding can affect recognition but it does not propose anything novel in terms of architecture to take care of such a type of crowding. The eccentricity based model does well only when the target is placed at the center of the image but maybe windowing over the frames instead of taking crops starting from the middle might help.

=References=
1) Volokitin A, Roig G, Poggio T:"Do Deep Neural Networks Suffer from Crowding?" Conference on Neural Information Processing Systems (NIPS). 2017
2) Francis X. Chen, Gemma Roig, Leyla Isik, Xavier Boix and Tomaso Poggio: "Eccentricity Dependent Deep Neural Networks for Modeling Human Vision" Journal of Vision. 17. 808. 10.1167/17.10.808.
3) J Harrison, W & W Remington, R & Mattingley, Jason. (2014). Visual crowding is anisotropic along the horizontal meridian during smooth pursuit. Journal of vision. 14. 10.1167/14.1.21. http://willjharrison.com/2014/01/new-paper-visual-crowding-is-anisotropic-along-the-horizontal-meridian-during-smooth-pursuit/

Do Deep Neural Networks Suffer from Crowding

2018-03-22T19:12:57Z

Apon: /* Introduction */

= Introduction =
Since the increase in popularity of Deep Neural Networks (DNNs), there has been lots of research in making machines capable of recognizing objects the same way humans do. Humans can recognize objects in a way that is invariant to scale, translation, and clutter. Crowding is another visual effect suffered by humans, in which an object that can be recognized in isolation can no longer be recognized when other objects, called flankers, are placed close to it. This paper focuses on studying the impact of crowding on DNNs by adding clutter to the images and then analyzing which models and settings suffer less from such effects.

[[File:paper25_fig_crowding_ex.png|center|600px]]
The figure shows a visual example of crowding [3]. Keep your eyes still and look at the dot in the center and try to identify the "A" in the two circles. You should see that it is much easier to make out the "A" in the right than in the left circle. The same "A" exists in both circles, however, the left circle contains flankers which are those line segments.

The paper investigates two types of DNNs for crowding: traditional deep convolutional neural networks (DCNN) and a multi-scale eccentricity-dependent model which is an extension of the DCNNs and inspired by the retina where the receptive field size of the convolutional filters in the model grows with increasing distance from the center of the image, called the eccentricity and will be explained below. The authors focus on the dependence of crowding on image factors, such as flanker configuration, target-flanker similarity, target eccentricity and premature pooling in particular.

= Models =
Two types of models are considered: deep convolutional neural networks and eccentricity-dependent models. Based on several hypothesis that pooling is the cause of crowding in human perception, the paper tries to investigate the effects of pooling on the detection of crowded images through these two network types.

== Deep Convolutional Neural Networks ==
The DCNN is a basic architecture with 3 convolutional layers, spatial 3x3 max-pooling with varying strides and a fully connected layer for classification as shown in the below figure.
[[File:DCNN.png|800px|center]]

The network is fed with images resized to 60x60, with mini-batches of 128 images, 32 feature channels for all convolutional layers, and convolutional filters of size 5x5 and stride 1.

As highlighted earlier, the effect of pooling is into main consideration and hence three different configurations have been investigated as below:

1. '''No total pooling''' Feature maps sizes decrease only due to boundary effects, as the 3x3 max pooling has stride 1. The square feature maps sizes after each pool layer are 60-54-48-42.
2. '''Progressive pooling''' 3x3 pooling with a stride of 2 halves the square size of the feature maps, until we pool over what remains in the final layer, getting rid of any spatial information before the fully connected layer. (60-27-11-1).
3. '''At end pooling''' Same as no total pooling, but before the fully connected layer, max-pool over the entire feature map. (60-54-48-1).

===What is the problem in CNNs?===
CNNs fall short in explaining human perceptual invariance. First, CNNs typically take input at a single uniform resolution. Biological measurements suggest that resolution is not uniform across the human visual field, but rather decays with eccentricity, i.e. distance from the center of focus Even more importantly, CNNs rely on data augmentation to achieve transformation-invariance and obviously a lot of processing is needed for CNNs.

==Eccentricity-dependent Model==
In order to take care of the scale invariance in the input image, the eccentricity dependent DNN is utilized. The main intuition behind this architecture is that as we increase eccentricity, the receptive fields also increase and hence the model will become invariant to changing input scales. It emphasizes scale invariance over translation invariance, in contrast to traditional DCNNs. In this model, the input image is cropped into varying scales (11 crops increasing by a factor of <math>\sqrt{2}</math> which are then resized to 60x60 pixels) and then fed to the network. The model computes an invariant representation of the input by sampling the inverted pyramid at a discrete set of scales with the same number of filters at each scale. Since the same number of filters are used for each scale, the smaller crops will be sampled at a high resolution while the larger crops will be sampled with a low resolution. These scales are fed into the network as an input channel to the convolutional layers and share the weights across scale and space.
[[File:EDM.png|2000x450px|center]]

The architecture of this model is the same as the previous DCNN model with the only change being the extra filters added for each of the scales, so the number of parameters remains the same as DCNN models. The authors perform spatial pooling, the aforementioned ''At end pooling'' is used here, and scale pooling which helps in reducing a number of scales by taking the maximum value of corresponding locations in the feature maps across multiple scales. It has three configurations: (1) at the beginning, in which all the different scales are pooled together after the first layer, 11-1-1-1-1 (2) progressively, 11-7-5-3-1 and (3) at the end, 11-11-11-11-1, in which all 11 scales are pooled together at the last layer.

===Contrast Normalization===
Since we have multiple scales of an input image, in some experiments, we perform normalization such that the sum of the pixel intensities in each scale is in the same range [0,1] (this is to prevent smaller crops, which have more non-black pixels, from disproportionately dominating max-pooling across scales). The normalized pixel intensities are then divided by a factor proportional to the crop area [[File:sqrtf.png|60px]] where i=1 is the smallest crop.

=Experiments and its Set-Up =
Targets are the set of objects to be recognized and flankers are the set of objects the model has not been trained to recognize, which act as clutter with respect to these target objects. The target objects are the even MNIST numbers having translational variance (shifted at different locations of the image along the horizontal axis), while flankers are from odd MNIST numbers, notMNIST dataset (contains alphabet letters) and Omniglot dataset (contains characters). Examples of the target and flanker configurations are shown below:
[[File:eximages.png|800px|center]]

The target and the object are referred to as ''a'' and ''x'' respectively with the below four configurations:
# No flankers. Only the target object. (a in the plots)
# One central flanker closer to the center of the image than the target. (xa)
# One peripheral flanker closer to the boundary of the image that the target. (ax)
# Two flankers spaced equally around the target, being both the same object (xax).

Training is done using backpropogation with images of size <math>1920 px^2</math> with embedded targets objects and flankers of size of <math>120 px^2</math>. The training and test images are divided as per the usual MNIST configuration. To determine if there is a difference between the peripheral flankers and the central flankers, all the tests are performed in the right half image plane.

==DNNs trained with Target and Flankers==
This is a constant spacing training setup where identical flankers are placed at a distance of 120 pixels either side of the target(xax) with the target having translational variance. The tests are evaluated on (i) DCNN with at the end pooling, and (ii) eccentricity-dependent model with 11-11-11-11-1 scale pooling, at the end spatial pooling and contrast normalization. The test data has different flanker configurations as described above.
[[File:result1.png|x450px|center]]

===Observations===
* With the flanker configuration same as the training one, models are better at recognizing objects in clutter rather than isolated objects for all image locations
* If the target-flanker spacing is changed, then models perform worse
* the eccentricity model is much better at recognizing objects in isolation than the DCNN because the multi-scale crops divide the image into discrete regions, letting the model learn from image parts as well as the whole image
* Only the eccentricity-dependent model is robust to different flanker configurations not included in training when the target is centered.

==DNNs trained with Images with the Target in Isolation==
Here the target objects are in isolation and with translational variance while the test-set is the same set of flanker configurations as used before.
[[File:result2.png|750x400px|center]]
In addition to the evaluation of DCNNs in constant target eccentricity at 240 pixels, here they are tested with images in which the target is fixed at 720 pixels from the center of the image, as shown in Fig 3. Since the target is already at the edge of the visual field, a flanker cannot be more peripheral in the image than the target. Same results as for the 240 pixels target eccentricity can be extracted. The closer the flanker is to the target, the more accuracy decreases. Also, it can be seen that when the target is close to the image boundary, recognition is poor because of boundary effects eroding away information about the target
[[File:paper25_supplemental1.png|800px|center]]

===DCNN Observations===
* The recognition gets worse with the increase in the number of flankers.
* Convolutional networks are capable of being invariant to translations.
* In the constant target eccentricity setup, where the target is fixed at the center of the image with varying target-flanker spacing, we observe that as the distance between target and flankers increase, recognition gets better.
* Spatial pooling helps in learning invariance.
*Flankers similar to the target object helps in recognition since they don't activate the convolutional filter more.
* notMNIST data affects leads to more crowding since they have many more edges and white image pixels which activate the convolutional layers more.

===Eccentric Model===
The set-up is the same as explained earlier.
[[File:result3.png|750x400px|center]]

====Observations====
* If the target is placed at the center and no contrast normalization is done, then the recognition accuracy is high since this model concentrates the most on the central region of the image.
* If contrast normalization is done, then all the scales will contribute equal amount and hence the eccentricity dependence is removed.
* Early pooling is harmful since it might take away the useful information very early which might be useful to the network.

==Complex Clutter==
Here, the targets are randomly embedded into images of the Places dataset and shifted along horizontally in order to investigate model robustness when the target is not at the image center. Tests are performed on DCNN and the eccentricity model with and without contrast normalization using at end pooling. The results are shown in Figure 9 below.

[[File:result4.png|750x400px|center]]

====Observations====
- Only eccentricity model without contrast normalization can recognize the target and only when the target is close to the image center.
- The eccentricity model does not need to be trained on different types of clutter to become robust to those types of clutter, but it needs to fixate on the relevant part of the image to recognize the target.

=Conclusions=
We often think that just training the network with data similar to the test data would achieve good results in a general scenario too but that's not the case as we trained the model with flankers and it did not give us the ideal results for the target objects.
*'''Flanker Configuration''': When models are trained with images of objects in isolation, adding flankers harms recognition. Adding two flankers is the same or worse than adding just one and the smaller the spacing between flanker and target, the more crowding occurs. This is because the pooling operation merges nearby responses, such as the target and flankers if they are close.
*'''Similarity between target and flanker''': Flankers more similar to targets cause more crowding, because of the selectivity property of the learned DNN filters.
*'''Dependence on target location and contrast normalization''': In DCNNs and eccentricity-dependent models with contrast normalization, recognition accuracy is the same across all eccentricities. In eccentricity-dependent networks without contrast normalization, recognition does not decrease despite the presence of clutter when the target is at the center of the image.
*'''Effect of pooling''': adding pooling leads to better recognition accuracy of the models. Yet, in the eccentricity model, pooling across the scales too early in the hierarchy leads to lower accuracy.

=Critique=
This paper just tries to check the impact of flankers on targets as to how crowding can affect recognition but it does not propose anything novel in terms of architecture to take care of such a type of crowding. The eccentricity based model does well only when the target is placed at the center of the image but maybe windowing over the frames instead of taking crops starting from the middle might help.

=References=
1) Volokitin A, Roig G, Poggio T:"Do Deep Neural Networks Suffer from Crowding?" Conference on Neural Information Processing Systems (NIPS). 2017
2) Francis X. Chen, Gemma Roig, Leyla Isik, Xavier Boix and Tomaso Poggio: "Eccentricity Dependent Deep Neural Networks for Modeling Human Vision" Journal of Vision. 17. 808. 10.1167/17.10.808.
3) J Harrison, W & W Remington, R & Mattingley, Jason. (2014). Visual crowding is anisotropic along the horizontal meridian during smooth pursuit. Journal of vision. 14. 10.1167/14.1.21. http://willjharrison.com/2014/01/new-paper-visual-crowding-is-anisotropic-along-the-horizontal-meridian-during-smooth-pursuit/

Do Deep Neural Networks Suffer from Crowding

2018-03-22T19:11:47Z

Apon: /* Introduction */

= Introduction =
Since the increase in popularity of Deep Neural Networks (DNNs), there has been lots of research in making machines capable of recognizing objects the same way humans do. Humans can recognize objects in a way that is invariant to scale, translation, and clutter. Crowding is another visual effect suffered by humans, in which an object that can be recognized in isolation can no longer be recognized when other objects, called flankers, are placed close to it. This paper focuses on studying the impact of crowding on DNNs by adding clutter to the images and then analyzing which models and settings suffer less from such effects.

[[File:paper25_fig_crowding_ex.png|center|600px]]
The figure shows a visual example of crowding [3]. Keep your eyes still and look at the dot in the center and try to identify the "A" in the two circles. You should see that it is much easier to make out the "A" in the right than in the left circle. The same "A" exists in both circles, however, the left circle contains flankers which are those line segments.

The paper investigates two types of DNNs for crowding: traditional deep convolutional neural networks(DCNN) and a multi-scale eccentricity-dependent model which is an extension of the DCNNs and inspired by the retina where the receptive field size of the convolutional filters in the model grows with increasing distance from the center of the image, called the eccentricity and will be explained below. The authors focus on the dependence of crowding on image factors, such as flanker configuration, target-flanker similarity, target eccentricity and premature pooling in particular.

= Models =
Two types of models are considered: deep convolutional neural networks and eccentricity-dependent models. Based on several hypothesis that pooling is the cause of crowding in human perception, the paper tries to investigate the effects of pooling on the detection of crowded images through these two network types.

== Deep Convolutional Neural Networks ==
The DCNN is a basic architecture with 3 convolutional layers, spatial 3x3 max-pooling with varying strides and a fully connected layer for classification as shown in the below figure.
[[File:DCNN.png|800px|center]]

The network is fed with images resized to 60x60, with mini-batches of 128 images, 32 feature channels for all convolutional layers, and convolutional filters of size 5x5 and stride 1.

As highlighted earlier, the effect of pooling is into main consideration and hence three different configurations have been investigated as below:

1. '''No total pooling''' Feature maps sizes decrease only due to boundary effects, as the 3x3 max pooling has stride 1. The square feature maps sizes after each pool layer are 60-54-48-42.
2. '''Progressive pooling''' 3x3 pooling with a stride of 2 halves the square size of the feature maps, until we pool over what remains in the final layer, getting rid of any spatial information before the fully connected layer. (60-27-11-1).
3. '''At end pooling''' Same as no total pooling, but before the fully connected layer, max-pool over the entire feature map. (60-54-48-1).

===What is the problem in CNNs?===
CNNs fall short in explaining human perceptual invariance. First, CNNs typically take input at a single uniform resolution. Biological measurements suggest that resolution is not uniform across the human visual field, but rather decays with eccentricity, i.e. distance from the center of focus Even more importantly, CNNs rely on data augmentation to achieve transformation-invariance and obviously a lot of processing is needed for CNNs.

==Eccentricity-dependent Model==
In order to take care of the scale invariance in the input image, the eccentricity dependent DNN is utilized. The main intuition behind this architecture is that as we increase eccentricity, the receptive fields also increase and hence the model will become invariant to changing input scales. It emphasizes scale invariance over translation invariance, in contrast to traditional DCNNs. In this model, the input image is cropped into varying scales (11 crops increasing by a factor of <math>\sqrt{2}</math> which are then resized to 60x60 pixels) and then fed to the network. The model computes an invariant representation of the input by sampling the inverted pyramid at a discrete set of scales with the same number of filters at each scale. Since the same number of filters are used for each scale, the smaller crops will be sampled at a high resolution while the larger crops will be sampled with a low resolution. These scales are fed into the network as an input channel to the convolutional layers and share the weights across scale and space.
[[File:EDM.png|2000x450px|center]]

The architecture of this model is the same as the previous DCNN model with the only change being the extra filters added for each of the scales, so the number of parameters remains the same as DCNN models. The authors perform spatial pooling, the aforementioned ''At end pooling'' is used here, and scale pooling which helps in reducing a number of scales by taking the maximum value of corresponding locations in the feature maps across multiple scales. It has three configurations: (1) at the beginning, in which all the different scales are pooled together after the first layer, 11-1-1-1-1 (2) progressively, 11-7-5-3-1 and (3) at the end, 11-11-11-11-1, in which all 11 scales are pooled together at the last layer.

===Contrast Normalization===
Since we have multiple scales of an input image, in some experiments, we perform normalization such that the sum of the pixel intensities in each scale is in the same range [0,1] (this is to prevent smaller crops, which have more non-black pixels, from disproportionately dominating max-pooling across scales). The normalized pixel intensities are then divided by a factor proportional to the crop area [[File:sqrtf.png|60px]] where i=1 is the smallest crop.

=Experiments and its Set-Up =
Targets are the set of objects to be recognized and flankers are the set of objects the model has not been trained to recognize, which act as clutter with respect to these target objects. The target objects are the even MNIST numbers having translational variance (shifted at different locations of the image along the horizontal axis), while flankers are from odd MNIST numbers, notMNIST dataset (contains alphabet letters) and Omniglot dataset (contains characters). Examples of the target and flanker configurations are shown below:
[[File:eximages.png|800px|center]]

The target and the object are referred to as ''a'' and ''x'' respectively with the below four configurations:
# No flankers. Only the target object. (a in the plots)
# One central flanker closer to the center of the image than the target. (xa)
# One peripheral flanker closer to the boundary of the image that the target. (ax)
# Two flankers spaced equally around the target, being both the same object (xax).

Training is done using backpropogation with images of size <math>1920 px^2</math> with embedded targets objects and flankers of size of <math>120 px^2</math>. The training and test images are divided as per the usual MNIST configuration. To determine if there is a difference between the peripheral flankers and the central flankers, all the tests are performed in the right half image plane.

==DNNs trained with Target and Flankers==
This is a constant spacing training setup where identical flankers are placed at a distance of 120 pixels either side of the target(xax) with the target having translational variance. The tests are evaluated on (i) DCNN with at the end pooling, and (ii) eccentricity-dependent model with 11-11-11-11-1 scale pooling, at the end spatial pooling and contrast normalization. The test data has different flanker configurations as described above.
[[File:result1.png|x450px|center]]

===Observations===
* With the flanker configuration same as the training one, models are better at recognizing objects in clutter rather than isolated objects for all image locations
* If the target-flanker spacing is changed, then models perform worse
* the eccentricity model is much better at recognizing objects in isolation than the DCNN because the multi-scale crops divide the image into discrete regions, letting the model learn from image parts as well as the whole image
* Only the eccentricity-dependent model is robust to different flanker configurations not included in training when the target is centered.

==DNNs trained with Images with the Target in Isolation==
Here the target objects are in isolation and with translational variance while the test-set is the same set of flanker configurations as used before.
[[File:result2.png|750x400px|center]]
In addition to the evaluation of DCNNs in constant target eccentricity at 240 pixels, here they are tested with images in which the target is fixed at 720 pixels from the center of the image, as shown in Fig 3. Since the target is already at the edge of the visual field, a flanker cannot be more peripheral in the image than the target. Same results as for the 240 pixels target eccentricity can be extracted. The closer the flanker is to the target, the more accuracy decreases. Also, it can be seen that when the target is close to the image boundary, recognition is poor because of boundary effects eroding away information about the target
[[File:paper25_supplemental1.png|800px|center]]

===DCNN Observations===
* The recognition gets worse with the increase in the number of flankers.
* Convolutional networks are capable of being invariant to translations.
* In the constant target eccentricity setup, where the target is fixed at the center of the image with varying target-flanker spacing, we observe that as the distance between target and flankers increase, recognition gets better.
* Spatial pooling helps in learning invariance.
*Flankers similar to the target object helps in recognition since they don't activate the convolutional filter more.
* notMNIST data affects leads to more crowding since they have many more edges and white image pixels which activate the convolutional layers more.

===Eccentric Model===
The set-up is the same as explained earlier.
[[File:result3.png|750x400px|center]]

====Observations====
* If the target is placed at the center and no contrast normalization is done, then the recognition accuracy is high since this model concentrates the most on the central region of the image.
* If contrast normalization is done, then all the scales will contribute equal amount and hence the eccentricity dependence is removed.
* Early pooling is harmful since it might take away the useful information very early which might be useful to the network.

==Complex Clutter==
Here, the targets are randomly embedded into images of the Places dataset and shifted along horizontally in order to investigate model robustness when the target is not at the image center. Tests are performed on DCNN and the eccentricity model with and without contrast normalization using at end pooling. The results are shown in Figure 9 below.

[[File:result4.png|750x400px|center]]

====Observations====
- Only eccentricity model without contrast normalization can recognize the target and only when the target is close to the image center.
- The eccentricity model does not need to be trained on different types of clutter to become robust to those types of clutter, but it needs to fixate on the relevant part of the image to recognize the target.

=Conclusions=
We often think that just training the network with data similar to the test data would achieve good results in a general scenario too but that's not the case as we trained the model with flankers and it did not give us the ideal results for the target objects.
*'''Flanker Configuration''': When models are trained with images of objects in isolation, adding flankers harms recognition. Adding two flankers is the same or worse than adding just one and the smaller the spacing between flanker and target, the more crowding occurs. This is because the pooling operation merges nearby responses, such as the target and flankers if they are close.
*'''Similarity between target and flanker''': Flankers more similar to targets cause more crowding, because of the selectivity property of the learned DNN filters.
*'''Dependence on target location and contrast normalization''': In DCNNs and eccentricity-dependent models with contrast normalization, recognition accuracy is the same across all eccentricities. In eccentricity-dependent networks without contrast normalization, recognition does not decrease despite the presence of clutter when the target is at the center of the image.
*'''Effect of pooling''': adding pooling leads to better recognition accuracy of the models. Yet, in the eccentricity model, pooling across the scales too early in the hierarchy leads to lower accuracy.

=Critique=
This paper just tries to check the impact of flankers on targets as to how crowding can affect recognition but it does not propose anything novel in terms of architecture to take care of such a type of crowding. The eccentricity based model does well only when the target is placed at the center of the image but maybe windowing over the frames instead of taking crops starting from the middle might help.

=References=
1) Volokitin A, Roig G, Poggio T:"Do Deep Neural Networks Suffer from Crowding?" Conference on Neural Information Processing Systems (NIPS). 2017
2) Francis X. Chen, Gemma Roig, Leyla Isik, Xavier Boix and Tomaso Poggio: "Eccentricity Dependent Deep Neural Networks for Modeling Human Vision" Journal of Vision. 17. 808. 10.1167/17.10.808.
3) J Harrison, W & W Remington, R & Mattingley, Jason. (2014). Visual crowding is anisotropic along the horizontal meridian during smooth pursuit. Journal of vision. 14. 10.1167/14.1.21. http://willjharrison.com/2014/01/new-paper-visual-crowding-is-anisotropic-along-the-horizontal-meridian-during-smooth-pursuit/

Spherical CNNs

2018-03-22T19:02:04Z

Apon: /* Molecular Atomization */

= Introduction =
Convolutional Neural Networks (CNNs), or network architectures involving CNNs, are the current state of the art for learning 2D image processing tasks such as semantic segmentation and object detection. CNNs work well in large part due to the property of being translationally equivariant. This property allows a network trained to detect a certain type of object to still detect the object even if it is translated to another position in the image. However, this does not correspond well to spherical signals since projecting a spherical signal onto a plane will result in distortions, as demonstrated in Figure 1. There are many different types of spherical projections onto a 2D plane, as most people know from the various types of world maps, none of which provide all the necessary properties for rotation-invariant learning. Applications where spherical CNNs can be applied include omnidirectional vision for robots, molecular regression problems, and weather/climate modelling.

[[File:paper26-fig1.png|center]]

The main contributions of this paper are the following:
# The theory of spherical CNNs.
# The first automatically differentiable implementation of the generalized Fourier transform for <math>S^2</math> and SO(3). The provided PyTorch code by the authors is easy to use, fast, and memory efficient.
# The first empirical support for the utility of spherical CNNs for rotation-invariant learning problems.

= Notation =
Below are listed several important terms:
* '''Unit Sphere''' <math>S^2</math> is defined as a sphere where all of its points are distance of 1 from the origin. The unit sphere can be parameterized by the spherical coordinates <math>\alpha ∈ [0, 2π]</math> and <math>β ∈ [0, π]</math>. This is a two-dimensional manifold with respect to <math>\alpha</math> and <math>β</math>.
* '''<math>S^2</math> Sphere''' The three dimensional surface from a 3D sphere
* '''Spherical Signals''' In the paper spherical images and filters are modeled as continuous functions <math>f : s^2 → \mathbb{R}^K</math>. K is the number of channels. Such as how RGB images have 3 channels a spherical signal can have numerous channels describing the data. Examples of channels which were used can be found in the experiments section.
* '''Rotations - SO(3)''' The group of 3D rotations on an <math>S^2</math> sphere. Sometimes called the "special orthogonal group". In this paper the ZYZ-Euler parameterization is used to represent SO(3) rotations with <math>\alpha, \beta</math>, and <math>\gamma</math>. Any rotation can be broken down into first a rotation (<math>\alpha</math>) about the Z-axis, then a rotation (<math>\beta</math>) about the new Y-axis (Y'), followed by a rotation (<math>\gamma</math>) about the new Z axis (Z"). [In the rest of this paper, to integrate functions on SO(3), the authors use a rotationally invariant probability measure on the Borel subsets of SO(3). This measure is an example of a Haar measure. Haar measures generalize the idea of rotationally invariant probability measures to general topological groups. For more on Haar measures, see (Feldman 2002) ]

= Related Work =
The related work presented in this paper is very brief, in large part due to the novelty of spherical CNNs and the length of the rest of the paper. The authors enumerate numerous papers which attempt to exploit larger groups of symmetries such as the translational symmetries of CNNs but do not go into specific details for any of these attempts. They do state that all the previous works are limited to discrete groups with the exception of SO(2)-steerable networks.
The authors also mention that previous works exist that analyze spherical images but that these do not have an equivariant architecture. They claim that Spherical CNNs are "the first to achieve equivariance to a continuous, non-commutative group (SO(3))". They also claim to be the first to use the generalized Fourier transform for speed effective performance of group correlation.

= Correlations on the Sphere and Rotation Group =
Spherical correlation is like planar correlation except instead of translation, there is rotation. The definitions for each are provided as follows:

'''Planar correlation''' The value of the output feature map at translation <math>\small x ∈ Z^2</math> is computed as an inner product between the input feature map and a filter, shifted by <math>\small x</math>.

'''Spherical correlation''' The value of the output feature map evaluated at rotation <math>\small R ∈ SO(3)</math> is computed as an inner product between the input feature map and a filter, rotated by <math>\small R</math>.

'''Rotation of Spherical Signals''' The paper introduces the rotation operator <math>L_R</math>. The rotation operator simply rotates a function (which allows us to rotate the the spherical filters) by <math>R^{-1}</math>. With this definition we have the property that <math>L_{RR'} = L_R L_{R'}</math>.

'''Inner Products''' The inner product of spherical signals is simply the integral summation on the vector space over the entire sphere.

<math>\langle\psi , f \rangle = \int_{S^2} \sum_{k=1}^K \psi_k (x)f_k (x)dx</math>

<math>dx</math> here is SO(3) rotation invariant and is equivalent to <math>d \alpha sin(\beta) d \beta / 4 \pi </math> in spherical coordinates. This comes from the ZYZ-Euler paramaterization where any rotation can be broken down into first a rotation about the Z-axis, then a rotation about the new Y-axis (Y'), followed by a rotation about the new Z axis (Z"). More details on this are given in Appendix A in the paper.

By this definition, the invariance of the inner product is then guaranteed for any rotation <math>R ∈ SO(3)</math>. In other words, when subjected to rotations, the volume under a spherical heightmap does not change. The following equations show that <math>L_R</math> has a distinct adjoint (<math>L_{R^{-1}}</math>) and that <math>L_R</math> is unitary and thus preserves orthogonality and distances.

<math>\langle L_R \psi \,, f \rangle = \int_{S^2} \sum_{k=1}^K \psi_k (R^{-1} x)f_k (x)dx</math>

::::<math>= \int_{S^2} \sum_{k=1}^K \psi_k (x)f_k (Rx)dx</math>

::::<math>= \langle \psi , L_{R^{-1}} f \rangle</math>

'''Spherical Correlation''' With the above knowledge the definition of spherical correlation of two signals <math>f</math> and <math>\psi</math> is:

<math>[\psi \star f](R) = \langle L_R \psi \,, f \rangle = \int_{S^2} \sum_{k=1}^K \psi_k (R^{-1} x)f_k (x)dx</math>

The output of the above equation is a function on SO(3). This can be thought of as for each rotation combination of <math>\alpha , \beta , \gamma </math> there is a different volume under the correlation. The authors make a point of noting that previous work by Driscoll and Healey only ensures circular symmetries about the Z axis and their new formulation ensures symmetry about any rotation.

'''Rotation of SO(3) Signals''' The first layer of Spherical CNNs take a function on the sphere (<math>S^2</math>) and output a function on SO(3). Therefore, if a Spherical CNN with more than one layer is going to be built there needs to be a way to find the correlation between two signals on SO(3). The authors then generalize the rotation operator (<math>L_R</math>) to encompass acting on signals from SO(3). This new definition of <math>L_R</math> is as follows: (where <math>R^{-1}Q</math> is a composition of rotations, i.e. multiplication of rotation matrices)

<math>[L_Rf](Q)=f(R^{-1} Q)</math>

'''Rotation Group Correlation''' The correlation of two signals (<math>f,\psi</math>) on SO(3) with K channels is defined as the following:

<math>[\psi \star f](R) = \langle L_R \psi , f \rangle = \int_{SO(3)} \sum_{k=1}^K \psi_k (R^{-1} Q)f_k (Q)dQ</math>

where dQ represents the ZYZ-Euler angles <math>d \alpha sin(\beta) d \beta d \gamma / 8 \pi^2 </math>. A complete derivation of this can be found in Appendix A.

'''Equivariance''' The equivariance for the rotation group correlation is similarly demonstrated. A layer is equivariant if for some operator <math>T_R</math>, <math>\Phi \circ L_R = T_R \circ \Phi</math>, and:

<math>[\psi \star [L_Qf]](R) = \langle L_R \psi , L_Qf \rangle = \langle L_{Q^{-1} R} \psi , f \rangle = [\psi \star f](Q^{-1}R) = [L_Q[\psi \star f]](R) </math>.

= Implementation with GFFT =
The authors leverage the Generalized Fourier Transform (GFT) and Generalized Fast Fourier Transform (GFFT) algorithms to compute the correlations outlined in the previous section. The Fast Fourier Transform (FFT) can compute correlations and convolutions efficiently by means of the Fourier theorem. The Fourier theorem states that a continuous periodic function can be expressed as a sum of a series of sine or cosine terms (called Fourier coefficients). The FFT can be generalized to <math>S^2</math> and SO(3) and is then called the GFT. The GFT is a linear projection of a function onto orthogonal basis functions. The basis functions are a set of irreducible unitary representations for a group (such as for <math>S^2</math> or SO(3)). For <math>S^2</math> the basis functions are the spherical harmonics <math>Y_m^l(x)</math>. For SO(3) these basis functions are called the Wigner D-functions <math>D_{mn}^l(R)</math>. For both sets of functions the indices are restricted to <math>l\geq0</math> and <math>-l \leq m,n \geq l</math>. The Wigner D-functions are also orthogonal so the Fourier coefficients can be computed by the inner product with the Wigner D-functions (See Appendix C for complete proof). The Wigner D-functions are complete which means that any function (which is well behaved) on SO(3) can be expressed as a linear combination of the Wigner D-functions. The GFT of a function on SO(3) is thus:

<math>\hat{f^l} = \int_X f(x) D^l(x)dx</math>

where <math>\hat{f}</math> represents the Fourier coefficients. For <math>S^2</math> we have the same equation but with the basis functions <math>Y^l</math>.

The inverse SO(3) Fourier transform is:

<math>f(R)=[\mathcal{F}^{-1} \hat{f}](R) = \sum_{l=0}^b (2l + 1) \sum_{m=-l}^l \sum_{n=-l}^l \hat{f_{mn}^l} D_{mn}^l(R) </math>

The bandwidth b represents the maximum frequency and is related to the resolution of the spatial grid. Kostelec and Rockmore are referenced for more knowledge on this topic.

The authors give proofs (Appendix D) that the SO(3) correlation satisfies the Fourier theorem and the <math>S^2</math> correlation of spherical signals can be computed by the outer products of the <math>S^2</math>-FTs (Shown in Figure 2).

[[File:paper26-fig2.png|center]]

The GFFT algorithm details are taken from Kostelec and Rockmore. The authors claim they have the first automatically differentiable implementation of the GFT for <math>S^2</math> and SO(3). The authors do not provide any run time comparisons for real time applications (they just mentioned that FFT can be computed in <math>O(n\mathrm{log}n)</math> time) or any comparisons on training times with/without GFFT. However, they do provide the source code of their implementation at: https://github.com/jonas-koehler/s2cnn.

= Experiments =
The authors provide several experiments. The first set of experiments are designed to show the numerical stability and accuracy of the outlined methods. The second group of experiments demonstrates how the algorithms can be applied to current problem domains.

==Equivariance Error==
In this experiment the authors try to show experimentally that their theory of equivariance holds. They express that they had doubts about the equivariance in practice due to potential discretization artifacts since equivariance was proven for the continuous case, with the potential consequence of equivariance not holding being that the weight sharing scheme becomes less effective. The experiment is set up by first testing the equivariance of the SO(3) correlation at different resolutions. 500 random rotations and feature maps (with 10 channels) are sampled. They then calculate the approximation error <math>\small\Delta = \dfrac{1}{n} \sum_{i=1}^n std(L_{R_i} \Phi(f_i) - \phi(L_{R_i} f_i))/std(\Phi(f_i))</math>
Note: The authors do not mention what the std function is however it is likely the standard deviation function as 'std' is the command for standard deviation in MATLAB.
<math>\Phi</math> is a composition of SO(3) correlation layers with filters which have been randomly initialized. The authors mention that they were expecting <math>\Delta</math> to be zero in the case of perfect equivariance. This is due to, as proven earlier, the following two terms equaling each other in the continuous case: <math>\small L_{R_i} \Phi(f_i) - \phi(L_{R_i} f_i)</math>. The results are shown in Figure 3.

[[File:paper26-fig3.png|center]]

<math>\Delta</math> only grows with resolution/layers when there is no activation function. With ReLU activation the error stays constant once slightly higher than 0 resolution. The authors indicate that the error must therefore be from the feature map rotation since this type of error is exact only for bandlimited functions.

==MNIST Data==
The experiment using MNIST data was created by projecting MNIST digits onto a sphere using stereographic projection to create the resulting images as seen in Figure 4.

[[File:paper26-fig4.png|center]]

The authors created two datasets, one with the projected digits and the other with the same projected digits which were then subjected to a random rotation. The spherical CNN architecture used was <math>\small S^2</math>conv-ReLU-SO(3)conv-ReLU-FC-softmax and was attempted with bandwidths of 30,10,6 and 20,40,10 channels for each layer respectively. This model was compared to a baseline CNN with layers conv-ReLU-conv-ReLU-FC-softmax with 5x5 filters, 32,64,10 channels and stride of 3. For comparison this leads to approximately 68K parameters for the baseline and 58K parameters for the spherical CNN. Results can be seen in Table 1. It is clear from the results that the spherical CNN architecture made the network rotationally invariant. Performance on the rotated set is almost identical to the non-rotated set. This is true even when trained on the non-rotated set and tested on the rotated set. Compare this to the non-spherical architecture which becomes unusable when rotating the digits.

[[File:paper26-tab1.png|center]]

==SHREC17==
The SHREC dataset contains 3D models from the ShapeNet dataset which are classified into categories. It consists of a regularly aligned dataset and a rotated dataset. The models from the SHREC17 dataset were projected onto a sphere by means of raycasting. Different properties of the objects obtained from the raycast of the original model and the convex hull of the model make up the different channels which are input into the spherical CNN.

[[File:paper26-fig5.png|center]]

The network architecture used is an initial <math>\small S^2</math>conv-BN-ReLU block which is followed by two SO(3)conv-BN-ReLU blocks. The output is then fed into a MaxPool-BN block then a linear layer to the output for final classification. The architecture for this experiment has ~1.4M parameters, far exceeding the scale of the spherical CNNs in the other experiments.

This architecture achieves state of the art results on the SHREC17 tasks. The model places 2nd or 3rd in all categories but was not submitted as the SHREC17 task is closed. Table 2 shows the comparison of results with the top 3 submissions in each category. In the table, P@N stands for precision, R@N stands for recall, F1@N stands for F-score, mAP stands for mean average precision, and NDCG stands for normalized discounted cumulative gain in relevance based on whether the category and subcategory labels are predicted correctly. The authors claim the results show empirical proof of the usefulness of spherical CNNs. They elaborate that this is largely due to the fact that most architectures on the SHREC17 competition are highly specialized whereas their model is fairly general.

[[File:paper26-tab2.png|center]]

==Molecular Atomization==
In this experiment a spherical CNN is implemented with an architecture resembling that of ResNet. They use the QM7 dataset (Blum et al. 2009) which has the task of predicting atomization energy of molecules. The QM7 dataset is a subset of GDB-13 (database of organic molecules) composed of all molecules up to 23 atoms. The positions and charges given in the dataset are projected onto the sphere using potential functions. This is done as follows. First, for each atom, a sphere is defined around its position with the radius of the sphere kept uniform across all atoms. The radius is chosen as the minimal radius so no intersections between atoms occur in the training set. Finally, using potential functions, a T channel spherical signal is produced for each atom in the molecule as shown in the figure below. A summary of their results is shown in Table 3 along with some of the spherical CNN architecture details. It shows the different RMSE obtained from different methods. The results from this final experiment also seem to be promising as the network the authors present achieves the second best score. They also note that the first place method grows exponentially with the number of atoms per molecule so is unlikely to scale well.

[[File:paper26-tab3.png|center]]

[[File:paper26-f6.png|center]]

= Conclusions =
This paper presents a novel architecture called Spherical CNNs. The paper defines <math>\small S^2</math> and SO(3) cross correlations, shows the theory behind their rotational invariance for continuous functions, and demonstrates that the invariance also applies to the discrete case. An effective GFFT algorithm was implemented and evaluated on two very different datasets with close to state of the art results, demonstrating that there are practical applications to Spherical CNNs.

For future work the authors believe that improvements can be obtained by generalizing the algorithms to the SE(3) group (SE(3) simply adds translations in 3D space to the SO(3) group). The authors also briefly mention their excitement for applying Spherical CNNs to omnidirectional vision such as in drones and autonomous cars. They state that there is very little publicly available omnidirectional image data which could be why they did not conduct any experiments in this area.

= Commentary =
The reviews on Spherical CNNs are very positive and it is ranked in the top 1% of papers submitted to ICLR 2018. Positive points are the novelty of the architecture, the wide variety of experiments performed, and the writing. One critique of the original submission is that the related works section only lists, instead of describing, previous methods and that a description of the methods would have provided more clarity. The authors have since expanded the section however I found that it is still limited which the authors attribute to length limitations. Another critique is that the evaluation does not provide enough depth. For example, it would have been great to see an example of omnidirectional vision for spherical networks. However, this is to be expected as it is just the introduction of spherical CNNs and more work is sure to come.

= Source Code =
Source code is available at:
https://github.com/jonas-koehler/s2cnn

= Sources =
* T. Cohen et al. Spherical CNNs, 2018.
* J. Feldman. Haar Measure. http://www.math.ubc.ca/~feldman/m606/haar.pdf
* P. Kostelec, D. Rockmore. FFTs on the Rotation Group, 2008.

Spherical CNNs

2018-03-22T18:58:25Z

Apon: /* Introduction */

= Introduction =
Convolutional Neural Networks (CNNs), or network architectures involving CNNs, are the current state of the art for learning 2D image processing tasks such as semantic segmentation and object detection. CNNs work well in large part due to the property of being translationally equivariant. This property allows a network trained to detect a certain type of object to still detect the object even if it is translated to another position in the image. However, this does not correspond well to spherical signals since projecting a spherical signal onto a plane will result in distortions, as demonstrated in Figure 1. There are many different types of spherical projections onto a 2D plane, as most people know from the various types of world maps, none of which provide all the necessary properties for rotation-invariant learning. Applications where spherical CNNs can be applied include omnidirectional vision for robots, molecular regression problems, and weather/climate modelling.

[[File:paper26-fig1.png|center]]

The main contributions of this paper are the following:
# The theory of spherical CNNs.
# The first automatically differentiable implementation of the generalized Fourier transform for <math>S^2</math> and SO(3). The provided PyTorch code by the authors is easy to use, fast, and memory efficient.
# The first empirical support for the utility of spherical CNNs for rotation-invariant learning problems.

= Notation =
Below are listed several important terms:
* '''Unit Sphere''' <math>S^2</math> is defined as a sphere where all of its points are distance of 1 from the origin. The unit sphere can be parameterized by the spherical coordinates <math>\alpha ∈ [0, 2π]</math> and <math>β ∈ [0, π]</math>. This is a two-dimensional manifold with respect to <math>\alpha</math> and <math>β</math>.
* '''<math>S^2</math> Sphere''' The three dimensional surface from a 3D sphere
* '''Spherical Signals''' In the paper spherical images and filters are modeled as continuous functions <math>f : s^2 → \mathbb{R}^K</math>. K is the number of channels. Such as how RGB images have 3 channels a spherical signal can have numerous channels describing the data. Examples of channels which were used can be found in the experiments section.
* '''Rotations - SO(3)''' The group of 3D rotations on an <math>S^2</math> sphere. Sometimes called the "special orthogonal group". In this paper the ZYZ-Euler parameterization is used to represent SO(3) rotations with <math>\alpha, \beta</math>, and <math>\gamma</math>. Any rotation can be broken down into first a rotation (<math>\alpha</math>) about the Z-axis, then a rotation (<math>\beta</math>) about the new Y-axis (Y'), followed by a rotation (<math>\gamma</math>) about the new Z axis (Z"). [In the rest of this paper, to integrate functions on SO(3), the authors use a rotationally invariant probability measure on the Borel subsets of SO(3). This measure is an example of a Haar measure. Haar measures generalize the idea of rotationally invariant probability measures to general topological groups. For more on Haar measures, see (Feldman 2002) ]

= Related Work =
The related work presented in this paper is very brief, in large part due to the novelty of spherical CNNs and the length of the rest of the paper. The authors enumerate numerous papers which attempt to exploit larger groups of symmetries such as the translational symmetries of CNNs but do not go into specific details for any of these attempts. They do state that all the previous works are limited to discrete groups with the exception of SO(2)-steerable networks.
The authors also mention that previous works exist that analyze spherical images but that these do not have an equivariant architecture. They claim that Spherical CNNs are "the first to achieve equivariance to a continuous, non-commutative group (SO(3))". They also claim to be the first to use the generalized Fourier transform for speed effective performance of group correlation.

= Correlations on the Sphere and Rotation Group =
Spherical correlation is like planar correlation except instead of translation, there is rotation. The definitions for each are provided as follows:

'''Planar correlation''' The value of the output feature map at translation <math>\small x ∈ Z^2</math> is computed as an inner product between the input feature map and a filter, shifted by <math>\small x</math>.

'''Spherical correlation''' The value of the output feature map evaluated at rotation <math>\small R ∈ SO(3)</math> is computed as an inner product between the input feature map and a filter, rotated by <math>\small R</math>.

'''Rotation of Spherical Signals''' The paper introduces the rotation operator <math>L_R</math>. The rotation operator simply rotates a function (which allows us to rotate the the spherical filters) by <math>R^{-1}</math>. With this definition we have the property that <math>L_{RR'} = L_R L_{R'}</math>.

'''Inner Products''' The inner product of spherical signals is simply the integral summation on the vector space over the entire sphere.

<math>\langle\psi , f \rangle = \int_{S^2} \sum_{k=1}^K \psi_k (x)f_k (x)dx</math>

<math>dx</math> here is SO(3) rotation invariant and is equivalent to <math>d \alpha sin(\beta) d \beta / 4 \pi </math> in spherical coordinates. This comes from the ZYZ-Euler paramaterization where any rotation can be broken down into first a rotation about the Z-axis, then a rotation about the new Y-axis (Y'), followed by a rotation about the new Z axis (Z"). More details on this are given in Appendix A in the paper.

By this definition, the invariance of the inner product is then guaranteed for any rotation <math>R ∈ SO(3)</math>. In other words, when subjected to rotations, the volume under a spherical heightmap does not change. The following equations show that <math>L_R</math> has a distinct adjoint (<math>L_{R^{-1}}</math>) and that <math>L_R</math> is unitary and thus preserves orthogonality and distances.

<math>\langle L_R \psi \,, f \rangle = \int_{S^2} \sum_{k=1}^K \psi_k (R^{-1} x)f_k (x)dx</math>

::::<math>= \int_{S^2} \sum_{k=1}^K \psi_k (x)f_k (Rx)dx</math>

::::<math>= \langle \psi , L_{R^{-1}} f \rangle</math>

'''Spherical Correlation''' With the above knowledge the definition of spherical correlation of two signals <math>f</math> and <math>\psi</math> is:

<math>[\psi \star f](R) = \langle L_R \psi \,, f \rangle = \int_{S^2} \sum_{k=1}^K \psi_k (R^{-1} x)f_k (x)dx</math>

The output of the above equation is a function on SO(3). This can be thought of as for each rotation combination of <math>\alpha , \beta , \gamma </math> there is a different volume under the correlation. The authors make a point of noting that previous work by Driscoll and Healey only ensures circular symmetries about the Z axis and their new formulation ensures symmetry about any rotation.

'''Rotation of SO(3) Signals''' The first layer of Spherical CNNs take a function on the sphere (<math>S^2</math>) and output a function on SO(3). Therefore, if a Spherical CNN with more than one layer is going to be built there needs to be a way to find the correlation between two signals on SO(3). The authors then generalize the rotation operator (<math>L_R</math>) to encompass acting on signals from SO(3). This new definition of <math>L_R</math> is as follows: (where <math>R^{-1}Q</math> is a composition of rotations, i.e. multiplication of rotation matrices)

<math>[L_Rf](Q)=f(R^{-1} Q)</math>

'''Rotation Group Correlation''' The correlation of two signals (<math>f,\psi</math>) on SO(3) with K channels is defined as the following:

<math>[\psi \star f](R) = \langle L_R \psi , f \rangle = \int_{SO(3)} \sum_{k=1}^K \psi_k (R^{-1} Q)f_k (Q)dQ</math>

where dQ represents the ZYZ-Euler angles <math>d \alpha sin(\beta) d \beta d \gamma / 8 \pi^2 </math>. A complete derivation of this can be found in Appendix A.

'''Equivariance''' The equivariance for the rotation group correlation is similarly demonstrated. A layer is equivariant if for some operator <math>T_R</math>, <math>\Phi \circ L_R = T_R \circ \Phi</math>, and:

<math>[\psi \star [L_Qf]](R) = \langle L_R \psi , L_Qf \rangle = \langle L_{Q^{-1} R} \psi , f \rangle = [\psi \star f](Q^{-1}R) = [L_Q[\psi \star f]](R) </math>.

= Implementation with GFFT =
The authors leverage the Generalized Fourier Transform (GFT) and Generalized Fast Fourier Transform (GFFT) algorithms to compute the correlations outlined in the previous section. The Fast Fourier Transform (FFT) can compute correlations and convolutions efficiently by means of the Fourier theorem. The Fourier theorem states that a continuous periodic function can be expressed as a sum of a series of sine or cosine terms (called Fourier coefficients). The FFT can be generalized to <math>S^2</math> and SO(3) and is then called the GFT. The GFT is a linear projection of a function onto orthogonal basis functions. The basis functions are a set of irreducible unitary representations for a group (such as for <math>S^2</math> or SO(3)). For <math>S^2</math> the basis functions are the spherical harmonics <math>Y_m^l(x)</math>. For SO(3) these basis functions are called the Wigner D-functions <math>D_{mn}^l(R)</math>. For both sets of functions the indices are restricted to <math>l\geq0</math> and <math>-l \leq m,n \geq l</math>. The Wigner D-functions are also orthogonal so the Fourier coefficients can be computed by the inner product with the Wigner D-functions (See Appendix C for complete proof). The Wigner D-functions are complete which means that any function (which is well behaved) on SO(3) can be expressed as a linear combination of the Wigner D-functions. The GFT of a function on SO(3) is thus:

<math>\hat{f^l} = \int_X f(x) D^l(x)dx</math>

where <math>\hat{f}</math> represents the Fourier coefficients. For <math>S^2</math> we have the same equation but with the basis functions <math>Y^l</math>.

The inverse SO(3) Fourier transform is:

<math>f(R)=[\mathcal{F}^{-1} \hat{f}](R) = \sum_{l=0}^b (2l + 1) \sum_{m=-l}^l \sum_{n=-l}^l \hat{f_{mn}^l} D_{mn}^l(R) </math>

The bandwidth b represents the maximum frequency and is related to the resolution of the spatial grid. Kostelec and Rockmore are referenced for more knowledge on this topic.

The authors give proofs (Appendix D) that the SO(3) correlation satisfies the Fourier theorem and the <math>S^2</math> correlation of spherical signals can be computed by the outer products of the <math>S^2</math>-FTs (Shown in Figure 2).

[[File:paper26-fig2.png|center]]

The GFFT algorithm details are taken from Kostelec and Rockmore. The authors claim they have the first automatically differentiable implementation of the GFT for <math>S^2</math> and SO(3). The authors do not provide any run time comparisons for real time applications (they just mentioned that FFT can be computed in <math>O(n\mathrm{log}n)</math> time) or any comparisons on training times with/without GFFT. However, they do provide the source code of their implementation at: https://github.com/jonas-koehler/s2cnn.

= Experiments =
The authors provide several experiments. The first set of experiments are designed to show the numerical stability and accuracy of the outlined methods. The second group of experiments demonstrates how the algorithms can be applied to current problem domains.

==Equivariance Error==
In this experiment the authors try to show experimentally that their theory of equivariance holds. They express that they had doubts about the equivariance in practice due to potential discretization artifacts since equivariance was proven for the continuous case, with the potential consequence of equivariance not holding being that the weight sharing scheme becomes less effective. The experiment is set up by first testing the equivariance of the SO(3) correlation at different resolutions. 500 random rotations and feature maps (with 10 channels) are sampled. They then calculate the approximation error <math>\small\Delta = \dfrac{1}{n} \sum_{i=1}^n std(L_{R_i} \Phi(f_i) - \phi(L_{R_i} f_i))/std(\Phi(f_i))</math>
Note: The authors do not mention what the std function is however it is likely the standard deviation function as 'std' is the command for standard deviation in MATLAB.
<math>\Phi</math> is a composition of SO(3) correlation layers with filters which have been randomly initialized. The authors mention that they were expecting <math>\Delta</math> to be zero in the case of perfect equivariance. This is due to, as proven earlier, the following two terms equaling each other in the continuous case: <math>\small L_{R_i} \Phi(f_i) - \phi(L_{R_i} f_i)</math>. The results are shown in Figure 3.

[[File:paper26-fig3.png|center]]

<math>\Delta</math> only grows with resolution/layers when there is no activation function. With ReLU activation the error stays constant once slightly higher than 0 resolution. The authors indicate that the error must therefore be from the feature map rotation since this type of error is exact only for bandlimited functions.

==MNIST Data==
The experiment using MNIST data was created by projecting MNIST digits onto a sphere using stereographic projection to create the resulting images as seen in Figure 4.

[[File:paper26-fig4.png|center]]

The authors created two datasets, one with the projected digits and the other with the same projected digits which were then subjected to a random rotation. The spherical CNN architecture used was <math>\small S^2</math>conv-ReLU-SO(3)conv-ReLU-FC-softmax and was attempted with bandwidths of 30,10,6 and 20,40,10 channels for each layer respectively. This model was compared to a baseline CNN with layers conv-ReLU-conv-ReLU-FC-softmax with 5x5 filters, 32,64,10 channels and stride of 3. For comparison this leads to approximately 68K parameters for the baseline and 58K parameters for the spherical CNN. Results can be seen in Table 1. It is clear from the results that the spherical CNN architecture made the network rotationally invariant. Performance on the rotated set is almost identical to the non-rotated set. This is true even when trained on the non-rotated set and tested on the rotated set. Compare this to the non-spherical architecture which becomes unusable when rotating the digits.

[[File:paper26-tab1.png|center]]

==SHREC17==
The SHREC dataset contains 3D models from the ShapeNet dataset which are classified into categories. It consists of a regularly aligned dataset and a rotated dataset. The models from the SHREC17 dataset were projected onto a sphere by means of raycasting. Different properties of the objects obtained from the raycast of the original model and the convex hull of the model make up the different channels which are input into the spherical CNN.

[[File:paper26-fig5.png|center]]

The network architecture used is an initial <math>\small S^2</math>conv-BN-ReLU block which is followed by two SO(3)conv-BN-ReLU blocks. The output is then fed into a MaxPool-BN block then a linear layer to the output for final classification. The architecture for this experiment has ~1.4M parameters, far exceeding the scale of the spherical CNNs in the other experiments.

This architecture achieves state of the art results on the SHREC17 tasks. The model places 2nd or 3rd in all categories but was not submitted as the SHREC17 task is closed. Table 2 shows the comparison of results with the top 3 submissions in each category. In the table, P@N stands for precision, R@N stands for recall, F1@N stands for F-score, mAP stands for mean average precision, and NDCG stands for normalized discounted cumulative gain in relevance based on whether the category and subcategory labels are predicted correctly. The authors claim the results show empirical proof of the usefulness of spherical CNNs. They elaborate that this is largely due to the fact that most architectures on the SHREC17 competition are highly specialized whereas their model is fairly general.

[[File:paper26-tab2.png|center]]

==Molecular Atomization==
In this experiment a spherical CNN is implemented with an architecture resembling that of ResNet. They use the QM7 dataset which has the task of predicting atomization energy of molecules. The positions and charges given in the dataset are projected onto the sphere using potential functions. This is done as follows. First, for each atom, a sphere is defined around its position with the radius of the sphere kept uniform across all atoms. The radius is chosen as the minimal radius so no intersections between atoms occur in the training set. Finally, using potential functions, a T channel spherical signal is produced for each atom in the molecule as shown in the figure below. A summary of their results is shown in Table 3 along with some of the spherical CNN architecture details. It shows the different RMSE obtained from different methods. The results from this final experiment also seem to be promising as the network the authors present achieves the second best score. They also note that the first place method grows exponentially with the number of atoms per molecule so is unlikely to scale well.

[[File:paper26-tab3.png|center]]

[[File:paper26-f6.png|center]]

= Conclusions =
This paper presents a novel architecture called Spherical CNNs. The paper defines <math>\small S^2</math> and SO(3) cross correlations, shows the theory behind their rotational invariance for continuous functions, and demonstrates that the invariance also applies to the discrete case. An effective GFFT algorithm was implemented and evaluated on two very different datasets with close to state of the art results, demonstrating that there are practical applications to Spherical CNNs.

For future work the authors believe that improvements can be obtained by generalizing the algorithms to the SE(3) group (SE(3) simply adds translations in 3D space to the SO(3) group). The authors also briefly mention their excitement for applying Spherical CNNs to omnidirectional vision such as in drones and autonomous cars. They state that there is very little publicly available omnidirectional image data which could be why they did not conduct any experiments in this area.

= Commentary =
The reviews on Spherical CNNs are very positive and it is ranked in the top 1% of papers submitted to ICLR 2018. Positive points are the novelty of the architecture, the wide variety of experiments performed, and the writing. One critique of the original submission is that the related works section only lists, instead of describing, previous methods and that a description of the methods would have provided more clarity. The authors have since expanded the section however I found that it is still limited which the authors attribute to length limitations. Another critique is that the evaluation does not provide enough depth. For example, it would have been great to see an example of omnidirectional vision for spherical networks. However, this is to be expected as it is just the introduction of spherical CNNs and more work is sure to come.

= Source Code =
Source code is available at:
https://github.com/jonas-koehler/s2cnn

= Sources =
* T. Cohen et al. Spherical CNNs, 2018.
* J. Feldman. Haar Measure. http://www.math.ubc.ca/~feldman/m606/haar.pdf
* P. Kostelec, D. Rockmore. FFTs on the Rotation Group, 2008.

stat946w18/Implicit Causal Models for Genome-wide Association Studies

2018-03-22T18:46:59Z

Apon: /* Real-data Analysis */

==Introduction and Motivation==
There is currently much progress in probabilistic models which could lead to the development of rich generative models. The models have been applied with neural networks, implicit densities, and with scalable algorithms to very large data for their Bayesian inference. However, most of the models are focused on capturing statistical relationships rather than causal relationships. Causal relationships are relationships where one event is a result of another event, i.e. a cause and effect. Causal models give us a sense of how manipulating the generative process could change the final results.

Genome-wide association studies (GWAS) are examples of causal relationships. Genome is basically the sum of all DNAs in an organism and contain information about the organism's attributes. Specifically, GWAS is about figuring out how genetic factors cause disease among humans. Here the genetic factors we are referring to are single nucleotide polymorphisms (SNPs), and getting a particular disease is treated as a trait, i.e., the outcome. In order to know about the reason of developing a disease and to cure it, the causation between SNPs and diseases is investigated: first, predict which one or more SNPs cause the disease; second, target the selected SNPs to cure the disease.

The figure below depicts an example Manhattan plot for a GWAS. Each dot represents an SNP. The x-axis is the chromosome location, and the y-axis is the negative log of the association p-value between the SNP and the disease, so points with the largest values represent strongly associated risk loci.

[[File:gwas-example.jpg|500px|center]]

This paper focuses on two challenges to combining modern probabilistic models and causality. The first one is how to build rich causal models with specific needs by GWAS. In general, probabilistic causal models involve a function <math>f</math> and a noise <math>n</math>. For working simplicity, we usually assume <math>f</math> as a linear model with Gaussian noise. However problems like GWAS require models with nonlinear, learnable interactions among the inputs and the noise.

The second challenge is how to address latent population-based confounders. Latent confounders are issues when we apply the causal models since we cannot observe them nor know the underlying structure. For example, in GWAS, both latent population structure, i.e., subgroups in the population with ancestry differences, and relatedness among sample individuals produce spurious correlations among SNPs to the trait of interest. The existing methods cannot easily accommodate the complex latent structure.

For the first challenge, the authors develop implicit causal models, a class of causal models that leverages neural architectures with an implicit density. With GWAS, implicit causal models generalize previous methods to capture important nonlinearities, such as gene-gene and gene-population interaction. Building on this, for the second challenge, they describe an implicit causal model that adjusts for population-confounders by sharing strength across examples (genes).

There has been an increasing number of works on causal models which focus on causal discovery and typically have strong assumptions such as Gaussian processes on noise variable or nonlinearities for the main function.

==Implicit Causal Models==
Implicit causal models are an extension of probabilistic causal models. Probabilistic causal models will be introduced first.

=== Probabilistic Causal Models ===
Probabilistic causal models have two parts: deterministic functions of noise and other variables. Consider background noise <math>\epsilon</math>, representing unknown background quantities which are jointly independent and global variable <math>\beta</math>, some function of this noise, where

[[File: eq1.1.png|800px|center]]

Each <math>\beta</math> and <math>x</math> is a function of noise; <math>y</math> is a function of noise and <math>x</math>，

[[File: eqt1.png|800px|center]]

The target is the causal mechanism <math>f_y</math> so that the causal effect <math>p(y|do(X=x),\beta)</math> can be calculated. <math>do(X=x)</math> means that we specify a value of <math>X</math> under the fixed structure <math>\beta</math>. By other paper’s work, it is assumed that <math>p(y|do(x),\beta) = p(y|x, \beta)</math>.

[[File: f_1.png|650px|center|]]

An example of probabilistic causal models is additive noise model.

[[File: eq2.1.png|800px|center]]

<math>f(.)</math> is usually a linear function or spline functions for nonlinearities. <math>\epsilon</math> is assumed to be standard normal, as well as <math>y</math>. Thus the posterior <math>p(\theta | x, y, \beta)</math> can be represented as

[[File: eqt2.png|800px|center]]

where <math>p(\theta)</math> is the prior which is known. Then, variational inference or MCMC can be applied to calculate the posterior distribution.

===Implicit Causal Models===
The difference between implicit causal models and probabilistic causal models is the noise variable. Instead of an additive noise term, implicit causal models directly take noise <math>\epsilon</math> into a neural network and output <math>x</math>.

The causal diagram has changed to:

[[File: f_2.png|650px|center|]]

They used fully connected neural network with a fair amount of hidden units to approximate each causal mechanism. Below is the formal description:

[[File: theorem.png|650px|center|]]

==Implicit Causal Models with Latent Confounders==
Previously, they assumed the global structure is observed. Next, the unobserved scenario is being considered.

===Causal Inference with a Latent Confounder===
Same as before, the interest is the causal effect <math>p(y|do(x_m), x_{-m})</math>. Here, the SNPs other than <math>x_m</math> is also under consideration. However, it is confounded by the unobserved confounder <math>z_n</math>. As a result, the standard inference method cannot be used in this case.

The paper proposed a new method which include the latent confounders. For each subject <math>n=1,…,N</math> and each SNP <math>m=1,…,M</math>,

[[File: eqt4.png|800px|center]]

The mechanism for latent confounder <math>z_n</math> is assumed to be known. SNPs depend on the confounders and the trait depends on all the SNPs and the confounders as well.

The posterior of <math>\theta</math> is needed to be calculate in order to estimate the mechanism <math>g_y</math> as well as the causal effect <math>p(y|do(x_m), x_{-m})</math>, so that it can be explained how changes to each SNP <math>X_m</math> cause changes to the trait <math>Y</math>.

[[File: eqt5.png|800px|center]]

Note that the latent structure <math>p(z|x, y)</math> is assumed known.

In general, causal inference with latent confounders can be dangerous: it uses the data twice, and thus it may bias the estimates of each arrow <math>X_m → Y</math>. Why is this justified? This is answered below:

'''Proposition 1'''. Assume the causal graph of Figure 2 (left) is correct and that the true distribution resides in some configuration of the parameters of the causal model (Figure 2 (right)). Then the posterior <math>p(θ | x, y)
</math> provides a consistent estimator of the causal mechanism <math>f_y</math>.

Proposition 1 rigorizes previous methods in the framework of probabilistic causal models. The intuition is that as more SNPs arrive (“M → ∞, N fixed”), the posterior concentrates at the true confounders <math>z_n</math>, and thus we can estimate the causal mechanism given each data point’s confounder <math>z_n</math>. As more data points arrive (“N → ∞, M fixed”), we can estimate the causal mechanism given any confounder <math>z_n</math> as there is an infinity of them.

===Implicit Causal Model with a Latent Confounder===
This section is the algorithm and functions to implementing an implicit causal model for GWAS.

====Generative Process of Confounders <math>z_n</math>.====
The distribution of confounders is set as standard normal. <math>z_n \in R^K</math> , where <math>K</math> is the dimension of <math>z_n</math> and <math>K</math> should make the latent space as close as possible to the true population structural.

====Generative Process of SNPs <math>x_{nm}</math>.====
Given SNP is coded for,

[[File: SNP.png|300px|center]]

The authors defined a <math>Binomial(2,\pi_{nm})</math> distribution on <math>x_{nm}</math>. And used logistic factor analysis to design the SNP matrix.

[[File: gpx.png|800px|center]]

A SNP matrix looks like this:
[[File: SNP_matrix.png|200px|center]]

Since logistic factor analysis makes strong assumptions, this paper suggests using a neural network to relax these assumptions,

[[File: gpxnn.png|800px|center]]

This renders the outputs to be a full <math>N*M</math> matrix due the the variables <math>w_m</math>, which act as principal component in PCA. Here, <math>\phi</math> has a standard normal prior distribution. The weights <math>w</math> and biases <math>\phi</math> are shared over the <math>m</math> SNPs and <math>n</math> individuals, which makes it possible to learn nonlinear interactions between <math>z_n</math> and <math>w_m</math>.

====Generative Process of Traits <math>y_n</math>.====
Previously, each trait is modeled by a linear regression,

[[File: gpy.png|800px|center]]

This also has very strong assumptions on SNPs, interactions, and additive noise. It can also be replaced by a neural network which only outputs a scalar,

[[File: gpynn.png|800px|center]]

==Likelihood-free Variational Inference==
Calculating the posterior of <math>\theta</math> is the key of applying the implicit causal model with latent confounders.

[[File: eqt5.png|800px|center]]

could be reduces to

[[File: lfvi1.png|800px|center]]

However, with implicit models, integrating over a nonlinear function could be suffered. The authors applied likelihood-free variational inference (LFVI). LFVI proposes a family of distribution over the latent variables. Here the variables <math>w_m</math> and <math>z_n</math> are all assumed to be Normal,

[[File: lfvi2.png|700px|center]]

For LFVI applied to GWAS, the algorithm which similar to the EM algorithm has been used:
[[File: em.png|800px|center]]

==Empirical Study==
The authors performed simulation on 100,000 SNPs, 940 to 5,000 individuals, and across 100 replications of 11 settings.
Four methods were compared:

* implicit causal model (ICM);
* PCA with linear regression (PCA);
* a linear mixed model (LMM);
* logistic factor analysis with inverse regression (GCAT).

The feedforward neural networks for traits and SNPs are fully connected with two hidden layers using ReLU activation function, and batch normalization.

===Simulation Study===
Based on real genomic data, a true model is applied to generate the SNPs and traits for each configuration.
There are four datasets used in this simulation study:

# HapMap [Balding-Nichols model]
# 1000 Genomes Project (TGP) [PCA]
#* Human Genome Diversity project (HGDP) [PCA]
#* HGDP [Pritchard-Stephens-Donelly model]
# A latent spatial position of individuals for population structure [spatial]

The table shows the prediction accuracy. The accuracy is calculated by the rate of the number of true positives divide the number of true positives plus false positives. True positives measure the proportion of positives that are correctly identified as such (e.g. the percentage of SNPs which are correctly identified as having the causal relation with the trait). In contrast, false positives state the SNPs has the causal relation with the trait when they don’t. The closer the rate to 1, the better the model is since false positives are considered as the wrong prediction.

[[File: table_1.png|650px|center|]]

The result represented above shows that the implicit causal model has the best performance among these four models in every situation. Especially, other models tend to do poorly on PSD and Spatial when <math>a</math> is small, but the ICM achieved a significantly high rate. The only comparable method to ICM is GCAT, when applying to simpler configurations.

===Real-data Analysis===
They also applied ICM to a real-world GWAS of Northern Finland Birth Cohorts, which measure metabolic traits and height and also contain 324,160 SNPs and 5,027 individuals. The data came from the database of Genotypes and Phenotypes (dbGaP) and used the same preprocessing as Song et al. Ten implicit causal models were fitted separately and the 2 neural networks both with two hidden layers were used for SNP and trait. The SNP network used 512 hidden units in both layers and the trait network used 32 and 256. The dimension of confounders (<math>K</math>) was set to be six, same as what was used in the paper by Song et al. for comparable models in Table 2.

[[File: table_2.png|650px|center|]]

The numbers in the above table are the number of significant loci for each of the 10 traits. The number for other methods, such as GCAT, LMM, PCA, and "uncorrected" (association tests without accounting for hidden relatedness of study samples) are obtained from other papers. By comparison, the ICM reached the level of the best previous model for each trait.

==Conclusion==
This paper introduced implicit causal models in order to account for nonlinear complex causal relationships, and applied the method to GWAS. It can not only capture important interactions between genes within an individual and among population level, but also can adjust for latent confounders by taking account of the latent variables into the model.

By the simulation study, the authors proved that the implicit causal model could beat other methods by 15-45.3% on a variety of datasets with variations on parameters.

The authors also believed this GWAS application is only the start of the usage of implicit causal models. The authors suggest that it might also be successfully used in the design of dynamic theories in high-energy physics or for modeling discrete choices in economics.

==Critique==
I think this paper is an interesting and novel work. The main contribution of this paper is to connect the statistical genetics and the machine learning methodology. The method is technically sound and does indeed generalize techniques currently used in statistical genetics.

The neural network used in this paper is a very simple feedforward 2 hidden-layer neural network, but the idea of where to use the neural network is crucial and might be significant in GWAS.

It has limitations as well. The empirical example in this paper is too easy, and far away from the realistic situation. Despite the simulation study showing some competing results, the Northern Finland Birth Cohort Data application did not demonstrate the advantage of using implicit causal model over the previous methods, such as GCAT or LMM.

Another limitation is about linkage disequilibrium as the authors stated as well. SNPs are not completely independent of each other; usually, they have correlations when the alleles at close locus. They did not consider this complex case, rather they only considered the simplest case where they assumed all the SNPs are independent.

Furthermore, one SNP maybe does not have enough power to explain the causal relationship. Recent papers indicate that causation to a trait may involve multiple SNPs.
This could be a future work as well.

==References==
Tran D, Blei D M. Implicit Causal Models for Genome-wide Association Studies[J]. arXiv preprint arXiv:1710.10742, 2017.

Patrik O Hoyer, Dominik Janzing, Joris M Mooij, Jonas Peters, and Prof Bernhard Schölkopf. Non- linear causal discovery with additive noise models. In Neural Information Processing Systems, 2009.

Alkes L Price, Nick J Patterson, Robert M Plenge, Michael E Weinblatt, Nancy A Shadick, and David Reich. Principal components analysis corrects for stratification in genome-wide association studies. Nature Genetics, 38(8):904–909, 2006.

Minsun Song, Wei Hao, and John D Storey. Testing for genetic associations in arbitrarily structured populations. Nature, 47(5):550–554, 2015.

Dustin Tran, Rajesh Ranganath, and David M Blei. Hierarchical implicit models and likelihood-free variational inference. In Neural Information Processing Systems, 2017.

stat946w18/Implicit Causal Models for Genome-wide Association Studies

2018-03-22T18:46:37Z

Apon: /* Real-data Analysis */

==Introduction and Motivation==
There is currently much progress in probabilistic models which could lead to the development of rich generative models. The models have been applied with neural networks, implicit densities, and with scalable algorithms to very large data for their Bayesian inference. However, most of the models are focused on capturing statistical relationships rather than causal relationships. Causal relationships are relationships where one event is a result of another event, i.e. a cause and effect. Causal models give us a sense of how manipulating the generative process could change the final results.

Genome-wide association studies (GWAS) are examples of causal relationships. Genome is basically the sum of all DNAs in an organism and contain information about the organism's attributes. Specifically, GWAS is about figuring out how genetic factors cause disease among humans. Here the genetic factors we are referring to are single nucleotide polymorphisms (SNPs), and getting a particular disease is treated as a trait, i.e., the outcome. In order to know about the reason of developing a disease and to cure it, the causation between SNPs and diseases is investigated: first, predict which one or more SNPs cause the disease; second, target the selected SNPs to cure the disease.

The figure below depicts an example Manhattan plot for a GWAS. Each dot represents an SNP. The x-axis is the chromosome location, and the y-axis is the negative log of the association p-value between the SNP and the disease, so points with the largest values represent strongly associated risk loci.

[[File:gwas-example.jpg|500px|center]]

This paper focuses on two challenges to combining modern probabilistic models and causality. The first one is how to build rich causal models with specific needs by GWAS. In general, probabilistic causal models involve a function <math>f</math> and a noise <math>n</math>. For working simplicity, we usually assume <math>f</math> as a linear model with Gaussian noise. However problems like GWAS require models with nonlinear, learnable interactions among the inputs and the noise.

The second challenge is how to address latent population-based confounders. Latent confounders are issues when we apply the causal models since we cannot observe them nor know the underlying structure. For example, in GWAS, both latent population structure, i.e., subgroups in the population with ancestry differences, and relatedness among sample individuals produce spurious correlations among SNPs to the trait of interest. The existing methods cannot easily accommodate the complex latent structure.

For the first challenge, the authors develop implicit causal models, a class of causal models that leverages neural architectures with an implicit density. With GWAS, implicit causal models generalize previous methods to capture important nonlinearities, such as gene-gene and gene-population interaction. Building on this, for the second challenge, they describe an implicit causal model that adjusts for population-confounders by sharing strength across examples (genes).

There has been an increasing number of works on causal models which focus on causal discovery and typically have strong assumptions such as Gaussian processes on noise variable or nonlinearities for the main function.

==Implicit Causal Models==
Implicit causal models are an extension of probabilistic causal models. Probabilistic causal models will be introduced first.

=== Probabilistic Causal Models ===
Probabilistic causal models have two parts: deterministic functions of noise and other variables. Consider background noise <math>\epsilon</math>, representing unknown background quantities which are jointly independent and global variable <math>\beta</math>, some function of this noise, where

[[File: eq1.1.png|800px|center]]

Each <math>\beta</math> and <math>x</math> is a function of noise; <math>y</math> is a function of noise and <math>x</math>，

[[File: eqt1.png|800px|center]]

The target is the causal mechanism <math>f_y</math> so that the causal effect <math>p(y|do(X=x),\beta)</math> can be calculated. <math>do(X=x)</math> means that we specify a value of <math>X</math> under the fixed structure <math>\beta</math>. By other paper’s work, it is assumed that <math>p(y|do(x),\beta) = p(y|x, \beta)</math>.

[[File: f_1.png|650px|center|]]

An example of probabilistic causal models is additive noise model.

[[File: eq2.1.png|800px|center]]

<math>f(.)</math> is usually a linear function or spline functions for nonlinearities. <math>\epsilon</math> is assumed to be standard normal, as well as <math>y</math>. Thus the posterior <math>p(\theta | x, y, \beta)</math> can be represented as

[[File: eqt2.png|800px|center]]

where <math>p(\theta)</math> is the prior which is known. Then, variational inference or MCMC can be applied to calculate the posterior distribution.

===Implicit Causal Models===
The difference between implicit causal models and probabilistic causal models is the noise variable. Instead of an additive noise term, implicit causal models directly take noise <math>\epsilon</math> into a neural network and output <math>x</math>.

The causal diagram has changed to:

[[File: f_2.png|650px|center|]]

They used fully connected neural network with a fair amount of hidden units to approximate each causal mechanism. Below is the formal description:

[[File: theorem.png|650px|center|]]

==Implicit Causal Models with Latent Confounders==
Previously, they assumed the global structure is observed. Next, the unobserved scenario is being considered.

===Causal Inference with a Latent Confounder===
Same as before, the interest is the causal effect <math>p(y|do(x_m), x_{-m})</math>. Here, the SNPs other than <math>x_m</math> is also under consideration. However, it is confounded by the unobserved confounder <math>z_n</math>. As a result, the standard inference method cannot be used in this case.

The paper proposed a new method which include the latent confounders. For each subject <math>n=1,…,N</math> and each SNP <math>m=1,…,M</math>,

[[File: eqt4.png|800px|center]]

The mechanism for latent confounder <math>z_n</math> is assumed to be known. SNPs depend on the confounders and the trait depends on all the SNPs and the confounders as well.

The posterior of <math>\theta</math> is needed to be calculate in order to estimate the mechanism <math>g_y</math> as well as the causal effect <math>p(y|do(x_m), x_{-m})</math>, so that it can be explained how changes to each SNP <math>X_m</math> cause changes to the trait <math>Y</math>.

[[File: eqt5.png|800px|center]]

Note that the latent structure <math>p(z|x, y)</math> is assumed known.

In general, causal inference with latent confounders can be dangerous: it uses the data twice, and thus it may bias the estimates of each arrow <math>X_m → Y</math>. Why is this justified? This is answered below:

'''Proposition 1'''. Assume the causal graph of Figure 2 (left) is correct and that the true distribution resides in some configuration of the parameters of the causal model (Figure 2 (right)). Then the posterior <math>p(θ | x, y)
</math> provides a consistent estimator of the causal mechanism <math>f_y</math>.

Proposition 1 rigorizes previous methods in the framework of probabilistic causal models. The intuition is that as more SNPs arrive (“M → ∞, N fixed”), the posterior concentrates at the true confounders <math>z_n</math>, and thus we can estimate the causal mechanism given each data point’s confounder <math>z_n</math>. As more data points arrive (“N → ∞, M fixed”), we can estimate the causal mechanism given any confounder <math>z_n</math> as there is an infinity of them.

===Implicit Causal Model with a Latent Confounder===
This section is the algorithm and functions to implementing an implicit causal model for GWAS.

====Generative Process of Confounders <math>z_n</math>.====
The distribution of confounders is set as standard normal. <math>z_n \in R^K</math> , where <math>K</math> is the dimension of <math>z_n</math> and <math>K</math> should make the latent space as close as possible to the true population structural.

====Generative Process of SNPs <math>x_{nm}</math>.====
Given SNP is coded for,

[[File: SNP.png|300px|center]]

The authors defined a <math>Binomial(2,\pi_{nm})</math> distribution on <math>x_{nm}</math>. And used logistic factor analysis to design the SNP matrix.

[[File: gpx.png|800px|center]]

A SNP matrix looks like this:
[[File: SNP_matrix.png|200px|center]]

Since logistic factor analysis makes strong assumptions, this paper suggests using a neural network to relax these assumptions,

[[File: gpxnn.png|800px|center]]

This renders the outputs to be a full <math>N*M</math> matrix due the the variables <math>w_m</math>, which act as principal component in PCA. Here, <math>\phi</math> has a standard normal prior distribution. The weights <math>w</math> and biases <math>\phi</math> are shared over the <math>m</math> SNPs and <math>n</math> individuals, which makes it possible to learn nonlinear interactions between <math>z_n</math> and <math>w_m</math>.

====Generative Process of Traits <math>y_n</math>.====
Previously, each trait is modeled by a linear regression,

[[File: gpy.png|800px|center]]

This also has very strong assumptions on SNPs, interactions, and additive noise. It can also be replaced by a neural network which only outputs a scalar,

[[File: gpynn.png|800px|center]]

==Likelihood-free Variational Inference==
Calculating the posterior of <math>\theta</math> is the key of applying the implicit causal model with latent confounders.

[[File: eqt5.png|800px|center]]

could be reduces to

[[File: lfvi1.png|800px|center]]

However, with implicit models, integrating over a nonlinear function could be suffered. The authors applied likelihood-free variational inference (LFVI). LFVI proposes a family of distribution over the latent variables. Here the variables <math>w_m</math> and <math>z_n</math> are all assumed to be Normal,

[[File: lfvi2.png|700px|center]]

For LFVI applied to GWAS, the algorithm which similar to the EM algorithm has been used:
[[File: em.png|800px|center]]

==Empirical Study==
The authors performed simulation on 100,000 SNPs, 940 to 5,000 individuals, and across 100 replications of 11 settings.
Four methods were compared:

* implicit causal model (ICM);
* PCA with linear regression (PCA);
* a linear mixed model (LMM);
* logistic factor analysis with inverse regression (GCAT).

The feedforward neural networks for traits and SNPs are fully connected with two hidden layers using ReLU activation function, and batch normalization.

===Simulation Study===
Based on real genomic data, a true model is applied to generate the SNPs and traits for each configuration.
There are four datasets used in this simulation study:

# HapMap [Balding-Nichols model]
# 1000 Genomes Project (TGP) [PCA]
#* Human Genome Diversity project (HGDP) [PCA]
#* HGDP [Pritchard-Stephens-Donelly model]
# A latent spatial position of individuals for population structure [spatial]

The table shows the prediction accuracy. The accuracy is calculated by the rate of the number of true positives divide the number of true positives plus false positives. True positives measure the proportion of positives that are correctly identified as such (e.g. the percentage of SNPs which are correctly identified as having the causal relation with the trait). In contrast, false positives state the SNPs has the causal relation with the trait when they don’t. The closer the rate to 1, the better the model is since false positives are considered as the wrong prediction.

[[File: table_1.png|650px|center|]]

The result represented above shows that the implicit causal model has the best performance among these four models in every situation. Especially, other models tend to do poorly on PSD and Spatial when <math>a</math> is small, but the ICM achieved a significantly high rate. The only comparable method to ICM is GCAT, when applying to simpler configurations.

===Real-data Analysis===
They also applied ICM to a real-world GWAS of Northern Finland Birth Cohorts, which measure metabolic traits and height and also contain 324,160 SNPs and 5,027 individuals. The data came from the database of Genotypes and Phenotypes (dbGaP) and used the same preprocessing as Song et al. Ten implicit causal models were fitted and the 2 neural networks both with two hidden layers were used for SNP and trait. The SNP network used 512 hidden units in both layers and the trait network used 32 and 256. The dimension of confounders (<math>K</math>) was set to be six, same as what was used in the paper by Song et al. for comparable models in Table 2.

[[File: table_2.png|650px|center|]]

The numbers in the above table are the number of significant loci for each of the 10 traits. The number for other methods, such as GCAT, LMM, PCA, and "uncorrected" (association tests without accounting for hidden relatedness of study samples) are obtained from other papers. By comparison, the ICM reached the level of the best previous model for each trait.

==Conclusion==
This paper introduced implicit causal models in order to account for nonlinear complex causal relationships, and applied the method to GWAS. It can not only capture important interactions between genes within an individual and among population level, but also can adjust for latent confounders by taking account of the latent variables into the model.

By the simulation study, the authors proved that the implicit causal model could beat other methods by 15-45.3% on a variety of datasets with variations on parameters.

The authors also believed this GWAS application is only the start of the usage of implicit causal models. The authors suggest that it might also be successfully used in the design of dynamic theories in high-energy physics or for modeling discrete choices in economics.

==Critique==
I think this paper is an interesting and novel work. The main contribution of this paper is to connect the statistical genetics and the machine learning methodology. The method is technically sound and does indeed generalize techniques currently used in statistical genetics.

The neural network used in this paper is a very simple feedforward 2 hidden-layer neural network, but the idea of where to use the neural network is crucial and might be significant in GWAS.

It has limitations as well. The empirical example in this paper is too easy, and far away from the realistic situation. Despite the simulation study showing some competing results, the Northern Finland Birth Cohort Data application did not demonstrate the advantage of using implicit causal model over the previous methods, such as GCAT or LMM.

Another limitation is about linkage disequilibrium as the authors stated as well. SNPs are not completely independent of each other; usually, they have correlations when the alleles at close locus. They did not consider this complex case, rather they only considered the simplest case where they assumed all the SNPs are independent.

Furthermore, one SNP maybe does not have enough power to explain the causal relationship. Recent papers indicate that causation to a trait may involve multiple SNPs.
This could be a future work as well.

==References==
Tran D, Blei D M. Implicit Causal Models for Genome-wide Association Studies[J]. arXiv preprint arXiv:1710.10742, 2017.

Patrik O Hoyer, Dominik Janzing, Joris M Mooij, Jonas Peters, and Prof Bernhard Schölkopf. Non- linear causal discovery with additive noise models. In Neural Information Processing Systems, 2009.

Alkes L Price, Nick J Patterson, Robert M Plenge, Michael E Weinblatt, Nancy A Shadick, and David Reich. Principal components analysis corrects for stratification in genome-wide association studies. Nature Genetics, 38(8):904–909, 2006.

Minsun Song, Wei Hao, and John D Storey. Testing for genetic associations in arbitrarily structured populations. Nature, 47(5):550–554, 2015.

Dustin Tran, Rajesh Ranganath, and David M Blei. Hierarchical implicit models and likelihood-free variational inference. In Neural Information Processing Systems, 2017.

stat946w18/Implicit Causal Models for Genome-wide Association Studies

2018-03-22T18:43:04Z

Apon: /* Introduction and Motivation */

==Introduction and Motivation==
There is currently much progress in probabilistic models which could lead to the development of rich generative models. The models have been applied with neural networks, implicit densities, and with scalable algorithms to very large data for their Bayesian inference. However, most of the models are focused on capturing statistical relationships rather than causal relationships. Causal relationships are relationships where one event is a result of another event, i.e. a cause and effect. Causal models give us a sense of how manipulating the generative process could change the final results.

Genome-wide association studies (GWAS) are examples of causal relationships. Genome is basically the sum of all DNAs in an organism and contain information about the organism's attributes. Specifically, GWAS is about figuring out how genetic factors cause disease among humans. Here the genetic factors we are referring to are single nucleotide polymorphisms (SNPs), and getting a particular disease is treated as a trait, i.e., the outcome. In order to know about the reason of developing a disease and to cure it, the causation between SNPs and diseases is investigated: first, predict which one or more SNPs cause the disease; second, target the selected SNPs to cure the disease.

The figure below depicts an example Manhattan plot for a GWAS. Each dot represents an SNP. The x-axis is the chromosome location, and the y-axis is the negative log of the association p-value between the SNP and the disease, so points with the largest values represent strongly associated risk loci.

[[File:gwas-example.jpg|500px|center]]

This paper focuses on two challenges to combining modern probabilistic models and causality. The first one is how to build rich causal models with specific needs by GWAS. In general, probabilistic causal models involve a function <math>f</math> and a noise <math>n</math>. For working simplicity, we usually assume <math>f</math> as a linear model with Gaussian noise. However problems like GWAS require models with nonlinear, learnable interactions among the inputs and the noise.

The second challenge is how to address latent population-based confounders. Latent confounders are issues when we apply the causal models since we cannot observe them nor know the underlying structure. For example, in GWAS, both latent population structure, i.e., subgroups in the population with ancestry differences, and relatedness among sample individuals produce spurious correlations among SNPs to the trait of interest. The existing methods cannot easily accommodate the complex latent structure.

For the first challenge, the authors develop implicit causal models, a class of causal models that leverages neural architectures with an implicit density. With GWAS, implicit causal models generalize previous methods to capture important nonlinearities, such as gene-gene and gene-population interaction. Building on this, for the second challenge, they describe an implicit causal model that adjusts for population-confounders by sharing strength across examples (genes).

There has been an increasing number of works on causal models which focus on causal discovery and typically have strong assumptions such as Gaussian processes on noise variable or nonlinearities for the main function.

==Implicit Causal Models==
Implicit causal models are an extension of probabilistic causal models. Probabilistic causal models will be introduced first.

=== Probabilistic Causal Models ===
Probabilistic causal models have two parts: deterministic functions of noise and other variables. Consider background noise <math>\epsilon</math>, representing unknown background quantities which are jointly independent and global variable <math>\beta</math>, some function of this noise, where

[[File: eq1.1.png|800px|center]]

Each <math>\beta</math> and <math>x</math> is a function of noise; <math>y</math> is a function of noise and <math>x</math>，

[[File: eqt1.png|800px|center]]

The target is the causal mechanism <math>f_y</math> so that the causal effect <math>p(y|do(X=x),\beta)</math> can be calculated. <math>do(X=x)</math> means that we specify a value of <math>X</math> under the fixed structure <math>\beta</math>. By other paper’s work, it is assumed that <math>p(y|do(x),\beta) = p(y|x, \beta)</math>.

[[File: f_1.png|650px|center|]]

An example of probabilistic causal models is additive noise model.

[[File: eq2.1.png|800px|center]]

<math>f(.)</math> is usually a linear function or spline functions for nonlinearities. <math>\epsilon</math> is assumed to be standard normal, as well as <math>y</math>. Thus the posterior <math>p(\theta | x, y, \beta)</math> can be represented as

[[File: eqt2.png|800px|center]]

where <math>p(\theta)</math> is the prior which is known. Then, variational inference or MCMC can be applied to calculate the posterior distribution.

===Implicit Causal Models===
The difference between implicit causal models and probabilistic causal models is the noise variable. Instead of an additive noise term, implicit causal models directly take noise <math>\epsilon</math> into a neural network and output <math>x</math>.

The causal diagram has changed to:

[[File: f_2.png|650px|center|]]

They used fully connected neural network with a fair amount of hidden units to approximate each causal mechanism. Below is the formal description:

[[File: theorem.png|650px|center|]]

==Implicit Causal Models with Latent Confounders==
Previously, they assumed the global structure is observed. Next, the unobserved scenario is being considered.

===Causal Inference with a Latent Confounder===
Same as before, the interest is the causal effect <math>p(y|do(x_m), x_{-m})</math>. Here, the SNPs other than <math>x_m</math> is also under consideration. However, it is confounded by the unobserved confounder <math>z_n</math>. As a result, the standard inference method cannot be used in this case.

The paper proposed a new method which include the latent confounders. For each subject <math>n=1,…,N</math> and each SNP <math>m=1,…,M</math>,

[[File: eqt4.png|800px|center]]

The mechanism for latent confounder <math>z_n</math> is assumed to be known. SNPs depend on the confounders and the trait depends on all the SNPs and the confounders as well.

The posterior of <math>\theta</math> is needed to be calculate in order to estimate the mechanism <math>g_y</math> as well as the causal effect <math>p(y|do(x_m), x_{-m})</math>, so that it can be explained how changes to each SNP <math>X_m</math> cause changes to the trait <math>Y</math>.

[[File: eqt5.png|800px|center]]

Note that the latent structure <math>p(z|x, y)</math> is assumed known.

In general, causal inference with latent confounders can be dangerous: it uses the data twice, and thus it may bias the estimates of each arrow <math>X_m → Y</math>. Why is this justified? This is answered below:

'''Proposition 1'''. Assume the causal graph of Figure 2 (left) is correct and that the true distribution resides in some configuration of the parameters of the causal model (Figure 2 (right)). Then the posterior <math>p(θ | x, y)
</math> provides a consistent estimator of the causal mechanism <math>f_y</math>.

Proposition 1 rigorizes previous methods in the framework of probabilistic causal models. The intuition is that as more SNPs arrive (“M → ∞, N fixed”), the posterior concentrates at the true confounders <math>z_n</math>, and thus we can estimate the causal mechanism given each data point’s confounder <math>z_n</math>. As more data points arrive (“N → ∞, M fixed”), we can estimate the causal mechanism given any confounder <math>z_n</math> as there is an infinity of them.

===Implicit Causal Model with a Latent Confounder===
This section is the algorithm and functions to implementing an implicit causal model for GWAS.

====Generative Process of Confounders <math>z_n</math>.====
The distribution of confounders is set as standard normal. <math>z_n \in R^K</math> , where <math>K</math> is the dimension of <math>z_n</math> and <math>K</math> should make the latent space as close as possible to the true population structural.

====Generative Process of SNPs <math>x_{nm}</math>.====
Given SNP is coded for,

[[File: SNP.png|300px|center]]

The authors defined a <math>Binomial(2,\pi_{nm})</math> distribution on <math>x_{nm}</math>. And used logistic factor analysis to design the SNP matrix.

[[File: gpx.png|800px|center]]

A SNP matrix looks like this:
[[File: SNP_matrix.png|200px|center]]

Since logistic factor analysis makes strong assumptions, this paper suggests using a neural network to relax these assumptions,

[[File: gpxnn.png|800px|center]]

This renders the outputs to be a full <math>N*M</math> matrix due the the variables <math>w_m</math>, which act as principal component in PCA. Here, <math>\phi</math> has a standard normal prior distribution. The weights <math>w</math> and biases <math>\phi</math> are shared over the <math>m</math> SNPs and <math>n</math> individuals, which makes it possible to learn nonlinear interactions between <math>z_n</math> and <math>w_m</math>.

====Generative Process of Traits <math>y_n</math>.====
Previously, each trait is modeled by a linear regression,

[[File: gpy.png|800px|center]]

This also has very strong assumptions on SNPs, interactions, and additive noise. It can also be replaced by a neural network which only outputs a scalar,

[[File: gpynn.png|800px|center]]

==Likelihood-free Variational Inference==
Calculating the posterior of <math>\theta</math> is the key of applying the implicit causal model with latent confounders.

[[File: eqt5.png|800px|center]]

could be reduces to

[[File: lfvi1.png|800px|center]]

However, with implicit models, integrating over a nonlinear function could be suffered. The authors applied likelihood-free variational inference (LFVI). LFVI proposes a family of distribution over the latent variables. Here the variables <math>w_m</math> and <math>z_n</math> are all assumed to be Normal,

[[File: lfvi2.png|700px|center]]

For LFVI applied to GWAS, the algorithm which similar to the EM algorithm has been used:
[[File: em.png|800px|center]]

==Empirical Study==
The authors performed simulation on 100,000 SNPs, 940 to 5,000 individuals, and across 100 replications of 11 settings.
Four methods were compared:

* implicit causal model (ICM);
* PCA with linear regression (PCA);
* a linear mixed model (LMM);
* logistic factor analysis with inverse regression (GCAT).

The feedforward neural networks for traits and SNPs are fully connected with two hidden layers using ReLU activation function, and batch normalization.

===Simulation Study===
Based on real genomic data, a true model is applied to generate the SNPs and traits for each configuration.
There are four datasets used in this simulation study:

# HapMap [Balding-Nichols model]
# 1000 Genomes Project (TGP) [PCA]
#* Human Genome Diversity project (HGDP) [PCA]
#* HGDP [Pritchard-Stephens-Donelly model]
# A latent spatial position of individuals for population structure [spatial]

The table shows the prediction accuracy. The accuracy is calculated by the rate of the number of true positives divide the number of true positives plus false positives. True positives measure the proportion of positives that are correctly identified as such (e.g. the percentage of SNPs which are correctly identified as having the causal relation with the trait). In contrast, false positives state the SNPs has the causal relation with the trait when they don’t. The closer the rate to 1, the better the model is since false positives are considered as the wrong prediction.

[[File: table_1.png|650px|center|]]

The result represented above shows that the implicit causal model has the best performance among these four models in every situation. Especially, other models tend to do poorly on PSD and Spatial when <math>a</math> is small, but the ICM achieved a significantly high rate. The only comparable method to ICM is GCAT, when applying to simpler configurations.

===Real-data Analysis===
They also applied ICM to a real-world GWAS of Northern Finland Birth Cohorts which contain 324,160 SNPs and 5,027 individuals. The data came from the database of Genotypes and Phenotypes (dbGaP) and used the same preprocessing as Song et al. Ten implicit causal models were fitted and the 2 neural networks both with two hidden layers were used for SNP and trait. The SNP network used 512 hidden units in both layers and the trait network used 32 and 256. The dimension of confounders (<math>K</math>) was set to be six, same as what was used in the paper by Song et al. for comparable models in Table 2.

[[File: table_2.png|650px|center|]]

The numbers in the above table are the number of significant loci for each of the 10 traits. The number for other methods, such as GCAT, LMM, PCA, and "uncorrected" (association tests without accounting for hidden relatedness of study samples) are obtained from other papers. By comparison, the ICM reached the level of the best previous model for each trait.

==Conclusion==
This paper introduced implicit causal models in order to account for nonlinear complex causal relationships, and applied the method to GWAS. It can not only capture important interactions between genes within an individual and among population level, but also can adjust for latent confounders by taking account of the latent variables into the model.

By the simulation study, the authors proved that the implicit causal model could beat other methods by 15-45.3% on a variety of datasets with variations on parameters.

The authors also believed this GWAS application is only the start of the usage of implicit causal models. The authors suggest that it might also be successfully used in the design of dynamic theories in high-energy physics or for modeling discrete choices in economics.

==Critique==
I think this paper is an interesting and novel work. The main contribution of this paper is to connect the statistical genetics and the machine learning methodology. The method is technically sound and does indeed generalize techniques currently used in statistical genetics.

The neural network used in this paper is a very simple feedforward 2 hidden-layer neural network, but the idea of where to use the neural network is crucial and might be significant in GWAS.

It has limitations as well. The empirical example in this paper is too easy, and far away from the realistic situation. Despite the simulation study showing some competing results, the Northern Finland Birth Cohort Data application did not demonstrate the advantage of using implicit causal model over the previous methods, such as GCAT or LMM.

Another limitation is about linkage disequilibrium as the authors stated as well. SNPs are not completely independent of each other; usually, they have correlations when the alleles at close locus. They did not consider this complex case, rather they only considered the simplest case where they assumed all the SNPs are independent.

Furthermore, one SNP maybe does not have enough power to explain the causal relationship. Recent papers indicate that causation to a trait may involve multiple SNPs.
This could be a future work as well.

==References==
Tran D, Blei D M. Implicit Causal Models for Genome-wide Association Studies[J]. arXiv preprint arXiv:1710.10742, 2017.

Patrik O Hoyer, Dominik Janzing, Joris M Mooij, Jonas Peters, and Prof Bernhard Schölkopf. Non- linear causal discovery with additive noise models. In Neural Information Processing Systems, 2009.

Alkes L Price, Nick J Patterson, Robert M Plenge, Michael E Weinblatt, Nancy A Shadick, and David Reich. Principal components analysis corrects for stratification in genome-wide association studies. Nature Genetics, 38(8):904–909, 2006.

Minsun Song, Wei Hao, and John D Storey. Testing for genetic associations in arbitrarily structured populations. Nature, 47(5):550–554, 2015.

Dustin Tran, Rajesh Ranganath, and David M Blei. Hierarchical implicit models and likelihood-free variational inference. In Neural Information Processing Systems, 2017.

MarrNet: 3D Shape Reconstruction via 2.5D Sketches

2018-03-22T18:25:22Z

Apon: /* PASCAL 3D+ */

= Introduction =
Humans are able to quickly recognize 3D shapes from images, even in spite of drastic differences in object texture, material, lighting, and background.

[[File:marrnet_intro_image.png|700px|thumb|center|Objects in real images. The appearance of the same shaped object varies based on colour, texture, lighting, background, etc. However, the 2.5D sketches (e.g. depth or normal maps) of the object remain constant, and can be seen as an abstraction of the object which is used to reconstruct the 3D shape.]]

In this work, the authors propose a novel end-to-end trainable model that sequentially estimates 2.5D sketches and 3D object shape from images. The two step approach makes the network more robust to differences in object texture, material, lighting and background. Based on the idea from [Marr, 1982] that human 3D perception relies on recovering 2.5D sketches, which include depth and surface normal maps, the author’s design an end-to-end trainable pipeline which they call MarrNet. MarrNet first estimates depth, normal maps, and silhouette, followed by a 3D shape.

The authors claim several unique advantages to their method. Single image 3D reconstruction is a highly under-constrained problem, requiring strong prior knowledge of object shapes. As well, accurate 3D object annotations using real images are not common, and many previous approaches rely on purely synthetic data. However, most of these methods suffer from domain adaptation due to imperfect rendering.

Using 2.5D sketches can alleviate the challenges of domain transfer. It is straightforward to generate perfect object surface normals and depths using a graphics engine. Since 2.5D sketches contain only depth, surface normal, and silhouette information, the second step of recovering 3D shape can be trained purely from synthetic data. As well, the introduction of differentiable constraints between 2.5D sketches and 3D shape makes it possible to fine-tune the system, even without any annotations.

The framework is evaluated on both synthetic objects from ShapeNet, and real images from PASCAL 3D+, showing good qualitative and quantitative performance in 3D shape reconstruction.

= Related Work =

== 2.5D Sketch Recovery ==
Researchers have explored recovering 2.5D information from shading, texture, and colour images in the past. More recently, the development of depth sensors has led to the creation of large RGB-D datasets, and papers on estimating depth, surface normals, and other intrinsic images using deep networks. While this method employs 2.5D estimation, the final output is a full 3D shape of an object.

== Single Image 3D Reconstruction ==
The development of large-scale shape repositories like ShapeNet has allowed for the development of models encoding shape priors for single image 3D reconstruction. These methods normally regress voxelized 3D shapes, relying on synthetic data or 2D masks for training. The formulation in the paper tackles domain adaptation better, since the network can be fine-tuned on images without any annotations.

== 2D-3D Consistency ==
Intuitively, the 3D shape can be constrained to be consistent with 2D observations. This idea has been explored for decades, with the use of depth and silhouettes, as well as some papers enforcing differentiable 2D-3D constraints for joint training of deep networks. In this work, this idea is exploited to develop differentiable constraints for consistency between the 2.5D sketches and 3D shape.

= Approach =
The 3D structure is recovered from a single RGB view using three steps, shown in Figure 1. The first step estimates 2.5D sketches, including depth, surface normal, and silhouette of the object. The second step, shown in Figure 2, estimates a 3D voxel representation of the object. The third step uses a reprojection consistency function to enforce the 2.5D sketch and 3D structure alignment.

[[File:marrnet_model_components.png|700px|thumb|center|MarrNet architecture. 2.5D sketches of normals, depths, and silhouette are first estimated. The sketches are then used to estimate the 3D shape. Finally, re-projection consistency is used to ensure consistency between the sketch and 3D output.]]

== 2.5D Sketch Estimation ==
The first step takes a 2D RGB image and predicts the surface normal, depth, and silhouette of the object. The goal is to estimate intrinsic object properties from the image, while discarding non-essential information. A ResNet-18 encoder-decoder network is used, with the encoder taking a 256 x 256 RGB image, producing 8 x 8 x 512 feature maps. The decoder is four sets of 5 x 5 convolutional and ReLU layers, followed by four sets of 1 x 1 convolutional and ReLU layers. The output is 256 x 256 resolution depth, surface normal, and silhouette images.

== 3D Shape Estimation ==
The second step estimates a voxelized 3D shape using the 2.5D sketches from the first step. The focus here is for the network to learn the shape prior that can explain the input well, and can be trained on synthetic data without suffering from the domain adaptation problem. The network architecture is inspired by the TL network, and 3D-VAE-GAN, with an encoder-decoder structure. The normal and depth image, masked by the estimated silhouette, are passed into 5 sets of convolutional, ReLU, and pooling layers, followed by two fully connected layers, with a final output width of 200. The 200-dimensional vector is passed into a decoder of 5 convolutional and ReLU layers, outputting a 128 x 128 x 128 voxelized estimate of the input.

== Re-projection Consistency ==
The third step consists of a depth re-projection loss and surface normal re-projection loss. Here, <math>v_{x, y, z}</math> represents the value at position <math>(x, y, z)</math> in a 3D voxel grid, with <math>v_{x, y, z} \in [0, 1] ∀ x, y, z</math>. <math>d_{x, y}</math> denotes the estimated depth at position <math>(x, y)</math>, <math>n_{x, y} = (n_a, n_b, n_c)</math> denotes the estimated surface normal. Orthographic projection is used.

[[File:marrnet_reprojection_consistency.png|700px|thumb|center|Reprojection consistency for voxels. Left and middle: criteria for depth and silhouettes. Right: criterion for surface normals]]

=== Depths ===
The voxel with depth <math>v_{x, y}, d_{x, y}</math> should be 1, while all voxels in front of it should be 0. The projected depth loss is defined as follows:

<math>
L_{depth}(x, y, z)=
\left\{
\begin{array}{ll}
v^2_{x, y, z}, & z < d_{x, y} \\
(1 - v_{x, y, z})^2, & z = d_{x, y} \\
0, & z > d_{x, y} \\
\end{array}
\right.
</math>

<math>
\frac{∂L_{depth}(x, y, z)}{∂v_{x, y, z}} =
\left\{
\begin{array}{ll}
2v{x, y, z}, & z < d_{x, y} \\
2(v_{x, y, z} - 1), & z = d_{x, y} \\
0, & z > d_{x, y} \\
\end{array}
\right.
</math>

When <math>d_{x, y} = \infty</math>, all voxels in front of it should be 0.

=== Surface Normals ===
Since vectors <math>n_{x} = (0, −n_{c}, n_{b})</math> and <math>n_{y} = (−n_{c}, 0, n_{a})</math> are orthogonal to the normal vector <math>n_{x, y} = (n_{a}, n_{b}, n_{c})</math>, they can be normalized to obtain <math>n’_{x} = (0, −1, n_{b}/n_{c})</math> and <math>n’_{y} = (−1, 0, n_{a}/n_{c})</math> on the estimated surface plane at <math>(x, y, z)</math>. The projected surface normal tried to guarantee voxels at <math>(x, y, z) ± n’_{x}</math> and <math>(x, y, z) ± n’_{y}</math> should be 1 to match the estimated normal. The constraints are only applied when the target voxels are inside the estimated silhouette.

The projected surface normal loss is defined as follows, with <math>z = d_{x, y}</math>:

<math>
L_{normal}(x, y, z) =
(1 - v_{x, y-1, z+\frac{n_b}{n_c}})^2 + (1 - v_{x, y+1, z-\frac{n_b}{n_c}})^2 +
(1 - v_{x-1, y, z+\frac{n_a}{n_c}})^2 + (1 - v_{x+1, y, z-\frac{n_a}{n_c}})^2
</math>

Gradients along x are:

<math>
\frac{dL_{normal}(x, y, z)}{dv_{x-1, y, z+\frac{n_a}{n_c}}} = 2(v_{x-1, y, z+\frac{n_a}{n_c}}-1)
</math>
and
<math>
\frac{dL_{normal}(x, y, z)}{dv_{x+1, y, z-\frac{n_a}{n_c}}} = 2(v_{x+1, y, z-\frac{n_a}{n_c}}-1)
</math>

Gradients along y are similar to x.

= Training =
The 2.5D and 3D estimation components are first pre-trained separately on synthetic data from ShapeNet, and then fine-tuned on real images.

For pre-training, the 2.5D sketch estimator is trained on synthetic ShapeNet depth, surface normal, and silhouette ground truth, using an L2 loss. The 3D estimator is trained with ground truth voxels using a cross-entropy loss.

Reprojection consistency loss is used to fine-tune the 3D estimation using real images, using the predicted depth, normals, and silhouette. A straightforward implementation leads to shapes that explain the 2.5D sketches well, but lead to unrealistic 3D appearance due to overfitting.

Instead, the decoder of the 3D estimator is fixed, and only the encoder is fine-tuned. The model is fine-tuned separately on each image for 40 iterations, which takes up to 10 seconds on the GPU. Without fine-tuning, testing time takes around 100 milliseconds. SGD is used for optimization with batch size of 4, learning rate of 0.001, and momentum of 0.9.

= Evaluation =
Qualitative and quantitative results are provided using different variants of the framework. The framework is evaluated on both synthetic and real images on three datasets.

== ShapeNet ==
Synthesized images of 6,778 chairs from ShapeNet are rendered from 20 random viewpoints. The chairs are placed in front of random background from the SUN dataset, and the RGB, depth, normal, and silhouette images are rendered using the physics-based renderer Mitsuba for more realistic images.

=== Method ===
MarrNet is trained without the final fine-tuning stage, since 3D shapes are available. A baseline is created that directly predicts the 3D shape using the same 3D shape estimator architecture with no 2.5D sketch estimation.

=== Results ===
The baseline output is compared to the full framework, and the figure below shows that MarrNet provides model outputs with more details and smoother surfaces than the baseline. Quantitatively, the full model also achieves 0.57 IoU, higher than the direct prediction baseline.

[[File:marrnet_shapenet_results.png|700px|thumb|center|ShapeNet results.]]

== PASCAL 3D+ ==
Rough 3D models are provided from real-life images.

=== Method ===
Each module is pre-trained on the ShapeNet dataset, and then fine-tuned on the PASCAL 3D+ dataset. Three variants of the model are tested. The first is trained using ShapeNet data only with no fine-tuning. The second is fine-tuned without fixing the decoder. The third is fine-tuned with a fixed decoder.

=== Results ===
The figure below shows the results of the ablation study. The model trained only on synthetic data provides reasonable estimates. However, fine-tuning without fixing the decoder leads to impossible shapes from certain views. The third model keeps the shape prior, providing more details in the final shape.

[[File:marrnet_pascal_3d_ablation.png|600px|thumb|center|Ablation studies using the PASCAL 3D+ dataset.]]

Additional comparisons are made with the state-of-the-art (DRC) on the provided ground truth shapes. MarrNet achieves 0.39 IoU, while DRC achieves 0.34. However, the authors claim that the IoU metric is sub-optimal for three reasons. First, there is no emphasis on details since the metric prefers models that predict mean shapes consistently. Second, all possible scales are searched during the IoU computation, making it less efficient. Third, PASCAL 3D+ only has rough annotations, with only 10 CAD chair models for all images, and computing IoU with these shapes is not very informative. Instead, human studies are conducted and MarrNet reconstructions are preferred 74% of the time over DRC, and 42% of the time to ground truth. This shows how MarrNet produces nice shapes and also highlights the fact that ground truth shapes are not very good.

[[File:human_studies.png|600px|thumb|center|Human preferences on chairs in PASCAL 3D+ (Xiang et al. 2014). The numbers show the percentage of how often humans prefered the 3D shape from DRC (state-of-the-art), MarrNet, or GT.]]

[[File:marrnet_pascal_3d_drc_comparison.png|600px|thumb|center|Comparison between DRC and MarrNet results.]]

Several failure cases are shown in the figure below. Specifically, the framework does not seem to work well on thin structures.

[[File:marrnet_pascal_3d_failure_cases.png|500px|thumb|center|Failure cases on PASCAL 3D+. The algorithm cannot recover thin structures.]]

===

== IKEA ==
This dataset contains images of IKEA furniture, with accurate 3D shape and pose annotations. Objects are often heavily occluded or truncated.

=== Results ===
Qualitative results are shown in the figure below. The model is shown to deal with mild occlusions in real life scenarios. Human studes show that MarrNet reconstructions are preferred 61% of the time to 3D-VAE-GAN.

[[File:marrnet_ikea_results.png|700px|thumb|center|Results on chairs in the IKEA dataset, and comparison with 3D-VAE-GAN.]]

== Other Data ==
MarrNet is also applied on cars and airplanes. Shown below, smaller details such as the horizontal stabilizer and rear-view mirrors are recovered.

[[File:marrnet_airplanes_and_cars.png|700px|thumb|center|Results on airplanes and cars from the PASCAL 3D+ dataset, and comparison with DRC.]]

MarrNet is also jointly trained on three object categories, and successfully recovers the shapes of different categories. Results are shown in the figure below.

[[File:marrnet_multiple_categories.png|700px|thumb|center|Results when trained jointly on all three object categories (cars, airplanes, and chairs).]]

= Commentary =
Qualitatively, the results look quite impressive. The 2.5D sketch estimation seems to distill the useful information for more realistic looking 3D shape estimation. The disentanglement of 2.5D and 3D estimation steps also allows for easier training and domain adaptation from synthetic data.

As the authors mention, the IoU metric is not very descriptive, and most of the comparisons in this paper are only qualitative, mainly being human preference studies. A better quantitative evaluation metric would greatly help in making an unbiased comparison between different results.

As seen in several of the results, the network does not deal well with objects that have thin structures, which is particularly noticeable with many of the chair arm rests. As well, looking more carefully at some results, it seems that fine-tuning only the 3D encoder does not seem to transfer well to unseen objects, since shape priors have already been learned by the decoder.

= Conclusion =
The proposed MarrNet employs a novel model to estimate 2.5D sketches for 3D shape reconstruction. The sketches are shown to improve the model’s performance, and make it easy to adapt to images across different domains and categories. Differentiable loss functions are created such that the model can be fine-tuned end-to-end on images without ground truth. The experiments show that the model performs well, and human studies show that the results are preferred over other methods.

= References =
# David Marr. Vision: A computational investigation into the human representation and processing of visual information. W. H. Freeman and Company, 1982.
# Shubham Tulsiani, Tinghui Zhou, Alexei A Efros, and Jitendra Malik. Multi-view supervision for single-view reconstruction via differentiable ray consistency. In CVPR, 2017.
# JiajunWu, Chengkai Zhang, Tianfan Xue,William T Freeman, and Joshua B Tenenbaum. Learning a Proba- bilistic Latent Space of Object Shapes via 3D Generative-Adversarial Modeling. In NIPS, 2016b.

File:human studies.png

2018-03-22T18:21:39Z

Apon:

stat946w18

2018-03-21T21:34:49Z

Apon: /* Paper presentation */

=[https://piazza.com/uwaterloo.ca/fall2017/stat946/resources List of Papers]=

= Record your contributions here [https://docs.google.com/spreadsheets/d/1fU746Cld_mSqQBCD5qadvkXZW1g-j-kHvmHQ6AMeuqU/edit?usp=sharing]=

Use the following notations:

P: You have written a summary/critique on the paper.

T: You had a technical contribution on a paper (excluding the paper that you present).

E: You had an editorial contribution on a paper (excluding the paper that you present).

[https://docs.google.com/forms/d/e/1FAIpQLSdcfYZu5cvpsbzf0Nlxh9TFk8k1m5vUgU1vCLHQNmJog4xSHw/viewform?usp=sf_link Your feedback on presentations]

=Paper presentation=
{| class="wikitable"

{| border="1" cellpadding="3"
|-
|width="60pt"|Date
|width="100pt"|Name
|width="30pt"|Paper number
|width="700pt"|Title
|width="30pt"|Link to the paper
|width="30pt"|Link to the summary
|-
|Feb 15 (example)||Ri Wang || ||Sequence to sequence learning with neural networks.||[http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Paper] || [http://wikicoursenote.com/wiki/Stat946f15/Sequence_to_sequence_learning_with_neural_networks#Long_Short-Term_Memory_Recurrent_Neural_Network Summary]
|-
|Feb 27 || || 1|| || ||
|-
|Feb 27 || || 2|| || ||
|-
|Feb 27 || || 3|| || ||
|-
|Mar 1 || Peter Forsyth || 4|| Unsupervised Machine Translation Using Monolingual Corpora Only || [https://arxiv.org/pdf/1711.00043.pdf Paper] || [[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Unsupervised_Machine_Translation_Using_Monolingual_Corpora_Only Summary]]
|-
|Mar 1 || wenqing liu || 5|| Spectral Normalization for Generative Adversarial Networks || [https://openreview.net/pdf?id=B1QRgziT- Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Spectral_normalization_for_generative_adversial_network Summary]
|-
|Mar 1 || Ilia Sucholutsky || 6|| One-Shot Imitation Learning || [https://papers.nips.cc/paper/6709-one-shot-imitation-learning.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=One-Shot_Imitation_Learning Summary]
|-
|Mar 6 || George (Shiyang) Wen || 7|| AmbientGAN: Generative models from lossy measurements || [https://openreview.net/pdf?id=Hy7fDog0b Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/AmbientGAN:_Generative_Models_from_Lossy_Measurements Summary]
|-
|Mar 6 || Raphael Tang || 8|| Rethinking the Smaller-Norm-Less-Informative Assumption in Channel Pruning of Convolutional Layers || [https://arxiv.org/pdf/1802.00124.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Rethinking_the_Smaller-Norm-Less-Informative_Assumption_in_Channel_Pruning_of_Convolutional_Layers Summary]
|-
|Mar 6 ||Fan Xia || 9|| Word translation without parallel data ||[https://arxiv.org/pdf/1710.04087.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Word_translation_without_parallel_data Summary]
|-
|Mar 8 || Alex (Xian) Wang || 10 || Self-Normalizing Neural Networks || [http://papers.nips.cc/paper/6698-self-normalizing-neural-networks.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Self_Normalizing_Neural_Networks Summary]
|-
|Mar 8 || Michael Broughton || 11|| Convergence of Adam and beyond || [https://openreview.net/pdf?id=ryQu7f-RZ Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=On_The_Convergence_Of_ADAM_And_Beyond Summary]
|-
|Mar 8 || Wei Tao Chen || 12|| Predicting Floor-Level for 911 Calls with Neural Networks and Smartphone Sensor Data || [https://openreview.net/forum?id=ryBnUWb0b Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Predicting_Floor-Level_for_911_Calls_with_Neural_Networks_and_Smartphone_Sensor_Data Summary]
|-
|Mar 13 || Chunshang Li || 13 || UNDERSTANDING IMAGE MOTION WITH GROUP REPRESENTATIONS || [https://openreview.net/pdf?id=SJLlmG-AZ Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Understanding_Image_Motion_with_Group_Representations Summary]
|-
|Mar 13 || Saifuddin Hitawala || 14 || Robust Imitation of Diverse Behaviors || [https://papers.nips.cc/paper/7116-robust-imitation-of-diverse-behaviors.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Robust_Imitation_of_Diverse_Behaviors Summary]
|-
|Mar 13 || Taylor Denouden || 15|| A neural representation of sketch drawings || [https://arxiv.org/pdf/1704.03477.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=A_Neural_Representation_of_Sketch_Drawings Summary]
|-
|Mar 15 || Zehao Xu || 16|| Synthetic and natural noise both break neural machine translation || [https://openreview.net/pdf?id=BJ8vJebC- Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Synthetic_and_natural_noise_both_break_neural_machine_translation Summary]
|-
|Mar 15 || Prarthana Bhattacharyya || 17|| Wasserstein Auto-Encoders || [https://arxiv.org/pdf/1711.01558.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Wasserstein_Auto-Encoders Summary]
|-
|Mar 15 || Changjian Li || 18|| Label-Free Supervision of Neural Networks with Physics and Domain Knowledge || [https://arxiv.org/pdf/1609.05566.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Label-Free_Supervision_of_Neural_Networks_with_Physics_and_Domain_Knowledge Summary]
|-
|Mar 20 || Travis Dunn || 19|| Continuous Adaptation via Meta-Learning in Nonstationary and Competitive Environments || [https://openreview.net/pdf?id=Sk2u1g-0- Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Continuous_Adaptation_via_Meta-Learning_in_Nonstationary_and_Competitive_Environments Summary]
|-
|Mar 20 || Sushrut Bhalla || 20|| MaskRNN: Instance Level Video Object Segmentation || [https://papers.nips.cc/paper/6636-maskrnn-instance-level-video-object-segmentation.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/MaskRNN:_Instance_Level_Video_Object_Segmentation Summary]
|-
|Mar 20 || Hamid Tahir || 21|| Wavelet Pooling for Convolution Neural Networks || [https://openreview.net/pdf?id=rkhlb8lCZ Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Wavelet_Pooling_CNN Summary]
|-
|Mar 22 || Dongyang Yang|| 22|| Implicit Causal Models for Genome-wide Association Studies || [https://openreview.net/pdf?id=SyELrEeAb Paper] ||[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Implicit_Causal_Models_for_Genome-wide_Association_Studies Summary]
|-
|Mar 22 || Yao Li || 23||Improving GANs Using Optimal Transport || [https://openreview.net/pdf?id=rkQkBnJAb Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/IMPROVING_GANS_USING_OPTIMAL_TRANSPORT Summary]
|-
|Mar 22 || Sahil Pereira || 24||End-to-End Differentiable Adversarial Imitation Learning|| [http://proceedings.mlr.press/v70/baram17a/baram17a.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=End-to-End_Differentiable_Adversarial_Imitation_Learning Summary]
|-
|Mar 27 || Jaspreet Singh Sambee || 25|| Do Deep Neural Networks Suffer from Crowding? || [http://papers.nips.cc/paper/7146-do-deep-neural-networks-suffer-from-crowding.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Do_Deep_Neural_Networks_Suffer_from_Crowding Summary]
|-
|Mar 27 || Braden Hurl || 26|| Spherical CNNs || [https://openreview.net/pdf?id=Hkbd5xZRb Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Spherical_CNNs Summary]
|-
|Mar 27 || Marko Ilievski || 27|| Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders || [http://proceedings.mlr.press/v70/engel17a/engel17a.pdf Paper] ||
|-
|Mar 29 || Alex Pon || 28||PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space || [https://arxiv.org/abs/1706.02413 Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=PointNet%2B%2B:_Deep_Hierarchical_Feature_Learning_on_Point_Sets_in_a_Metric_Space Summary]
|-
|Mar 29 || Sean Walsh || 29||Multi-scale Dense Networks for Resource Efficient Image Classification || [https://arxiv.org/pdf/1703.09844.pdf Paper] ||
|-
|Mar 29 || Jason Ku || 30||MarrNet: 3D Shape Reconstruction via 2.5D Sketches ||[https://arxiv.org/pdf/1711.03129.pdf Paper] ||
|-
|Apr 3 || Tong Yang || 31|| Dynamic Routing Between Capsules. || [http://papers.nips.cc/paper/6975-dynamic-routing-between-capsules.pdf Paper] ||
|-
|Apr 3 || Benjamin Skikos || 32|| Training and Inference with Integers in Deep Neural Networks || [https://openreview.net/pdf?id=HJGXzmspb Paper] ||
|-
|Apr 3 || Weishi Chen || 33|| Tensorized LSTMs for Sequence Learning || [https://arxiv.org/pdf/1711.01577.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Tensorized_LSTMs&action=edit&redlink=1 Summary] ||
|-

Wavelet Pooling CNN

2018-03-20T18:56:46Z

Apon: /* Introduction */

== Introduction ==
Convolutional neural networks (CNN) have been proven to be powerful in image classification. Over the past few years researchers have put efforts in improving fundamental components of CNNs such as the pooling operation. Various pooling methods exist; deterministic methods include max pooling and average pooling and probabilistic methods include mixed pooling and stochastic pooling. All these methods employ a neighborhood approach to the sub-sampling which, albeit fast and simple, can produce artifacts such as blurring, aliasing, and edge halos (Parker et al., 1983).

This paper introduces a novel pooling method based on the discrete wavelet transform. Specifically, it uses a second-level wavelet decomposition for the sub-sampling. This method, instead of nearest neighbor interpolation, uses a sub-band method that the authors claim produces less artifacts and represents the underlying features more accurately. Therefore, if pooling is viewed as a lossy process, the reason for employing a wavelet approach is to try to minimize this loss.

== Pooling Background ==
Pooling essentially means sub-sampling. After the pooling layer, the spatial dimensions of the data is reduced to some degree, with the goal being to compress the data rather than discard some of it. Typical approaches to pooling reduce the dimensionality by using some method to combine a region of values into one value. Max pooling and Mean/Average pooling are the 2 most commonly used pooling methods. For max pooling, this can be represented by the equation <math>a_{kij} = max_{(p,q) \epsilon R_{ij}} (a_{kpq})</math> where <math>a_{kij}</math> is the output activation of the <math>k^th</math> feature map at <math>(i,j)</math>, <math>a_{kpq}</math> is input activation at <math>(p,q)</math> within <math>R_{ij}</math>, and <math>|R_{ij}|</math> is the size of the pooling region. Mean pooling can be represented by the equation <math>a_{kij} = \frac{1}{|R_{ij}|} \sum_{(p,q) \epsilon R_{ij}} (a_{kpq})</math> with everything defined as before. Figure 1 provides a numerical example that can be followed.

[[File:WT_Fig1.PNG|650px|center|]]

The paper mentions that these pooling methods, although simple and effective, have shortcomings. Max pooling can omit details from an image if the important features have less intensity than the insignificant ones, and also commonly overfits. On the other hand, average pooling can dilute important features if the data is averaged with values of significantly lower intensities. Figure 2 displays an image of this.

[[File:WT_Fig2.PNG|650px|center|]]

To account for the above mentioned issues, probabilistic pooling methods were introduced, namely mixed pooling and; stochastic pooling. Mixed pooling is a simple method which just combines the max and the average pooling by randomly selecting one method over the other during training. Stochastic pooling on the other hand randomly samples within a receptive field with the activation values as the probabilities. These are calculated by taking each activation value and dividing it by the sum of all activation values in the grid so that the probabilities sum to 1.

Figure 3 shows an example of how stochastic pooling works. On the left is a 3x3 grid filled with activations. The middle grid is the corresponding probability for each activation. The activation in the middle was randomly selected (it had a 13% chance of getting selected). Because the stochastic pooling is based on the probability of the pixels, it is able to avoid the shortcomings of max and mean pooling mentioned above.

[[File:paper21-stochasticpooling.png|650px|center|]]

== Wavelet Background ==
Data or signals tend to be composed of slowly changing trends (low frequency) as well as fast changing transients (high frequency). Similarly, images have smooth regions of intensity which are perturbed by edges or abrupt changes. We know that these abrupt changes can represent features that are of great importance to us when we perform deep learning. Wavelets are a class of functions that are well localized in time and frequency. Compare this to the Fourier transform which represents signals as the sum of sine waves which oscillate forever (not localized in time and space). The ability of wavelets to be localized in time and space is what makes it suitable for detecting the abrupt changes in an image well.

Essentially, a wavelet is a fast decaying, oscillating signal with zero mean that only exists for a fixed duration and can be scaled and shifted in time. There are some well defined types of wavelets as shown in Figure 3. The key characteristic of wavelets for us is that they have a band-pass characteristic, and the band can be adjusted based on the scaling and shifting.

[[File:WT_Fig3.jpg|650px|center|]]

The paper uses discrete wavelet transform and more specifically a faster variation called Fast Wavelet Transform (FWT) using the Haar wavelet. There also exists a continuous wavelet transform. The main difference in these is how the scale and shift parameters are selected.

== Discrete Wavelet Transform General==
The discrete wavelet transform for images is essentially applying a low pass and high pass filter to your image where the transfer functions of the filters are related and defined by the type of wavelet used (Haar in this paper). This is shown in the figures below, which also show the recursive nature of the transform. For an image, the per row transform is taken first. This results in a new image where the first half is a low frequency sub-band and the second half is the high frequency sub-band. Then this new image is transformed again per column, resulting in four sub-bands. Generally, the low frequency content approximates the image and the high frequency content represents abrupt changes. Therefore, one can simply take the LL band and perform the transformation again to sub-sample even more.

[[File:WT_Fig8.png|650px|center|]]

[[File:WT_Fig9.png|650px|center|]]

== DWT example using Haar Wavelet ==
Suppose we have an image represented by the following pixels:
<math> \begin{bmatrix}
100 & 50 & 60 & 150 \\
20 & 60 & 40 & 30 \\
50 & 90 & 70 & 82 \\
74 & 66 & 90 & 58 \\
\end{bmatrix} </math>

For each level of the DWT using the Haar wavelet, we will perform the transform on the rows first and then the columns. For the row pass, we transform each row as follows:
* Take row i = [ i1, i2, i3, i4], and let i_t = [a1, a2, d1, d2] represent the transformed row
* a1 = (i1 + i2)/2
* a2 = (i3 + i4)/2
* d1 = (i1 - i2)/2
* d2 = (i3 - i4)/2

After the row transforms, the images looks as follows:
<math> \begin{bmatrix}
75 & 105 & 25 & -45 \\
40 & 35 & -20 & 5 \\
70 & 76 & -20 & -6 \\
70 & 74 & 4 & 16 \\
\end{bmatrix} </math>

Now we apply the same method to the columns in the exact same way.

== Proposed Method ==
The proposed method uses subbands from the second level FWT and discards the first level subbands. The authors postulate that this method is more 'organic' in capturing the data compression and will create less artifacts that may affect the image classification.
=== Forward Propagation ===
FWT can be expressed by <math>W_\varphi[j + 1, k] = h_\varphi[-n]*W_\varphi[j,n]|_{n = 2k, k <= 0}</math> and <math>W_\psi[j + 1, k] = h_\psi[-n]*W_\psi[j,n]|_{n = 2k, k <= 0}</math> where <math>\varphi</math> is the approximation function, <math>\psi</math> is the detail function, <math>W_\varphi</math>, <math>W_\psi</math>, are approximation and detail coefficients, <math>h_\varphi[-n]</math> and <math>h_\psi[-n]</math> are time reversed scaling and wavelet vectors, <math>(n)</math> represents the sample in the vector, and <math>j</math> denotes the resolution level. To apply to images, FWT is first applied on the rows and then the columns. If a low (L) and high(H) sub-band is extracted from the rows and similarly for the columns than at each level there is 4 sub-bands (LH, HL, HH, and LL) where LL will further be decomposed into the level 2 decomposition.

Using the level 2 decomposition sub-bands, the Inverse Fast Wavelet Transform (IFWT) is used to obtain the resulting sub-sampled image, which is sub-sampled by a factor of two. The Equation for IFWT is <math>W_\varphi[j, k] = h_\varphi[-n]*W_\varphi[j + 1,n] + h_\psi[-n]*W_\psi[j + 1,n]|_{n = \frac{k}{2}, k <= 0}</math> where the parameters are the same as previously explained. Figure 4 displays the algorithm for the forward propagation.

[[File:WT_Fig6.PNG|650px|center|]]

=== Back Propagation ===
This is simply the reverse of the forward propagation. The FWT of the image is upsampled to be used as the level 2 decomposition. Then IFWT is performed to obtain the original image which is upsampled by a factor of two using wavelet methods. Figure 5 displays the algorithm.

[[File:WT_Fig7.PNG|650px|center|]]

== Results ==
The authors tested on MNIST, CIFAR-10, SHVN, and KDEF and the paper provides comprehensive results for each. Stochastic gradient descent was used and the Haar wavelet is used due to its even, square subbands. The network for all datasets except MNIST is loosedly based on (Zeiler & Fergus, 2013). The authors keep the network consistent, but change the pooling method for each dataset. They also experiment with dropout and Batch Normalization to examine the effects of regularization on their method. All pooling methods compared use a 2x2 window, and a consistent pooling method was used for all pooling layers of a network. The overall results teach us that the pooling method should be chosen specific to the type of data we have. In some cases wavelet pooling may perform the best, and in other cases, other methods may perform better, if the data is more suited for those types of pooling.

=== MNIST ===
Figure 7 shows the network and Table 1 shows the accuracy. It can be seen that wavelet pooling achieves the best accuracy from all pooling methods compared. Figure 8 shows the energy of each method per epoch.

[[File:WT_Fig4.PNG|650px|center|]]

[[File:paper21_fig8.png|800px|center]]

[[File:WT_Tab1.PNG|650px|center|]]

=== CIFAR-10 ===
In order to investigate the performance of different pooling methods, two types of networks are trained based on CIFAR-10. The first one is the regular CNN and the second one is the network with dropout and batch normalization. Figure 9 shows the network and Tables 2 and 3 shows the accuracy without and with dropout. Average pooling achieves the best accuracy but wavelet pooling is still competitive, while max pooling overfitted on the validation data fairly quickly as shown by the right energy curve in Figure 10 (although the accuracy performance is not significantly worse when dropout and batch normalization are applied).

[[File:WT_Fig5.PNG|650px|center|]]

[[File:paper21_fig10.png|800px|center]]

[[File:WT_Tab2.PNG|650px|center|]]

[[File:WT_Tab3.PNG|650px|center|]]

===SHVN===
Figure 11 shows the network and Tables 4 and 5 shows the accuracy without and with dropout. The proposed method does not perform well in this experiment.

[[File: a.png|650px|center|]]

[[File:paper21_fig12.png|800px|center]]

[[File: b.png|650px|center|]]

===KDEF===
The authors experimented with pooling methods + dropout on the KDEF dataset (which consists of 4,900 images of 35 people portraying varying emotions through facial expressions under different poses, 3,900 of which were randomly assigned to be used for training). The data was treated for errors (e.g. corrupt images) and resized to 128x128 for memory and time constraints.

Figure 13 below shows the network structure. Figure 14 shows the energy curve of the competing models on training and validation sets as the number of epochs increases, and Table 6 shows the accuracy performance. Average pooling demonstrated the highest accuracy, with wavelet pooling coming in second and max pooling a close third. However, stochastic and wavelet pooling exhibited more stable learning progression compared to the other methods, and max pooling eventually overfitted.

[[File:kdef_struc.PNG|700px|center|]]
[[File:kdef_curve.PNG|750px|center|]]
[[File:kdef_accu.PNG|550px|center|]]

== Computational Complexity ==
The authors explain that their paper is a proof of concept and is not meant to implement wavelet pooling in the most efficient way. The table below displays a comparison of the number of mathematical operations for each method according to the dataset. It can be seen that wavelet pooling is significantly worse. The authors explain that through good implementation and coding practices, the method can prove to be viable.

[[File:WT_Tab4.PNG|650px|center|]]

== Criticism ==
=== Positive ===
* Wavelet Pooling achieves competitive performance with standard go to pooling methods
* Leads to comparison of discrete transformation techniques for pooling (DCT, DFT)
=== Negative ===
* Only 2x2 pooling window used for comparison
* Highly computationally extensive
* Not as simple as other pooling methods
* Only one wavelet used (HAAR wavelet)

== References ==
Travis Williams and Robert Li. Wavelet Pooling for Convolutional Neural Networks. ICLR 2018.

Wavelet Pooling CNN

2018-03-20T18:55:45Z

Apon: /* Introduction */

== Introduction ==
Convolutional neural networks (CNN) have been proven to be powerful in image classification. Over the past few years researchers have put efforts in improving fundamental components of CNNs such as the pooling operation. Various pooling methods exist. Deterministic methods include max pooling and average pooling. Probabilistic methods include mixed pooling and stochastic pooling. All these methods employ a neighborhood approach to the sub-sampling which, albeit fast and simple, can produce artifacts such as blurring, aliasing, and edge halos (Parker et al., 1983).

This paper introduces a novel pooling method based on the discrete wavelet transform. Specifically, it uses a second-level wavelet decomposition for the sub-sampling. This method, instead of nearest neighbor interpolation, uses a sub-band method that the authors claim produces less artifacts and represents the underlying features more accurately. Therefore, if pooling is viewed as a lossy process, the reason for employing a wavelet approach is to try to minimize this loss.

== Pooling Background ==
Pooling essentially means sub-sampling. After the pooling layer, the spatial dimensions of the data is reduced to some degree, with the goal being to compress the data rather than discard some of it. Typical approaches to pooling reduce the dimensionality by using some method to combine a region of values into one value. Max pooling and Mean/Average pooling are the 2 most commonly used pooling methods. For max pooling, this can be represented by the equation <math>a_{kij} = max_{(p,q) \epsilon R_{ij}} (a_{kpq})</math> where <math>a_{kij}</math> is the output activation of the <math>k^th</math> feature map at <math>(i,j)</math>, <math>a_{kpq}</math> is input activation at <math>(p,q)</math> within <math>R_{ij}</math>, and <math>|R_{ij}|</math> is the size of the pooling region. Mean pooling can be represented by the equation <math>a_{kij} = \frac{1}{|R_{ij}|} \sum_{(p,q) \epsilon R_{ij}} (a_{kpq})</math> with everything defined as before. Figure 1 provides a numerical example that can be followed.

[[File:WT_Fig1.PNG|650px|center|]]

The paper mentions that these pooling methods, although simple and effective, have shortcomings. Max pooling can omit details from an image if the important features have less intensity than the insignificant ones, and also commonly overfits. On the other hand, average pooling can dilute important features if the data is averaged with values of significantly lower intensities. Figure 2 displays an image of this.

[[File:WT_Fig2.PNG|650px|center|]]

To account for the above mentioned issues, probabilistic pooling methods were introduced, namely mixed pooling and; stochastic pooling. Mixed pooling is a simple method which just combines the max and the average pooling by randomly selecting one method over the other during training. Stochastic pooling on the other hand randomly samples within a receptive field with the activation values as the probabilities. These are calculated by taking each activation value and dividing it by the sum of all activation values in the grid so that the probabilities sum to 1.

Figure 3 shows an example of how stochastic pooling works. On the left is a 3x3 grid filled with activations. The middle grid is the corresponding probability for each activation. The activation in the middle was randomly selected (it had a 13% chance of getting selected). Because the stochastic pooling is based on the probability of the pixels, it is able to avoid the shortcomings of max and mean pooling mentioned above.

[[File:paper21-stochasticpooling.png|650px|center|]]

== Wavelet Background ==
Data or signals tend to be composed of slowly changing trends (low frequency) as well as fast changing transients (high frequency). Similarly, images have smooth regions of intensity which are perturbed by edges or abrupt changes. We know that these abrupt changes can represent features that are of great importance to us when we perform deep learning. Wavelets are a class of functions that are well localized in time and frequency. Compare this to the Fourier transform which represents signals as the sum of sine waves which oscillate forever (not localized in time and space). The ability of wavelets to be localized in time and space is what makes it suitable for detecting the abrupt changes in an image well.

Essentially, a wavelet is a fast decaying, oscillating signal with zero mean that only exists for a fixed duration and can be scaled and shifted in time. There are some well defined types of wavelets as shown in Figure 3. The key characteristic of wavelets for us is that they have a band-pass characteristic, and the band can be adjusted based on the scaling and shifting.

[[File:WT_Fig3.jpg|650px|center|]]

The paper uses discrete wavelet transform and more specifically a faster variation called Fast Wavelet Transform (FWT) using the Haar wavelet. There also exists a continuous wavelet transform. The main difference in these is how the scale and shift parameters are selected.

== Discrete Wavelet Transform General==
The discrete wavelet transform for images is essentially applying a low pass and high pass filter to your image where the transfer functions of the filters are related and defined by the type of wavelet used (Haar in this paper). This is shown in the figures below, which also show the recursive nature of the transform. For an image, the per row transform is taken first. This results in a new image where the first half is a low frequency sub-band and the second half is the high frequency sub-band. Then this new image is transformed again per column, resulting in four sub-bands. Generally, the low frequency content approximates the image and the high frequency content represents abrupt changes. Therefore, one can simply take the LL band and perform the transformation again to sub-sample even more.

[[File:WT_Fig8.png|650px|center|]]

[[File:WT_Fig9.png|650px|center|]]

== DWT example using Haar Wavelet ==
Suppose we have an image represented by the following pixels:
<math> \begin{bmatrix}
100 & 50 & 60 & 150 \\
20 & 60 & 40 & 30 \\
50 & 90 & 70 & 82 \\
74 & 66 & 90 & 58 \\
\end{bmatrix} </math>

For each level of the DWT using the Haar wavelet, we will perform the transform on the rows first and then the columns. For the row pass, we transform each row as follows:
* Take row i = [ i1, i2, i3, i4], and let i_t = [a1, a2, d1, d2] represent the transformed row
* a1 = (i1 + i2)/2
* a2 = (i3 + i4)/2
* d1 = (i1 - i2)/2
* d2 = (i3 - i4)/2

After the row transforms, the images looks as follows:
<math> \begin{bmatrix}
75 & 105 & 25 & -45 \\
40 & 35 & -20 & 5 \\
70 & 76 & -20 & -6 \\
70 & 74 & 4 & 16 \\
\end{bmatrix} </math>

Now we apply the same method to the columns in the exact same way.

== Proposed Method ==
The proposed method uses subbands from the second level FWT and discards the first level subbands. The authors postulate that this method is more 'organic' in capturing the data compression and will create less artifacts that may affect the image classification.
=== Forward Propagation ===
FWT can be expressed by <math>W_\varphi[j + 1, k] = h_\varphi[-n]*W_\varphi[j,n]|_{n = 2k, k <= 0}</math> and <math>W_\psi[j + 1, k] = h_\psi[-n]*W_\psi[j,n]|_{n = 2k, k <= 0}</math> where <math>\varphi</math> is the approximation function, <math>\psi</math> is the detail function, <math>W_\varphi</math>, <math>W_\psi</math>, are approximation and detail coefficients, <math>h_\varphi[-n]</math> and <math>h_\psi[-n]</math> are time reversed scaling and wavelet vectors, <math>(n)</math> represents the sample in the vector, and <math>j</math> denotes the resolution level. To apply to images, FWT is first applied on the rows and then the columns. If a low (L) and high(H) sub-band is extracted from the rows and similarly for the columns than at each level there is 4 sub-bands (LH, HL, HH, and LL) where LL will further be decomposed into the level 2 decomposition.

Using the level 2 decomposition sub-bands, the Inverse Fast Wavelet Transform (IFWT) is used to obtain the resulting sub-sampled image, which is sub-sampled by a factor of two. The Equation for IFWT is <math>W_\varphi[j, k] = h_\varphi[-n]*W_\varphi[j + 1,n] + h_\psi[-n]*W_\psi[j + 1,n]|_{n = \frac{k}{2}, k <= 0}</math> where the parameters are the same as previously explained. Figure 4 displays the algorithm for the forward propagation.

[[File:WT_Fig6.PNG|650px|center|]]

=== Back Propagation ===
This is simply the reverse of the forward propagation. The FWT of the image is upsampled to be used as the level 2 decomposition. Then IFWT is performed to obtain the original image which is upsampled by a factor of two using wavelet methods. Figure 5 displays the algorithm.

[[File:WT_Fig7.PNG|650px|center|]]

== Results ==
The authors tested on MNIST, CIFAR-10, SHVN, and KDEF and the paper provides comprehensive results for each. Stochastic gradient descent was used and the Haar wavelet is used due to its even, square subbands. The network for all datasets except MNIST is loosedly based on (Zeiler & Fergus, 2013). The authors keep the network consistent, but change the pooling method for each dataset. They also experiment with dropout and Batch Normalization to examine the effects of regularization on their method. All pooling methods compared use a 2x2 window, and a consistent pooling method was used for all pooling layers of a network. The overall results teach us that the pooling method should be chosen specific to the type of data we have. In some cases wavelet pooling may perform the best, and in other cases, other methods may perform better, if the data is more suited for those types of pooling.

=== MNIST ===
Figure 7 shows the network and Table 1 shows the accuracy. It can be seen that wavelet pooling achieves the best accuracy from all pooling methods compared. Figure 8 shows the energy of each method per epoch.

[[File:WT_Fig4.PNG|650px|center|]]

[[File:paper21_fig8.png|800px|center]]

[[File:WT_Tab1.PNG|650px|center|]]

=== CIFAR-10 ===
In order to investigate the performance of different pooling methods, two types of networks are trained based on CIFAR-10. The first one is the regular CNN and the second one is the network with dropout and batch normalization. Figure 9 shows the network and Tables 2 and 3 shows the accuracy without and with dropout. Average pooling achieves the best accuracy but wavelet pooling is still competitive, while max pooling overfitted on the validation data fairly quickly as shown by the right energy curve in Figure 10 (although the accuracy performance is not significantly worse when dropout and batch normalization are applied).

[[File:WT_Fig5.PNG|650px|center|]]

[[File:paper21_fig10.png|800px|center]]

[[File:WT_Tab2.PNG|650px|center|]]

[[File:WT_Tab3.PNG|650px|center|]]

===SHVN===
Figure 11 shows the network and Tables 4 and 5 shows the accuracy without and with dropout. The proposed method does not perform well in this experiment.

[[File: a.png|650px|center|]]

[[File:paper21_fig12.png|800px|center]]

[[File: b.png|650px|center|]]

===KDEF===
The authors experimented with pooling methods + dropout on the KDEF dataset (which consists of 4,900 images of 35 people portraying varying emotions through facial expressions under different poses, 3,900 of which were randomly assigned to be used for training). The data was treated for errors (e.g. corrupt images) and resized to 128x128 for memory and time constraints.

Figure 13 below shows the network structure. Figure 14 shows the energy curve of the competing models on training and validation sets as the number of epochs increases, and Table 6 shows the accuracy performance. Average pooling demonstrated the highest accuracy, with wavelet pooling coming in second and max pooling a close third. However, stochastic and wavelet pooling exhibited more stable learning progression compared to the other methods, and max pooling eventually overfitted.

[[File:kdef_struc.PNG|700px|center|]]
[[File:kdef_curve.PNG|750px|center|]]
[[File:kdef_accu.PNG|550px|center|]]

== Computational Complexity ==
The authors explain that their paper is a proof of concept and is not meant to implement wavelet pooling in the most efficient way. The table below displays a comparison of the number of mathematical operations for each method according to the dataset. It can be seen that wavelet pooling is significantly worse. The authors explain that through good implementation and coding practices, the method can prove to be viable.

[[File:WT_Tab4.PNG|650px|center|]]

== Criticism ==
=== Positive ===
* Wavelet Pooling achieves competitive performance with standard go to pooling methods
* Leads to comparison of discrete transformation techniques for pooling (DCT, DFT)
=== Negative ===
* Only 2x2 pooling window used for comparison
* Highly computationally extensive
* Not as simple as other pooling methods
* Only one wavelet used (HAAR wavelet)

== References ==
Travis Williams and Robert Li. Wavelet Pooling for Convolutional Neural Networks. ICLR 2018.

PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space

2018-03-18T21:39:56Z

Apon:

= Introduction =
This paper builds off of ideas from PointNet (Qi et al., 2017). The name PointNet is derived from the network's input - a point cloud. A point cloud is a set of three dimensional points that each have coordinates <math> (x,y,z) </math>. These coordinates usually represent the surface of an object. For example, a point cloud describing the shape of a torus is shown below.

[[File:Point cloud torus.gif|thumb|center|Point cloud torus]]

Processing point clouds is important in applications such as autonomous driving where point clouds are collected from an onboard LiDAR sensor. These point clouds can then be used for object detection. However, point clouds are challenging to process because:

# They are unordered. If <math> N </math> is the number of points in a point cloud, then there are <math> N! </math> permutations that the point cloud can be represented.
# The spatial arrangement of the points contains useful information, thus it needs to be encoded.
# The function processing the point cloud needs to be invariant to transformations such as rotation and translations of all points.

Previously, typical point cloud processing methods handled the challenges of point clouds by transforming the data with a 3D voxel grid or by representing the point cloud with multiple 2D images. When PointNet was introduced, it was novel because it directly took points as its input. PointNet++ improves on PointNet by using a hierarchical method to better capture local structures of the point cloud.

[[File:point_cloud.png | 400px|thumb|center|Examples of point clouds and their associated task. Classification (left), part segmentation (centre), scene segmentation (right) ]]

= Review of PointNet =

The PointNet architecture is shown below. The input of the network is <math> n </math> points, which each have <math> (x,y,z) </math> coordinates. Each point is processed individually through a multi-layer perceptron (MLP). This network creates an encoding for each point; in the diagram, each point is represented by a 1024 dimension vector. Then, using a max pool layer a vector is created that represents the "global signature" of a point cloud. If classification is the task, this global signature is processed by another MLP to compute the classification scores. If segmentation is the task, this global signature is appended to to each point from the "nx64" layer, and these points are processed by a MLP to compute a semantic category score for each point.

The core idea of the network is to learn a symmetric function on transformed points. Through the T-Nets and the MLP network, a transformation is learned with the hopes of making points invariant to point cloud transformations. Learning a symmetric function solves the challenge imposed by having unordered points; a symmetric function will produce the same value no matter the order of the input. This symmetric function is represented by the max pool layer.

[[File:pointnet_arch.png | 700px|thumb|center|PointNet architecture. The blue highlighted region is when it is used for classification, and the beige highlighted region is when it is used for segmentation.]]

= PointNet++ =

The motivation for PointNet++ is that PointNet does not capture local, fine-grained details. Since PointNet performs a max pool layer over all of its points, information such as the local interaction between points is lost.

== Problem Statement ==

There is a metric space <math> X = (M,d) </math> where <math>d</math> is the metric from a Euclidean space <math>\pmb{\mathbb{R}}^n</math> and <math> M \subseteq \pmb{\mathbb{R}}^n </math> is the set of points. The goal is to learn a function that takes <math>X</math> as the input and output a class or per point label to each member of <math>M</math>.

== Method ==

=== High Level Overview ===
[[File:point_net++.png | 700px|thumb|right|PointNet++ architecture]]

The PointNet++ architecture is shown on the right. The core idea is that a hierarchical architecture is used and at each level of the hierarchy a set of points is processed and abstracted to a new set with less points, i.e.,

\begin{aligned}
\text{Input at each level: } N \times (d + c) \text{ matrix}
\end{aligned}

where <math>N</math> is the number of points, <math>d</math> is the coordinate points <math>(x,y,z)</math> and <math>c</math> is the feature representation of each point, and

\begin{aligned}
\text{Output at each level: } N' \times (d + c') \text{ matrix}
\end{aligned}

where <math>N'</math> is the new number (smaller) of points and <math>c'</math> is the new feature vector.

Each level has three layers: Sampling, Grouping, and PointNet. The Sampling layer selects points that will act as centroids of local regions within the point cloud. The Grouping layer then finds points near these centroids. Lastly, the PointNet layer performs PointNet on each group to encode local information.

=== Sampling Layer ===

The input of this layer is a set of points <math>{\{x_1,x_2,...,x_n}\}</math>. The goal of this layer is to select a subset of these points <math>{\{\hat{x}_1, \hat{x}_2,...,\hat{x}_m\}} </math> that will define the centroid of local regions.

To select these points farthest point sampling is used. This is where <math>\hat{x}_j</math> is the most distant point with regards to <math>{\{\hat{x}_1, \hat{x}_2,...,\hat{x}_{j-1}\}}</math>. This ensures coverage of the entire point cloud opposed to random sampling.

=== Grouping Layer ===

The objective of the grouping layer is to form local regions around each centroid by grouping points near the selected centroids. The input is a point set of size <math>N \times (d + c)</math> and the coordinates of the centroids <math>N' \times d</math>. The output is the groups of points within each region <math>N' \times k \times (d+c)</math> where <math>k</math> is the number of points in each region.

Note that <math>k</math> can vary per group. Later, the PointNet layer creates a feature vector that is the same size for all regions at a hierarchical level.

To determine which points belong to a group a ball query is used; all points within a radius of the centroid are grouped. This is advantageous over nearest neighbour because it guarantees a fixed region space, which is important when learning local structure.

=== PointNet Layer ===

After grouping, PointNet is applied to the points. However, first the coordinates of points in a local region are converted to a local coordinate frame by <math> x_i = x_i - \bar{x}</math> where <math>\bar{x}</math> is the coordinates of the centroid.

=== Robust Feature Learning under Non-Uniform Sampling Density ===

The previous description of grouping uses a single scale. This is not optimal because the density varies per section of the point cloud. At each level, it would be better if the PointNet layer was applied to adaptively sized groups depending on the point cloud density.

The two grouping methods the authors propose are shown in the diagram below. Multi-scale grouping (MSG) applies PointNet at various scales per group. The features from the various scales are concatenated. This method, however, is computationally expensive because for each region it always applies PointNet to all points. On the other hand, multi-resolution grouping (MRG) is less computationally expensive but still adaptively collects features. As shown in the diagram, the left vector is obtained by applying PointNet to three points, and these three points obtained information from three groups. This vector is then concatenated by a vector that is created by using PointNet on all the points in the level below. The second vector can be weighed more heavily if the first vector contains a sparse amount of points.

[[File:grouping.png | 300px|thumb|center|Example of the two ways to perform grouping]]

== Point Cloud Segmentation ==

If the task is segmentation, the architecture is slightly modified since we want a semantic score for each point. To achieve this, distance-based interpolation and skip-connections are used.

=== Distance-based Interpolation ===

Here, point features from <math>N_l \times (d + C)</math> points are propagated to <math>N_{l-1} \times (d + C)</math> points where <math>N_{l-1}</math> is greater than <math>N_l</math>.

To propagate features an inverse distance weighted average based on <math>k</math> nearest neighbors is used. The <math>p=2</math> and <math>k=3</math>.

[[File:prop_feature.png | 500px|thumb|center|Feature interpolation during segmentation]]

=== Skip-connections ===

In addition, skip connections are used (see the PointNet++ architecture diagram). The features from the the skip layers are concatenated with the interpolated features. Next, a "unit-wise" PointNet is applied, which the authors describe as similar to a one-by-one convolution.

== Experiments ==
To validate the effectiveness of PointNet++, experiments in three areas were performed - classification in Euclidean metric space, semantic scene labelling, and classification in non-Euclidean space.

=== Point Set Classification in Euclidean Metric Space ===

The digit dataset, MNIST, was converted to a 2D point cloud. Pixel intensities were normalized in the range of <math>[0, 1]</math>, and only pixels with intensities larger than 0.5 were considered. The coordinate system was set at the centre of the image. PointNet++ achieved a classification error of 0.51%. The original PointNet had 0.78% classification error. The table below compares these results to the state-of-the-art.

[[File:mnist_results.png | 300px|thumb|center|MNIST classification results.]]

In addition, the ModelNet40 dataset was used. This dataset consists of CAD models. Three dimensional point clouds were sampled from mesh surfaces of the ModelNet40 shapes. The classification results from this dataset are shown below.

[[File:modelnet40.png | 300px|thumb|center|ModelNet40 classification results.]]

An experiment was performed to show how the accuracy was affected by the number of points used. With PointNet++ using multi-scale grouping and dropout, the performance decreased by less than 1% when 1024 test points was reduced to 256. On the other hand, PointNet's performance was impacted by the decrease in points.

[[File:num_points_acc.png | 300px|thumb|center|Relationship between accuracy and the number of points used for classification.]]

=== Semantic Scene Labelling ===

The ScanNet dataset was used for experiments in semantic scene labelling. This dataset consists of laser scans of indoor scenes where the goal is to predict a semantic label for each point. Example results are shown below.

[[File:scannet.png | 300px|thumb|center|Example ScanNet semantic segmentation results.]]

To compare to other methods, the authors convert their point labels to a voxel format, and accuracy is determined on a per voxel basis. The accuracy compared to other methods is shown below.

[[File:scannet_acc.png | 500px|thumb|center|ScanNet semantic segmentation classification comparison to other methods.]]

=== Classification in Non-Euclidean Metric Space ===

[[File:shrec.png | 300px|thumb|right|Example of shapes from the SHREC15 dataset.]]

Lastly, experiments were performed on the SHREC15 dataset. This dataset contains shapes that have different poses. This experiment shows that PointNet++ is able to generalize to non-Euclidean spaces. Results from this dataset are provided below.

[[File:shrec15_results.png | 500px|thumb|center|Results from the SHREC15 dataset.]]

== Critique ==

It seems clear that PointNet is lacking capturing local context between points. PointNet++ seems to be an important extension, but the improvements in the experimental results seem small. Some computational efficiency experiments would have been nice. For example, the processing speed of the network, and the computational efficiency of MRG over MRG.

== Code ==

Code for PointNet++ can be found at: https://github.com/charlesq34/pointnet2

=Sources=
1. Charles R. Qi, Li Yi, Hao Su, Leonidas J. Guibas. PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space, 2017

2. Charles R. Qi, Hao Su, Kaichun Mo, Leonidas J. Guibas. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation, 2017

PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space

2018-03-17T19:09:52Z

Apon: /* Skip-connections */

= Introduction =
This paper builds off of ideas from PointNet (Qi et al., 2017). The name PointNet is derived from the network's input - a point cloud. A point cloud is a set of three dimensional points that each have coordinates <math> (x,y,z) </math>. These coordinates usually represent the surface of an object. For example, a point cloud describing the shape of a torus is shown below.

[[File:Point cloud torus.gif|thumb|center|Point cloud torus]]

Processing point clouds is important in applications such as autonomous driving where point clouds are collected from an onboard LiDAR sensor. These point clouds can then be used for object detection. However, point clouds are challenging to process because:

# They are unordered. If <math> N </math> is the number of points in a point cloud, then there are <math> N! </math> permutations that the point cloud can be represented.
# The spatial arrangement of the points contains useful information, thus it needs to be encoded.
# The function processing the point cloud needs to be invariant to transformations such as rotation and translations of all points.

Previously, typical point cloud processing methods handled the challenges of point clouds by transforming the data with a 3D voxel grid or by representing the point cloud with multiple 2D images. When PointNet was introduced, it was novel because it directly took points as its input. PointNet++ improves on PointNet by using a hierarchical method to better capture local structures of the point cloud.

[[File:point_cloud.png | 400px|thumb|center|Examples of point clouds and their associated task. Classification (left), part segmentation (centre), scene segmentation (right) ]]

= Review of PointNet =

The PointNet architecture is shown below. The input of the network is <math> n </math> points, which each have <math> (x,y,z) </math> coordinates. Each point is processed individually through a multi-layer perceptron (MLP). This network creates an encoding for each point; in the diagram, each point is represented by a 1024 dimension vector. Then, using a max pool layer a vector is created that represents the "global signature" of a point cloud. If classification is the task, this global signature is processed by another MLP to compute the classification scores. If segmentation is the task, this global signature is appended to to each point from the "nx64" layer, and these points are processed by a MLP to compute a semantic category score for each point.

The core idea of the network is to learn a symmetric function on transformed points. Through the T-Nets and the MLP network, a transformation is learned with the hopes of making points invariant to point cloud transformations. Learning a symmetric function solves the challenge imposed by having unordered points; a symmetric function will produce the same value no matter the order of the input. This symmetric function is represented by the max pool layer.

[[File:pointnet_arch.png | 700px|thumb|center|PointNet architecture. The blue highlighted region is when it is used for classification, and the beige highlighted region is when it is used for segmentation.]]

= PointNet++ =

The motivation for PointNet++ is that PointNet does not capture local, fine-grained details. Since PointNet performs a max pool layer over all of its points, information such as the local interaction between points is lost.

== Problem Statement ==

There is a metric space <math> X = (M,d) </math> where <math>d</math> is the metric from a Euclidean space <math>\pmb{\mathbb{R}}^n</math> and <math> M \subseteq \pmb{\mathbb{R}}^n </math> is the set of points. The goal is to learn a function that takes <math>X</math> as the input and output a class or per point label to each member of <math>M</math>.

== Method ==

=== High Level Overview ===
[[File:point_net++.png | 700px|thumb|right|PointNet++ architecture]]

The PointNet++ architecture is shown on the right. The core idea is that a hierarchical architecture is used and at each level of the hierarchy a set of points is processed and abstracted to a new set with less points, i.e.,

\begin{aligned}
\text{Input at each level: } N \times (d + c) \text{ matrix}
\end{aligned}

where <math>N</math> is the number of points, <math>d</math> is the coordinate points <math>(x,y,z)</math> and <math>c</math> is the feature representation of each point, and

\begin{aligned}
\text{Output at each level: } N' \times (d + c') \text{ matrix}
\end{aligned}

where <math>N'</math> is the new number (smaller) of points and <math>c'</math> is the new feature vector.

Each level has three layers: Sampling, Grouping, and PointNet. The Sampling layer selects points that will act as centroids of local regions within the point cloud. The Grouping layer then finds points near these centroids. Lastly, the PointNet layer performs PointNet on each group to encode local information.

=== Sampling Layer ===

The input of this layer is a set of points <math>{\{x_1,x_2,...,x_n}\}</math>. The goal of this layer is to select a subset of these points <math>{\{\hat{x}_1, \hat{x}_2,...,\hat{x}_m\}} </math> that will define the centroid of local regions.

To select these points farthest point sampling is used. This is where <math>\hat{x}_j</math> is the most distant point with regards to <math>{\{\hat{x}_1, \hat{x}_2,...,\hat{x}_{j-1}\}}</math>. This ensures coverage of the entire point cloud opposed to random sampling.

=== Grouping Layer ===

The objective of the grouping layer is to form local regions around each centroid by grouping points near the selected centroids. The input is a point set of size <math>N \times (d + c)</math> and the coordinates of the centroids <math>N' \times d</math>. The output is the groups of points within each region <math>N' \times k \times (d+c)</math> where <math>k</math> is the number of points in each region.

Note that <math>k</math> can vary per group. Later, the PointNet layer creates a feature vector that is the same size for all regions at a hierarchical level.

To determine which points belong to a group a ball query is used; all points within a radius of the centroid are grouped. This is advantageous over nearest neighbour because it guarantees a fixed region space, which is important when learning local structure.

=== PointNet Layer ===

After grouping, PointNet is applied to the points. However, first the coordinates of points in a local region are converted to a local coordinate frame by <math> x_i = x_i - \bar{x}</math> where <math>\bar{x}</math> is the coordinates of the centroid.

=== Robust Feature Learning under Non-Uniform Sampling Density ===

The previous description of grouping uses a single scale. This is not optimal because the density varies per section of the point cloud. At each level, it would be better if the PointNet layer was applied to adaptively sized groups depending on the point cloud density.

The two grouping methods the authors propose are shown in the diagram below. Multi-scale grouping (MSG) applies PointNet at various scales per group. The features from the various scales are concatenated. This method, however, is computationally expensive because for each region it always applies PointNet to all points. On the other hand, multi-resolution grouping (MRG) is less computationally expensive but still adaptively collects features. As shown in the diagram, the left vector is obtained by applying PointNet to three points, and these three points obtained information from three groups. This vector is then concatenated by a vector that is created by using PointNet on all the points in the level below. The second vector can be weighed more heavily if the first vector contains a sparse amount of points.

[[File:grouping.png | 300px|thumb|center|Example of the two ways to perform grouping]]

== Point Cloud Segmentation ==

If the task is segmentation, the architecture is slightly modified since we want a semantic score for each point. To achieve this, distance-based interpolation and skip-connections are used.

=== Distance-based Interpolation ===

Here, point features from <math>N_l \times (d + C)</math> points are propagated to <math>N_{l-1} \times (d + C)</math> points where <math>N_{l-1}</math> is greater than <math>N_l</math>.

To propagate features an inverse distance weighted average based on <math>k</math> nearest neighbors is used. The <math>p=2</math> and <math>k=3</math>.

[[File:prop_feature.png | 500px|thumb|center|Feature interpolation during segmentation]]

=== Skip-connections ===

In addition, skip connections are used (see the PointNet++ architecture diagram). The features from the the skip layers are concatenated with the interpolated features. Next, a "unit-wise" PointNet is applied, which the authors describe as similar to a one-by-one convolution.

== Experiments ==
To validate the effectiveness of PointNet++, experiments in three areas were performed - classification in Euclidean metric space, semantic scene labelling, and classification in non-Euclidean space.

=== Point Set Classification in Euclidean Metric Space ===

The digit dataset, MNIST, was converted to a 2D point cloud. Pixel intensities were normalized in the range of <math>[0, 1]</math>, and only pixels with intensities larger than 0.5 were considered. The coordinate system was set at the centre of the image. PointNet++ achieved a classification error of 0.51%. The original PointNet had 0.78% classification error. The table below compares these results to the state-of-the-art.

[[File:mnist_results.png | 300px|thumb|center|MNIST classification results.]]

In addition, the ModelNet40 dataset was used. This dataset consists of CAD models. Three dimensional point clouds were sampled from mesh surfaces of the ModelNet40 shapes. The classification results from this dataset are shown below.

[[File:modelnet40.png | 300px|thumb|center|ModelNet40 classification results.]]

An experiment was performed to show how the accuracy was affected by the number of points used. With PointNet++ using multi-scale grouping and dropout, the performance decreased by less than 1% when 1024 test points was reduced to 256. On the other hand, PointNet's performance was impacted by the decrease in points.

[[File:num_points_acc.png | 300px|thumb|center|Relationship between accuracy and the number of points used for classification.]]

=== Semantic Scene Labelling ===

The ScanNet dataset was used for experiments in semantic scene labelling. This dataset consists of laser scans of indoor scenes where the goal is to predict a semantic label for each point. Example results are shown below.

[[File:scannet.png | 300px|thumb|center|Example ScanNet semantic segmentation results.]]

To compare to other methods, the authors convert their point labels to a voxel format, and accuracy is determined on a per voxel basis. The accuracy compared to other methods is shown below.

[[File:scannet_acc.png | 500px|thumb|center|ScanNet semantic segmentation classification comparison to other methods.]]

=== Classification in Non-Euclidean Metric Space ===

[[File:shrec.png | 300px|thumb|right|Example of shapes from the SHREC15 dataset.]]

Lastly, experiments were performed on the SHREC15 dataset. This dataset contains shapes that have different poses. This experiment shows that PointNet++ is able to generalize to non-Euclidean spaces. Results from this dataset are provided below.

[[File:shrec15_results.png | 500px|thumb|center|Results from the SHREC15 dataset.]]

== Code ==

Code for PointNet++ can be found at: https://github.com/charlesq34/pointnet2

=Sources=
1. Charles R. Qi, Li Yi, Hao Su, Leonidas J. Guibas. PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space, 2017

2. Charles R. Qi, Hao Su, Kaichun Mo, Leonidas J. Guibas. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation, 2017

PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space

2018-03-17T19:05:21Z

Apon: /* Grouping Layer */

= Introduction =
This paper builds off of ideas from PointNet (Qi et al., 2017). The name PointNet is derived from the network's input - a point cloud. A point cloud is a set of three dimensional points that each have coordinates <math> (x,y,z) </math>. These coordinates usually represent the surface of an object. For example, a point cloud describing the shape of a torus is shown below.

[[File:Point cloud torus.gif|thumb|center|Point cloud torus]]

Processing point clouds is important in applications such as autonomous driving where point clouds are collected from an onboard LiDAR sensor. These point clouds can then be used for object detection. However, point clouds are challenging to process because:

# They are unordered. If <math> N </math> is the number of points in a point cloud, then there are <math> N! </math> permutations that the point cloud can be represented.
# The spatial arrangement of the points contains useful information, thus it needs to be encoded.
# The function processing the point cloud needs to be invariant to transformations such as rotation and translations of all points.

Previously, typical point cloud processing methods handled the challenges of point clouds by transforming the data with a 3D voxel grid or by representing the point cloud with multiple 2D images. When PointNet was introduced, it was novel because it directly took points as its input. PointNet++ improves on PointNet by using a hierarchical method to better capture local structures of the point cloud.

[[File:point_cloud.png | 400px|thumb|center|Examples of point clouds and their associated task. Classification (left), part segmentation (centre), scene segmentation (right) ]]

= Review of PointNet =

The PointNet architecture is shown below. The input of the network is <math> n </math> points, which each have <math> (x,y,z) </math> coordinates. Each point is processed individually through a multi-layer perceptron (MLP). This network creates an encoding for each point; in the diagram, each point is represented by a 1024 dimension vector. Then, using a max pool layer a vector is created that represents the "global signature" of a point cloud. If classification is the task, this global signature is processed by another MLP to compute the classification scores. If segmentation is the task, this global signature is appended to to each point from the "nx64" layer, and these points are processed by a MLP to compute a semantic category score for each point.

The core idea of the network is to learn a symmetric function on transformed points. Through the T-Nets and the MLP network, a transformation is learned with the hopes of making points invariant to point cloud transformations. Learning a symmetric function solves the challenge imposed by having unordered points; a symmetric function will produce the same value no matter the order of the input. This symmetric function is represented by the max pool layer.

[[File:pointnet_arch.png | 700px|thumb|center|PointNet architecture. The blue highlighted region is when it is used for classification, and the beige highlighted region is when it is used for segmentation.]]

= PointNet++ =

The motivation for PointNet++ is that PointNet does not capture local, fine-grained details. Since PointNet performs a max pool layer over all of its points, information such as the local interaction between points is lost.

== Problem Statement ==

There is a metric space <math> X = (M,d) </math> where <math>d</math> is the metric from a Euclidean space <math>\pmb{\mathbb{R}}^n</math> and <math> M \subseteq \pmb{\mathbb{R}}^n </math> is the set of points. The goal is to learn a function that takes <math>X</math> as the input and output a class or per point label to each member of <math>M</math>.

== Method ==

=== High Level Overview ===
[[File:point_net++.png | 700px|thumb|right|PointNet++ architecture]]

The PointNet++ architecture is shown on the right. The core idea is that a hierarchical architecture is used and at each level of the hierarchy a set of points is processed and abstracted to a new set with less points, i.e.,

\begin{aligned}
\text{Input at each level: } N \times (d + c) \text{ matrix}
\end{aligned}

where <math>N</math> is the number of points, <math>d</math> is the coordinate points <math>(x,y,z)</math> and <math>c</math> is the feature representation of each point, and

\begin{aligned}
\text{Output at each level: } N' \times (d + c') \text{ matrix}
\end{aligned}

where <math>N'</math> is the new number (smaller) of points and <math>c'</math> is the new feature vector.

Each level has three layers: Sampling, Grouping, and PointNet. The Sampling layer selects points that will act as centroids of local regions within the point cloud. The Grouping layer then finds points near these centroids. Lastly, the PointNet layer performs PointNet on each group to encode local information.

=== Sampling Layer ===

The input of this layer is a set of points <math>{\{x_1,x_2,...,x_n}\}</math>. The goal of this layer is to select a subset of these points <math>{\{\hat{x}_1, \hat{x}_2,...,\hat{x}_m\}} </math> that will define the centroid of local regions.

To select these points farthest point sampling is used. This is where <math>\hat{x}_j</math> is the most distant point with regards to <math>{\{\hat{x}_1, \hat{x}_2,...,\hat{x}_{j-1}\}}</math>. This ensures coverage of the entire point cloud opposed to random sampling.

=== Grouping Layer ===

The objective of the grouping layer is to form local regions around each centroid by grouping points near the selected centroids. The input is a point set of size <math>N \times (d + c)</math> and the coordinates of the centroids <math>N' \times d</math>. The output is the groups of points within each region <math>N' \times k \times (d+c)</math> where <math>k</math> is the number of points in each region.

Note that <math>k</math> can vary per group. Later, the PointNet layer creates a feature vector that is the same size for all regions at a hierarchical level.

To determine which points belong to a group a ball query is used; all points within a radius of the centroid are grouped. This is advantageous over nearest neighbour because it guarantees a fixed region space, which is important when learning local structure.

=== PointNet Layer ===

After grouping, PointNet is applied to the points. However, first the coordinates of points in a local region are converted to a local coordinate frame by <math> x_i = x_i - \bar{x}</math> where <math>\bar{x}</math> is the coordinates of the centroid.

=== Robust Feature Learning under Non-Uniform Sampling Density ===

The previous description of grouping uses a single scale. This is not optimal because the density varies per section of the point cloud. At each level, it would be better if the PointNet layer was applied to adaptively sized groups depending on the point cloud density.

The two grouping methods the authors propose are shown in the diagram below. Multi-scale grouping (MSG) applies PointNet at various scales per group. The features from the various scales are concatenated. This method, however, is computationally expensive because for each region it always applies PointNet to all points. On the other hand, multi-resolution grouping (MRG) is less computationally expensive but still adaptively collects features. As shown in the diagram, the left vector is obtained by applying PointNet to three points, and these three points obtained information from three groups. This vector is then concatenated by a vector that is created by using PointNet on all the points in the level below. The second vector can be weighed more heavily if the first vector contains a sparse amount of points.

[[File:grouping.png | 300px|thumb|center|Example of the two ways to perform grouping]]

== Point Cloud Segmentation ==

If the task is segmentation, the architecture is slightly modified since we want a semantic score for each point. To achieve this, distance-based interpolation and skip-connections are used.

=== Distance-based Interpolation ===

Here, point features from <math>N_l \times (d + C)</math> points are propagated to <math>N_{l-1} \times (d + C)</math> points where <math>N_{l-1}</math> is greater than <math>N_l</math>.

To propagate features an inverse distance weighted average based on <math>k</math> nearest neighbors is used. The <math>p=2</math> and <math>k=3</math>.

[[File:prop_feature.png | 500px|thumb|center|Feature interpolation during segmentation]]

=== Skip-connections ===

In addition, skip connections are used (see the Point++ architecture diagram). The features from the the skip layers are concatenated with the interpolated features. Next, a "unit-wise" PointNet is applied, which the authors describe as similar to a one-by-one convolution.

== Experiments ==
To validate the effectiveness of PointNet++, experiments in three areas were performed - classification in Euclidean metric space, semantic scene labelling, and classification in non-Euclidean space.

=== Point Set Classification in Euclidean Metric Space ===

The digit dataset, MNIST, was converted to a 2D point cloud. Pixel intensities were normalized in the range of <math>[0, 1]</math>, and only pixels with intensities larger than 0.5 were considered. The coordinate system was set at the centre of the image. PointNet++ achieved a classification error of 0.51%. The original PointNet had 0.78% classification error. The table below compares these results to the state-of-the-art.

[[File:mnist_results.png | 300px|thumb|center|MNIST classification results.]]

In addition, the ModelNet40 dataset was used. This dataset consists of CAD models. Three dimensional point clouds were sampled from mesh surfaces of the ModelNet40 shapes. The classification results from this dataset are shown below.

[[File:modelnet40.png | 300px|thumb|center|ModelNet40 classification results.]]

An experiment was performed to show how the accuracy was affected by the number of points used. With PointNet++ using multi-scale grouping and dropout, the performance decreased by less than 1% when 1024 test points was reduced to 256. On the other hand, PointNet's performance was impacted by the decrease in points.

[[File:num_points_acc.png | 300px|thumb|center|Relationship between accuracy and the number of points used for classification.]]

=== Semantic Scene Labelling ===

The ScanNet dataset was used for experiments in semantic scene labelling. This dataset consists of laser scans of indoor scenes where the goal is to predict a semantic label for each point. Example results are shown below.

[[File:scannet.png | 300px|thumb|center|Example ScanNet semantic segmentation results.]]

To compare to other methods, the authors convert their point labels to a voxel format, and accuracy is determined on a per voxel basis. The accuracy compared to other methods is shown below.

[[File:scannet_acc.png | 500px|thumb|center|ScanNet semantic segmentation classification comparison to other methods.]]

=== Classification in Non-Euclidean Metric Space ===

[[File:shrec.png | 300px|thumb|right|Example of shapes from the SHREC15 dataset.]]

Lastly, experiments were performed on the SHREC15 dataset. This dataset contains shapes that have different poses. This experiment shows that PointNet++ is able to generalize to non-Euclidean spaces. Results from this dataset are provided below.

[[File:shrec15_results.png | 500px|thumb|center|Results from the SHREC15 dataset.]]

== Code ==

Code for PointNet++ can be found at: https://github.com/charlesq34/pointnet2

=Sources=
1. Charles R. Qi, Li Yi, Hao Su, Leonidas J. Guibas. PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space, 2017

2. Charles R. Qi, Hao Su, Kaichun Mo, Leonidas J. Guibas. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation, 2017

PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space

2018-03-17T19:04:51Z

Apon: /* Grouping Layer */

= Introduction =
This paper builds off of ideas from PointNet (Qi et al., 2017). The name PointNet is derived from the network's input - a point cloud. A point cloud is a set of three dimensional points that each have coordinates <math> (x,y,z) </math>. These coordinates usually represent the surface of an object. For example, a point cloud describing the shape of a torus is shown below.

[[File:Point cloud torus.gif|thumb|center|Point cloud torus]]

Processing point clouds is important in applications such as autonomous driving where point clouds are collected from an onboard LiDAR sensor. These point clouds can then be used for object detection. However, point clouds are challenging to process because:

# They are unordered. If <math> N </math> is the number of points in a point cloud, then there are <math> N! </math> permutations that the point cloud can be represented.
# The spatial arrangement of the points contains useful information, thus it needs to be encoded.
# The function processing the point cloud needs to be invariant to transformations such as rotation and translations of all points.

Previously, typical point cloud processing methods handled the challenges of point clouds by transforming the data with a 3D voxel grid or by representing the point cloud with multiple 2D images. When PointNet was introduced, it was novel because it directly took points as its input. PointNet++ improves on PointNet by using a hierarchical method to better capture local structures of the point cloud.

[[File:point_cloud.png | 400px|thumb|center|Examples of point clouds and their associated task. Classification (left), part segmentation (centre), scene segmentation (right) ]]

= Review of PointNet =

The PointNet architecture is shown below. The input of the network is <math> n </math> points, which each have <math> (x,y,z) </math> coordinates. Each point is processed individually through a multi-layer perceptron (MLP). This network creates an encoding for each point; in the diagram, each point is represented by a 1024 dimension vector. Then, using a max pool layer a vector is created that represents the "global signature" of a point cloud. If classification is the task, this global signature is processed by another MLP to compute the classification scores. If segmentation is the task, this global signature is appended to to each point from the "nx64" layer, and these points are processed by a MLP to compute a semantic category score for each point.

The core idea of the network is to learn a symmetric function on transformed points. Through the T-Nets and the MLP network, a transformation is learned with the hopes of making points invariant to point cloud transformations. Learning a symmetric function solves the challenge imposed by having unordered points; a symmetric function will produce the same value no matter the order of the input. This symmetric function is represented by the max pool layer.

[[File:pointnet_arch.png | 700px|thumb|center|PointNet architecture. The blue highlighted region is when it is used for classification, and the beige highlighted region is when it is used for segmentation.]]

= PointNet++ =

The motivation for PointNet++ is that PointNet does not capture local, fine-grained details. Since PointNet performs a max pool layer over all of its points, information such as the local interaction between points is lost.

== Problem Statement ==

There is a metric space <math> X = (M,d) </math> where <math>d</math> is the metric from a Euclidean space <math>\pmb{\mathbb{R}}^n</math> and <math> M \subseteq \pmb{\mathbb{R}}^n </math> is the set of points. The goal is to learn a function that takes <math>X</math> as the input and output a class or per point label to each member of <math>M</math>.

== Method ==

=== High Level Overview ===
[[File:point_net++.png | 700px|thumb|right|PointNet++ architecture]]

The PointNet++ architecture is shown on the right. The core idea is that a hierarchical architecture is used and at each level of the hierarchy a set of points is processed and abstracted to a new set with less points, i.e.,

\begin{aligned}
\text{Input at each level: } N \times (d + c) \text{ matrix}
\end{aligned}

where <math>N</math> is the number of points, <math>d</math> is the coordinate points <math>(x,y,z)</math> and <math>c</math> is the feature representation of each point, and

\begin{aligned}
\text{Output at each level: } N' \times (d + c') \text{ matrix}
\end{aligned}

where <math>N'</math> is the new number (smaller) of points and <math>c'</math> is the new feature vector.

Each level has three layers: Sampling, Grouping, and PointNet. The Sampling layer selects points that will act as centroids of local regions within the point cloud. The Grouping layer then finds points near these centroids. Lastly, the PointNet layer performs PointNet on each group to encode local information.

=== Sampling Layer ===

The input of this layer is a set of points <math>{\{x_1,x_2,...,x_n}\}</math>. The goal of this layer is to select a subset of these points <math>{\{\hat{x}_1, \hat{x}_2,...,\hat{x}_m\}} </math> that will define the centroid of local regions.

To select these points farthest point sampling is used. This is where <math>\hat{x}_j</math> is the most distant point with regards to <math>{\{\hat{x}_1, \hat{x}_2,...,\hat{x}_{j-1}\}}</math>. This ensures coverage of the entire point cloud opposed to random sampling.

=== Grouping Layer ===

The objective of the grouping layer is to form local regions around each centroid by grouping points near the selected centroids. The input is a point set of size <math>N \times (d + c)</math> and the coordinates of the centroids <math>N' \times d</math>. The output is the groups of points within each region <math>N' \times k \times (d+c)</math> where <math>k</math> is the number of points in each region.

Note that <math>k</math> can vary per group. Later, the PointNet layer creates a feature vector that is the same size for all regions at a hierarchical level.

To determine which points belong to a group a ball query is used; all points within a radius of the centroid are grouped. This is advantageous over nearest neighbour because it guarantees a fixed region space, which is important when learning local structure,

=== PointNet Layer ===

After grouping, PointNet is applied to the points. However, first the coordinates of points in a local region are converted to a local coordinate frame by <math> x_i = x_i - \bar{x}</math> where <math>\bar{x}</math> is the coordinates of the centroid.

=== Robust Feature Learning under Non-Uniform Sampling Density ===

The previous description of grouping uses a single scale. This is not optimal because the density varies per section of the point cloud. At each level, it would be better if the PointNet layer was applied to adaptively sized groups depending on the point cloud density.

The two grouping methods the authors propose are shown in the diagram below. Multi-scale grouping (MSG) applies PointNet at various scales per group. The features from the various scales are concatenated. This method, however, is computationally expensive because for each region it always applies PointNet to all points. On the other hand, multi-resolution grouping (MRG) is less computationally expensive but still adaptively collects features. As shown in the diagram, the left vector is obtained by applying PointNet to three points, and these three points obtained information from three groups. This vector is then concatenated by a vector that is created by using PointNet on all the points in the level below. The second vector can be weighed more heavily if the first vector contains a sparse amount of points.

[[File:grouping.png | 300px|thumb|center|Example of the two ways to perform grouping]]

== Point Cloud Segmentation ==

If the task is segmentation, the architecture is slightly modified since we want a semantic score for each point. To achieve this, distance-based interpolation and skip-connections are used.

=== Distance-based Interpolation ===

Here, point features from <math>N_l \times (d + C)</math> points are propagated to <math>N_{l-1} \times (d + C)</math> points where <math>N_{l-1}</math> is greater than <math>N_l</math>.

To propagate features an inverse distance weighted average based on <math>k</math> nearest neighbors is used. The <math>p=2</math> and <math>k=3</math>.

[[File:prop_feature.png | 500px|thumb|center|Feature interpolation during segmentation]]

=== Skip-connections ===

In addition, skip connections are used (see the Point++ architecture diagram). The features from the the skip layers are concatenated with the interpolated features. Next, a "unit-wise" PointNet is applied, which the authors describe as similar to a one-by-one convolution.

== Experiments ==
To validate the effectiveness of PointNet++, experiments in three areas were performed - classification in Euclidean metric space, semantic scene labelling, and classification in non-Euclidean space.

=== Point Set Classification in Euclidean Metric Space ===

The digit dataset, MNIST, was converted to a 2D point cloud. Pixel intensities were normalized in the range of <math>[0, 1]</math>, and only pixels with intensities larger than 0.5 were considered. The coordinate system was set at the centre of the image. PointNet++ achieved a classification error of 0.51%. The original PointNet had 0.78% classification error. The table below compares these results to the state-of-the-art.

[[File:mnist_results.png | 300px|thumb|center|MNIST classification results.]]

In addition, the ModelNet40 dataset was used. This dataset consists of CAD models. Three dimensional point clouds were sampled from mesh surfaces of the ModelNet40 shapes. The classification results from this dataset are shown below.

[[File:modelnet40.png | 300px|thumb|center|ModelNet40 classification results.]]

An experiment was performed to show how the accuracy was affected by the number of points used. With PointNet++ using multi-scale grouping and dropout, the performance decreased by less than 1% when 1024 test points was reduced to 256. On the other hand, PointNet's performance was impacted by the decrease in points.

[[File:num_points_acc.png | 300px|thumb|center|Relationship between accuracy and the number of points used for classification.]]

=== Semantic Scene Labelling ===

The ScanNet dataset was used for experiments in semantic scene labelling. This dataset consists of laser scans of indoor scenes where the goal is to predict a semantic label for each point. Example results are shown below.

[[File:scannet.png | 300px|thumb|center|Example ScanNet semantic segmentation results.]]

To compare to other methods, the authors convert their point labels to a voxel format, and accuracy is determined on a per voxel basis. The accuracy compared to other methods is shown below.

[[File:scannet_acc.png | 500px|thumb|center|ScanNet semantic segmentation classification comparison to other methods.]]

=== Classification in Non-Euclidean Metric Space ===

[[File:shrec.png | 300px|thumb|right|Example of shapes from the SHREC15 dataset.]]

Lastly, experiments were performed on the SHREC15 dataset. This dataset contains shapes that have different poses. This experiment shows that PointNet++ is able to generalize to non-Euclidean spaces. Results from this dataset are provided below.

[[File:shrec15_results.png | 500px|thumb|center|Results from the SHREC15 dataset.]]

== Code ==

Code for PointNet++ can be found at: https://github.com/charlesq34/pointnet2

=Sources=
1. Charles R. Qi, Li Yi, Hao Su, Leonidas J. Guibas. PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space, 2017

2. Charles R. Qi, Hao Su, Kaichun Mo, Leonidas J. Guibas. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation, 2017

PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space

2018-03-17T18:58:39Z

Apon: /* Problem Statement */

= Introduction =
This paper builds off of ideas from PointNet (Qi et al., 2017). The name PointNet is derived from the network's input - a point cloud. A point cloud is a set of three dimensional points that each have coordinates <math> (x,y,z) </math>. These coordinates usually represent the surface of an object. For example, a point cloud describing the shape of a torus is shown below.

[[File:Point cloud torus.gif|thumb|center|Point cloud torus]]

Processing point clouds is important in applications such as autonomous driving where point clouds are collected from an onboard LiDAR sensor. These point clouds can then be used for object detection. However, point clouds are challenging to process because:

# They are unordered. If <math> N </math> is the number of points in a point cloud, then there are <math> N! </math> permutations that the point cloud can be represented.
# The spatial arrangement of the points contains useful information, thus it needs to be encoded.
# The function processing the point cloud needs to be invariant to transformations such as rotation and translations of all points.

Previously, typical point cloud processing methods handled the challenges of point clouds by transforming the data with a 3D voxel grid or by representing the point cloud with multiple 2D images. When PointNet was introduced, it was novel because it directly took points as its input. PointNet++ improves on PointNet by using a hierarchical method to better capture local structures of the point cloud.

[[File:point_cloud.png | 400px|thumb|center|Examples of point clouds and their associated task. Classification (left), part segmentation (centre), scene segmentation (right) ]]

= Review of PointNet =

The PointNet architecture is shown below. The input of the network is <math> n </math> points, which each have <math> (x,y,z) </math> coordinates. Each point is processed individually through a multi-layer perceptron (MLP). This network creates an encoding for each point; in the diagram, each point is represented by a 1024 dimension vector. Then, using a max pool layer a vector is created that represents the "global signature" of a point cloud. If classification is the task, this global signature is processed by another MLP to compute the classification scores. If segmentation is the task, this global signature is appended to to each point from the "nx64" layer, and these points are processed by a MLP to compute a semantic category score for each point.

The core idea of the network is to learn a symmetric function on transformed points. Through the T-Nets and the MLP network, a transformation is learned with the hopes of making points invariant to point cloud transformations. Learning a symmetric function solves the challenge imposed by having unordered points; a symmetric function will produce the same value no matter the order of the input. This symmetric function is represented by the max pool layer.

[[File:pointnet_arch.png | 700px|thumb|center|PointNet architecture. The blue highlighted region is when it is used for classification, and the beige highlighted region is when it is used for segmentation.]]

= PointNet++ =

The motivation for PointNet++ is that PointNet does not capture local, fine-grained details. Since PointNet performs a max pool layer over all of its points, information such as the local interaction between points is lost.

== Problem Statement ==

There is a metric space <math> X = (M,d) </math> where <math>d</math> is the metric from a Euclidean space <math>\pmb{\mathbb{R}}^n</math> and <math> M \subseteq \pmb{\mathbb{R}}^n </math> is the set of points. The goal is to learn a function that takes <math>X</math> as the input and output a class or per point label to each member of <math>M</math>.

== Method ==

=== High Level Overview ===
[[File:point_net++.png | 700px|thumb|right|PointNet++ architecture]]

The PointNet++ architecture is shown on the right. The core idea is that a hierarchical architecture is used and at each level of the hierarchy a set of points is processed and abstracted to a new set with less points, i.e.,

\begin{aligned}
\text{Input at each level: } N \times (d + c) \text{ matrix}
\end{aligned}

where <math>N</math> is the number of points, <math>d</math> is the coordinate points <math>(x,y,z)</math> and <math>c</math> is the feature representation of each point, and

\begin{aligned}
\text{Output at each level: } N' \times (d + c') \text{ matrix}
\end{aligned}

where <math>N'</math> is the new number (smaller) of points and <math>c'</math> is the new feature vector.

Each level has three layers: Sampling, Grouping, and PointNet. The Sampling layer selects points that will act as centroids of local regions within the point cloud. The Grouping layer then finds points near these centroids. Lastly, the PointNet layer performs PointNet on each group to encode local information.

=== Sampling Layer ===

The input of this layer is a set of points <math>{\{x_1,x_2,...,x_n}\}</math>. The goal of this layer is to select a subset of these points <math>{\{\hat{x}_1, \hat{x}_2,...,\hat{x}_m\}} </math> that will define the centroid of local regions.

To select these points farthest point sampling is used. This is where <math>\hat{x}_j</math> is the most distant point with regards to <math>{\{\hat{x}_1, \hat{x}_2,...,\hat{x}_{j-1}\}}</math>. This ensures coverage of the entire point cloud opposed to random sampling.

=== Grouping Layer ===

The object of the grouping layer is to form local regions around each centroid by group points near the selected centroids. The input is a point set of size <math>N x (d + c)</math> and the coordinates of the centroids <math>N' \times d</math>. The output is the groups of points within each region <math>N' \times k \times (d+c)</math> where <math>k</math> is the number of points in each region.

Note that <math>k</math> can vary per group. Later, the PointNet layer creates a feature vector that has the same size for all regions at the hierarchical level.

To determine which points belong to a group ball query is used; all points within a radius of the centroid are grouped. This is advantageous over nearest neighbour because it guarantees a fixed region space, which is important when learning local structure,

=== PointNet Layer ===

After grouping, PointNet is applied to the points. However, first the coordinates of points in a local region are converted to a local coordinate frame by <math> x_i = x_i - \bar{x}</math> where <math>\bar{x}</math> is the coordinates of the centroid.

=== Robust Feature Learning under Non-Uniform Sampling Density ===

The previous description of grouping uses a single scale. This is not optimal because the density varies per section of the point cloud. At each level, it would be better if the PointNet layer was applied to adaptively sized groups depending on the point cloud density.

The two grouping methods the authors propose are shown in the diagram below. Multi-scale grouping (MSG) applies PointNet at various scales per group. The features from the various scales are concatenated. This method, however, is computationally expensive because for each region it always applies PointNet to all points. On the other hand, multi-resolution grouping (MRG) is less computationally expensive but still adaptively collects features. As shown in the diagram, the left vector is obtained by applying PointNet to three points, and these three points obtained information from three groups. This vector is then concatenated by a vector that is created by using PointNet on all the points in the level below. The second vector can be weighed more heavily if the first vector contains a sparse amount of points.

[[File:grouping.png | 300px|thumb|center|Example of the two ways to perform grouping]]

== Point Cloud Segmentation ==

If the task is segmentation, the architecture is slightly modified since we want a semantic score for each point. To achieve this, distance-based interpolation and skip-connections are used.

=== Distance-based Interpolation ===

Here, point features from <math>N_l \times (d + C)</math> points are propagated to <math>N_{l-1} \times (d + C)</math> points where <math>N_{l-1}</math> is greater than <math>N_l</math>.

To propagate features an inverse distance weighted average based on <math>k</math> nearest neighbors is used. The <math>p=2</math> and <math>k=3</math>.

[[File:prop_feature.png | 500px|thumb|center|Feature interpolation during segmentation]]

=== Skip-connections ===

In addition, skip connections are used (see the Point++ architecture diagram). The features from the the skip layers are concatenated with the interpolated features. Next, a "unit-wise" PointNet is applied, which the authors describe as similar to a one-by-one convolution.

== Experiments ==
To validate the effectiveness of PointNet++, experiments in three areas were performed - classification in Euclidean metric space, semantic scene labelling, and classification in non-Euclidean space.

=== Point Set Classification in Euclidean Metric Space ===

The digit dataset, MNIST, was converted to a 2D point cloud. Pixel intensities were normalized in the range of <math>[0, 1]</math>, and only pixels with intensities larger than 0.5 were considered. The coordinate system was set at the centre of the image. PointNet++ achieved a classification error of 0.51%. The original PointNet had 0.78% classification error. The table below compares these results to the state-of-the-art.

[[File:mnist_results.png | 300px|thumb|center|MNIST classification results.]]

In addition, the ModelNet40 dataset was used. This dataset consists of CAD models. Three dimensional point clouds were sampled from mesh surfaces of the ModelNet40 shapes. The classification results from this dataset are shown below.

[[File:modelnet40.png | 300px|thumb|center|ModelNet40 classification results.]]

An experiment was performed to show how the accuracy was affected by the number of points used. With PointNet++ using multi-scale grouping and dropout, the performance decreased by less than 1% when 1024 test points was reduced to 256. On the other hand, PointNet's performance was impacted by the decrease in points.

[[File:num_points_acc.png | 300px|thumb|center|Relationship between accuracy and the number of points used for classification.]]

=== Semantic Scene Labelling ===

The ScanNet dataset was used for experiments in semantic scene labelling. This dataset consists of laser scans of indoor scenes where the goal is to predict a semantic label for each point. Example results are shown below.

[[File:scannet.png | 300px|thumb|center|Example ScanNet semantic segmentation results.]]

To compare to other methods, the authors convert their point labels to a voxel format, and accuracy is determined on a per voxel basis. The accuracy compared to other methods is shown below.

[[File:scannet_acc.png | 500px|thumb|center|ScanNet semantic segmentation classification comparison to other methods.]]

=== Classification in Non-Euclidean Metric Space ===

[[File:shrec.png | 300px|thumb|right|Example of shapes from the SHREC15 dataset.]]

Lastly, experiments were performed on the SHREC15 dataset. This dataset contains shapes that have different poses. This experiment shows that PointNet++ is able to generalize to non-Euclidean spaces. Results from this dataset are provided below.

[[File:shrec15_results.png | 500px|thumb|center|Results from the SHREC15 dataset.]]

== Code ==

Code for PointNet++ can be found at: https://github.com/charlesq34/pointnet2

=Sources=
1. Charles R. Qi, Li Yi, Hao Su, Leonidas J. Guibas. PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space, 2017

2. Charles R. Qi, Hao Su, Kaichun Mo, Leonidas J. Guibas. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation, 2017

PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space

2018-03-17T18:57:42Z

Apon: /* Review of PointNet */

= Introduction =
This paper builds off of ideas from PointNet (Qi et al., 2017). The name PointNet is derived from the network's input - a point cloud. A point cloud is a set of three dimensional points that each have coordinates <math> (x,y,z) </math>. These coordinates usually represent the surface of an object. For example, a point cloud describing the shape of a torus is shown below.

[[File:Point cloud torus.gif|thumb|center|Point cloud torus]]

Processing point clouds is important in applications such as autonomous driving where point clouds are collected from an onboard LiDAR sensor. These point clouds can then be used for object detection. However, point clouds are challenging to process because:

# They are unordered. If <math> N </math> is the number of points in a point cloud, then there are <math> N! </math> permutations that the point cloud can be represented.
# The spatial arrangement of the points contains useful information, thus it needs to be encoded.
# The function processing the point cloud needs to be invariant to transformations such as rotation and translations of all points.

Previously, typical point cloud processing methods handled the challenges of point clouds by transforming the data with a 3D voxel grid or by representing the point cloud with multiple 2D images. When PointNet was introduced, it was novel because it directly took points as its input. PointNet++ improves on PointNet by using a hierarchical method to better capture local structures of the point cloud.

[[File:point_cloud.png | 400px|thumb|center|Examples of point clouds and their associated task. Classification (left), part segmentation (centre), scene segmentation (right) ]]

= Review of PointNet =

The PointNet architecture is shown below. The input of the network is <math> n </math> points, which each have <math> (x,y,z) </math> coordinates. Each point is processed individually through a multi-layer perceptron (MLP). This network creates an encoding for each point; in the diagram, each point is represented by a 1024 dimension vector. Then, using a max pool layer a vector is created that represents the "global signature" of a point cloud. If classification is the task, this global signature is processed by another MLP to compute the classification scores. If segmentation is the task, this global signature is appended to to each point from the "nx64" layer, and these points are processed by a MLP to compute a semantic category score for each point.

The core idea of the network is to learn a symmetric function on transformed points. Through the T-Nets and the MLP network, a transformation is learned with the hopes of making points invariant to point cloud transformations. Learning a symmetric function solves the challenge imposed by having unordered points; a symmetric function will produce the same value no matter the order of the input. This symmetric function is represented by the max pool layer.

[[File:pointnet_arch.png | 700px|thumb|center|PointNet architecture. The blue highlighted region is when it is used for classification, and the beige highlighted region is when it is used for segmentation.]]

= PointNet++ =

The motivation for PointNet++ is that PointNet does not capture local, fine-grained details. Since PointNet performs a max pool layer over all of its points, information such as the local interaction between points is lost.

== Problem Statement ==

There is a metric space <math> X = (M,d) </math> where <math>d</math> is the metric from a Euclidean space <math>\pmb{\mathbb{R}}^n</math> and <math> M \subseteq \pmb{\mathbb{R}}^n </math> is the set of points. The goal is to learn a function that takes <math>X</math> as the input as outputs a a class or per point label to each member of <math>M</math>.

== Method ==

=== High Level Overview ===
[[File:point_net++.png | 700px|thumb|right|PointNet++ architecture]]

The PointNet++ architecture is shown on the right. The core idea is that a hierarchical architecture is used and at each level of the hierarchy a set of points is processed and abstracted to a new set with less points, i.e.,

\begin{aligned}
\text{Input at each level: } N \times (d + c) \text{ matrix}
\end{aligned}

where <math>N</math> is the number of points, <math>d</math> is the coordinate points <math>(x,y,z)</math> and <math>c</math> is the feature representation of each point, and

\begin{aligned}
\text{Output at each level: } N' \times (d + c') \text{ matrix}
\end{aligned}

where <math>N'</math> is the new number (smaller) of points and <math>c'</math> is the new feature vector.

Each level has three layers: Sampling, Grouping, and PointNet. The Sampling layer selects points that will act as centroids of local regions within the point cloud. The Grouping layer then finds points near these centroids. Lastly, the PointNet layer performs PointNet on each group to encode local information.

=== Sampling Layer ===

The input of this layer is a set of points <math>{\{x_1,x_2,...,x_n}\}</math>. The goal of this layer is to select a subset of these points <math>{\{\hat{x}_1, \hat{x}_2,...,\hat{x}_m\}} </math> that will define the centroid of local regions.

To select these points farthest point sampling is used. This is where <math>\hat{x}_j</math> is the most distant point with regards to <math>{\{\hat{x}_1, \hat{x}_2,...,\hat{x}_{j-1}\}}</math>. This ensures coverage of the entire point cloud opposed to random sampling.

=== Grouping Layer ===

The object of the grouping layer is to form local regions around each centroid by group points near the selected centroids. The input is a point set of size <math>N x (d + c)</math> and the coordinates of the centroids <math>N' \times d</math>. The output is the groups of points within each region <math>N' \times k \times (d+c)</math> where <math>k</math> is the number of points in each region.

Note that <math>k</math> can vary per group. Later, the PointNet layer creates a feature vector that has the same size for all regions at the hierarchical level.

To determine which points belong to a group ball query is used; all points within a radius of the centroid are grouped. This is advantageous over nearest neighbour because it guarantees a fixed region space, which is important when learning local structure,

=== PointNet Layer ===

After grouping, PointNet is applied to the points. However, first the coordinates of points in a local region are converted to a local coordinate frame by <math> x_i = x_i - \bar{x}</math> where <math>\bar{x}</math> is the coordinates of the centroid.

=== Robust Feature Learning under Non-Uniform Sampling Density ===

The previous description of grouping uses a single scale. This is not optimal because the density varies per section of the point cloud. At each level, it would be better if the PointNet layer was applied to adaptively sized groups depending on the point cloud density.

The two grouping methods the authors propose are shown in the diagram below. Multi-scale grouping (MSG) applies PointNet at various scales per group. The features from the various scales are concatenated. This method, however, is computationally expensive because for each region it always applies PointNet to all points. On the other hand, multi-resolution grouping (MRG) is less computationally expensive but still adaptively collects features. As shown in the diagram, the left vector is obtained by applying PointNet to three points, and these three points obtained information from three groups. This vector is then concatenated by a vector that is created by using PointNet on all the points in the level below. The second vector can be weighed more heavily if the first vector contains a sparse amount of points.

[[File:grouping.png | 300px|thumb|center|Example of the two ways to perform grouping]]

== Point Cloud Segmentation ==

If the task is segmentation, the architecture is slightly modified since we want a semantic score for each point. To achieve this, distance-based interpolation and skip-connections are used.

=== Distance-based Interpolation ===

Here, point features from <math>N_l \times (d + C)</math> points are propagated to <math>N_{l-1} \times (d + C)</math> points where <math>N_{l-1}</math> is greater than <math>N_l</math>.

To propagate features an inverse distance weighted average based on <math>k</math> nearest neighbors is used. The <math>p=2</math> and <math>k=3</math>.

[[File:prop_feature.png | 500px|thumb|center|Feature interpolation during segmentation]]

=== Skip-connections ===

In addition, skip connections are used (see the Point++ architecture diagram). The features from the the skip layers are concatenated with the interpolated features. Next, a "unit-wise" PointNet is applied, which the authors describe as similar to a one-by-one convolution.

== Experiments ==
To validate the effectiveness of PointNet++, experiments in three areas were performed - classification in Euclidean metric space, semantic scene labelling, and classification in non-Euclidean space.

=== Point Set Classification in Euclidean Metric Space ===

The digit dataset, MNIST, was converted to a 2D point cloud. Pixel intensities were normalized in the range of <math>[0, 1]</math>, and only pixels with intensities larger than 0.5 were considered. The coordinate system was set at the centre of the image. PointNet++ achieved a classification error of 0.51%. The original PointNet had 0.78% classification error. The table below compares these results to the state-of-the-art.

[[File:mnist_results.png | 300px|thumb|center|MNIST classification results.]]

In addition, the ModelNet40 dataset was used. This dataset consists of CAD models. Three dimensional point clouds were sampled from mesh surfaces of the ModelNet40 shapes. The classification results from this dataset are shown below.

[[File:modelnet40.png | 300px|thumb|center|ModelNet40 classification results.]]

An experiment was performed to show how the accuracy was affected by the number of points used. With PointNet++ using multi-scale grouping and dropout, the performance decreased by less than 1% when 1024 test points was reduced to 256. On the other hand, PointNet's performance was impacted by the decrease in points.

[[File:num_points_acc.png | 300px|thumb|center|Relationship between accuracy and the number of points used for classification.]]

=== Semantic Scene Labelling ===

The ScanNet dataset was used for experiments in semantic scene labelling. This dataset consists of laser scans of indoor scenes where the goal is to predict a semantic label for each point. Example results are shown below.

[[File:scannet.png | 300px|thumb|center|Example ScanNet semantic segmentation results.]]

To compare to other methods, the authors convert their point labels to a voxel format, and accuracy is determined on a per voxel basis. The accuracy compared to other methods is shown below.

[[File:scannet_acc.png | 500px|thumb|center|ScanNet semantic segmentation classification comparison to other methods.]]

=== Classification in Non-Euclidean Metric Space ===

[[File:shrec.png | 300px|thumb|right|Example of shapes from the SHREC15 dataset.]]

Lastly, experiments were performed on the SHREC15 dataset. This dataset contains shapes that have different poses. This experiment shows that PointNet++ is able to generalize to non-Euclidean spaces. Results from this dataset are provided below.

[[File:shrec15_results.png | 500px|thumb|center|Results from the SHREC15 dataset.]]

== Code ==

Code for PointNet++ can be found at: https://github.com/charlesq34/pointnet2

=Sources=
1. Charles R. Qi, Li Yi, Hao Su, Leonidas J. Guibas. PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space, 2017

2. Charles R. Qi, Hao Su, Kaichun Mo, Leonidas J. Guibas. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation, 2017

PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space

2018-03-17T18:52:28Z

Apon: /* Distance-based Interpolation */

= Introduction =
This paper builds off of ideas from PointNet (Qi et al., 2017). The name PointNet is derived from the network's input - a point cloud. A point cloud is a set of three dimensional points that each have coordinates <math> (x,y,z) </math>. These coordinates usually represent the surface of an object. For example, a point cloud describing the shape of a torus is shown below.

[[File:Point cloud torus.gif|thumb|center|Point cloud torus]]

Processing point clouds is important in applications such as autonomous driving where point clouds are collected from an onboard LiDAR sensor. These point clouds can then be used for object detection. However, point clouds are challenging to process because:

# They are unordered. If <math> N </math> is the number of points in a point cloud, then there are <math> N! </math> permutations that the point cloud can be represented.
# The spatial arrangement of the points contains useful information, thus it needs to be encoded.
# The function processing the point cloud needs to be invariant to transformations such as rotation and translations of all points.

Previously, typical point cloud processing methods handled the challenges of point clouds by transforming the data with a 3D voxel grid or by representing the point cloud with multiple 2D images. When PointNet was introduced, it was novel because it directly took points as its input. PointNet++ improves on PointNet by using a hierarchical method to better capture local structures of the point cloud.

[[File:point_cloud.png | 400px|thumb|center|Examples of point clouds and their associated task. Classification (left), part segmentation (centre), scene segmentation (right) ]]

= Review of PointNet =

The PointNet architecture is shown below. The input of the network is <math> n </math> points, which each have <math> (x,y,z) </math> coordinates. Each point processed individually through a multi-layer perceptron (MLP). This network creates an encoding for each point; in the diagram, each point is represented by a 1024 dimension vector. Then, using a max pool layer a vector is created, that represents the "global signature" of a point cloud. If classification is the task, this global signature is processed by another MLP to compute the classification scores. If segmentation is the task, this global signature is appended to to each point from the "nx64" layer, and these points are processed by a MLP to compute a semantic category score for each point.

The core idea of the network is to learn a symmetric function on transformed points. Through the T-Nets and the MLP network, a transformation is learned with the hopes of making points invariant to point cloud transformations. Learning a symmetric function solves the challenge imposed by having unordered points; a symmetric function will produce the same value no matter the order of the input. This symmetric function is represented by the max pool layer.

[[File:pointnet_arch.png | 700px|thumb|center|PointNet architecture. The blue highlighted region is when it is used for classification, and the beige highlighted region is when it is used for segmentation.]]

= PointNet++ =

The motivation for PointNet++ is that PointNet does not capture local, fine-grained details. Since PointNet performs a max pool layer over all of its points, information such as the local interaction between points is lost.

== Problem Statement ==

There is a metric space <math> X = (M,d) </math> where <math>d</math> is the metric from a Euclidean space <math>\pmb{\mathbb{R}}^n</math> and <math> M \subseteq \pmb{\mathbb{R}}^n </math> is the set of points. The goal is to learn a function that takes <math>X</math> as the input as outputs a a class or per point label to each member of <math>M</math>.

== Method ==

=== High Level Overview ===
[[File:point_net++.png | 700px|thumb|right|PointNet++ architecture]]

The PointNet++ architecture is shown on the right. The core idea is that a hierarchical architecture is used and at each level of the hierarchy a set of points is processed and abstracted to a new set with less points, i.e.,

\begin{aligned}
\text{Input at each level: } N \times (d + c) \text{ matrix}
\end{aligned}

where <math>N</math> is the number of points, <math>d</math> is the coordinate points <math>(x,y,z)</math> and <math>c</math> is the feature representation of each point, and

\begin{aligned}
\text{Output at each level: } N' \times (d + c') \text{ matrix}
\end{aligned}

where <math>N'</math> is the new number (smaller) of points and <math>c'</math> is the new feature vector.

Each level has three layers: Sampling, Grouping, and PointNet. The Sampling layer selects points that will act as centroids of local regions within the point cloud. The Grouping layer then finds points near these centroids. Lastly, the PointNet layer performs PointNet on each group to encode local information.

=== Sampling Layer ===

The input of this layer is a set of points <math>{\{x_1,x_2,...,x_n}\}</math>. The goal of this layer is to select a subset of these points <math>{\{\hat{x}_1, \hat{x}_2,...,\hat{x}_m\}} </math> that will define the centroid of local regions.

To select these points farthest point sampling is used. This is where <math>\hat{x}_j</math> is the most distant point with regards to <math>{\{\hat{x}_1, \hat{x}_2,...,\hat{x}_{j-1}\}}</math>. This ensures coverage of the entire point cloud opposed to random sampling.

=== Grouping Layer ===

The object of the grouping layer is to form local regions around each centroid by group points near the selected centroids. The input is a point set of size <math>N x (d + c)</math> and the coordinates of the centroids <math>N' \times d</math>. The output is the groups of points within each region <math>N' \times k \times (d+c)</math> where <math>k</math> is the number of points in each region.

Note that <math>k</math> can vary per group. Later, the PointNet layer creates a feature vector that has the same size for all regions at the hierarchical level.

To determine which points belong to a group ball query is used; all points within a radius of the centroid are grouped. This is advantageous over nearest neighbour because it guarantees a fixed region space, which is important when learning local structure,

=== PointNet Layer ===

After grouping, PointNet is applied to the points. However, first the coordinates of points in a local region are converted to a local coordinate frame by <math> x_i = x_i - \bar{x}</math> where <math>\bar{x}</math> is the coordinates of the centroid.

=== Robust Feature Learning under Non-Uniform Sampling Density ===

The previous description of grouping uses a single scale. This is not optimal because the density varies per section of the point cloud. At each level, it would be better if the PointNet layer was applied to adaptively sized groups depending on the point cloud density.

The two grouping methods the authors propose are shown in the diagram below. Multi-scale grouping (MSG) applies PointNet at various scales per group. The features from the various scales are concatenated. This method, however, is computationally expensive because for each region it always applies PointNet to all points. On the other hand, multi-resolution grouping (MRG) is less computationally expensive but still adaptively collects features. As shown in the diagram, the left vector is obtained by applying PointNet to three points, and these three points obtained information from three groups. This vector is then concatenated by a vector that is created by using PointNet on all the points in the level below. The second vector can be weighed more heavily if the first vector contains a sparse amount of points.

[[File:grouping.png | 300px|thumb|center|Example of the two ways to perform grouping]]

== Point Cloud Segmentation ==

If the task is segmentation, the architecture is slightly modified since we want a semantic score for each point. To achieve this, distance-based interpolation and skip-connections are used.

=== Distance-based Interpolation ===

Here, point features from <math>N_l \times (d + C)</math> points are propagated to <math>N_{l-1} \times (d + C)</math> points where <math>N_{l-1}</math> is greater than <math>N_l</math>.

To propagate features an inverse distance weighted average based on <math>k</math> nearest neighbors is used. The <math>p=2</math> and <math>k=3</math>.

[[File:prop_feature.png | 500px|thumb|center|Feature interpolation during segmentation]]

=== Skip-connections ===

In addition, skip connections are used (see the Point++ architecture diagram). The features from the the skip layers are concatenated with the interpolated features. Next, a "unit-wise" PointNet is applied, which the authors describe as similar to a one-by-one convolution.

== Experiments ==
To validate the effectiveness of PointNet++, experiments in three areas were performed - classification in Euclidean metric space, semantic scene labelling, and classification in non-Euclidean space.

=== Point Set Classification in Euclidean Metric Space ===

The digit dataset, MNIST, was converted to a 2D point cloud. Pixel intensities were normalized in the range of <math>[0, 1]</math>, and only pixels with intensities larger than 0.5 were considered. The coordinate system was set at the centre of the image. PointNet++ achieved a classification error of 0.51%. The original PointNet had 0.78% classification error. The table below compares these results to the state-of-the-art.

[[File:mnist_results.png | 300px|thumb|center|MNIST classification results.]]

In addition, the ModelNet40 dataset was used. This dataset consists of CAD models. Three dimensional point clouds were sampled from mesh surfaces of the ModelNet40 shapes. The classification results from this dataset are shown below.

[[File:modelnet40.png | 300px|thumb|center|ModelNet40 classification results.]]

An experiment was performed to show how the accuracy was affected by the number of points used. With PointNet++ using multi-scale grouping and dropout, the performance decreased by less than 1% when 1024 test points was reduced to 256. On the other hand, PointNet's performance was impacted by the decrease in points.

[[File:num_points_acc.png | 300px|thumb|center|Relationship between accuracy and the number of points used for classification.]]

=== Semantic Scene Labelling ===

The ScanNet dataset was used for experiments in semantic scene labelling. This dataset consists of laser scans of indoor scenes where the goal is to predict a semantic label for each point. Example results are shown below.

[[File:scannet.png | 300px|thumb|center|Example ScanNet semantic segmentation results.]]

To compare to other methods, the authors convert their point labels to a voxel format, and accuracy is determined on a per voxel basis. The accuracy compared to other methods is shown below.

[[File:scannet_acc.png | 500px|thumb|center|ScanNet semantic segmentation classification comparison to other methods.]]

=== Classification in Non-Euclidean Metric Space ===

[[File:shrec.png | 300px|thumb|right|Example of shapes from the SHREC15 dataset.]]

Lastly, experiments were performed on the SHREC15 dataset. This dataset contains shapes that have different poses. This experiment shows that PointNet++ is able to generalize to non-Euclidean spaces. Results from this dataset are provided below.

[[File:shrec15_results.png | 500px|thumb|center|Results from the SHREC15 dataset.]]

== Code ==

Code for PointNet++ can be found at: https://github.com/charlesq34/pointnet2

=Sources=
1. Charles R. Qi, Li Yi, Hao Su, Leonidas J. Guibas. PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space, 2017

2. Charles R. Qi, Hao Su, Kaichun Mo, Leonidas J. Guibas. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation, 2017

PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space

2018-03-17T18:52:18Z

Apon: /* Point Cloud Segmentation */

= Introduction =
This paper builds off of ideas from PointNet (Qi et al., 2017). The name PointNet is derived from the network's input - a point cloud. A point cloud is a set of three dimensional points that each have coordinates <math> (x,y,z) </math>. These coordinates usually represent the surface of an object. For example, a point cloud describing the shape of a torus is shown below.

[[File:Point cloud torus.gif|thumb|center|Point cloud torus]]

Processing point clouds is important in applications such as autonomous driving where point clouds are collected from an onboard LiDAR sensor. These point clouds can then be used for object detection. However, point clouds are challenging to process because:

# They are unordered. If <math> N </math> is the number of points in a point cloud, then there are <math> N! </math> permutations that the point cloud can be represented.
# The spatial arrangement of the points contains useful information, thus it needs to be encoded.
# The function processing the point cloud needs to be invariant to transformations such as rotation and translations of all points.

Previously, typical point cloud processing methods handled the challenges of point clouds by transforming the data with a 3D voxel grid or by representing the point cloud with multiple 2D images. When PointNet was introduced, it was novel because it directly took points as its input. PointNet++ improves on PointNet by using a hierarchical method to better capture local structures of the point cloud.

[[File:point_cloud.png | 400px|thumb|center|Examples of point clouds and their associated task. Classification (left), part segmentation (centre), scene segmentation (right) ]]

= Review of PointNet =

The PointNet architecture is shown below. The input of the network is <math> n </math> points, which each have <math> (x,y,z) </math> coordinates. Each point processed individually through a multi-layer perceptron (MLP). This network creates an encoding for each point; in the diagram, each point is represented by a 1024 dimension vector. Then, using a max pool layer a vector is created, that represents the "global signature" of a point cloud. If classification is the task, this global signature is processed by another MLP to compute the classification scores. If segmentation is the task, this global signature is appended to to each point from the "nx64" layer, and these points are processed by a MLP to compute a semantic category score for each point.

The core idea of the network is to learn a symmetric function on transformed points. Through the T-Nets and the MLP network, a transformation is learned with the hopes of making points invariant to point cloud transformations. Learning a symmetric function solves the challenge imposed by having unordered points; a symmetric function will produce the same value no matter the order of the input. This symmetric function is represented by the max pool layer.

[[File:pointnet_arch.png | 700px|thumb|center|PointNet architecture. The blue highlighted region is when it is used for classification, and the beige highlighted region is when it is used for segmentation.]]

= PointNet++ =

The motivation for PointNet++ is that PointNet does not capture local, fine-grained details. Since PointNet performs a max pool layer over all of its points, information such as the local interaction between points is lost.

== Problem Statement ==

There is a metric space <math> X = (M,d) </math> where <math>d</math> is the metric from a Euclidean space <math>\pmb{\mathbb{R}}^n</math> and <math> M \subseteq \pmb{\mathbb{R}}^n </math> is the set of points. The goal is to learn a function that takes <math>X</math> as the input as outputs a a class or per point label to each member of <math>M</math>.

== Method ==

=== High Level Overview ===
[[File:point_net++.png | 700px|thumb|right|PointNet++ architecture]]

The PointNet++ architecture is shown on the right. The core idea is that a hierarchical architecture is used and at each level of the hierarchy a set of points is processed and abstracted to a new set with less points, i.e.,

\begin{aligned}
\text{Input at each level: } N \times (d + c) \text{ matrix}
\end{aligned}

where <math>N</math> is the number of points, <math>d</math> is the coordinate points <math>(x,y,z)</math> and <math>c</math> is the feature representation of each point, and

\begin{aligned}
\text{Output at each level: } N' \times (d + c') \text{ matrix}
\end{aligned}

where <math>N'</math> is the new number (smaller) of points and <math>c'</math> is the new feature vector.

Each level has three layers: Sampling, Grouping, and PointNet. The Sampling layer selects points that will act as centroids of local regions within the point cloud. The Grouping layer then finds points near these centroids. Lastly, the PointNet layer performs PointNet on each group to encode local information.

=== Sampling Layer ===

The input of this layer is a set of points <math>{\{x_1,x_2,...,x_n}\}</math>. The goal of this layer is to select a subset of these points <math>{\{\hat{x}_1, \hat{x}_2,...,\hat{x}_m\}} </math> that will define the centroid of local regions.

To select these points farthest point sampling is used. This is where <math>\hat{x}_j</math> is the most distant point with regards to <math>{\{\hat{x}_1, \hat{x}_2,...,\hat{x}_{j-1}\}}</math>. This ensures coverage of the entire point cloud opposed to random sampling.

=== Grouping Layer ===

The object of the grouping layer is to form local regions around each centroid by group points near the selected centroids. The input is a point set of size <math>N x (d + c)</math> and the coordinates of the centroids <math>N' \times d</math>. The output is the groups of points within each region <math>N' \times k \times (d+c)</math> where <math>k</math> is the number of points in each region.

Note that <math>k</math> can vary per group. Later, the PointNet layer creates a feature vector that has the same size for all regions at the hierarchical level.

To determine which points belong to a group ball query is used; all points within a radius of the centroid are grouped. This is advantageous over nearest neighbour because it guarantees a fixed region space, which is important when learning local structure,

=== PointNet Layer ===

After grouping, PointNet is applied to the points. However, first the coordinates of points in a local region are converted to a local coordinate frame by <math> x_i = x_i - \bar{x}</math> where <math>\bar{x}</math> is the coordinates of the centroid.

=== Robust Feature Learning under Non-Uniform Sampling Density ===

The previous description of grouping uses a single scale. This is not optimal because the density varies per section of the point cloud. At each level, it would be better if the PointNet layer was applied to adaptively sized groups depending on the point cloud density.

The two grouping methods the authors propose are shown in the diagram below. Multi-scale grouping (MSG) applies PointNet at various scales per group. The features from the various scales are concatenated. This method, however, is computationally expensive because for each region it always applies PointNet to all points. On the other hand, multi-resolution grouping (MRG) is less computationally expensive but still adaptively collects features. As shown in the diagram, the left vector is obtained by applying PointNet to three points, and these three points obtained information from three groups. This vector is then concatenated by a vector that is created by using PointNet on all the points in the level below. The second vector can be weighed more heavily if the first vector contains a sparse amount of points.

[[File:grouping.png | 300px|thumb|center|Example of the two ways to perform grouping]]

== Point Cloud Segmentation ==

If the task is segmentation, the architecture is slightly modified since we want a semantic score for each point. To achieve this, distance-based interpolation and skip-connections are used.

=== Distance-based Interpolation ===

Here, point features from <math>N_l \times (d + C)</math> points are propagated to <math>N_{l-1} \times (d + C)</math> points where <math>N_{l-1}</math> is greater than <math>N_l</math>.

To propagate features an inverse distance weighted average based on <math>k</math> nearest neighbors is used. The <math>p=2</math> and <math>k=3</math>

[[File:prop_feature.png | 500px|thumb|center|Feature interpolation during segmentation]]

=== Skip-connections ===

In addition, skip connections are used (see the Point++ architecture diagram). The features from the the skip layers are concatenated with the interpolated features. Next, a "unit-wise" PointNet is applied, which the authors describe as similar to a one-by-one convolution.

== Experiments ==
To validate the effectiveness of PointNet++, experiments in three areas were performed - classification in Euclidean metric space, semantic scene labelling, and classification in non-Euclidean space.

=== Point Set Classification in Euclidean Metric Space ===

The digit dataset, MNIST, was converted to a 2D point cloud. Pixel intensities were normalized in the range of <math>[0, 1]</math>, and only pixels with intensities larger than 0.5 were considered. The coordinate system was set at the centre of the image. PointNet++ achieved a classification error of 0.51%. The original PointNet had 0.78% classification error. The table below compares these results to the state-of-the-art.

[[File:mnist_results.png | 300px|thumb|center|MNIST classification results.]]

In addition, the ModelNet40 dataset was used. This dataset consists of CAD models. Three dimensional point clouds were sampled from mesh surfaces of the ModelNet40 shapes. The classification results from this dataset are shown below.

[[File:modelnet40.png | 300px|thumb|center|ModelNet40 classification results.]]

An experiment was performed to show how the accuracy was affected by the number of points used. With PointNet++ using multi-scale grouping and dropout, the performance decreased by less than 1% when 1024 test points was reduced to 256. On the other hand, PointNet's performance was impacted by the decrease in points.

[[File:num_points_acc.png | 300px|thumb|center|Relationship between accuracy and the number of points used for classification.]]

=== Semantic Scene Labelling ===

The ScanNet dataset was used for experiments in semantic scene labelling. This dataset consists of laser scans of indoor scenes where the goal is to predict a semantic label for each point. Example results are shown below.

[[File:scannet.png | 300px|thumb|center|Example ScanNet semantic segmentation results.]]

To compare to other methods, the authors convert their point labels to a voxel format, and accuracy is determined on a per voxel basis. The accuracy compared to other methods is shown below.

[[File:scannet_acc.png | 500px|thumb|center|ScanNet semantic segmentation classification comparison to other methods.]]

=== Classification in Non-Euclidean Metric Space ===

[[File:shrec.png | 300px|thumb|right|Example of shapes from the SHREC15 dataset.]]

Lastly, experiments were performed on the SHREC15 dataset. This dataset contains shapes that have different poses. This experiment shows that PointNet++ is able to generalize to non-Euclidean spaces. Results from this dataset are provided below.

[[File:shrec15_results.png | 500px|thumb|center|Results from the SHREC15 dataset.]]

== Code ==

Code for PointNet++ can be found at: https://github.com/charlesq34/pointnet2

=Sources=
1. Charles R. Qi, Li Yi, Hao Su, Leonidas J. Guibas. PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space, 2017

2. Charles R. Qi, Hao Su, Kaichun Mo, Leonidas J. Guibas. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation, 2017

PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space

2018-03-17T18:52:00Z

Apon:

= Introduction =
This paper builds off of ideas from PointNet (Qi et al., 2017). The name PointNet is derived from the network's input - a point cloud. A point cloud is a set of three dimensional points that each have coordinates <math> (x,y,z) </math>. These coordinates usually represent the surface of an object. For example, a point cloud describing the shape of a torus is shown below.

[[File:Point cloud torus.gif|thumb|center|Point cloud torus]]

Processing point clouds is important in applications such as autonomous driving where point clouds are collected from an onboard LiDAR sensor. These point clouds can then be used for object detection. However, point clouds are challenging to process because:

# They are unordered. If <math> N </math> is the number of points in a point cloud, then there are <math> N! </math> permutations that the point cloud can be represented.
# The spatial arrangement of the points contains useful information, thus it needs to be encoded.
# The function processing the point cloud needs to be invariant to transformations such as rotation and translations of all points.

Previously, typical point cloud processing methods handled the challenges of point clouds by transforming the data with a 3D voxel grid or by representing the point cloud with multiple 2D images. When PointNet was introduced, it was novel because it directly took points as its input. PointNet++ improves on PointNet by using a hierarchical method to better capture local structures of the point cloud.

[[File:point_cloud.png | 400px|thumb|center|Examples of point clouds and their associated task. Classification (left), part segmentation (centre), scene segmentation (right) ]]

= Review of PointNet =

The PointNet architecture is shown below. The input of the network is <math> n </math> points, which each have <math> (x,y,z) </math> coordinates. Each point processed individually through a multi-layer perceptron (MLP). This network creates an encoding for each point; in the diagram, each point is represented by a 1024 dimension vector. Then, using a max pool layer a vector is created, that represents the "global signature" of a point cloud. If classification is the task, this global signature is processed by another MLP to compute the classification scores. If segmentation is the task, this global signature is appended to to each point from the "nx64" layer, and these points are processed by a MLP to compute a semantic category score for each point.

The core idea of the network is to learn a symmetric function on transformed points. Through the T-Nets and the MLP network, a transformation is learned with the hopes of making points invariant to point cloud transformations. Learning a symmetric function solves the challenge imposed by having unordered points; a symmetric function will produce the same value no matter the order of the input. This symmetric function is represented by the max pool layer.

[[File:pointnet_arch.png | 700px|thumb|center|PointNet architecture. The blue highlighted region is when it is used for classification, and the beige highlighted region is when it is used for segmentation.]]

= PointNet++ =

The motivation for PointNet++ is that PointNet does not capture local, fine-grained details. Since PointNet performs a max pool layer over all of its points, information such as the local interaction between points is lost.

== Problem Statement ==

There is a metric space <math> X = (M,d) </math> where <math>d</math> is the metric from a Euclidean space <math>\pmb{\mathbb{R}}^n</math> and <math> M \subseteq \pmb{\mathbb{R}}^n </math> is the set of points. The goal is to learn a function that takes <math>X</math> as the input as outputs a a class or per point label to each member of <math>M</math>.

== Method ==

=== High Level Overview ===
[[File:point_net++.png | 700px|thumb|right|PointNet++ architecture]]

The PointNet++ architecture is shown on the right. The core idea is that a hierarchical architecture is used and at each level of the hierarchy a set of points is processed and abstracted to a new set with less points, i.e.,

\begin{aligned}
\text{Input at each level: } N \times (d + c) \text{ matrix}
\end{aligned}

where <math>N</math> is the number of points, <math>d</math> is the coordinate points <math>(x,y,z)</math> and <math>c</math> is the feature representation of each point, and

\begin{aligned}
\text{Output at each level: } N' \times (d + c') \text{ matrix}
\end{aligned}

where <math>N'</math> is the new number (smaller) of points and <math>c'</math> is the new feature vector.

Each level has three layers: Sampling, Grouping, and PointNet. The Sampling layer selects points that will act as centroids of local regions within the point cloud. The Grouping layer then finds points near these centroids. Lastly, the PointNet layer performs PointNet on each group to encode local information.

=== Sampling Layer ===

The input of this layer is a set of points <math>{\{x_1,x_2,...,x_n}\}</math>. The goal of this layer is to select a subset of these points <math>{\{\hat{x}_1, \hat{x}_2,...,\hat{x}_m\}} </math> that will define the centroid of local regions.

To select these points farthest point sampling is used. This is where <math>\hat{x}_j</math> is the most distant point with regards to <math>{\{\hat{x}_1, \hat{x}_2,...,\hat{x}_{j-1}\}}</math>. This ensures coverage of the entire point cloud opposed to random sampling.

=== Grouping Layer ===

The object of the grouping layer is to form local regions around each centroid by group points near the selected centroids. The input is a point set of size <math>N x (d + c)</math> and the coordinates of the centroids <math>N' \times d</math>. The output is the groups of points within each region <math>N' \times k \times (d+c)</math> where <math>k</math> is the number of points in each region.

Note that <math>k</math> can vary per group. Later, the PointNet layer creates a feature vector that has the same size for all regions at the hierarchical level.

To determine which points belong to a group ball query is used; all points within a radius of the centroid are grouped. This is advantageous over nearest neighbour because it guarantees a fixed region space, which is important when learning local structure,

=== PointNet Layer ===

After grouping, PointNet is applied to the points. However, first the coordinates of points in a local region are converted to a local coordinate frame by <math> x_i = x_i - \bar{x}</math> where <math>\bar{x}</math> is the coordinates of the centroid.

=== Robust Feature Learning under Non-Uniform Sampling Density ===

The previous description of grouping uses a single scale. This is not optimal because the density varies per section of the point cloud. At each level, it would be better if the PointNet layer was applied to adaptively sized groups depending on the point cloud density.

The two grouping methods the authors propose are shown in the diagram below. Multi-scale grouping (MSG) applies PointNet at various scales per group. The features from the various scales are concatenated. This method, however, is computationally expensive because for each region it always applies PointNet to all points. On the other hand, multi-resolution grouping (MRG) is less computationally expensive but still adaptively collects features. As shown in the diagram, the left vector is obtained by applying PointNet to three points, and these three points obtained information from three groups. This vector is then concatenated by a vector that is created by using PointNet on all the points in the level below. The second vector can be weighed more heavily if the first vector contains a sparse amount of points.

[[File:grouping.png | 300px|thumb|center|Example of the two ways to perform grouping]]

== Point Cloud Segmentation ==

If the task is segmentation, the architecture is slightly modified since we want a semantic score for each point. To achieve this, distance-based interpolation and skip-connections are used.

=== Distance-based Interpolation ===

Here, point features from <math>N_l \times (d + C)</math> points are propagated to <math>N_{l-1} \times (d + C)</math> points where <math>N_{l-1}</math> is greater than <math>N_l</math>.

To propagate features an inverse distance weighted average based on <math>k</math> nearest neighbors is used. The <math>p=2</math> and <math>k=3</math>

[[File:prop_feature.png | 700px|thumb|right|Feature interpolation during segmentation]]

=== Skip-connections ===

In addition, skip connections are used (see the Point++ architecture diagram). The features from the the skip layers are concatenated with the interpolated features. Next, a "unit-wise" PointNet is applied, which the authors describe as similar to a one-by-one convolution.

== Experiments ==
To validate the effectiveness of PointNet++, experiments in three areas were performed - classification in Euclidean metric space, semantic scene labelling, and classification in non-Euclidean space.

=== Point Set Classification in Euclidean Metric Space ===

The digit dataset, MNIST, was converted to a 2D point cloud. Pixel intensities were normalized in the range of <math>[0, 1]</math>, and only pixels with intensities larger than 0.5 were considered. The coordinate system was set at the centre of the image. PointNet++ achieved a classification error of 0.51%. The original PointNet had 0.78% classification error. The table below compares these results to the state-of-the-art.

[[File:mnist_results.png | 300px|thumb|center|MNIST classification results.]]

In addition, the ModelNet40 dataset was used. This dataset consists of CAD models. Three dimensional point clouds were sampled from mesh surfaces of the ModelNet40 shapes. The classification results from this dataset are shown below.

[[File:modelnet40.png | 300px|thumb|center|ModelNet40 classification results.]]

An experiment was performed to show how the accuracy was affected by the number of points used. With PointNet++ using multi-scale grouping and dropout, the performance decreased by less than 1% when 1024 test points was reduced to 256. On the other hand, PointNet's performance was impacted by the decrease in points.

[[File:num_points_acc.png | 300px|thumb|center|Relationship between accuracy and the number of points used for classification.]]

=== Semantic Scene Labelling ===

The ScanNet dataset was used for experiments in semantic scene labelling. This dataset consists of laser scans of indoor scenes where the goal is to predict a semantic label for each point. Example results are shown below.

[[File:scannet.png | 300px|thumb|center|Example ScanNet semantic segmentation results.]]

To compare to other methods, the authors convert their point labels to a voxel format, and accuracy is determined on a per voxel basis. The accuracy compared to other methods is shown below.

[[File:scannet_acc.png | 500px|thumb|center|ScanNet semantic segmentation classification comparison to other methods.]]

=== Classification in Non-Euclidean Metric Space ===

[[File:shrec.png | 300px|thumb|right|Example of shapes from the SHREC15 dataset.]]

Lastly, experiments were performed on the SHREC15 dataset. This dataset contains shapes that have different poses. This experiment shows that PointNet++ is able to generalize to non-Euclidean spaces. Results from this dataset are provided below.

[[File:shrec15_results.png | 500px|thumb|center|Results from the SHREC15 dataset.]]

== Code ==

Code for PointNet++ can be found at: https://github.com/charlesq34/pointnet2

=Sources=
1. Charles R. Qi, Li Yi, Hao Su, Leonidas J. Guibas. PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space, 2017

2. Charles R. Qi, Hao Su, Kaichun Mo, Leonidas J. Guibas. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation, 2017

PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space

2018-03-17T18:47:28Z

Apon: /* Point Cloud Segmentation */

= Introduction =
This paper builds off of ideas from PointNet (Qi et al., 2017). The name PointNet is derived from the network's input - a point cloud. A point cloud is a set of three dimensional points that each have coordinates <math> (x,y,z) </math>. These coordinates usually represent the surface of an object. For example, a point cloud describing the shape of a torus is shown below.

[[File:Point cloud torus.gif|thumb|center|Point cloud torus]]

Processing point clouds is important in applications such as autonomous driving where point clouds are collected from an onboard LiDAR sensor. These point clouds can then be used for object detection. However, point clouds are challenging to process because:

# They are unordered. If <math> N </math> is the number of points in a point cloud, then there are <math> N! </math> permutations that the point cloud can be represented.
# The spatial arrangement of the points contains useful information, thus it needs to be encoded.
# The function processing the point cloud needs to be invariant to transformations such as rotation and translations of all points.

Previously, typical point cloud processing methods handled the challenges of point clouds by transforming the data with a 3D voxel grid or by representing the point cloud with multiple 2D images. When PointNet was introduced, it was novel because it directly took points as its input. PointNet++ improves on PointNet by using a hierarchical method to better capture local structures of the point cloud.

[[File:point_cloud.png | 400px|thumb|center|Examples of point clouds and their associated task. Classification (left), part segmentation (centre), scene segmentation (right) ]]

= Review of PointNet =

The PointNet architecture is shown below. The input of the network is <math> n </math> points, which each have <math> (x,y,z) </math> coordinates. Each point processed individually through a multi-layer perceptron (MLP). This network creates an encoding for each point; in the diagram, each point is represented by a 1024 dimension vector. Then, using a max pool layer a vector is created, that represents the "global signature" of a point cloud. If classification is the task, this global signature is processed by another MLP to compute the classification scores. If segmentation is the task, this global signature is appended to to each point from the "nx64" layer, and these points are processed by a MLP to compute a semantic category score for each point.

The core idea of the network is to learn a symmetric function on transformed points. Through the T-Nets and the MLP network, a transformation is learned with the hopes of making points invariant to point cloud transformations. Learning a symmetric function solves the challenge imposed by having unordered points; a symmetric function will produce the same value no matter the order of the input. This symmetric function is represented by the max pool layer.

[[File:pointnet_arch.png | 700px|thumb|center|PointNet architecture. The blue highlighted region is when it is used for classification, and the beige highlighted region is when it is used for segmentation.]]

= PointNet++ =

The motivation for PointNet++ is that PointNet does not capture local, fine-grained details. Since PointNet performs a max pool layer over all of its points, information such as the local interaction between points is lost.

== Problem Statement ==

There is a metric space <math> X = (M,d) </math> where <math>d</math> is the metric from a Euclidean space <math>\pmb{\mathbb{R}}^n</math> and <math> M \subseteq \pmb{\mathbb{R}}^n </math> is the set of points. The goal is to learn a function that takes <math>X</math> as the input as outputs a a class or per point label to each member of <math>M</math>.

== Method ==

=== High Level Overview ===
[[File:point_net++.png | 700px|thumb|right|PointNet++ architecture]]

The PointNet++ architecture is shown on the right. The core idea is that a hierarchical architecture is used and at each level of the hierarchy a set of points is processed and abstracted to a new set with less points, i.e.,

\begin{aligned}
\text{Input at each level: } N \times (d + c) \text{ matrix}
\end{aligned}

where <math>N</math> is the number of points, <math>d</math> is the coordinate points <math>(x,y,z)</math> and <math>c</math> is the feature representation of each point, and

\begin{aligned}
\text{Output at each level: } N' \times (d + c') \text{ matrix}
\end{aligned}

where <math>N'</math> is the new number (smaller) of points and <math>c'</math> is the new feature vector.

Each level has three layers: Sampling, Grouping, and PointNet. The Sampling layer selects points that will act as centroids of local regions within the point cloud. The Grouping layer then finds points near these centroids. Lastly, the PointNet layer performs PointNet on each group to encode local information.

=== Sampling Layer ===

The input of this layer is a set of points <math>{\{x_1,x_2,...,x_n}\}</math>. The goal of this layer is to select a subset of these points <math>{\{\hat{x}_1, \hat{x}_2,...,\hat{x}_m\}} </math> that will define the centroid of local regions.

To select these points farthest point sampling is used. This is where <math>\hat{x}_j</math> is the most distant point with regards to <math>{\{\hat{x}_1, \hat{x}_2,...,\hat{x}_{j-1}\}}</math>. This ensures coverage of the entire point cloud opposed to random sampling.

=== Grouping Layer ===

The object of the grouping layer is to form local regions around each centroid by group points near the selected centroids. The input is a point set of size <math>N x (d + c)</math> and the coordinates of the centroids <math>N' \times d</math>. The output is the groups of points within each region <math>N' \times k \times (d+c)</math> where <math>k</math> is the number of points in each region.

Note that <math>k</math> can vary per group. Later, the PointNet layer creates a feature vector that has the same size for all regions at the hierarchical level.

To determine which points belong to a group ball query is used; all points within a radius of the centroid are grouped. This is advantageous over nearest neighbour because it guarantees a fixed region space, which is important when learning local structure,

=== PointNet Layer ===

After grouping, PointNet is applied to the points. However, first the coordinates of points in a local region are converted to a local coordinate frame by <math> x_i = x_i - \bar{x}</math> where <math>\bar{x}</math> is the coordinates of the centroid.

=== Robust Feature Learning under Non-Uniform Sampling Density ===

The previous description of grouping uses a single scale. This is not optimal because the density varies per section of the point cloud. At each level, it would be better if the PointNet layer was applied to adaptively sized groups depending on the point cloud density.

The two grouping methods the authors propose are shown in the diagram below. Multi-scale grouping (MSG) applies PointNet at various scales per group. The features from the various scales are concatenated. This method, however, is computationally expensive because for each region it always applies PointNet to all points. On the other hand, multi-resolution grouping (MRG) is less computationally expensive but still adaptively collects features. As shown in the diagram, the left vector is obtained by applying PointNet to three points, and these three points obtained information from three groups. This vector is then concatenated by a vector that is created by using PointNet on all the points in the level below. The second vector can be weighed more heavily if the first vector contains a sparse amount of points.

[[File:grouping.png | 300px|thumb|center|Example of the two ways to perform grouping]]

== Point Cloud Segmentation ==

If the task is segmentation, the architecture is slightly modified since we want a semantic score for each point. To achieve this, distance-based interpolation and skip-connections are used.

=== Distance-based Interpolation ===

Here, point features from <math>N_l \times (d + C)</math> points are propagated to <math>N_{l-1} \times (d + C)</math> points where <math>N_{l-1}</math> is greater than <math>N_l</math>.

To propagate features an inverse distance weighted average based on <math>k</math> nearest neighbors is used.

== Experiments ==
To validate the effectiveness of PointNet++, experiments in three areas were performed - classification in Euclidean metric space, semantic scene labelling, and classification in non-Euclidean space.

=== Point Set Classification in Euclidean Metric Space ===

The digit dataset, MNIST, was converted to a 2D point cloud. Pixel intensities were normalized in the range of <math>[0, 1]</math>, and only pixels with intensities larger than 0.5 were considered. The coordinate system was set at the centre of the image. PointNet++ achieved a classification error of 0.51%. The original PointNet had 0.78% classification error. The table below compares these results to the state-of-the-art.

[[File:mnist_results.png | 300px|thumb|center|MNIST classification results.]]

In addition, the ModelNet40 dataset was used. This dataset consists of CAD models. Three dimensional point clouds were sampled from mesh surfaces of the ModelNet40 shapes. The classification results from this dataset are shown below.

[[File:modelnet40.png | 300px|thumb|center|ModelNet40 classification results.]]

An experiment was performed to show how the accuracy was affected by the number of points used. With PointNet++ using multi-scale grouping and dropout, the performance decreased by less than 1% when 1024 test points was reduced to 256. On the other hand, PointNet's performance was impacted by the decrease in points.

[[File:num_points_acc.png | 300px|thumb|center|Relationship between accuracy and the number of points used for classification.]]

=== Semantic Scene Labelling ===

The ScanNet dataset was used for experiments in semantic scene labelling. This dataset consists of laser scans of indoor scenes where the goal is to predict a semantic label for each point. Example results are shown below.

[[File:scannet.png | 300px|thumb|center|Example ScanNet semantic segmentation results.]]

To compare to other methods, the authors convert their point labels to a voxel format, and accuracy is determined on a per voxel basis. The accuracy compared to other methods is shown below.

[[File:scannet_acc.png | 500px|thumb|center|ScanNet semantic segmentation classification comparison to other methods.]]

=== Classification in Non-Euclidean Metric Space ===

[[File:shrec.png | 300px|thumb|right|Example of shapes from the SHREC15 dataset.]]

Lastly, experiments were performed on the SHREC15 dataset. This dataset contains shapes that have different poses. This experiment shows that PointNet++ is able to generalize to non-Euclidean spaces. Results from this dataset are provided below.

[[File:shrec15_results.png | 500px|thumb|center|Results from the SHREC15 dataset.]]

== Code ==

Code for PointNet++ can be found at: https://github.com/charlesq34/pointnet2

=Sources=
1. Charles R. Qi, Li Yi, Hao Su, Leonidas J. Guibas. PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space, 2017

2. Charles R. Qi, Hao Su, Kaichun Mo, Leonidas J. Guibas. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation, 2017

File:prop feature.png

2018-03-17T18:47:24Z

Apon:

PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space

2018-03-17T18:07:39Z

Apon:

= Introduction =
This paper builds off of ideas from PointNet (Qi et al., 2017). The name PointNet is derived from the network's input - a point cloud. A point cloud is a set of three dimensional points that each have coordinates <math> (x,y,z) </math>. These coordinates usually represent the surface of an object. For example, a point cloud describing the shape of a torus is shown below.

[[File:Point cloud torus.gif|thumb|center|Point cloud torus]]

Processing point clouds is important in applications such as autonomous driving where point clouds are collected from an onboard LiDAR sensor. These point clouds can then be used for object detection. However, point clouds are challenging to process because:

# They are unordered. If <math> N </math> is the number of points in a point cloud, then there are <math> N! </math> permutations that the point cloud can be represented.
# The spatial arrangement of the points contains useful information, thus it needs to be encoded.
# The function processing the point cloud needs to be invariant to transformations such as rotation and translations of all points.

Previously, typical point cloud processing methods handled the challenges of point clouds by transforming the data with a 3D voxel grid or by representing the point cloud with multiple 2D images. When PointNet was introduced, it was novel because it directly took points as its input. PointNet++ improves on PointNet by using a hierarchical method to better capture local structures of the point cloud.

[[File:point_cloud.png | 400px|thumb|center|Examples of point clouds and their associated task. Classification (left), part segmentation (centre), scene segmentation (right) ]]

= Review of PointNet =

The PointNet architecture is shown below. The input of the network is <math> n </math> points, which each have <math> (x,y,z) </math> coordinates. Each point processed individually through a multi-layer perceptron (MLP). This network creates an encoding for each point; in the diagram, each point is represented by a 1024 dimension vector. Then, using a max pool layer a vector is created, that represents the "global signature" of a point cloud. If classification is the task, this global signature is processed by another MLP to compute the classification scores. If segmentation is the task, this global signature is appended to to each point from the "nx64" layer, and these points are processed by a MLP to compute a semantic category score for each point.

The core idea of the network is to learn a symmetric function on transformed points. Through the T-Nets and the MLP network, a transformation is learned with the hopes of making points invariant to point cloud transformations. Learning a symmetric function solves the challenge imposed by having unordered points; a symmetric function will produce the same value no matter the order of the input. This symmetric function is represented by the max pool layer.

[[File:pointnet_arch.png | 700px|thumb|center|PointNet architecture. The blue highlighted region is when it is used for classification, and the beige highlighted region is when it is used for segmentation.]]

= PointNet++ =

The motivation for PointNet++ is that PointNet does not capture local, fine-grained details. Since PointNet performs a max pool layer over all of its points, information such as the local interaction between points is lost.

== Problem Statement ==

There is a metric space <math> X = (M,d) </math> where <math>d</math> is the metric from a Euclidean space <math>\pmb{\mathbb{R}}^n</math> and <math> M \subseteq \pmb{\mathbb{R}}^n </math> is the set of points. The goal is to learn a function that takes <math>X</math> as the input as outputs a a class or per point label to each member of <math>M</math>.

== Method ==

=== High Level Overview ===
[[File:point_net++.png | 700px|thumb|right|PointNet++ architecture]]

The PointNet++ architecture is shown on the right. The core idea is that a hierarchical architecture is used and at each level of the hierarchy a set of points is processed and abstracted to a new set with less points, i.e.,

\begin{aligned}
\text{Input at each level: } N \times (d + c) \text{ matrix}
\end{aligned}

where <math>N</math> is the number of points, <math>d</math> is the coordinate points <math>(x,y,z)</math> and <math>c</math> is the feature representation of each point, and

\begin{aligned}
\text{Output at each level: } N' \times (d + c') \text{ matrix}
\end{aligned}

where <math>N'</math> is the new number (smaller) of points and <math>c'</math> is the new feature vector.

Each level has three layers: Sampling, Grouping, and PointNet. The Sampling layer selects points that will act as centroids of local regions within the point cloud. The Grouping layer then finds points near these centroids. Lastly, the PointNet layer performs PointNet on each group to encode local information.

=== Sampling Layer ===

The input of this layer is a set of points <math>{\{x_1,x_2,...,x_n}\}</math>. The goal of this layer is to select a subset of these points <math>{\{\hat{x}_1, \hat{x}_2,...,\hat{x}_m\}} </math> that will define the centroid of local regions.

To select these points farthest point sampling is used. This is where <math>\hat{x}_j</math> is the most distant point with regards to <math>{\{\hat{x}_1, \hat{x}_2,...,\hat{x}_{j-1}\}}</math>. This ensures coverage of the entire point cloud opposed to random sampling.

=== Grouping Layer ===

The object of the grouping layer is to form local regions around each centroid by group points near the selected centroids. The input is a point set of size <math>N x (d + c)</math> and the coordinates of the centroids <math>N' \times d</math>. The output is the groups of points within each region <math>N' \times k \times (d+c)</math> where <math>k</math> is the number of points in each region.

Note that <math>k</math> can vary per group. Later, the PointNet layer creates a feature vector that has the same size for all regions at the hierarchical level.

To determine which points belong to a group ball query is used; all points within a radius of the centroid are grouped. This is advantageous over nearest neighbour because it guarantees a fixed region space, which is important when learning local structure,

=== PointNet Layer ===

After grouping, PointNet is applied to the points. However, first the coordinates of points in a local region are converted to a local coordinate frame by <math> x_i = x_i - \bar{x}</math> where <math>\bar{x}</math> is the coordinates of the centroid.

=== Robust Feature Learning under Non-Uniform Sampling Density ===

The previous description of grouping uses a single scale. This is not optimal because the density varies per section of the point cloud. At each level, it would be better if the PointNet layer was applied to adaptively sized groups depending on the point cloud density.

The two grouping methods the authors propose are shown in the diagram below. Multi-scale grouping (MSG) applies PointNet at various scales per group. The features from the various scales are concatenated. This method, however, is computationally expensive because for each region it always applies PointNet to all points. On the other hand, multi-resolution grouping (MRG) is less computationally expensive but still adaptively collects features. As shown in the diagram, the left vector is obtained by applying PointNet to three points, and these three points obtained information from three groups. This vector is then concatenated by a vector that is created by using PointNet on all the points in the level below. The second vector can be weighed more heavily if the first vector contains a sparse amount of points.

[[File:grouping.png | 300px|thumb|center|Example of the two ways to perform grouping]]

== Point Cloud Segmentation ==

If the task is segmentation, the architecture is slightly modified.

== Experiments ==
To validate the effectiveness of PointNet++, experiments in three areas were performed - classification in Euclidean metric space, semantic scene labelling, and classification in non-Euclidean space.

=== Point Set Classification in Euclidean Metric Space ===

The digit dataset, MNIST, was converted to a 2D point cloud. Pixel intensities were normalized in the range of <math>[0, 1]</math>, and only pixels with intensities larger than 0.5 were considered. The coordinate system was set at the centre of the image. PointNet++ achieved a classification error of 0.51%. The original PointNet had 0.78% classification error. The table below compares these results to the state-of-the-art.

[[File:mnist_results.png | 300px|thumb|center|MNIST classification results.]]

In addition, the ModelNet40 dataset was used. This dataset consists of CAD models. Three dimensional point clouds were sampled from mesh surfaces of the ModelNet40 shapes. The classification results from this dataset are shown below.

[[File:modelnet40.png | 300px|thumb|center|ModelNet40 classification results.]]

An experiment was performed to show how the accuracy was affected by the number of points used. With PointNet++ using multi-scale grouping and dropout, the performance decreased by less than 1% when 1024 test points was reduced to 256. On the other hand, PointNet's performance was impacted by the decrease in points.

[[File:num_points_acc.png | 300px|thumb|center|Relationship between accuracy and the number of points used for classification.]]

=== Semantic Scene Labelling ===

The ScanNet dataset was used for experiments in semantic scene labelling. This dataset consists of laser scans of indoor scenes where the goal is to predict a semantic label for each point. Example results are shown below.

[[File:scannet.png | 300px|thumb|center|Example ScanNet semantic segmentation results.]]

To compare to other methods, the authors convert their point labels to a voxel format, and accuracy is determined on a per voxel basis. The accuracy compared to other methods is shown below.

[[File:scannet_acc.png | 500px|thumb|center|ScanNet semantic segmentation classification comparison to other methods.]]

=== Classification in Non-Euclidean Metric Space ===

[[File:shrec.png | 300px|thumb|right|Example of shapes from the SHREC15 dataset.]]

Lastly, experiments were performed on the SHREC15 dataset. This dataset contains shapes that have different poses. This experiment shows that PointNet++ is able to generalize to non-Euclidean spaces. Results from this dataset are provided below.

[[File:shrec15_results.png | 500px|thumb|center|Results from the SHREC15 dataset.]]

== Code ==

Code for PointNet++ can be found at: https://github.com/charlesq34/pointnet2

=Sources=
1. Charles R. Qi, Li Yi, Hao Su, Leonidas J. Guibas. PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space, 2017

2. Charles R. Qi, Hao Su, Kaichun Mo, Leonidas J. Guibas. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation, 2017

PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space

2018-03-17T18:05:46Z

Apon: /* = Classification in Non-Euclidean Metric Space */

= Introduction =
This paper builds off of ideas from PointNet (Qi et al., 2017). The name PointNet is derived from the network's input - a point cloud. A point cloud is a set of three dimensional points that each have coordinates <math> (x,y,z) </math>. These coordinates usually represent the surface of an object. For example, a point cloud describing the shape of a torus is shown below.

[[File:Point cloud torus.gif|thumb|center|Point cloud torus]]

Processing point clouds is important in applications such as autonomous driving where point clouds are collected from an onboard LiDAR sensor. These point clouds can then be used for object detection. However, point clouds are challenging to process because:

# They are unordered. If <math> N </math> is the number of points in a point cloud, then there are <math> N! </math> permutations that the point cloud can be represented.
# The spatial arrangement of the points contains useful information, thus it needs to be encoded.
# The function processing the point cloud needs to be invariant to transformations such as rotation and translations of all points.

Previously, typical point cloud processing methods handled the challenges of point clouds by transforming the data with a 3D voxel grid or by representing the point cloud with multiple 2D images. When PointNet was introduced, it was novel because it directly took points as its input. PointNet++ improves on PointNet by using a hierarchical method to better capture local structures of the point cloud.

[[File:point_cloud.png | 400px|thumb|center|Examples of point clouds and their associated task. Classification (left), part segmentation (centre), scene segmentation (right) ]]

= Review of PointNet =

The PointNet architecture is shown below. The input of the network is <math> n </math> points, which each have <math> (x,y,z) </math> coordinates. Each point processed individually through a multi-layer perceptron (MLP). This network creates an encoding for each point; in the diagram, each point is represented by a 1024 dimension vector. Then, using a max pool layer a vector is created, that represents the "global signature" of a point cloud. If classification is the task, this global signature is processed by another MLP to compute the classification scores. If segmentation is the task, this global signature is appended to to each point from the "nx64" layer, and these points are processed by a MLP to compute a semantic category score for each point.

The core idea of the network is to learn a symmetric function on transformed points. Through the T-Nets and the MLP network, a transformation is learned with the hopes of making points invariant to point cloud transformations. Learning a symmetric function solves the challenge imposed by having unordered points; a symmetric function will produce the same value no matter the order of the input. This symmetric function is represented by the max pool layer.

[[File:pointnet_arch.png | 700px|thumb|center|PointNet architecture. The blue highlighted region is when it is used for classification, and the beige highlighted region is when it is used for segmentation.]]

= PointNet++ =

The motivation for PointNet++ is that PointNet does not capture local, fine-grained details. Since PointNet performs a max pool layer over all of its points, information such as the local interaction between points is lost.

== Problem Statement ==

There is a metric space <math> X = (M,d) </math> where <math>d</math> is the metric from a Euclidean space <math>\pmb{\mathbb{R}}^n</math> and <math> M \subseteq \pmb{\mathbb{R}}^n </math> is the set of points. The goal is to learn a function that takes <math>X</math> as the input as outputs a a class or per point label to each member of <math>M</math>.

== Method ==

=== High Level Overview ===
[[File:point_net++.png | 700px|thumb|right|PointNet++ architecture]]

The PointNet++ architecture is shown on the right. The core idea is that a hierarchical architecture is used and at each level of the hierarchy a set of points is processed and abstracted to a new set with less points, i.e.,

\begin{aligned}
\text{Input at each level: } N \times (d + c) \text{ matrix}
\end{aligned}

where <math>N</math> is the number of points, <math>d</math> is the coordinate points <math>(x,y,z)</math> and <math>c</math> is the feature representation of each point, and

\begin{aligned}
\text{Output at each level: } N' \times (d + c') \text{ matrix}
\end{aligned}

where <math>N'</math> is the new number (smaller) of points and <math>c'</math> is the new feature vector.

Each level has three layers: Sampling, Grouping, and PointNet. The Sampling layer selects points that will act as centroids of local regions within the point cloud. The Grouping layer then finds points near these centroids. Lastly, the PointNet layer performs PointNet on each group to encode local information.

=== Sampling Layer ===

The input of this layer is a set of points <math>{\{x_1,x_2,...,x_n}\}</math>. The goal of this layer is to select a subset of these points <math>{\{\hat{x}_1, \hat{x}_2,...,\hat{x}_m\}} </math> that will define the centroid of local regions.

To select these points farthest point sampling is used. This is where <math>\hat{x}_j</math> is the most distant point with regards to <math>{\{\hat{x}_1, \hat{x}_2,...,\hat{x}_{j-1}\}}</math>. This ensures coverage of the entire point cloud opposed to random sampling.

=== Grouping Layer ===

The object of the grouping layer is to form local regions around each centroid by group points near the selected centroids. The input is a point set of size <math>N x (d + c)</math> and the coordinates of the centroids <math>N' \times d</math>. The output is the groups of points within each region <math>N' \times k \times (d+c)</math> where <math>k</math> is the number of points in each region.

Note that <math>k</math> can vary per group. Later, the PointNet layer creates a feature vector that has the same size for all regions at the hierarchical level.

To determine which points belong to a group ball query is used; all points within a radius of the centroid are grouped. This is advantageous over nearest neighbour because it guarantees a fixed region space, which is important when learning local structure,

=== PointNet Layer ===

After grouping, PointNet is applied to the points. However, first the coordinates of points in a local region are converted to a local coordinate frame by <math> x_i = x_i - \bar{x}</math> where <math>\bar{x}</math> is the coordinates of the centroid.

=== Robust Feature Learning under Non-Uniform Sampling Density ===

The previous description of grouping uses a single scale. This is not optimal because the density varies per section of the point cloud. At each level, it would be better if the PointNet layer was applied to adaptively sized groups depending on the point cloud density.

The two grouping methods the authors propose are shown in the diagram below. Multi-scale grouping (MSG) applies PointNet at various scales per group. The features from the various scales are concatenated. This method, however, is computationally expensive because for each region it always applies PointNet to all points. On the other hand, multi-resolution grouping (MRG) is less computationally expensive but still adaptively collects features. As shown in the diagram, the left vector is obtained by applying PointNet to three points, and these three points obtained information from three groups. This vector is then concatenated by a vector that is created by using PointNet on all the points in the level below. The second vector can be weighed more heavily if the first vector contains a sparse amount of points.

[[File:grouping.png | 300px|thumb|center|Example of the two ways to perform grouping]]

== Point Cloud Segmentation ==

If the task is segmentation, the architecture is slightly modified.

== Experiments ==
To validate the effectiveness of PointNet++, experiments in three areas were performed - classification in Euclidean metric space, semantic scene labelling, and classification in non-Euclidean space.

=== Point Set Classification in Euclidean Metric Space ===

The digit dataset, MNIST, was converted to a 2D point cloud. Pixel intensities were normalized in the range of <math>[0, 1]</math>, and only pixels with intensities larger than 0.5 were considered. The coordinate system was set at the centre of the image. PointNet++ achieved a classification error of 0.51%. The original PointNet had 0.78% classification error. The table below compares these results to the state-of-the-art.

[[File:mnist_results.png | 300px|thumb|center|MNIST classification results.]]

In addition, the ModelNet40 dataset was used. This dataset consists of CAD models. Three dimensional point clouds were sampled from mesh surfaces of the ModelNet40 shapes. The classification results from this dataset are shown below.

[[File:modelnet40.png | 300px|thumb|center|ModelNet40 classification results.]]

An experiment was performed to show how the accuracy was affected by the number of points used. With PointNet++ using multi-scale grouping and dropout, the performance decreased by less than 1% when 1024 test points was reduced to 256. On the other hand, PointNet's performance was impacted by the decrease in points.

[[File:num_points_acc.png | 300px|thumb|center|Relationship between accuracy and the number of points used for classification.]]

=== Semantic Scene Labelling ===

The ScanNet dataset was used for experiments in semantic scene labelling. This dataset consists of laser scans of indoor scenes where the goal is to predict a semantic label for each point. Example results are shown below.

[[File:scannet.png | 300px|thumb|center|Example ScanNet semantic segmentation results.]]

To compare to other methods, the authors convert their point labels to a voxel format, and accuracy is determined on a per voxel basis. The accuracy compared to other methods is shown below.

[[File:scannet_acc.png | 500px|thumb|center|ScanNet semantic segmentation classification comparison to other methods.]]

=== Classification in Non-Euclidean Metric Space ===

[[File:shrec.png | 300px|thumb|right|Example of shapes from the SHREC15 dataset.]]

Lastly, experiments were performed on the SHREC15 dataset. This dataset contains shapes that have different poses. This experiment shows that PointNet++ is able to generalize to non-Euclidean spaces. Results from this dataset are provided below.

[[File:shrec15_results.png | 500px|thumb|center|Results from the SHREC15 dataset.]]

=Sources=
1. Charles R. Qi, Li Yi, Hao Su, Leonidas J. Guibas. PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space, 2017

2. Charles R. Qi, Hao Su, Kaichun Mo, Leonidas J. Guibas. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation, 2017

PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space

2018-03-17T18:05:36Z

Apon: /* Semantic Scene Labelling */

= Introduction =
This paper builds off of ideas from PointNet (Qi et al., 2017). The name PointNet is derived from the network's input - a point cloud. A point cloud is a set of three dimensional points that each have coordinates <math> (x,y,z) </math>. These coordinates usually represent the surface of an object. For example, a point cloud describing the shape of a torus is shown below.

[[File:Point cloud torus.gif|thumb|center|Point cloud torus]]

Processing point clouds is important in applications such as autonomous driving where point clouds are collected from an onboard LiDAR sensor. These point clouds can then be used for object detection. However, point clouds are challenging to process because:

# They are unordered. If <math> N </math> is the number of points in a point cloud, then there are <math> N! </math> permutations that the point cloud can be represented.
# The spatial arrangement of the points contains useful information, thus it needs to be encoded.
# The function processing the point cloud needs to be invariant to transformations such as rotation and translations of all points.

Previously, typical point cloud processing methods handled the challenges of point clouds by transforming the data with a 3D voxel grid or by representing the point cloud with multiple 2D images. When PointNet was introduced, it was novel because it directly took points as its input. PointNet++ improves on PointNet by using a hierarchical method to better capture local structures of the point cloud.

[[File:point_cloud.png | 400px|thumb|center|Examples of point clouds and their associated task. Classification (left), part segmentation (centre), scene segmentation (right) ]]

= Review of PointNet =

The PointNet architecture is shown below. The input of the network is <math> n </math> points, which each have <math> (x,y,z) </math> coordinates. Each point processed individually through a multi-layer perceptron (MLP). This network creates an encoding for each point; in the diagram, each point is represented by a 1024 dimension vector. Then, using a max pool layer a vector is created, that represents the "global signature" of a point cloud. If classification is the task, this global signature is processed by another MLP to compute the classification scores. If segmentation is the task, this global signature is appended to to each point from the "nx64" layer, and these points are processed by a MLP to compute a semantic category score for each point.

The core idea of the network is to learn a symmetric function on transformed points. Through the T-Nets and the MLP network, a transformation is learned with the hopes of making points invariant to point cloud transformations. Learning a symmetric function solves the challenge imposed by having unordered points; a symmetric function will produce the same value no matter the order of the input. This symmetric function is represented by the max pool layer.

[[File:pointnet_arch.png | 700px|thumb|center|PointNet architecture. The blue highlighted region is when it is used for classification, and the beige highlighted region is when it is used for segmentation.]]

= PointNet++ =

The motivation for PointNet++ is that PointNet does not capture local, fine-grained details. Since PointNet performs a max pool layer over all of its points, information such as the local interaction between points is lost.

== Problem Statement ==

There is a metric space <math> X = (M,d) </math> where <math>d</math> is the metric from a Euclidean space <math>\pmb{\mathbb{R}}^n</math> and <math> M \subseteq \pmb{\mathbb{R}}^n </math> is the set of points. The goal is to learn a function that takes <math>X</math> as the input as outputs a a class or per point label to each member of <math>M</math>.

== Method ==

=== High Level Overview ===
[[File:point_net++.png | 700px|thumb|right|PointNet++ architecture]]

The PointNet++ architecture is shown on the right. The core idea is that a hierarchical architecture is used and at each level of the hierarchy a set of points is processed and abstracted to a new set with less points, i.e.,

\begin{aligned}
\text{Input at each level: } N \times (d + c) \text{ matrix}
\end{aligned}

where <math>N</math> is the number of points, <math>d</math> is the coordinate points <math>(x,y,z)</math> and <math>c</math> is the feature representation of each point, and

\begin{aligned}
\text{Output at each level: } N' \times (d + c') \text{ matrix}
\end{aligned}

where <math>N'</math> is the new number (smaller) of points and <math>c'</math> is the new feature vector.

Each level has three layers: Sampling, Grouping, and PointNet. The Sampling layer selects points that will act as centroids of local regions within the point cloud. The Grouping layer then finds points near these centroids. Lastly, the PointNet layer performs PointNet on each group to encode local information.

=== Sampling Layer ===

The input of this layer is a set of points <math>{\{x_1,x_2,...,x_n}\}</math>. The goal of this layer is to select a subset of these points <math>{\{\hat{x}_1, \hat{x}_2,...,\hat{x}_m\}} </math> that will define the centroid of local regions.

To select these points farthest point sampling is used. This is where <math>\hat{x}_j</math> is the most distant point with regards to <math>{\{\hat{x}_1, \hat{x}_2,...,\hat{x}_{j-1}\}}</math>. This ensures coverage of the entire point cloud opposed to random sampling.

=== Grouping Layer ===

The object of the grouping layer is to form local regions around each centroid by group points near the selected centroids. The input is a point set of size <math>N x (d + c)</math> and the coordinates of the centroids <math>N' \times d</math>. The output is the groups of points within each region <math>N' \times k \times (d+c)</math> where <math>k</math> is the number of points in each region.

Note that <math>k</math> can vary per group. Later, the PointNet layer creates a feature vector that has the same size for all regions at the hierarchical level.

To determine which points belong to a group ball query is used; all points within a radius of the centroid are grouped. This is advantageous over nearest neighbour because it guarantees a fixed region space, which is important when learning local structure,

=== PointNet Layer ===

After grouping, PointNet is applied to the points. However, first the coordinates of points in a local region are converted to a local coordinate frame by <math> x_i = x_i - \bar{x}</math> where <math>\bar{x}</math> is the coordinates of the centroid.

=== Robust Feature Learning under Non-Uniform Sampling Density ===

The previous description of grouping uses a single scale. This is not optimal because the density varies per section of the point cloud. At each level, it would be better if the PointNet layer was applied to adaptively sized groups depending on the point cloud density.

The two grouping methods the authors propose are shown in the diagram below. Multi-scale grouping (MSG) applies PointNet at various scales per group. The features from the various scales are concatenated. This method, however, is computationally expensive because for each region it always applies PointNet to all points. On the other hand, multi-resolution grouping (MRG) is less computationally expensive but still adaptively collects features. As shown in the diagram, the left vector is obtained by applying PointNet to three points, and these three points obtained information from three groups. This vector is then concatenated by a vector that is created by using PointNet on all the points in the level below. The second vector can be weighed more heavily if the first vector contains a sparse amount of points.

[[File:grouping.png | 300px|thumb|center|Example of the two ways to perform grouping]]

== Point Cloud Segmentation ==

If the task is segmentation, the architecture is slightly modified.

== Experiments ==
To validate the effectiveness of PointNet++, experiments in three areas were performed - classification in Euclidean metric space, semantic scene labelling, and classification in non-Euclidean space.

=== Point Set Classification in Euclidean Metric Space ===

The digit dataset, MNIST, was converted to a 2D point cloud. Pixel intensities were normalized in the range of <math>[0, 1]</math>, and only pixels with intensities larger than 0.5 were considered. The coordinate system was set at the centre of the image. PointNet++ achieved a classification error of 0.51%. The original PointNet had 0.78% classification error. The table below compares these results to the state-of-the-art.

[[File:mnist_results.png | 300px|thumb|center|MNIST classification results.]]

In addition, the ModelNet40 dataset was used. This dataset consists of CAD models. Three dimensional point clouds were sampled from mesh surfaces of the ModelNet40 shapes. The classification results from this dataset are shown below.

[[File:modelnet40.png | 300px|thumb|center|ModelNet40 classification results.]]

An experiment was performed to show how the accuracy was affected by the number of points used. With PointNet++ using multi-scale grouping and dropout, the performance decreased by less than 1% when 1024 test points was reduced to 256. On the other hand, PointNet's performance was impacted by the decrease in points.

[[File:num_points_acc.png | 300px|thumb|center|Relationship between accuracy and the number of points used for classification.]]

=== Semantic Scene Labelling ===

The ScanNet dataset was used for experiments in semantic scene labelling. This dataset consists of laser scans of indoor scenes where the goal is to predict a semantic label for each point. Example results are shown below.

[[File:scannet.png | 300px|thumb|center|Example ScanNet semantic segmentation results.]]

To compare to other methods, the authors convert their point labels to a voxel format, and accuracy is determined on a per voxel basis. The accuracy compared to other methods is shown below.

[[File:scannet_acc.png | 500px|thumb|center|ScanNet semantic segmentation classification comparison to other methods.]]

=== Classification in Non-Euclidean Metric Space ==

[[File:shrec.png | 300px|thumb|right|Example of shapes from the SHREC15 dataset.]]

Lastly, experiments were performed on the SHREC15 dataset. This dataset contains shapes that have different poses. This experiment shows that PointNet++ is able to generalize to non-Euclidean spaces. Results from this dataset are provided below.

[[File:shrec15_results.png | 500px|thumb|center|Results from the SHREC15 dataset.]]

=Sources=
1. Charles R. Qi, Li Yi, Hao Su, Leonidas J. Guibas. PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space, 2017

2. Charles R. Qi, Hao Su, Kaichun Mo, Leonidas J. Guibas. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation, 2017

PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space

2018-03-17T18:05:25Z

Apon: /* = Classification in Non-Euclidean Metric Space */

= Introduction =
This paper builds off of ideas from PointNet (Qi et al., 2017). The name PointNet is derived from the network's input - a point cloud. A point cloud is a set of three dimensional points that each have coordinates <math> (x,y,z) </math>. These coordinates usually represent the surface of an object. For example, a point cloud describing the shape of a torus is shown below.

[[File:Point cloud torus.gif|thumb|center|Point cloud torus]]

Processing point clouds is important in applications such as autonomous driving where point clouds are collected from an onboard LiDAR sensor. These point clouds can then be used for object detection. However, point clouds are challenging to process because:

# They are unordered. If <math> N </math> is the number of points in a point cloud, then there are <math> N! </math> permutations that the point cloud can be represented.
# The spatial arrangement of the points contains useful information, thus it needs to be encoded.
# The function processing the point cloud needs to be invariant to transformations such as rotation and translations of all points.

Previously, typical point cloud processing methods handled the challenges of point clouds by transforming the data with a 3D voxel grid or by representing the point cloud with multiple 2D images. When PointNet was introduced, it was novel because it directly took points as its input. PointNet++ improves on PointNet by using a hierarchical method to better capture local structures of the point cloud.

[[File:point_cloud.png | 400px|thumb|center|Examples of point clouds and their associated task. Classification (left), part segmentation (centre), scene segmentation (right) ]]

= Review of PointNet =

The PointNet architecture is shown below. The input of the network is <math> n </math> points, which each have <math> (x,y,z) </math> coordinates. Each point processed individually through a multi-layer perceptron (MLP). This network creates an encoding for each point; in the diagram, each point is represented by a 1024 dimension vector. Then, using a max pool layer a vector is created, that represents the "global signature" of a point cloud. If classification is the task, this global signature is processed by another MLP to compute the classification scores. If segmentation is the task, this global signature is appended to to each point from the "nx64" layer, and these points are processed by a MLP to compute a semantic category score for each point.

The core idea of the network is to learn a symmetric function on transformed points. Through the T-Nets and the MLP network, a transformation is learned with the hopes of making points invariant to point cloud transformations. Learning a symmetric function solves the challenge imposed by having unordered points; a symmetric function will produce the same value no matter the order of the input. This symmetric function is represented by the max pool layer.

[[File:pointnet_arch.png | 700px|thumb|center|PointNet architecture. The blue highlighted region is when it is used for classification, and the beige highlighted region is when it is used for segmentation.]]

= PointNet++ =

The motivation for PointNet++ is that PointNet does not capture local, fine-grained details. Since PointNet performs a max pool layer over all of its points, information such as the local interaction between points is lost.

== Problem Statement ==

There is a metric space <math> X = (M,d) </math> where <math>d</math> is the metric from a Euclidean space <math>\pmb{\mathbb{R}}^n</math> and <math> M \subseteq \pmb{\mathbb{R}}^n </math> is the set of points. The goal is to learn a function that takes <math>X</math> as the input as outputs a a class or per point label to each member of <math>M</math>.

== Method ==

=== High Level Overview ===
[[File:point_net++.png | 700px|thumb|right|PointNet++ architecture]]

The PointNet++ architecture is shown on the right. The core idea is that a hierarchical architecture is used and at each level of the hierarchy a set of points is processed and abstracted to a new set with less points, i.e.,

\begin{aligned}
\text{Input at each level: } N \times (d + c) \text{ matrix}
\end{aligned}

where <math>N</math> is the number of points, <math>d</math> is the coordinate points <math>(x,y,z)</math> and <math>c</math> is the feature representation of each point, and

\begin{aligned}
\text{Output at each level: } N' \times (d + c') \text{ matrix}
\end{aligned}

where <math>N'</math> is the new number (smaller) of points and <math>c'</math> is the new feature vector.

Each level has three layers: Sampling, Grouping, and PointNet. The Sampling layer selects points that will act as centroids of local regions within the point cloud. The Grouping layer then finds points near these centroids. Lastly, the PointNet layer performs PointNet on each group to encode local information.

=== Sampling Layer ===

The input of this layer is a set of points <math>{\{x_1,x_2,...,x_n}\}</math>. The goal of this layer is to select a subset of these points <math>{\{\hat{x}_1, \hat{x}_2,...,\hat{x}_m\}} </math> that will define the centroid of local regions.

To select these points farthest point sampling is used. This is where <math>\hat{x}_j</math> is the most distant point with regards to <math>{\{\hat{x}_1, \hat{x}_2,...,\hat{x}_{j-1}\}}</math>. This ensures coverage of the entire point cloud opposed to random sampling.

=== Grouping Layer ===

The object of the grouping layer is to form local regions around each centroid by group points near the selected centroids. The input is a point set of size <math>N x (d + c)</math> and the coordinates of the centroids <math>N' \times d</math>. The output is the groups of points within each region <math>N' \times k \times (d+c)</math> where <math>k</math> is the number of points in each region.

Note that <math>k</math> can vary per group. Later, the PointNet layer creates a feature vector that has the same size for all regions at the hierarchical level.

To determine which points belong to a group ball query is used; all points within a radius of the centroid are grouped. This is advantageous over nearest neighbour because it guarantees a fixed region space, which is important when learning local structure,

=== PointNet Layer ===

After grouping, PointNet is applied to the points. However, first the coordinates of points in a local region are converted to a local coordinate frame by <math> x_i = x_i - \bar{x}</math> where <math>\bar{x}</math> is the coordinates of the centroid.

=== Robust Feature Learning under Non-Uniform Sampling Density ===

The previous description of grouping uses a single scale. This is not optimal because the density varies per section of the point cloud. At each level, it would be better if the PointNet layer was applied to adaptively sized groups depending on the point cloud density.

The two grouping methods the authors propose are shown in the diagram below. Multi-scale grouping (MSG) applies PointNet at various scales per group. The features from the various scales are concatenated. This method, however, is computationally expensive because for each region it always applies PointNet to all points. On the other hand, multi-resolution grouping (MRG) is less computationally expensive but still adaptively collects features. As shown in the diagram, the left vector is obtained by applying PointNet to three points, and these three points obtained information from three groups. This vector is then concatenated by a vector that is created by using PointNet on all the points in the level below. The second vector can be weighed more heavily if the first vector contains a sparse amount of points.

[[File:grouping.png | 300px|thumb|center|Example of the two ways to perform grouping]]

== Point Cloud Segmentation ==

If the task is segmentation, the architecture is slightly modified.

== Experiments ==
To validate the effectiveness of PointNet++, experiments in three areas were performed - classification in Euclidean metric space, semantic scene labelling, and classification in non-Euclidean space.

=== Point Set Classification in Euclidean Metric Space ===

The digit dataset, MNIST, was converted to a 2D point cloud. Pixel intensities were normalized in the range of <math>[0, 1]</math>, and only pixels with intensities larger than 0.5 were considered. The coordinate system was set at the centre of the image. PointNet++ achieved a classification error of 0.51%. The original PointNet had 0.78% classification error. The table below compares these results to the state-of-the-art.

[[File:mnist_results.png | 300px|thumb|center|MNIST classification results.]]

In addition, the ModelNet40 dataset was used. This dataset consists of CAD models. Three dimensional point clouds were sampled from mesh surfaces of the ModelNet40 shapes. The classification results from this dataset are shown below.

[[File:modelnet40.png | 300px|thumb|center|ModelNet40 classification results.]]

An experiment was performed to show how the accuracy was affected by the number of points used. With PointNet++ using multi-scale grouping and dropout, the performance decreased by less than 1% when 1024 test points was reduced to 256. On the other hand, PointNet's performance was impacted by the decrease in points.

[[File:num_points_acc.png | 300px|thumb|center|Relationship between accuracy and the number of points used for classification.]]

=== Semantic Scene Labelling ===

The ScanNet dataset was used for experiments in semantic scene labelling. This dataset consists of laser scans of indoor scenes where the goal is to predict a semantic label for each point. Example results are shown below.

[[File:scannet.png | 300px|thumb|center|Example ScanNet semantic segmentation results.]]

To compare to other methods, the authors convert their point labels to a voxel format, and accuracy is determined on a per voxel basis. The accuracy compared to other methods is shown below.

[[File:scannet_acc.png | 300px|thumb|center|ScanNet semantic segmentation classification comparison to other methods.]]

=== Classification in Non-Euclidean Metric Space ==

[[File:shrec.png | 300px|thumb|right|Example of shapes from the SHREC15 dataset.]]

Lastly, experiments were performed on the SHREC15 dataset. This dataset contains shapes that have different poses. This experiment shows that PointNet++ is able to generalize to non-Euclidean spaces. Results from this dataset are provided below.

[[File:shrec15_results.png | 500px|thumb|center|Results from the SHREC15 dataset.]]

=Sources=
1. Charles R. Qi, Li Yi, Hao Su, Leonidas J. Guibas. PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space, 2017

2. Charles R. Qi, Hao Su, Kaichun Mo, Leonidas J. Guibas. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation, 2017

PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space

2018-03-17T18:05:17Z

Apon: /* = Classification in Non-Euclidean Metric Space */

= Introduction =
This paper builds off of ideas from PointNet (Qi et al., 2017). The name PointNet is derived from the network's input - a point cloud. A point cloud is a set of three dimensional points that each have coordinates <math> (x,y,z) </math>. These coordinates usually represent the surface of an object. For example, a point cloud describing the shape of a torus is shown below.

[[File:Point cloud torus.gif|thumb|center|Point cloud torus]]

Processing point clouds is important in applications such as autonomous driving where point clouds are collected from an onboard LiDAR sensor. These point clouds can then be used for object detection. However, point clouds are challenging to process because:

# They are unordered. If <math> N </math> is the number of points in a point cloud, then there are <math> N! </math> permutations that the point cloud can be represented.
# The spatial arrangement of the points contains useful information, thus it needs to be encoded.
# The function processing the point cloud needs to be invariant to transformations such as rotation and translations of all points.

Previously, typical point cloud processing methods handled the challenges of point clouds by transforming the data with a 3D voxel grid or by representing the point cloud with multiple 2D images. When PointNet was introduced, it was novel because it directly took points as its input. PointNet++ improves on PointNet by using a hierarchical method to better capture local structures of the point cloud.

[[File:point_cloud.png | 400px|thumb|center|Examples of point clouds and their associated task. Classification (left), part segmentation (centre), scene segmentation (right) ]]

= Review of PointNet =

The PointNet architecture is shown below. The input of the network is <math> n </math> points, which each have <math> (x,y,z) </math> coordinates. Each point processed individually through a multi-layer perceptron (MLP). This network creates an encoding for each point; in the diagram, each point is represented by a 1024 dimension vector. Then, using a max pool layer a vector is created, that represents the "global signature" of a point cloud. If classification is the task, this global signature is processed by another MLP to compute the classification scores. If segmentation is the task, this global signature is appended to to each point from the "nx64" layer, and these points are processed by a MLP to compute a semantic category score for each point.

The core idea of the network is to learn a symmetric function on transformed points. Through the T-Nets and the MLP network, a transformation is learned with the hopes of making points invariant to point cloud transformations. Learning a symmetric function solves the challenge imposed by having unordered points; a symmetric function will produce the same value no matter the order of the input. This symmetric function is represented by the max pool layer.

[[File:pointnet_arch.png | 700px|thumb|center|PointNet architecture. The blue highlighted region is when it is used for classification, and the beige highlighted region is when it is used for segmentation.]]

= PointNet++ =

The motivation for PointNet++ is that PointNet does not capture local, fine-grained details. Since PointNet performs a max pool layer over all of its points, information such as the local interaction between points is lost.

== Problem Statement ==

There is a metric space <math> X = (M,d) </math> where <math>d</math> is the metric from a Euclidean space <math>\pmb{\mathbb{R}}^n</math> and <math> M \subseteq \pmb{\mathbb{R}}^n </math> is the set of points. The goal is to learn a function that takes <math>X</math> as the input as outputs a a class or per point label to each member of <math>M</math>.

== Method ==

=== High Level Overview ===
[[File:point_net++.png | 700px|thumb|right|PointNet++ architecture]]

The PointNet++ architecture is shown on the right. The core idea is that a hierarchical architecture is used and at each level of the hierarchy a set of points is processed and abstracted to a new set with less points, i.e.,

\begin{aligned}
\text{Input at each level: } N \times (d + c) \text{ matrix}
\end{aligned}

where <math>N</math> is the number of points, <math>d</math> is the coordinate points <math>(x,y,z)</math> and <math>c</math> is the feature representation of each point, and

\begin{aligned}
\text{Output at each level: } N' \times (d + c') \text{ matrix}
\end{aligned}

where <math>N'</math> is the new number (smaller) of points and <math>c'</math> is the new feature vector.

Each level has three layers: Sampling, Grouping, and PointNet. The Sampling layer selects points that will act as centroids of local regions within the point cloud. The Grouping layer then finds points near these centroids. Lastly, the PointNet layer performs PointNet on each group to encode local information.

=== Sampling Layer ===

The input of this layer is a set of points <math>{\{x_1,x_2,...,x_n}\}</math>. The goal of this layer is to select a subset of these points <math>{\{\hat{x}_1, \hat{x}_2,...,\hat{x}_m\}} </math> that will define the centroid of local regions.

To select these points farthest point sampling is used. This is where <math>\hat{x}_j</math> is the most distant point with regards to <math>{\{\hat{x}_1, \hat{x}_2,...,\hat{x}_{j-1}\}}</math>. This ensures coverage of the entire point cloud opposed to random sampling.

=== Grouping Layer ===

The object of the grouping layer is to form local regions around each centroid by group points near the selected centroids. The input is a point set of size <math>N x (d + c)</math> and the coordinates of the centroids <math>N' \times d</math>. The output is the groups of points within each region <math>N' \times k \times (d+c)</math> where <math>k</math> is the number of points in each region.

Note that <math>k</math> can vary per group. Later, the PointNet layer creates a feature vector that has the same size for all regions at the hierarchical level.

To determine which points belong to a group ball query is used; all points within a radius of the centroid are grouped. This is advantageous over nearest neighbour because it guarantees a fixed region space, which is important when learning local structure,

=== PointNet Layer ===

After grouping, PointNet is applied to the points. However, first the coordinates of points in a local region are converted to a local coordinate frame by <math> x_i = x_i - \bar{x}</math> where <math>\bar{x}</math> is the coordinates of the centroid.

=== Robust Feature Learning under Non-Uniform Sampling Density ===

The previous description of grouping uses a single scale. This is not optimal because the density varies per section of the point cloud. At each level, it would be better if the PointNet layer was applied to adaptively sized groups depending on the point cloud density.

The two grouping methods the authors propose are shown in the diagram below. Multi-scale grouping (MSG) applies PointNet at various scales per group. The features from the various scales are concatenated. This method, however, is computationally expensive because for each region it always applies PointNet to all points. On the other hand, multi-resolution grouping (MRG) is less computationally expensive but still adaptively collects features. As shown in the diagram, the left vector is obtained by applying PointNet to three points, and these three points obtained information from three groups. This vector is then concatenated by a vector that is created by using PointNet on all the points in the level below. The second vector can be weighed more heavily if the first vector contains a sparse amount of points.

[[File:grouping.png | 300px|thumb|center|Example of the two ways to perform grouping]]

== Point Cloud Segmentation ==

If the task is segmentation, the architecture is slightly modified.

== Experiments ==
To validate the effectiveness of PointNet++, experiments in three areas were performed - classification in Euclidean metric space, semantic scene labelling, and classification in non-Euclidean space.

=== Point Set Classification in Euclidean Metric Space ===

The digit dataset, MNIST, was converted to a 2D point cloud. Pixel intensities were normalized in the range of <math>[0, 1]</math>, and only pixels with intensities larger than 0.5 were considered. The coordinate system was set at the centre of the image. PointNet++ achieved a classification error of 0.51%. The original PointNet had 0.78% classification error. The table below compares these results to the state-of-the-art.

[[File:mnist_results.png | 300px|thumb|center|MNIST classification results.]]

In addition, the ModelNet40 dataset was used. This dataset consists of CAD models. Three dimensional point clouds were sampled from mesh surfaces of the ModelNet40 shapes. The classification results from this dataset are shown below.

[[File:modelnet40.png | 300px|thumb|center|ModelNet40 classification results.]]

An experiment was performed to show how the accuracy was affected by the number of points used. With PointNet++ using multi-scale grouping and dropout, the performance decreased by less than 1% when 1024 test points was reduced to 256. On the other hand, PointNet's performance was impacted by the decrease in points.

[[File:num_points_acc.png | 300px|thumb|center|Relationship between accuracy and the number of points used for classification.]]

=== Semantic Scene Labelling ===

The ScanNet dataset was used for experiments in semantic scene labelling. This dataset consists of laser scans of indoor scenes where the goal is to predict a semantic label for each point. Example results are shown below.

[[File:scannet.png | 300px|thumb|center|Example ScanNet semantic segmentation results.]]

To compare to other methods, the authors convert their point labels to a voxel format, and accuracy is determined on a per voxel basis. The accuracy compared to other methods is shown below.

[[File:scannet_acc.png | 300px|thumb|center|ScanNet semantic segmentation classification comparison to other methods.]]

=== Classification in Non-Euclidean Metric Space ==

[[File:shrec.png | 300px|thumb|right|Example of shapes from the SHREC15 dataset.]]

Lastly, experiments were performed on the SHREC15 dataset. This dataset contains shapes that have different poses. This experiment shows that PointNet++ is able to generalize to non-Euclidean spaces. Results from this dataset are provided below.

[[File:shrec15_results.png | 300px|thumb|center|Results from the SHREC15 dataset.]]

=Sources=
1. Charles R. Qi, Li Yi, Hao Su, Leonidas J. Guibas. PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space, 2017

2. Charles R. Qi, Hao Su, Kaichun Mo, Leonidas J. Guibas. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation, 2017

File:shrec15 results.png

2018-03-17T18:02:25Z

Apon:

PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space

2018-03-17T18:00:54Z

Apon: /* Experiments */

= Introduction =
This paper builds off of ideas from PointNet (Qi et al., 2017). The name PointNet is derived from the network's input - a point cloud. A point cloud is a set of three dimensional points that each have coordinates <math> (x,y,z) </math>. These coordinates usually represent the surface of an object. For example, a point cloud describing the shape of a torus is shown below.

[[File:Point cloud torus.gif|thumb|center|Point cloud torus]]

Processing point clouds is important in applications such as autonomous driving where point clouds are collected from an onboard LiDAR sensor. These point clouds can then be used for object detection. However, point clouds are challenging to process because:

# They are unordered. If <math> N </math> is the number of points in a point cloud, then there are <math> N! </math> permutations that the point cloud can be represented.
# The spatial arrangement of the points contains useful information, thus it needs to be encoded.
# The function processing the point cloud needs to be invariant to transformations such as rotation and translations of all points.

Previously, typical point cloud processing methods handled the challenges of point clouds by transforming the data with a 3D voxel grid or by representing the point cloud with multiple 2D images. When PointNet was introduced, it was novel because it directly took points as its input. PointNet++ improves on PointNet by using a hierarchical method to better capture local structures of the point cloud.

[[File:point_cloud.png | 400px|thumb|center|Examples of point clouds and their associated task. Classification (left), part segmentation (centre), scene segmentation (right) ]]

= Review of PointNet =

The PointNet architecture is shown below. The input of the network is <math> n </math> points, which each have <math> (x,y,z) </math> coordinates. Each point processed individually through a multi-layer perceptron (MLP). This network creates an encoding for each point; in the diagram, each point is represented by a 1024 dimension vector. Then, using a max pool layer a vector is created, that represents the "global signature" of a point cloud. If classification is the task, this global signature is processed by another MLP to compute the classification scores. If segmentation is the task, this global signature is appended to to each point from the "nx64" layer, and these points are processed by a MLP to compute a semantic category score for each point.

The core idea of the network is to learn a symmetric function on transformed points. Through the T-Nets and the MLP network, a transformation is learned with the hopes of making points invariant to point cloud transformations. Learning a symmetric function solves the challenge imposed by having unordered points; a symmetric function will produce the same value no matter the order of the input. This symmetric function is represented by the max pool layer.

[[File:pointnet_arch.png | 700px|thumb|center|PointNet architecture. The blue highlighted region is when it is used for classification, and the beige highlighted region is when it is used for segmentation.]]

= PointNet++ =

The motivation for PointNet++ is that PointNet does not capture local, fine-grained details. Since PointNet performs a max pool layer over all of its points, information such as the local interaction between points is lost.

== Problem Statement ==

There is a metric space <math> X = (M,d) </math> where <math>d</math> is the metric from a Euclidean space <math>\pmb{\mathbb{R}}^n</math> and <math> M \subseteq \pmb{\mathbb{R}}^n </math> is the set of points. The goal is to learn a function that takes <math>X</math> as the input as outputs a a class or per point label to each member of <math>M</math>.

== Method ==

=== High Level Overview ===
[[File:point_net++.png | 700px|thumb|right|PointNet++ architecture]]

The PointNet++ architecture is shown on the right. The core idea is that a hierarchical architecture is used and at each level of the hierarchy a set of points is processed and abstracted to a new set with less points, i.e.,

\begin{aligned}
\text{Input at each level: } N \times (d + c) \text{ matrix}
\end{aligned}

where <math>N</math> is the number of points, <math>d</math> is the coordinate points <math>(x,y,z)</math> and <math>c</math> is the feature representation of each point, and

\begin{aligned}
\text{Output at each level: } N' \times (d + c') \text{ matrix}
\end{aligned}

where <math>N'</math> is the new number (smaller) of points and <math>c'</math> is the new feature vector.

Each level has three layers: Sampling, Grouping, and PointNet. The Sampling layer selects points that will act as centroids of local regions within the point cloud. The Grouping layer then finds points near these centroids. Lastly, the PointNet layer performs PointNet on each group to encode local information.

=== Sampling Layer ===

The input of this layer is a set of points <math>{\{x_1,x_2,...,x_n}\}</math>. The goal of this layer is to select a subset of these points <math>{\{\hat{x}_1, \hat{x}_2,...,\hat{x}_m\}} </math> that will define the centroid of local regions.

To select these points farthest point sampling is used. This is where <math>\hat{x}_j</math> is the most distant point with regards to <math>{\{\hat{x}_1, \hat{x}_2,...,\hat{x}_{j-1}\}}</math>. This ensures coverage of the entire point cloud opposed to random sampling.

=== Grouping Layer ===

The object of the grouping layer is to form local regions around each centroid by group points near the selected centroids. The input is a point set of size <math>N x (d + c)</math> and the coordinates of the centroids <math>N' \times d</math>. The output is the groups of points within each region <math>N' \times k \times (d+c)</math> where <math>k</math> is the number of points in each region.

Note that <math>k</math> can vary per group. Later, the PointNet layer creates a feature vector that has the same size for all regions at the hierarchical level.

To determine which points belong to a group ball query is used; all points within a radius of the centroid are grouped. This is advantageous over nearest neighbour because it guarantees a fixed region space, which is important when learning local structure,

=== PointNet Layer ===

After grouping, PointNet is applied to the points. However, first the coordinates of points in a local region are converted to a local coordinate frame by <math> x_i = x_i - \bar{x}</math> where <math>\bar{x}</math> is the coordinates of the centroid.

=== Robust Feature Learning under Non-Uniform Sampling Density ===

The previous description of grouping uses a single scale. This is not optimal because the density varies per section of the point cloud. At each level, it would be better if the PointNet layer was applied to adaptively sized groups depending on the point cloud density.

The two grouping methods the authors propose are shown in the diagram below. Multi-scale grouping (MSG) applies PointNet at various scales per group. The features from the various scales are concatenated. This method, however, is computationally expensive because for each region it always applies PointNet to all points. On the other hand, multi-resolution grouping (MRG) is less computationally expensive but still adaptively collects features. As shown in the diagram, the left vector is obtained by applying PointNet to three points, and these three points obtained information from three groups. This vector is then concatenated by a vector that is created by using PointNet on all the points in the level below. The second vector can be weighed more heavily if the first vector contains a sparse amount of points.

[[File:grouping.png | 300px|thumb|center|Example of the two ways to perform grouping]]

== Point Cloud Segmentation ==

If the task is segmentation, the architecture is slightly modified.

== Experiments ==
To validate the effectiveness of PointNet++, experiments in three areas were performed - classification in Euclidean metric space, semantic scene labelling, and classification in non-Euclidean space.

=== Point Set Classification in Euclidean Metric Space ===

The digit dataset, MNIST, was converted to a 2D point cloud. Pixel intensities were normalized in the range of <math>[0, 1]</math>, and only pixels with intensities larger than 0.5 were considered. The coordinate system was set at the centre of the image. PointNet++ achieved a classification error of 0.51%. The original PointNet had 0.78% classification error. The table below compares these results to the state-of-the-art.

[[File:mnist_results.png | 300px|thumb|center|MNIST classification results.]]

In addition, the ModelNet40 dataset was used. This dataset consists of CAD models. Three dimensional point clouds were sampled from mesh surfaces of the ModelNet40 shapes. The classification results from this dataset are shown below.

[[File:modelnet40.png | 300px|thumb|center|ModelNet40 classification results.]]

An experiment was performed to show how the accuracy was affected by the number of points used. With PointNet++ using multi-scale grouping and dropout, the performance decreased by less than 1% when 1024 test points was reduced to 256. On the other hand, PointNet's performance was impacted by the decrease in points.

[[File:num_points_acc.png | 300px|thumb|center|Relationship between accuracy and the number of points used for classification.]]

=== Semantic Scene Labelling ===

The ScanNet dataset was used for experiments in semantic scene labelling. This dataset consists of laser scans of indoor scenes where the goal is to predict a semantic label for each point. Example results are shown below.

[[File:scannet.png | 300px|thumb|center|Example ScanNet semantic segmentation results.]]

To compare to other methods, the authors convert their point labels to a voxel format, and accuracy is determined on a per voxel basis. The accuracy compared to other methods is shown below.

[[File:scannet_acc.png | 300px|thumb|center|ScanNet semantic segmentation classification comparison to other methods.]]

=== Classification in Non-Euclidean Metric Space ==

Lastly, experiments were performed on the SHREC15 dataset. This dataset contains shapes that have different poses. An example is shown below.

[[File:shrec.png | 300px|thumb|center|Example of shapes from the SHREC15 dataset.]]

=Sources=
1. Charles R. Qi, Li Yi, Hao Su, Leonidas J. Guibas. PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space, 2017

2. Charles R. Qi, Hao Su, Kaichun Mo, Leonidas J. Guibas. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation, 2017

File:shrec.png

2018-03-17T18:00:22Z

Apon:

PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space

2018-03-17T17:58:41Z

Apon: /* Semantic Scene Labelling */

= Introduction =
This paper builds off of ideas from PointNet (Qi et al., 2017). The name PointNet is derived from the network's input - a point cloud. A point cloud is a set of three dimensional points that each have coordinates <math> (x,y,z) </math>. These coordinates usually represent the surface of an object. For example, a point cloud describing the shape of a torus is shown below.

[[File:Point cloud torus.gif|thumb|center|Point cloud torus]]

Processing point clouds is important in applications such as autonomous driving where point clouds are collected from an onboard LiDAR sensor. These point clouds can then be used for object detection. However, point clouds are challenging to process because:

# They are unordered. If <math> N </math> is the number of points in a point cloud, then there are <math> N! </math> permutations that the point cloud can be represented.
# The spatial arrangement of the points contains useful information, thus it needs to be encoded.
# The function processing the point cloud needs to be invariant to transformations such as rotation and translations of all points.

Previously, typical point cloud processing methods handled the challenges of point clouds by transforming the data with a 3D voxel grid or by representing the point cloud with multiple 2D images. When PointNet was introduced, it was novel because it directly took points as its input. PointNet++ improves on PointNet by using a hierarchical method to better capture local structures of the point cloud.

[[File:point_cloud.png | 400px|thumb|center|Examples of point clouds and their associated task. Classification (left), part segmentation (centre), scene segmentation (right) ]]

= Review of PointNet =

The PointNet architecture is shown below. The input of the network is <math> n </math> points, which each have <math> (x,y,z) </math> coordinates. Each point processed individually through a multi-layer perceptron (MLP). This network creates an encoding for each point; in the diagram, each point is represented by a 1024 dimension vector. Then, using a max pool layer a vector is created, that represents the "global signature" of a point cloud. If classification is the task, this global signature is processed by another MLP to compute the classification scores. If segmentation is the task, this global signature is appended to to each point from the "nx64" layer, and these points are processed by a MLP to compute a semantic category score for each point.

The core idea of the network is to learn a symmetric function on transformed points. Through the T-Nets and the MLP network, a transformation is learned with the hopes of making points invariant to point cloud transformations. Learning a symmetric function solves the challenge imposed by having unordered points; a symmetric function will produce the same value no matter the order of the input. This symmetric function is represented by the max pool layer.

[[File:pointnet_arch.png | 700px|thumb|center|PointNet architecture. The blue highlighted region is when it is used for classification, and the beige highlighted region is when it is used for segmentation.]]

= PointNet++ =

The motivation for PointNet++ is that PointNet does not capture local, fine-grained details. Since PointNet performs a max pool layer over all of its points, information such as the local interaction between points is lost.

== Problem Statement ==

There is a metric space <math> X = (M,d) </math> where <math>d</math> is the metric from a Euclidean space <math>\pmb{\mathbb{R}}^n</math> and <math> M \subseteq \pmb{\mathbb{R}}^n </math> is the set of points. The goal is to learn a function that takes <math>X</math> as the input as outputs a a class or per point label to each member of <math>M</math>.

== Method ==

=== High Level Overview ===
[[File:point_net++.png | 700px|thumb|right|PointNet++ architecture]]

The PointNet++ architecture is shown on the right. The core idea is that a hierarchical architecture is used and at each level of the hierarchy a set of points is processed and abstracted to a new set with less points, i.e.,

\begin{aligned}
\text{Input at each level: } N \times (d + c) \text{ matrix}
\end{aligned}

where <math>N</math> is the number of points, <math>d</math> is the coordinate points <math>(x,y,z)</math> and <math>c</math> is the feature representation of each point, and

\begin{aligned}
\text{Output at each level: } N' \times (d + c') \text{ matrix}
\end{aligned}

where <math>N'</math> is the new number (smaller) of points and <math>c'</math> is the new feature vector.

Each level has three layers: Sampling, Grouping, and PointNet. The Sampling layer selects points that will act as centroids of local regions within the point cloud. The Grouping layer then finds points near these centroids. Lastly, the PointNet layer performs PointNet on each group to encode local information.

=== Sampling Layer ===

The input of this layer is a set of points <math>{\{x_1,x_2,...,x_n}\}</math>. The goal of this layer is to select a subset of these points <math>{\{\hat{x}_1, \hat{x}_2,...,\hat{x}_m\}} </math> that will define the centroid of local regions.

To select these points farthest point sampling is used. This is where <math>\hat{x}_j</math> is the most distant point with regards to <math>{\{\hat{x}_1, \hat{x}_2,...,\hat{x}_{j-1}\}}</math>. This ensures coverage of the entire point cloud opposed to random sampling.

=== Grouping Layer ===

The object of the grouping layer is to form local regions around each centroid by group points near the selected centroids. The input is a point set of size <math>N x (d + c)</math> and the coordinates of the centroids <math>N' \times d</math>. The output is the groups of points within each region <math>N' \times k \times (d+c)</math> where <math>k</math> is the number of points in each region.

Note that <math>k</math> can vary per group. Later, the PointNet layer creates a feature vector that has the same size for all regions at the hierarchical level.

To determine which points belong to a group ball query is used; all points within a radius of the centroid are grouped. This is advantageous over nearest neighbour because it guarantees a fixed region space, which is important when learning local structure,

=== PointNet Layer ===

After grouping, PointNet is applied to the points. However, first the coordinates of points in a local region are converted to a local coordinate frame by <math> x_i = x_i - \bar{x}</math> where <math>\bar{x}</math> is the coordinates of the centroid.

=== Robust Feature Learning under Non-Uniform Sampling Density ===

The previous description of grouping uses a single scale. This is not optimal because the density varies per section of the point cloud. At each level, it would be better if the PointNet layer was applied to adaptively sized groups depending on the point cloud density.

The two grouping methods the authors propose are shown in the diagram below. Multi-scale grouping (MSG) applies PointNet at various scales per group. The features from the various scales are concatenated. This method, however, is computationally expensive because for each region it always applies PointNet to all points. On the other hand, multi-resolution grouping (MRG) is less computationally expensive but still adaptively collects features. As shown in the diagram, the left vector is obtained by applying PointNet to three points, and these three points obtained information from three groups. This vector is then concatenated by a vector that is created by using PointNet on all the points in the level below. The second vector can be weighed more heavily if the first vector contains a sparse amount of points.

[[File:grouping.png | 300px|thumb|center|Example of the two ways to perform grouping]]

== Point Cloud Segmentation ==

If the task is segmentation, the architecture is slightly modified.

== Experiments ==
To validate the effectiveness of PointNet++, experiments in three areas were performed - classification in Euclidean metric space, semantic scene labelling, and classification in non-Euclidean space.

=== Point Set Classification in Euclidean Metric Space ===

The digit dataset, MNIST, was converted to a 2D point cloud. Pixel intensities were normalized in the range of <math>[0, 1]</math>, and only pixels with intensities larger than 0.5 were considered. The coordinate system was set at the centre of the image. PointNet++ achieved a classification error of 0.51%. The original PointNet had 0.78% classification error. The table below compares these results to the state-of-the-art.

[[File:mnist_results.png | 300px|thumb|center|MNIST classification results.]]

In addition, the ModelNet40 dataset was used. This dataset consists of CAD models. Three dimensional point clouds were sampled from mesh surfaces of the ModelNet40 shapes. The classification results from this dataset are shown below.

[[File:modelnet40.png | 300px|thumb|center|ModelNet40 classification results.]]

An experiment was performed to show how the accuracy was affected by the number of points used. With PointNet++ using multi-scale grouping and dropout, the performance decreased by less than 1% when 1024 test points was reduced to 256. On the other hand, PointNet's performance was impacted by the decrease in points.

[[File:num_points_acc.png | 300px|thumb|center|Relationship between accuracy and the number of points used for classification.]]

=== Semantic Scene Labelling ===

The ScanNet dataset was used for experiments in semantic scene labelling. This dataset consists of laser scans of indoor scenes where the goal is to predict a semantic label for each point. Example results are shown below.

[[File:scannet.png | 300px|thumb|center|Example ScanNet semantic segmentation results.]]

To compare to other methods, the authors convert their point labels to a voxel format, and accuracy is determined on a per voxel basis. The accuracy compared to other methods is shown below.

[[File:scannet_acc.png | 300px|thumb|center|ScanNet semantic segmentation classification comparison to other methods.]]

=Sources=
1. Charles R. Qi, Li Yi, Hao Su, Leonidas J. Guibas. PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space, 2017

2. Charles R. Qi, Hao Su, Kaichun Mo, Leonidas J. Guibas. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation, 2017

PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space

2018-03-17T17:58:17Z

Apon: /* Experiments */

= Introduction =
This paper builds off of ideas from PointNet (Qi et al., 2017). The name PointNet is derived from the network's input - a point cloud. A point cloud is a set of three dimensional points that each have coordinates <math> (x,y,z) </math>. These coordinates usually represent the surface of an object. For example, a point cloud describing the shape of a torus is shown below.

[[File:Point cloud torus.gif|thumb|center|Point cloud torus]]

Processing point clouds is important in applications such as autonomous driving where point clouds are collected from an onboard LiDAR sensor. These point clouds can then be used for object detection. However, point clouds are challenging to process because:

# They are unordered. If <math> N </math> is the number of points in a point cloud, then there are <math> N! </math> permutations that the point cloud can be represented.
# The spatial arrangement of the points contains useful information, thus it needs to be encoded.
# The function processing the point cloud needs to be invariant to transformations such as rotation and translations of all points.

Previously, typical point cloud processing methods handled the challenges of point clouds by transforming the data with a 3D voxel grid or by representing the point cloud with multiple 2D images. When PointNet was introduced, it was novel because it directly took points as its input. PointNet++ improves on PointNet by using a hierarchical method to better capture local structures of the point cloud.

[[File:point_cloud.png | 400px|thumb|center|Examples of point clouds and their associated task. Classification (left), part segmentation (centre), scene segmentation (right) ]]

= Review of PointNet =

The PointNet architecture is shown below. The input of the network is <math> n </math> points, which each have <math> (x,y,z) </math> coordinates. Each point processed individually through a multi-layer perceptron (MLP). This network creates an encoding for each point; in the diagram, each point is represented by a 1024 dimension vector. Then, using a max pool layer a vector is created, that represents the "global signature" of a point cloud. If classification is the task, this global signature is processed by another MLP to compute the classification scores. If segmentation is the task, this global signature is appended to to each point from the "nx64" layer, and these points are processed by a MLP to compute a semantic category score for each point.

The core idea of the network is to learn a symmetric function on transformed points. Through the T-Nets and the MLP network, a transformation is learned with the hopes of making points invariant to point cloud transformations. Learning a symmetric function solves the challenge imposed by having unordered points; a symmetric function will produce the same value no matter the order of the input. This symmetric function is represented by the max pool layer.

[[File:pointnet_arch.png | 700px|thumb|center|PointNet architecture. The blue highlighted region is when it is used for classification, and the beige highlighted region is when it is used for segmentation.]]

= PointNet++ =

The motivation for PointNet++ is that PointNet does not capture local, fine-grained details. Since PointNet performs a max pool layer over all of its points, information such as the local interaction between points is lost.

== Problem Statement ==

There is a metric space <math> X = (M,d) </math> where <math>d</math> is the metric from a Euclidean space <math>\pmb{\mathbb{R}}^n</math> and <math> M \subseteq \pmb{\mathbb{R}}^n </math> is the set of points. The goal is to learn a function that takes <math>X</math> as the input as outputs a a class or per point label to each member of <math>M</math>.

== Method ==

=== High Level Overview ===
[[File:point_net++.png | 700px|thumb|right|PointNet++ architecture]]

The PointNet++ architecture is shown on the right. The core idea is that a hierarchical architecture is used and at each level of the hierarchy a set of points is processed and abstracted to a new set with less points, i.e.,

\begin{aligned}
\text{Input at each level: } N \times (d + c) \text{ matrix}
\end{aligned}

where <math>N</math> is the number of points, <math>d</math> is the coordinate points <math>(x,y,z)</math> and <math>c</math> is the feature representation of each point, and

\begin{aligned}
\text{Output at each level: } N' \times (d + c') \text{ matrix}
\end{aligned}

where <math>N'</math> is the new number (smaller) of points and <math>c'</math> is the new feature vector.

Each level has three layers: Sampling, Grouping, and PointNet. The Sampling layer selects points that will act as centroids of local regions within the point cloud. The Grouping layer then finds points near these centroids. Lastly, the PointNet layer performs PointNet on each group to encode local information.

=== Sampling Layer ===

The input of this layer is a set of points <math>{\{x_1,x_2,...,x_n}\}</math>. The goal of this layer is to select a subset of these points <math>{\{\hat{x}_1, \hat{x}_2,...,\hat{x}_m\}} </math> that will define the centroid of local regions.

To select these points farthest point sampling is used. This is where <math>\hat{x}_j</math> is the most distant point with regards to <math>{\{\hat{x}_1, \hat{x}_2,...,\hat{x}_{j-1}\}}</math>. This ensures coverage of the entire point cloud opposed to random sampling.

=== Grouping Layer ===

The object of the grouping layer is to form local regions around each centroid by group points near the selected centroids. The input is a point set of size <math>N x (d + c)</math> and the coordinates of the centroids <math>N' \times d</math>. The output is the groups of points within each region <math>N' \times k \times (d+c)</math> where <math>k</math> is the number of points in each region.

Note that <math>k</math> can vary per group. Later, the PointNet layer creates a feature vector that has the same size for all regions at the hierarchical level.

To determine which points belong to a group ball query is used; all points within a radius of the centroid are grouped. This is advantageous over nearest neighbour because it guarantees a fixed region space, which is important when learning local structure,

=== PointNet Layer ===

After grouping, PointNet is applied to the points. However, first the coordinates of points in a local region are converted to a local coordinate frame by <math> x_i = x_i - \bar{x}</math> where <math>\bar{x}</math> is the coordinates of the centroid.

=== Robust Feature Learning under Non-Uniform Sampling Density ===

The previous description of grouping uses a single scale. This is not optimal because the density varies per section of the point cloud. At each level, it would be better if the PointNet layer was applied to adaptively sized groups depending on the point cloud density.

The two grouping methods the authors propose are shown in the diagram below. Multi-scale grouping (MSG) applies PointNet at various scales per group. The features from the various scales are concatenated. This method, however, is computationally expensive because for each region it always applies PointNet to all points. On the other hand, multi-resolution grouping (MRG) is less computationally expensive but still adaptively collects features. As shown in the diagram, the left vector is obtained by applying PointNet to three points, and these three points obtained information from three groups. This vector is then concatenated by a vector that is created by using PointNet on all the points in the level below. The second vector can be weighed more heavily if the first vector contains a sparse amount of points.

[[File:grouping.png | 300px|thumb|center|Example of the two ways to perform grouping]]

== Point Cloud Segmentation ==

If the task is segmentation, the architecture is slightly modified.

== Experiments ==
To validate the effectiveness of PointNet++, experiments in three areas were performed - classification in Euclidean metric space, semantic scene labelling, and classification in non-Euclidean space.

=== Point Set Classification in Euclidean Metric Space ===

The digit dataset, MNIST, was converted to a 2D point cloud. Pixel intensities were normalized in the range of <math>[0, 1]</math>, and only pixels with intensities larger than 0.5 were considered. The coordinate system was set at the centre of the image. PointNet++ achieved a classification error of 0.51%. The original PointNet had 0.78% classification error. The table below compares these results to the state-of-the-art.

[[File:mnist_results.png | 300px|thumb|center|MNIST classification results.]]

In addition, the ModelNet40 dataset was used. This dataset consists of CAD models. Three dimensional point clouds were sampled from mesh surfaces of the ModelNet40 shapes. The classification results from this dataset are shown below.

[[File:modelnet40.png | 300px|thumb|center|ModelNet40 classification results.]]

An experiment was performed to show how the accuracy was affected by the number of points used. With PointNet++ using multi-scale grouping and dropout, the performance decreased by less than 1% when 1024 test points was reduced to 256. On the other hand, PointNet's performance was impacted by the decrease in points.

[[File:num_points_acc.png | 300px|thumb|center|Relationship between accuracy and the number of points used for classification.]]

== Semantic Scene Labelling ==

The ScanNet dataset was used for experiments in semantic scene labelling. This dataset consists of laser scans of indoor scenes where the goal is to predict a semantic label for each point. Example results are shown below.

[[File:scannet.png | 300px|thumb|center|Example ScanNet semantic segmentation results.]]

To compare to other methods, the authors convert their point labels to a voxel format, and accuracy is determined on a per voxel basis. The accuracy compared to other methods is shown below.

[[File:scannet_acc.png | 300px|thumb|center|ScanNet semantic segmentation classification comparison to other methods.]]

=Sources=
1. Charles R. Qi, Li Yi, Hao Su, Leonidas J. Guibas. PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space, 2017

2. Charles R. Qi, Hao Su, Kaichun Mo, Leonidas J. Guibas. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation, 2017

File:scannet acc.png

2018-03-17T17:56:20Z

Apon:

File:scannet.png

2018-03-17T17:46:14Z

Apon:

PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space

2018-03-17T17:43:34Z

Apon: /* Experiments */

= Introduction =
This paper builds off of ideas from PointNet (Qi et al., 2017). The name PointNet is derived from the network's input - a point cloud. A point cloud is a set of three dimensional points that each have coordinates <math> (x,y,z) </math>. These coordinates usually represent the surface of an object. For example, a point cloud describing the shape of a torus is shown below.

[[File:Point cloud torus.gif|thumb|center|Point cloud torus]]

Processing point clouds is important in applications such as autonomous driving where point clouds are collected from an onboard LiDAR sensor. These point clouds can then be used for object detection. However, point clouds are challenging to process because:

# They are unordered. If <math> N </math> is the number of points in a point cloud, then there are <math> N! </math> permutations that the point cloud can be represented.
# The spatial arrangement of the points contains useful information, thus it needs to be encoded.
# The function processing the point cloud needs to be invariant to transformations such as rotation and translations of all points.

Previously, typical point cloud processing methods handled the challenges of point clouds by transforming the data with a 3D voxel grid or by representing the point cloud with multiple 2D images. When PointNet was introduced, it was novel because it directly took points as its input. PointNet++ improves on PointNet by using a hierarchical method to better capture local structures of the point cloud.

[[File:point_cloud.png | 400px|thumb|center|Examples of point clouds and their associated task. Classification (left), part segmentation (centre), scene segmentation (right) ]]

= Review of PointNet =

The PointNet architecture is shown below. The input of the network is <math> n </math> points, which each have <math> (x,y,z) </math> coordinates. Each point processed individually through a multi-layer perceptron (MLP). This network creates an encoding for each point; in the diagram, each point is represented by a 1024 dimension vector. Then, using a max pool layer a vector is created, that represents the "global signature" of a point cloud. If classification is the task, this global signature is processed by another MLP to compute the classification scores. If segmentation is the task, this global signature is appended to to each point from the "nx64" layer, and these points are processed by a MLP to compute a semantic category score for each point.

The core idea of the network is to learn a symmetric function on transformed points. Through the T-Nets and the MLP network, a transformation is learned with the hopes of making points invariant to point cloud transformations. Learning a symmetric function solves the challenge imposed by having unordered points; a symmetric function will produce the same value no matter the order of the input. This symmetric function is represented by the max pool layer.

[[File:pointnet_arch.png | 700px|thumb|center|PointNet architecture. The blue highlighted region is when it is used for classification, and the beige highlighted region is when it is used for segmentation.]]

= PointNet++ =

The motivation for PointNet++ is that PointNet does not capture local, fine-grained details. Since PointNet performs a max pool layer over all of its points, information such as the local interaction between points is lost.

== Problem Statement ==

There is a metric space <math> X = (M,d) </math> where <math>d</math> is the metric from a Euclidean space <math>\pmb{\mathbb{R}}^n</math> and <math> M \subseteq \pmb{\mathbb{R}}^n </math> is the set of points. The goal is to learn a function that takes <math>X</math> as the input as outputs a a class or per point label to each member of <math>M</math>.

== Method ==

=== High Level Overview ===
[[File:point_net++.png | 700px|thumb|right|PointNet++ architecture]]

The PointNet++ architecture is shown on the right. The core idea is that a hierarchical architecture is used and at each level of the hierarchy a set of points is processed and abstracted to a new set with less points, i.e.,

\begin{aligned}
\text{Input at each level: } N \times (d + c) \text{ matrix}
\end{aligned}

where <math>N</math> is the number of points, <math>d</math> is the coordinate points <math>(x,y,z)</math> and <math>c</math> is the feature representation of each point, and

\begin{aligned}
\text{Output at each level: } N' \times (d + c') \text{ matrix}
\end{aligned}

where <math>N'</math> is the new number (smaller) of points and <math>c'</math> is the new feature vector.

Each level has three layers: Sampling, Grouping, and PointNet. The Sampling layer selects points that will act as centroids of local regions within the point cloud. The Grouping layer then finds points near these centroids. Lastly, the PointNet layer performs PointNet on each group to encode local information.

=== Sampling Layer ===

The input of this layer is a set of points <math>{\{x_1,x_2,...,x_n}\}</math>. The goal of this layer is to select a subset of these points <math>{\{\hat{x}_1, \hat{x}_2,...,\hat{x}_m\}} </math> that will define the centroid of local regions.

To select these points farthest point sampling is used. This is where <math>\hat{x}_j</math> is the most distant point with regards to <math>{\{\hat{x}_1, \hat{x}_2,...,\hat{x}_{j-1}\}}</math>. This ensures coverage of the entire point cloud opposed to random sampling.

=== Grouping Layer ===

The object of the grouping layer is to form local regions around each centroid by group points near the selected centroids. The input is a point set of size <math>N x (d + c)</math> and the coordinates of the centroids <math>N' \times d</math>. The output is the groups of points within each region <math>N' \times k \times (d+c)</math> where <math>k</math> is the number of points in each region.

Note that <math>k</math> can vary per group. Later, the PointNet layer creates a feature vector that has the same size for all regions at the hierarchical level.

To determine which points belong to a group ball query is used; all points within a radius of the centroid are grouped. This is advantageous over nearest neighbour because it guarantees a fixed region space, which is important when learning local structure,

=== PointNet Layer ===

After grouping, PointNet is applied to the points. However, first the coordinates of points in a local region are converted to a local coordinate frame by <math> x_i = x_i - \bar{x}</math> where <math>\bar{x}</math> is the coordinates of the centroid.

=== Robust Feature Learning under Non-Uniform Sampling Density ===

The previous description of grouping uses a single scale. This is not optimal because the density varies per section of the point cloud. At each level, it would be better if the PointNet layer was applied to adaptively sized groups depending on the point cloud density.

The two grouping methods the authors propose are shown in the diagram below. Multi-scale grouping (MSG) applies PointNet at various scales per group. The features from the various scales are concatenated. This method, however, is computationally expensive because for each region it always applies PointNet to all points. On the other hand, multi-resolution grouping (MRG) is less computationally expensive but still adaptively collects features. As shown in the diagram, the left vector is obtained by applying PointNet to three points, and these three points obtained information from three groups. This vector is then concatenated by a vector that is created by using PointNet on all the points in the level below. The second vector can be weighed more heavily if the first vector contains a sparse amount of points.

[[File:grouping.png | 300px|thumb|center|Example of the two ways to perform grouping]]

== Point Cloud Segmentation ==

If the task is segmentation, the architecture is slightly modified.

== Experiments ==
To validate the effectiveness of PointNet++, experiments in three areas were performed - classification in Euclidean metric space, semantic scene labelling, and classification in non-Euclidean space.

=== Point Set Classification in Euclidean Metric Space ===

The digit dataset, MNIST, was converted to a 2D point cloud. Pixel intensities were normalized in the range of <math>[0, 1]</math>, and only pixels with intensities larger than 0.5 were considered. The coordinate system was set at the centre of the image. PointNet++ achieved a classification error of 0.51%. The original PointNet had 0.78% classification error. The table below compares these results to the state-of-the-art.

[[File:mnist_results.png | 300px|thumb|center|MNIST classification results.]]

In addition, the ModelNet40 dataset was used. This dataset consists of CAD models. Three dimensional point clouds were sampled from mesh surfaces of the ModelNet40 shapes. The classification results from this dataset are shown below.

[[File:modelnet40.png | 300px|thumb|center|ModelNet40 classification results.]]

An experiment was performed to show how the accuracy was affected by the number of points used. With PointNet++ using multi-scale grouping and dropout, the performance decreased by less than 1% when 1024 test points was reduced to 256. On the other hand, PointNet's performance was impacted by the decrease in points.

[[File:num_points_acc.png | 300px|thumb|center|Relationship between accuracy and the number of points used for classification.]]

=Sources=
1. Charles R. Qi, Li Yi, Hao Su, Leonidas J. Guibas. PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space, 2017

2. Charles R. Qi, Hao Su, Kaichun Mo, Leonidas J. Guibas. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation, 2017

PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space

2018-03-17T17:42:38Z

Apon:

= Introduction =
This paper builds off of ideas from PointNet (Qi et al., 2017). The name PointNet is derived from the network's input - a point cloud. A point cloud is a set of three dimensional points that each have coordinates <math> (x,y,z) </math>. These coordinates usually represent the surface of an object. For example, a point cloud describing the shape of a torus is shown below.

[[File:Point cloud torus.gif|thumb|center|Point cloud torus]]

Processing point clouds is important in applications such as autonomous driving where point clouds are collected from an onboard LiDAR sensor. These point clouds can then be used for object detection. However, point clouds are challenging to process because:

# They are unordered. If <math> N </math> is the number of points in a point cloud, then there are <math> N! </math> permutations that the point cloud can be represented.
# The spatial arrangement of the points contains useful information, thus it needs to be encoded.
# The function processing the point cloud needs to be invariant to transformations such as rotation and translations of all points.

Previously, typical point cloud processing methods handled the challenges of point clouds by transforming the data with a 3D voxel grid or by representing the point cloud with multiple 2D images. When PointNet was introduced, it was novel because it directly took points as its input. PointNet++ improves on PointNet by using a hierarchical method to better capture local structures of the point cloud.

[[File:point_cloud.png | 400px|thumb|center|Examples of point clouds and their associated task. Classification (left), part segmentation (centre), scene segmentation (right) ]]

= Review of PointNet =

The PointNet architecture is shown below. The input of the network is <math> n </math> points, which each have <math> (x,y,z) </math> coordinates. Each point processed individually through a multi-layer perceptron (MLP). This network creates an encoding for each point; in the diagram, each point is represented by a 1024 dimension vector. Then, using a max pool layer a vector is created, that represents the "global signature" of a point cloud. If classification is the task, this global signature is processed by another MLP to compute the classification scores. If segmentation is the task, this global signature is appended to to each point from the "nx64" layer, and these points are processed by a MLP to compute a semantic category score for each point.

The core idea of the network is to learn a symmetric function on transformed points. Through the T-Nets and the MLP network, a transformation is learned with the hopes of making points invariant to point cloud transformations. Learning a symmetric function solves the challenge imposed by having unordered points; a symmetric function will produce the same value no matter the order of the input. This symmetric function is represented by the max pool layer.

[[File:pointnet_arch.png | 700px|thumb|center|PointNet architecture. The blue highlighted region is when it is used for classification, and the beige highlighted region is when it is used for segmentation.]]

= PointNet++ =

The motivation for PointNet++ is that PointNet does not capture local, fine-grained details. Since PointNet performs a max pool layer over all of its points, information such as the local interaction between points is lost.

== Problem Statement ==

There is a metric space <math> X = (M,d) </math> where <math>d</math> is the metric from a Euclidean space <math>\pmb{\mathbb{R}}^n</math> and <math> M \subseteq \pmb{\mathbb{R}}^n </math> is the set of points. The goal is to learn a function that takes <math>X</math> as the input as outputs a a class or per point label to each member of <math>M</math>.

== Method ==

=== High Level Overview ===
[[File:point_net++.png | 700px|thumb|right|PointNet++ architecture]]

The PointNet++ architecture is shown on the right. The core idea is that a hierarchical architecture is used and at each level of the hierarchy a set of points is processed and abstracted to a new set with less points, i.e.,

\begin{aligned}
\text{Input at each level: } N \times (d + c) \text{ matrix}
\end{aligned}

where <math>N</math> is the number of points, <math>d</math> is the coordinate points <math>(x,y,z)</math> and <math>c</math> is the feature representation of each point, and

\begin{aligned}
\text{Output at each level: } N' \times (d + c') \text{ matrix}
\end{aligned}

where <math>N'</math> is the new number (smaller) of points and <math>c'</math> is the new feature vector.

Each level has three layers: Sampling, Grouping, and PointNet. The Sampling layer selects points that will act as centroids of local regions within the point cloud. The Grouping layer then finds points near these centroids. Lastly, the PointNet layer performs PointNet on each group to encode local information.

=== Sampling Layer ===

The input of this layer is a set of points <math>{\{x_1,x_2,...,x_n}\}</math>. The goal of this layer is to select a subset of these points <math>{\{\hat{x}_1, \hat{x}_2,...,\hat{x}_m\}} </math> that will define the centroid of local regions.

To select these points farthest point sampling is used. This is where <math>\hat{x}_j</math> is the most distant point with regards to <math>{\{\hat{x}_1, \hat{x}_2,...,\hat{x}_{j-1}\}}</math>. This ensures coverage of the entire point cloud opposed to random sampling.

=== Grouping Layer ===

The object of the grouping layer is to form local regions around each centroid by group points near the selected centroids. The input is a point set of size <math>N x (d + c)</math> and the coordinates of the centroids <math>N' \times d</math>. The output is the groups of points within each region <math>N' \times k \times (d+c)</math> where <math>k</math> is the number of points in each region.

Note that <math>k</math> can vary per group. Later, the PointNet layer creates a feature vector that has the same size for all regions at the hierarchical level.

To determine which points belong to a group ball query is used; all points within a radius of the centroid are grouped. This is advantageous over nearest neighbour because it guarantees a fixed region space, which is important when learning local structure,

=== PointNet Layer ===

After grouping, PointNet is applied to the points. However, first the coordinates of points in a local region are converted to a local coordinate frame by <math> x_i = x_i - \bar{x}</math> where <math>\bar{x}</math> is the coordinates of the centroid.

=== Robust Feature Learning under Non-Uniform Sampling Density ===

The previous description of grouping uses a single scale. This is not optimal because the density varies per section of the point cloud. At each level, it would be better if the PointNet layer was applied to adaptively sized groups depending on the point cloud density.

The two grouping methods the authors propose are shown in the diagram below. Multi-scale grouping (MSG) applies PointNet at various scales per group. The features from the various scales are concatenated. This method, however, is computationally expensive because for each region it always applies PointNet to all points. On the other hand, multi-resolution grouping (MRG) is less computationally expensive but still adaptively collects features. As shown in the diagram, the left vector is obtained by applying PointNet to three points, and these three points obtained information from three groups. This vector is then concatenated by a vector that is created by using PointNet on all the points in the level below. The second vector can be weighed more heavily if the first vector contains a sparse amount of points.

[[File:grouping.png | 300px|thumb|center|Example of the two ways to perform grouping]]

== Point Cloud Segmentation ==

If the task is segmentation, the architecture is slightly modified.

== Experiments ==
To validate the effectiveness of PointNet++, experiments in three areas were performed - classification in Euclidean metric space, semantic scene labelling, and classification in non-Euclidean space.

=== Point Set Classification in Euclidean Metric Space ===

The digit dataset, MNIST, was converted to a 2D point cloud. Pixel intensities were normalized in the range of <math>[0, 1]</math>, and only pixels with intensities larger than 0.5 were considered. The coordinate system was set at the centre of the image. PointNet++ achieved a classification error of 0.51%. The original PointNet had 0.78% classification error. The table below compares these results to the state-of-the-art.

[[File:mnist_results.png | 300px|thumb|center|MNIST classification results.]]

In addition, the ModelNet40 dataset was used. This dataset consists of CAD models. Three dimensional point clouds were sampled from mesh surfaces of the ModelNet40 shapes. The classification results from this dataset are shown below.

[[File:modelnet40.png | 300px|thumb|center|ModelNet40 classification results.]]

An experiment was performed to show how the accuracy was affected by the number of points used. With PointNet++ using multi-scale grouping and dropout, the performance decreased by less than 1% when 1024 test points was reduced to 256.

[[File:num_points_acc.png | 300px|thumb|center|Relationship between accuracy and the number of points used for classification.]]

=Sources=
1. Charles R. Qi, Li Yi, Hao Su, Leonidas J. Guibas. PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space, 2017

2. Charles R. Qi, Hao Su, Kaichun Mo, Leonidas J. Guibas. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation, 2017

PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space

2018-03-17T17:38:29Z

Apon: /* Experiments */

= Introduction =
This paper builds off of ideas from PointNet (Qi et al., 2017). The name PointNet is derived from the network's input - a point cloud. A point cloud is a set of three dimensional points that each have coordinates <math> (x,y,z) </math>. These coordinates usually represent the surface of an object. For example, a point cloud describing the shape of a torus is shown below.

[[File:Point cloud torus.gif|thumb|center|Point cloud torus]]

Processing point clouds is important in applications such as autonomous driving where point clouds are collected from an onboard LiDAR sensor. These point clouds can then be used for object detection. However, point clouds are challenging to process because:

# They are unordered. If <math> N </math> is the number of points in a point cloud, then there are <math> N! </math> permutations that the point cloud can be represented.
# The spatial arrangement of the points contains useful information, thus it needs to be encoded.
# The function processing the point cloud needs to be invariant to transformations such as rotation and translations of all points.

Previously, typical point cloud processing methods handled the challenges of point clouds by transforming the data with a 3D voxel grid or by representing the point cloud with multiple 2D images. When PointNet was introduced, it was novel because it directly took points as its input. PointNet++ improves on PointNet by using a hierarchical method to better capture local structures of the point cloud.

[[File:point_cloud.png | 400px|thumb|center|Examples of point clouds and their associated task. Classification (left), part segmentation (centre), scene segmentation (right) ]]

= Review of PointNet =

The PointNet architecture is shown below. The input of the network is <math> n </math> points, which each have <math> (x,y,z) </math> coordinates. Each point processed individually through a multi-layer perceptron (MLP). This network creates an encoding for each point; in the diagram, each point is represented by a 1024 dimension vector. Then, using a max pool layer a vector is created, that represents the "global signature" of a point cloud. If classification is the task, this global signature is processed by another MLP to compute the classification scores. If segmentation is the task, this global signature is appended to to each point from the "nx64" layer, and these points are processed by a MLP to compute a semantic category score for each point.

The core idea of the network is to learn a symmetric function on transformed points. Through the T-Nets and the MLP network, a transformation is learned with the hopes of making points invariant to point cloud transformations. Learning a symmetric function solves the challenge imposed by having unordered points; a symmetric function will produce the same value no matter the order of the input. This symmetric function is represented by the max pool layer.

[[File:pointnet_arch.png | 700px|thumb|center|PointNet architecture. The blue highlighted region is when it is used for classification, and the beige highlighted region is when it is used for segmentation.]]

= PointNet++ =

The motivation for PointNet++ is that PointNet does not capture local, fine-grained details. Since PointNet performs a max pool layer over all of its points, information such as the local interaction between points is lost.

== Problem Statement ==

There is a metric space <math> X = (M,d) </math> where <math>d</math> is the metric from a Euclidean space <math>\pmb{\mathbb{R}}^n</math> and <math> M \subseteq \pmb{\mathbb{R}}^n </math> is the set of points. The goal is to learn a function that takes <math>X</math> as the input as outputs a a class or per point label to each member of <math>M</math>.

== Method ==

=== High Level Overview ===
[[File:point_net++.png | 700px|thumb|right|PointNet++ architecture]]

The PointNet++ architecture is shown on the right. The core idea is that a hierarchical architecture is used and at each level of the hierarchy a set of points is processed and abstracted to a new set with less points, i.e.,

\begin{aligned}
\text{Input at each level: } N \times (d + c) \text{ matrix}
\end{aligned}

where <math>N</math> is the number of points, <math>d</math> is the coordinate points <math>(x,y,z)</math> and <math>c</math> is the feature representation of each point, and

\begin{aligned}
\text{Output at each level: } N' \times (d + c') \text{ matrix}
\end{aligned}

where <math>N'</math> is the new number (smaller) of points and <math>c'</math> is the new feature vector.

Each level has three layers: Sampling, Grouping, and PointNet. The Sampling layer selects points that will act as centroids of local regions within the point cloud. The Grouping layer then finds points near these centroids. Lastly, the PointNet layer performs PointNet on each group to encode local information.

=== Sampling Layer ===

The input of this layer is a set of points <math>{\{x_1,x_2,...,x_n}\}</math>. The goal of this layer is to select a subset of these points <math>{\{\hat{x}_1, \hat{x}_2,...,\hat{x}_m\}} </math> that will define the centroid of local regions.

To select these points farthest point sampling is used. This is where <math>\hat{x}_j</math> is the most distant point with regards to <math>{\{\hat{x}_1, \hat{x}_2,...,\hat{x}_{j-1}\}}</math>. This ensures coverage of the entire point cloud opposed to random sampling.

=== Grouping Layer ===

The object of the grouping layer is to form local regions around each centroid by group points near the selected centroids. The input is a point set of size <math>N x (d + c)</math> and the coordinates of the centroids <math>N' \times d</math>. The output is the groups of points within each region <math>N' \times k \times (d+c)</math> where <math>k</math> is the number of points in each region.

Note that <math>k</math> can vary per group. Later, the PointNet layer creates a feature vector that has the same size for all regions at the hierarchical level.

To determine which points belong to a group ball query is used; all points within a radius of the centroid are grouped. This is advantageous over nearest neighbour because it guarantees a fixed region space, which is important when learning local structure,

=== PointNet Layer ===

After grouping, PointNet is applied to the points. However, first the coordinates of points in a local region are converted to a local coordinate frame by <math> x_i = x_i - \bar{x}</math> where <math>\bar{x}</math> is the coordinates of the centroid.

=== Robust Feature Learning under Non-Uniform Sampling Density ===

The previous description of grouping uses a single scale. This is not optimal because the density varies per section of the point cloud. At each level, it would be better if the PointNet layer was applied to adaptively sized groups depending on the point cloud density.

The two grouping methods the authors propose are shown in the diagram below. Multi-scale grouping (MSG) applies PointNet at various scales per group. The features from the various scales are concatenated. This method, however, is computationally expensive because for each region it always applies PointNet to all points. On the other hand, multi-resolution grouping (MRG) is less computationally expensive but still adaptively collects features. As shown in the diagram, the left vector is obtained by applying PointNet to three points, and these three points obtained information from three groups. This vector is then concatenated by a vector that is created by using PointNet on all the points in the level below. The second vector can be weighed more heavily if the first vector contains a sparse amount of points.

[[File:grouping.png | 300px|thumb|center|Example of the two ways to perform grouping]]

== Point Cloud Segmentation ==

If the task is segmentation, the architecture is slightly modified.

== Experiments ==
To validate the effectiveness of PointNet++, experiments in three areas were performed - classification in Euclidean metric space, semantic scene labelling, and classification in non-Euclidean space.

=== Point Set Classification in Euclidean Metric Space ===

The digit dataset, MNIST, was converted to a 2D point cloud. Pixel intensities were normalized in the range of <math>[0, 1]</math>, and only pixels with intensities larger than 0.5 were considered. The coordinate system was set at the centre of the image. PointNet++ achieved a classification error of 0.51%. The original PointNet had 0.78% classification error. The table below compares these results to the state-of-the-art.

In addition, the ModelNet40 dataset was used. This dataset consists of CAD models. Three dimensional point clouds were sampled from mesh surfaces of the ModelNet40 shapes. The classification results from this dataset are shown below.

=Sources=
1. Charles R. Qi, Li Yi, Hao Su, Leonidas J. Guibas. PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space, 2017

2. Charles R. Qi, Hao Su, Kaichun Mo, Leonidas J. Guibas. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation, 2017