Spherical CNNs

From statwiki
Jump to navigation Jump to search

WORK IN PROGRESS********************************************************************************************************************

Introduction

Convolutional Neural Networks (CNNs), or network architectures involving CNNs, are the current state of the art for learning 2D image processing tasks such as semantic segmentation and object detection. CNNs work well in large part due to the property of being translationally equivariant. This property allows a network trained to detect a certain type of object to still detect the object even if it is translated to another position in the image. However, this does not correspond well to spherical signals since projecting a spherical signal onto a plane will result in distortions as demonstrated in Figure 1. There are many different types of spherical projections onto a 2D plane, as most people know from the various types of world maps, none of which provide all necessary qualities for equivariant convolution of a spherical signal.


Notation

Below are listed several important terms:

  • Unit Sphere [math]\displaystyle{ S^2 }[/math] is defined as a sphere where all of its points are distance of 1 from the origin. The unit sphere can be parameterized by the spherical coordinates [math]\displaystyle{ \alpha ∈ [0, 2π] }[/math] and [math]\displaystyle{ β ∈ [0, π] }[/math]. This is a two-dimensional manifold with respect to [math]\displaystyle{ \alpha }[/math] and [math]\displaystyle{ β }[/math].
  • [math]\displaystyle{ S^2 }[/math] Sphere The three dimensional surface from a 3D sphere
  • Spherical Signals In the paper spherical images and filters are modeled as continuous functions [math]\displaystyle{ f : s^2 → \mathbb{R}^K }[/math]. K is the number of channels. Such as how RGB images have 3 channels a spherical signal can have numerous channels describing the data. Examples of channels which were used can be found in the experiments section.
  • Rotations - SO(3) Essentially the group of 3D rotations. Sometimes called the "special orthogonal group".

Related Work

The related work presented in this paper is very brief, in large part due to the novelty of spherical CNNs. The authors ennumerate numerous papers which attempt to exploit larger groups of symmetries such as the translational symmetries of CNNs but do not go into detail on what these are. They do state that all the previous works are limited to discrete groups with the exception of SO(2)-steerable networks. The authors also mention that previous works exist that analyze spherical images but that these do not have an equivariant architecture. They claim that Spherical CNNs are "the first to achieve equivariance to a continuous, non-commutative group (SO(3))". They also claim to be the first to use the generalized Fourier transform for speed effective performance of group correlation.

Correlations on the Sphere and Rotation Group

Spherical correlation is like planar correlation except instead of translation, there is rotation. The definitions for each are provided as follows:

Planar correlation: The value of the output feature map at translation [math]\displaystyle{ \small x ∈ Z^2 }[/math] is computed as an inner product between the input feature map and a filter, shifted by [math]\displaystyle{ \small x }[/math].

Spherical correlation: The value of the output feature map evaluated at rotation [math]\displaystyle{ \small R ∈ SO(3) }[/math] is computed as an inner product between the input feature map and a filter, rotated by [math]\displaystyle{ \small R }[/math].

Rotation of Spherical Signals The papers introduces the rotation operator [math]\displaystyle{ L_R }[/math]. The rotation operator simply rotates a function (which allows us to rotate the the spherical filters) by [math]\displaystyle{ R^{-1} }[/math]. With this definition we have the property that [math]\displaystyle{ L_{RR'} = L_R L_{R'} }[/math].

Inner Products The inner product of spherical signals is simply the integral summation on the vector space over the entire sphere. [math]\displaystyle{ dx }[/math] here is standard rotation invariant which is equivalent to [math]\displaystyle{ d \alpha sin(\beta) d \beta / 4 \pi }[/math] in spherical coordinates. This comes from the ZYZ-Euler paramaterization where any rotation can be broken down into first a rotation about the Z-axis, then a rotation about the new Y-axis (Y'), followed by a rotation about the new Z axis (Z). More details on this are given in Appendix A in the paper.

By this definition, the invariance of the inner product is then guaranteed for any rotation [math]\displaystyle{ R ∈ SO(3) }[/math]. In other words, when subjected to rotations, the volume under a spherical heightmap does not change. The following equations show that [math]\displaystyle{ L_R }[/math] has a distinct adjoint ([math]\displaystyle{ L_{R^{-1}} }[/math]) and that [math]\displaystyle{ L_R }[/math] is unitary (preserves orthogonality and distances).

[math]\displaystyle{ \langle L_R \psi \,, f \rangle = \int_{S^2} \sum_{k=1}^K \psi_k (R^{-1} x)f_k (x)dx }[/math]

[math]\displaystyle{ = \int_{S^2} \sum_{k=1}^K \psi_k (x)f_k (Rx)dx }[/math]
[math]\displaystyle{ = \langle \psi , L_{R^{-1}} f \rangle }[/math]

Spherical Correlation With the above knowledge the definition of spherical correlation of two signals [math]\displaystyle{ f }[/math] and [math]\displaystyle{ \psi }[/math] is:

[math]\displaystyle{ [\psi \star f](R) = \langle L_R \psi \,, f \rangle = \int_{S^2} \sum_{k=1}^K \psi_k (R^{-1} x)f_k (x)dx }[/math]

The output of the above equation is a function on SO(3). This can be thought of as for each rotation combination of [math]\displaystyle{ \alpha , \beta , \gamma }[/math] there is a different volume under the correlation. The authors make a point of noting that previous work by Driscoll and Healey ensures circular symmetries about the Z axis and their new formulation ensures symmetry about any rotation.

Rotation of SO(3) Signals The first layer of Spherical CNNs take a function on the sphere ([math]\displaystyle{ S^2 }[/math]) and output a function on SO(3). Therefore, if a Spherical CNN with more than one layer is going to be built there needs to be a way to find the correlation between two signals on SO(3). The authors then generalize the rotation operator ([math]\displaystyle{ L_R }[/math]) to encompass acting on signals from SO(3). This new definition of [math]\displaystyle{ L_R }[/math] is as follows: (where [math]\displaystyle{ R^{-1}Q }[/math] is a composition of rotations, i.e. multiplication of rotation matrices)

[math]\displaystyle{ [L_Rf](Q)=f(R^{-1} Q) }[/math]

Rotation Group Correlation From above the definition of the correlation of two signals ([math]\displaystyle{ f,\psi }[/math]) on SO(3) with K channels is the following:

[math]\displaystyle{ [\psi \star f](R) = \langle L_R \psi , f \rangle = \int_{SO(3)} \sum_{k=1}^K \psi_k (R^{-1} Q)f_k (Q)dQ }[/math]

where dQ represents the ZYZ-Euler angles [math]\displaystyle{ d \alpha sin(\beta) d \beta d \gamma / 8 \pi^2 }[/math]. A complete derivation of this can be found in Appendix A.

They show equivariance for the rotation group correlation similarly as with the sphere/rotation group correlation.

Efficient Implementation

The authors leverage the Generalized Fourier Transform (GFT) and Generalized Fast Fourier Transform (GFFT) algorithms to compute the correlations outlined in the previous section. The Fast Fourier Transform (FFT) can compute correlations and convolutions efficiently by means of the Fourier theorem. The Fourier theorem states that a continuous periodic function can be expressed as a sum of a series of sine or cosine terms (called Fourier coefficients). The FFT can be generalized to [math]\displaystyle{ S^2 }[/math] and SO(3) and is called the GFT. The GFT is a linear projection of a function onto orthogonal basis functions. The basis functions are a set of irreducible unitary representations for a group (such as for [math]\displaystyle{ S^2 }[/math] or SO(3)). For [math]\displaystyle{ S^2 }[/math] the basis functions are the spherical harmonics [math]\displaystyle{ Y_m^l(x) }[/math]. For SO(3) these basis functions are called the Wigner D-functions [math]\displaystyle{ D_{mn}^l(R) }[/math]. For both functions the indices are restricted to [math]\displaystyle{ l\geq0 }[/math] and [math]\displaystyle{ -l \leq m,n \geq l }[/math]. The Wigner D-functions are also orthogonal so the Fourier coefficients can be computed by the inner product with the Wigner D-functions (See Appendix C). The Wigner D-functions are complete which means that any function (which is well behaved) on SO(3) can be expressed as a linear combination of the Wigner D-functions. Therefore, the inverse SO(3) Fourier transform is: (where [math]\displaystyle{ \hat{f} }[/math] represents the Fourier coefficients and b is labeled the bandwidth which is related to the resolution of the spatial grid)

[math]\displaystyle{ f(R)=[\mathcal{F}^{-1} \hat{f}](R) = \sum_{l=0}^b (2l + 1) \sum_{m=-l}^l \sum_{n=-l}^l \hat{f_{mn}^l} D_{mn}^l(R) }[/math]

The authors show (Appendix D) that the SO(3) correlation satisfies the Fourier theorem and the [math]\displaystyle{ S^2 }[/math] correlation of spherical signals can be computed by the outer products of the [math]\displaystyle{ S^2 }[/math]-FTs (Shown in Figure 2).

The authors do not provide any run time comparisons for real time applications or any comparisons on training times with/without GFFT. However, they do provide the source code of their implementation at: https://github.com/jonas-koehler/s2cnn

Experiments

The authors provide several experiments. The first set of experiments are designed to show the numerical stability and accuracy of the outlined methods. The second group of experiments demonstrates how the algorithms can be applied to current problem domains.

Equivariance Error

In this experiment the authors try to show experimentally that their theory of equivariance holds. They express their doubts that the equivariance would hold due to potential discretization artifacts. The experiment is set up by first testing the equivariance of the SO(3) correlation at different resolutions. 500 random rotations and feature maps (with 10 channels) are sampled. They then calculate [math]\displaystyle{ \small\Delta = \dfrac{1}{n} \sum_{i=1}^n std(L_{R_i} \Phi(f_i) - \phi(L_{R_i} f_i))/std(\Phi(f_i)) }[/math] Note: The authors do not mention what the std function is however it is likely the standard deviation function as 'std' is the command for standard deviation in MATLAB. [math]\displaystyle{ \Phi }[/math] is a composition of SO(3) correlation layers with filters which have been randomly initialized. The authors mention that they were expecting [math]\displaystyle{ \Delta }[/math] to be zero in the case of perfect equivariance. The is due to, as proven earlier, the following two terms equalling each other in the continuous case: [math]\displaystyle{ \small L_{R_i} \Phi(f_i) - \phi(L_{R_i} f_i) }[/math]. The results are shown in Figure 3.

[math]\displaystyle{ \Delta }[/math] only grows with resolution/layers when there is no activation function. With ReLU activation the error stays constant once slightly higher than 0 resolution. The authors indicate that the error must therefore be from the feature map rotation since this type of error is exact only for bandlimited functions.


MNIST Data

The experiment using MNIST data was created by projecting MNIST digits onto a sphere using stereographic projection to create the resulting images as seen in Figure 4.

The authors created two datasets, one with the projected digits and the other with the same projected digits which were then subjected to a random rotation. The spherical CNN architecture used was [math]\displaystyle{ \small S^2 }[/math]conv-ReLU-SO(3)conv-ReLU-FC-softmax and was attempted with bandwidths of 30,10,6 and 20,40,10 channels for each layer respectively. This model was compared to a baseline CNN with layers conv-ReLU-conv-ReLU-FC-softmax with 5x5 filters, 32,64,10 channels and stride of 3. For comparison this leads to approximately 68K parameters for the baseline and 58K parameters for the spherical CNN. Results can be seen in Table 1. It is clear from the results that the spherical CNN architecture made the network rotationally invariant. Performance on the rotated set is almost identical to the non-rotated set, even when trained on the non-rotated set. Compare this to the non-spherical architecture which becomes unusable when rotating the digits.


SHREC17

The SHREC dataset contains 3D models from the ShapeNet dataset which are classified into categories. It consists of a regularly aligned dataset and a rotated dataset. The models from the SHREC17 dataset were projected onto a sphere by means of raycasting. Different properties of the objects obtained from the raycast of the original model and the convex hull of the model make up the different channels which are input into the spherical CNN.



The network architecture used is an initial [math]\displaystyle{ \small S^2 }[/math]conv-BN-ReLU block which is followed by two SO(3)conv-BN-ReLU blocks. The output is then fed into a MaxPool-BN block then a linear layer to the output for final classification. The architecture for this experiment has ~1.4M parameters, far exceeding the scale of the spherical CNNs in the other experiments.

This architecture achieves state of the art results on the 2017 tasks. The model places 2nd or 3rd in all categories but was not submitted as the 2017 task is closed. Table 2 shows the comparison of results with the top 3 submissions in each category. The authors claim the results show empirical proof of the usefulness of spherical CNNs. They elaborate that this is largely due to the fact that most architectures on the SHREC17 competition are highly specialized whereas their model is fairly general.


Molecular Atomization

In this experiment a spherical CNN is implemented with an architecture resembling that of ResNet. They use the QM7 dataset which has the task of predicting atomization energy of molecules. The positions and charges given in the dataset are projected onto the sphere using potential functions. A summary of their results is shown in Table 3 along with some of the spherical CNN architecture details. It shows the different RMSE obtained from different methods. The results from this final experiment also seem to be promising as the network the authors present achieves the second best score. They also note that the first place method grows exponentially with the number of atoms per molecule so is unlikely to scale well.

Conclusions

This paper presents a novel architecture called Spherical CNNs. The paper defines [math]\displaystyle{ \small S^2 }[/math] and SO(3) cross correlations, shows the theory behind their rotational invariance, and demonstrates that the invariance also applies to the discrete case. An effective Generalized FFT algorithm was implemented and evaluated on two very different datasets with close to state of the art results.

For future work the authors believe that improvements can be obtained by generalizing the algorithms to the SE(3) group (SE(3) simply adds translations in 3D space to the SO(3) group). The authors also briefly mention their excitement for applying Spherical CNNs to omnidirectional vision such as in drones and autonomous cars. They state that there is very little publicly available omnidirectional image data which is potentially why they did not conduct any experiments in this area.

Commentary

The reviews on Spherical CNNs are very positive and it is ranked in the top 1% of papers submitted to ICLR 2018. Positive points are the novelty of the architecture, the wide variety of experiments performed, and the writing. One critique of the original submission is that the related works section only lists, instead of describing, previous methods and that a description of the methods would have provided more clarity. The authors have since expanded the section however I found that it is still limited which the authors attribute to length limitations. Another critique is that the evaluation does not provide enough depth. For example, it would have been great to see an example of omnidirectional vision for spherical networks. However, this is to be expected as it is just the introduction of spherical CNNs and more work is sure to come.

Source Code

Source code is available at: https://github.com/jonas-koehler/s2cnn

Sources

1. T. Cohen et al. Spherical CNNs, 2018.