Wavelet Pooling CNN
It is generally the case that Convolution Neural Networks (CNNs) out perform vector-based deep learning techniques. As such, the fundamentals of CNNs are good candidates to be innovated in order to improve said performance. The pooling layer is one of these fundamentals, and although various methods exist ranging from deterministic and simple: max pooling and average pooling, to probabilistic: mixed pooling and stochastic pooling, all these methods employ a neighborhood approach to the sub-sampling which, albeit fast and simple, can produce artifacts such as blurring, aliasing, and edge halos (Parker et al., 1983).
This paper introduces a novel pooling method based on the discrete wavelet transform. Specifically, it uses a second-level wavelet decomposition for the sub-sampling. This method, instead of nearest neighbor interpolation, uses a sub-band method that the authors claim produces less artifacts and represents the underlying features more accurately. Therefore, if pooling is viewed as a lossy process, the reason for employing a wavelet approach is to try to minimize this loss.
Pooling essentially means sub-sampling. After the pooling layer, the spatial dimensions of the data is reduced to some degree, with the goal being to compress the data rather than discard some of it. Typical approaches to pooling reduce the dimensionality by using some method to combine a region of values into one. For max pooling, this can be represented by the equation (EQUATION) where akij is the output activation of the kth feature map at (i,j), akpq is input activation at (p,q) within Rij, and Rij is the size of the pooling region. Mean pooling can be represented by the equation (EQUATION) with everything defined as before. Figure 1 provides a numerical example that can be followed.
[insert figure 1]
The paper mentions that these pooling methods, although simple and effective, have shortcomings. Max pooling can omit details from an image if the important features have less intensity than the insignificant ones, and also commonly overfits. On the other hand, average pooling can dilute important features if the data is averaged with values of significantly lower intensities. Figure 2 displays an image of this.
[insert figure 2]
Data or signals tend to be composed of slowly changing trends (low frequency) as well as fast changing transients (high frequency). Similarly, images have smooth regions of intensity which are perturbed by edges or abrupt changes. We know that these abrupt changes can represent features that are of great importance to us when we perform deep learning. Wavelets are a class of functions that are well localized in time and frequency. Compare this to the Fourier transform which represents signals as the sum of sine waves which oscillate forever (not localized in time and space). The ability of wavelets to be localized in time and space is what makes it suitable for detecting the abrupt changes in an image well.
Essentially, a wavelet is a fast decaying, oscillating signal with zero mean that only exists for a fixed duration and can be scaled and shifted in time. There are some well defined types of wavelets as shown in Figure 3. The key characteristic of wavelets for us is that they have a band-pass characteristic, and the band can be adjusted based on the scaling and shifting.
[Insert figure 3]
The paper uses discrete wavelet transform and more specifically a faster variation called Fast Wavelet Transform (FWT). There also exists a continuous wavelet transform. The main difference in these is how the scale and shift parameters are selected.
The authors tested on MNIST, CIFAR-10, SHVN, and KDEF and the paper provides comprehensive results for each. Stochastic gradient descent was used and the Haar wavelet is used due to its even, square subbands. The network for all datasets except MNIST is loosedly based on (Zeiler & Fergus, 2013). The authors keep the network consistent, but change the pooling method for each dataset. They also experiment with dropout and Batch Normalization to examine the effects of regularization on their method. All pooling methods compared use a 2x2 window.