Depthwise Convolution Is All You Need for Learning Multiple Visual Domains: Difference between revisions

From statwiki
Jump to navigation Jump to search
No edit summary
No edit summary
Line 26: Line 26:
Our proposed approach is based on depthwise separable convolution that factorizes a standard 3 × 3 convolution into a 3 × 3 depthwise convolution and a 1 × 1 pointwise con- volution. While standard convolution performs the channel- wise and spatial-wise computation in one step, depthwise separable convolution splits the computation into two steps: depthwise convolution applies a single convolutional filter per each input channel and pointwise convolution is used to create a linear combination of the output of the depthwise convolution. The comparison of standard convolution and depthwise separable convolution is shown in Fig. 3.
Our proposed approach is based on depthwise separable convolution that factorizes a standard 3 × 3 convolution into a 3 × 3 depthwise convolution and a 1 × 1 pointwise con- volution. While standard convolution performs the channel- wise and spatial-wise computation in one step, depthwise separable convolution splits the computation into two steps: depthwise convolution applies a single convolutional filter per each input channel and pointwise convolution is used to create a linear combination of the output of the depthwise convolution. The comparison of standard convolution and depthwise separable convolution is shown in Fig. 3.
Consider applying a standard convolutional filter K of sizeW ×W ×M ×N onaninputfeaturemapF ofsize
Consider applying a standard convolutional filter K of sizeW ×W ×M ×N onaninputfeaturemapF ofsize
[[File:Standard_convolution_and_depthwise_separable_convolution.png|200px|thumb|left|Standard_convolution_and_depthwise_separable_convolution]]]
[[File:Standard_convolution_and_depthwise_separable_convolution.png]]]
Depthwise convolution and pointwise convolution have different roles in generating new features: the former is used for capturing spatial correlations while the latter is used for capturing channel-wise correlations.
Depthwise convolution and pointwise convolution have different roles in generating new features: the former is used for capturing spatial correlations while the latter is used for capturing channel-wise correlations.



Revision as of 06:45, 13 November 2021

Presented by

Yuwei Liu, Daniel Mao

Introduction

This paper propose a multi-domain learning architecture based on depthwise separable convolu- tion, which is based on the assumption that images from different domains share cross-channel correlations but have domain-specific spatial correlations. The proposed model is compact and has minimal overhead when being applied to new domains. Additionally, we introduce a gat- ing mechanism to promote soft sharing between different domains. The approach was evalueated on Visual Decathlon Challenge, and it showed that the approach can achieve the highest score while only requiring 50% of the parameters compared with the state-of-the-art approaches.

Motivation

Can we build a single neural network that can deal with images across different domains? This question motivates the concept of "multi-domain learning", and there are two challenges in multi-domain learning: 1.Identify a common structure among different domains. 2.Add new tasks to the model without introducing additional parameters.

Previous Work

1. Multi-Domain Learningaims at creating a single neural network to perform image classification tasks in a variety of domains. (Bilen and Vedaldi 2017) showed that a single neural network can learn simultaneously several different visual domains by using an instance normalization layer. (Rebuffi, Bilen, and Vedaldi 2017; 2018) proposed universal parametric families of neural networks that contain specialized problem-specific mod- els which differ only by a small number of parameters. (Rosenfeld and Tsotsos 2018) proposed a method called Deep Adaptation Networks (DAN) that constrains newly learned filters for new domains to be linear combinations of existing ones.

2. Multi-Task Learning (Doersch and Zisserman 2017; Kokkinos 2017) is to extract different features from a single input to simultaneously perform classification, object recognition, edge detection, etc. Various applications can be benefited from a multi-task learning approach since the training signals can be reused among related tasks (Caruana 1997; Zamir et al. 2018).

3. Transfer Learning is to improve the performance of a model on a target domain by leveraging the information from a related source domain (Pan, Yang, and others 2010; Bengio 2012; Hu, Lu, and Tan 2015). Transfer learning has wide applications in a variety of areas, such as computer vision (Raina et al. 2007), sentiment analysis (Glorot, Bordes, and Bengio 2011) and recommender systems (Pan et al. 2010; Guo, Wang, and Xu 2015).

Model Architecture

Our proposed approach is based on depthwise separable convolution that factorizes a standard 3 × 3 convolution into a 3 × 3 depthwise convolution and a 1 × 1 pointwise con- volution. While standard convolution performs the channel- wise and spatial-wise computation in one step, depthwise separable convolution splits the computation into two steps: depthwise convolution applies a single convolutional filter per each input channel and pointwise convolution is used to create a linear combination of the output of the depthwise convolution. The comparison of standard convolution and depthwise separable convolution is shown in Fig. 3. Consider applying a standard convolutional filter K of sizeW ×W ×M ×N onaninputfeaturemapF ofsize ] Depthwise convolution and pointwise convolution have different roles in generating new features: the former is used for capturing spatial correlations while the latter is used for capturing channel-wise correlations.

Network Architecture For the experiments, we use the same ResNet-26 architecture as in (Rebuffi, Bilen, and Vedaldi 2018). This allows us to fairly compare the performance of the proposed approach with previous ones. This original architecture has three macro residual blocks, each outputting 64, 128, 256 feature channels. Each macro block consists of 4 residual blocks. Each residual block has two convolutional layers consisting of 3 × 3 convolutional filters. The network ends with a global average pooling layer and a softmax layer for classification. Different from (Rebuffi, Bilen, and Vedaldi 2018), each standard convolution in the ResNet-26 was replaced with depthwise separable convolution and increase the channel size. The modified network architecture is shown in Fig. 2. This choice leads to a more compact model while still maintaining enough network capacity. The original ResNet-26 has over 6M parameters while our modified architecture has only half the amount of parameters. In the experiments we found that the reduction of parameters does no harm to the performance of the model. The use of depthwise sepa- rable convolution allows us to model cross-channel corre- lations and spatial correlations separately. The idea behind our multi-domain learning method is to leverage the differ- ent roles of cross-channel correlations and spatial correla- tions in generating image features by sharing the pointwise convolution across different domains.

Learning Multiple Domains For multi-domain learning, it is essential to have a set of uni- versally sharable parameters that can generalize to unseen domains. To get a good starting set of parameters, we first train the modified ResNet-26 on ImageNet. After we obtain a well-initialized network, each time when a new domain ar- rives, we add a new output layer and finetune the depth-wise convolutional filters. The pointwise convolutional filters are shared across different domains. Since the statistics of theimages from different domains are different, we also allow domain-specific batch normalization parameters. During in- ference, we stack the trained depthwise convolutional filters for all domains as a 4D tensor and the output of domain d can be calculated as, The adoption of depthwise separable convolution pro- vides a natural separation for modeling cross-channel correlations and spatial correlations. Experimental evidence (Chollet 2017) suggests the decouple of cross-channel correlations and spatial correlations would result in more useful features. We take one step further to develop a multi-domain domain method based on the assumption that different domains share cross-channel correlations but have domain-specific spatial correlations. Our method is based on two observations: model efficiency and interpretability of hidden units in a deep neural network.

Experiments

The authors conducted two experiments to test the theories: the first was based on simulation, and the second used the data CIFAR10.

There are two activations used in the simulation: well-specified under-parametrized logistic regression as well as general convex ERM with the under-confident activation [math]\displaystyle{ \sigma_{underconf} }[/math]. The “calibration curves” were plotted for both activations: the x-axis is p, the y-axis is the average probability given the prediction.

The figure above shows four main results: First, the logistic regression is over-confident at all [math]\displaystyle{ \kappa }[/math]. Second, over-confidence is more severe when [math]\displaystyle{ \kappa }[/math] increases, suggests the conclusion of the theory holds more broadly than its assumptions. Third, [math]\displaystyle{ \sigma_{underconf} }[/math] leads to under-confidence for [math]\displaystyle{ p \in (0.5, 0.51) }[/math], which verifies Theorem 2 and Corollary 3. Finally, theoretical prediction closely matches the simulation, further confirms the theory.

The generality of the theory beyond the Gaussian input assumption and the binary classification setting was further tested using dataset CIFAR10 by running multi-class logistic regression on the first five classes on it. The author performed logistic regression on two kinds of labels: true label and pseudo-label generated from the multi-class logistic (softmax) model.

The figure above indicates that the logistic regression is over-confident on both labels, where the over-confidence is more severe on the pseudo-labels than the true labels. This suggests the result that logistic regression is inherently over-confident may hold more broadly for other under-parametrized problems without strong assumptions on the input distribution, or even when the labels are not necessarily realizable by the model.

Conclusion

1. The well-specified logistic regression is inherently over-confident:

Conditioned on the model predicting [math]\displaystyle{ p \gt 0.5 }[/math], the actual probability of the label being one is lower by an amount of [math]\displaystyle{ \Theta (d/n) }[/math], in the limit of [math]\displaystyle{ n,d\to ∞ }[/math] proportionally and [math]\displaystyle{ n/d }[/math] is large. In other words, the calibration error is always in the over-confident direction. Also, the overall Calibration Error (CE) of the logistic model is [math]\displaystyle{ \Theta (d/n) }[/math] in this limiting regime.

2. The authors identify sufficient conditions for over-and under-confidence in general binary classification problems, where the data is generated from an arbitrary nonlinear activation, and they solve a well-specified empirical risk minimization (ERM) problem with a suitable loss function. Their conditions imply that any symmetric, monotone activation [math]\displaystyle{ \sigma: R→[0,1] }[/math] that is concave at all [math]\displaystyle{ z \gt 0 }[/math] will yield a classifier that is over-confident at any confidence level.

3. Another perhaps surprising implication is that over-confidence is not universal:

They prove that there exists an activation function for which under-confidence can happen for a certain range of confidence levels.

Critiques

This paper provides a precise theoretical study of the calibration error of logistic regression and a class of general binary classification problems. They show that logistic regression is inherently over-confident by [math]\displaystyle{ \Theta (d/n) }[/math] as [math]\displaystyle{ n /d }[/math] is large,and establish sufficient conditions for the over-or under-confidence of unregularized ERM for general binary classification. Their results reveal that

(1) Over-confidence is not just a result of over-parametrization;

(2) Over-confidence is a common mode but not universal.

Their work opens up a number of future questions, such as the interplay between calibration and model training (or regularization), or theoretical studies of calibration on nonlinear models.

References