Incremental Boosting Convolutional Neural Network for Facial Action Unit Recognition: Difference between revisions

From statwiki
Jump to navigation Jump to search
Line 20: Line 20:
H(x) = \sum\limits_{i = 1}^K \alpha_i h(x_{ij}; \lambda_j)
H(x) = \sum\limits_{i = 1}^K \alpha_i h(x_{ij}; \lambda_j)
\end{equation}
\end{equation}
where $x_{ij}$ is the $j^{th}$ activation feature of the $i^{th}$ input. Here, we define $h(\cdot)$ as:
where $x_{ij}$ is the $j^{th}$ activation feature of the $i^{th}$ input. Here we define $h(\cdot)$ to be a differentiable surrogate for the sign function, as:
\begin{equation}
\begin{equation}
h(x_{ij}; \lambda_j) = \dfrac {f(x_{ij}; \lambda_j)}{\sqrt{f(x_{ij}; \lambda_j)^2 + \eta^2}}.
h(x_{ij}; \lambda_j) = \dfrac {f(x_{ij}; \lambda_j)}{\sqrt{f(x_{ij}; \lambda_j)^2 + \eta^2}}
\end{equation}
\end{equation}
where $f(x_{ij}; \lambda_j)$ denotes a decision-stump (one-level decision tree) with $\lambda_j$ being the threshold.


== CNNs ==
== CNNs ==

Revision as of 17:57, 20 October 2017

Introduction

Facial expression is one of the most natural ways that human beings express emotion. The Facial Action Coding System (FACS) attempts to systemically categorize each facial expression by specifying a basic set of muscle contractions or relaxations, formally called Action Units (AUs). For example, "AU 1" stands for the inner portion of the brows being raised, and "AU 6" stands for the cheeks. Such a framework would help us in denoting any facial expression as possibly a combination of different AUs. However, during the course of an average day, most human beings do not experience drastically varying emotions and therefore their facial expressions might change only subtly. Additionally, there might also be a lot of subjectivity involved if this task were to be done manually. To address these issues, it is imperative to automate this task. Moreover, automating AU recognition also has potential applications in human-computer interaction (HCI), online education, interactive gaming, among other domains.

Because of the recent advancements in object detection and categorization tasks, CNNs are an appealing go-to for the facial AU recognition task described above. However, compared to the former, the training sets available for the AU recognition task are not very large, and therefore the learned CNNs suffer from overfitting. This work builds on the idea of integrating boosting within CNN. Boosting is a technique wherein multiple weak learners are combined together to yield a strong learner. Moreover, this work also modifies the mechanics of learning a CNN by breaking large datasets in mini-batches. Typically, in a batch strategy, each iteration uses a batch only to update the parameters and the features learned in that iteration are discarded for subsequent iterations. Herein the authors incorporate incremental learning by building a classifier that incrementally learns from all the iterations/batches. Hence the genesis of the name Incremental Boosting - Convolutional Neural Network (IB-CNN) is justified. The IB-CNN introduced here outperforms the state-of-the-art CNNs on four standard AU-coded databases.

Related Work

Methodology

We begin by providing a quick review of CNN and boosting-CNN (B-CNN), and then developing the details model presented in this paper.

CNNs

Convolutional Neural Networks are neural networks that contain at least one convolution layer. Typically, the convolution layer is accompanied with a pooling layer. A convolution layer, as the name suggests, convolves the input matrix (or tensor) with a filter (or kernel). This operation produces a feature map (or activation map) that, roughly speaking, shows the presence or absence of the filter in the image. Note that the parameters in the kernel are not preset and are learned. A pooling layer reduces the dimensionality of the data by summarizing the activation map by some means (such as max or average). Following this, the matrix (or tensor) data is then vectorized using a fully-connected (FC) layer. The neurons in the FC layer have connections to all the activations in the previous layer. Finally, there is a decision layer that has as many neurons as the number of classes. The decision layer decides the class based on computing a score function using the activations of the FC layer. Generally, an inner-product score function is used, which is replaced by the boosting score function in this work.

Boosting CNN

Let $X = [x_1, \dots, x_M]$ denote the activation features of a training data batch of size $M$, where each $x_i$ has $K$ activation features. Also, let $Y = [y_1, \dots , y_M]$ denote the corresponding labels, i.e. each $y_i \in \{-1, +1\}$. In other words, the vector $x_i$ contains all the activations of the FC layer, which is typically multiplied with the weights corresponding to the connections between the FC layer and the decision layer. However, as mentioned earlier, this work achieves that same task of scoring via boosting as opposed to computing the inner product. Denoting the weak classifiers by $h(\cdot)$, we obtain the strong classifier as: \begin{equation} H(x) = \sum\limits_{i = 1}^K \alpha_i h(x_{ij}; \lambda_j) \end{equation} where $x_{ij}$ is the $j^{th}$ activation feature of the $i^{th}$ input. Here we define $h(\cdot)$ to be a differentiable surrogate for the sign function, as: \begin{equation} h(x_{ij}; \lambda_j) = \dfrac {f(x_{ij}; \lambda_j)}{\sqrt{f(x_{ij}; \lambda_j)^2 + \eta^2}} \end{equation} where $f(x_{ij}; \lambda_j)$ denotes a decision-stump (one-level decision tree) with $\lambda_j$ being the threshold.

CNNs

Experiments

Conclusion

References

1. Han, S., Meng, H., Khan, A. S., Tong, Y. (2016) "Incremental Boosting Convolutional Neural Network for Facial Action Unit Recognition". NIPS. 2. Tian, Y., Kanade, T., Cohn, J. F. (2001) "Recognizing Action Units for Facial Expression Analysis". IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 23., No. 2.