# joint training of a convolutional network and a graphical model for human pose estimation

## Introduction

Human body pose estimation, or specifically the localization of human joints in monocular RGB images, remains a very challenging task in computer vision. Recent approaches to this problem fall into two broad categories: traditional deformable part models and deep-learning based discriminative models. Traditional models rely on the aggregation of hand-crafted low-level features and then use a standard classifier or a higher level generative model to detect the pose, which require the features to be sensitive enough and invariant to deformations. Deep learning approaches learn an empirical set of low and high-level features which are more tolerant to variations. However, it’s difficult to incorporate prior knowledge about the structure of the human body.

This paper proposes a new hybrid architecture that consists of a deep Convolutional Network Part-Detector and a part-based Spatial-Model. This combination and joint training significantly outperforms existing state-of-the-art models on the task of human body pose recognition.

## Model

### Convolutional Network Part-Detector

They combine an efficient ConvNet architecture with multi-resolution and overlapping receptive fields, which is shown in the figure below.

Traditionally, in image processing tasks such as these, a Laplacian Pyramid<ref> "Pyramid (image processing)" </ref> of three resolution banks is used to provide each bank with non-overlapping spectral content. Then the Local Contrast Normalization (LCN<ref> Collobert R, Kavukcuoglu K, Farabet C.Torch7: A matlab-like environment for machine learning BigLearn, NIPS Workshop. 2011 (EPFL-CONF-192376). </ref>) is applied to those input images. However, in this model, only a full image stage and a half-resolution stage was used, allowing for a simpler architecture and faster training.

Although, a sliding window architecture is usually used for this type of task, it has the down side of creating redundant convolutions. Instead, in this network, for each resolution bank, ConvNet architecture with overlapping receptive fields is used to get a heat-map as output, which produces a per-pixel likelihood for key joint locations on the human skeleton.

The convolution results (feature maps) of the low resolution bank are upscaled and interleaved with those of high resolution bank. Then, these dense feature maps are processed through convolution stages at each pixel, which is equivalent to fully-connected network model but more efficient.

Supervised training of the network is performed using batched Stochastic Gradient Descent (SGD) with Nesterov Momentum. They use a Mean Squared Error (MSE) criterion to minimize the distance between the predicted output and a target heat-map. At training time they also perform random perturbations of the input images (randomly flipping and scaling the images) to increase generalization performance.

### Higher-Level Spatial-Model

They use a higher-level Spatial-Model to get rid of false positive outliers and anatomically incorrect poses predicted by the Part-Detector, constraining joint inter-connectivity and enforcing global pose consistency.

They formulate the Spatial-Model as an MRF-like model over the distribution of spatial locations for each body part. After the unary potentials for each body part location are provided by the Part-Detector, the pair-wise potentials in the graph are computed using convolutional priors, which model the conditional distribution of the location of one body part to another. For instance, the final marginal likelihood for a body part A can be calculated as:

$\bar{p}_{A}=\frac{1}{Z}\prod_{v\in V}^{ }\left ( p_{A|v}*p_{v}+b_{v\rightarrow A} \right )$

Where $v$ is the joint location, $p_{A|v}$ is the conditional prior which is the likelihood of the body part A occurring in pixel location (i, j) when joint $v$ is located at the center pixel, $b_{v\rightarrow A}$ is a bias term used to describe the background probability for the message from joint $v$ to A, and Z is the partition function. The learned pair-wise distributions are purely uniform when any pairwise edge should to be removed from the graph structure.

For their practical implementation they treat the distributions above as energies to avoid the evaluation of Z in the equation before. Their final model is

$\bar{e}_{A}=\mathrm{exp}\left ( \sum_{v\in V}^{ }\left [ \mathrm{log}\left ( \mathrm{SoftPlus}\left ( e_{A|v} \right )*\mathrm{ReLU}\left ( e_{v} \right )+\mathrm{SoftPlus}\left ( b_{v\rightarrow A} \right ) \right ) \right ] \right )$

$\mathrm{where:SoftPlus}\left ( x \right )=\frac{1}{\beta }\mathrm{log}\left ( 1+\mathrm{exp}\left ( \beta x \right ) \right ), 0.5\leq \beta \leq 2$

$\mathrm{ReLU}\left ( x \right )=\mathrm{max}\left ( x,\epsilon \right ), 0\lt \epsilon \leq 0.01$

With this modified formulation, the equation can be trained by using back-propagation and SGD. The network-based implementation of the equation is shown below.

The convolution kernels they use in this step is quite large, thus they apply FFT convolutions based on the GPU, which is introduced by Mathieu et al.<ref> Mathieu M, Henaff M, LeCun Y.Fast training of convolutional networks through ffts arXiv preprint arXiv:1312.5851, 2013. </ref>.The convolution weights are initialized using the empirical histogram of joint displacements created from the training examples. Moreover, during training they randomly flip and scale the heat-map inputs to improve generalization performance.

### Unified Model

They first train the Part-Detector separately and store the heat-map outputs, then use these heat-maps to train a Spatial-Model. Finally, they combine the trained Part-Detector and Spatial-Models and back-propagate through the entire network, which further improves performance. Because the SpatialModel is able to effectively reduce the output dimension of possible heat-map activations, the PartDetector can use available learning capacity to better localize the precise target activation.

## Results

They evaluated their architecture on the FLIC and extended-LSP datasets. The FLIC dataset is comprised of 5003 images from Hollywood movies with actors in predominantly front-facing standing up poses, while the extended-LSP dataset contains a wider variety of poses of athletes playing sport. They also proposed a new dataset called FLIC-plus<ref> "FLIC-plus Dataset" </ref> which is fairer than FLIC-full dataset.

Their model’s performance on the FLIC test-set for the elbow and wrist joints is shown below. It’s trained by using both the FLIC and FLIC-plus training sets.

Performance on the LSP dataset is shown here.

Since the LSP dataset cover a larger range of the possible poses, their Spatial-Model is less effective. The accuracy for this dataset is lower than FLIC. They believe that increasing the size of the training set will improve performance for these difficult cases.

<references />