joint training of a convolutional network and a graphical model for human pose estimation
Human body pose estimation, or specifically the localization of human joints in monocular RGB images, remains a very challenging task in computer vision. Recent approaches to this problem fall into two broad categories: traditional deformable part models and deep-learning based discriminative models. Traditional models rely on the aggregation of hand-crafted low-level features and then use a standard classifier or a higher level generative model to detect the pose, which require the features to be sensitive enough and invariant to deformations. Deep learning approaches learn an empirical set of low and high-level features which are more tolerant to variations. However, it’s difficult to incorporate prior knowledge about the structure of the human body.
This paper proposes a new hybrid architecture that consists of a deep Convolutional Network Part-Detector and a part-based Spatial-Model. This combination and joint training significantly outperforms existing state-of-the-art models on the task of human body pose recognition.
Convolutional Network Part-Detector
They combine an efficient sliding window-based architecture with multi-resolution and overlapping receptive fields, which is shown in the figure below.
First, a Laplacian Pyramid<ref> "Pyramid (image processing)" </ref> of three resolution banks is used to provide each bank with non-overlapping spectral content. Then the Local Contrast Normalization (LCN<ref> Collobert R, Kavukcuoglu K, Farabet C.Torch7: A matlab-like environment for machine learning BigLearn, NIPS Workshop. 2011 (EPFL-CONF-192376). </ref>) is applied to those input images. For each resolution bank, sliding-window ConvNet architecture with overlapping receptive fields is used to get a heat-map as output, which produces a per-pixel likelihood for key joint locations on the human skeleton.
The convolution results (feature maps) of the low resolution bank are upscaled and interleaved with those of high resolution bank. Then, these dense feature maps are processed through convolution stages at each pixel, which is equivalent to fully-connected network model but more efficient.
Supervised training of the network is performed using batched Stochastic Gradient Descent (SGD) with Nesterov Momentum. They use a Mean Squared Error (MSE) criterion to minimize the distance between the predicted output and a target heat-map. At training time they also perform random perturbations of the input images (randomly flipping and scaling the images) to increase generalization performance.