stat946w18/MaskRNN: Instance Level Video Object Segmentation
Introduction
Deep Learning has produced state of the art results in many computer vision tasks like image classification, object localization, object detection, object segmentation, semantic segmentation and instance level video object segmentation. Image classification classify the image based on the prominent objects. Object localization is the task of finding objects’ location in the frame. Object Segmentation task involves providing a pixel map which represents the pixel wise location of the objects in the image. Semantic segmentation task attempts at segmenting the image into meaningful parts. Instance level video object segmentation is the task of consistent object segmentation in video sequences.
There are 2 different types of video object segmentation: Unsupervised and Semi-supervised. In unsupervised video object segmentation, the task is to find the salient objects and track the main objects in the video. In an unsupervised setting, the ground truth mask of the salient objects is provided for the first frame. The task is thus simplified to only track the objects required. In this paper we look at an unsupervised video object segmentation technique.
Background Papers
Video object segmentation has been performed using spatio-temporal graphs and deep learning. The Graph based methods construct 3D spatio-temporal graphs in order to model the inter- and the intra-frame relationship of pixels or superpixels in a video.Hence they are computationally slower than deep learning methods and are unable to run at real-time. There are 2 main deep learning techniques for semi-supervised video object segmentation: One Shot Video Object Segmentation (OSVOS) and Learning Video Object Segmentation from Static Images (MaskTrack). Following a brief description of the new techniques introduced by these papers for semi-supervised video object segmentation task.
OSVOS (One-Shot Video Object Segmentation)
This paper introduces the technique of using a frame-by-frame object segmentation without any temporal information from the previous frames of the video. The paper uses a VGG-16 network with pre-trained weights from image classification task. This network is then converted into a fully-connected network (FCN) by removing the fully connected dense layers at the end and adding convolution layers to generate a segment mask of the input. This network is then trained on the DAVIS 2016 dataset.
During testing, the trained VGG-16 FCN is fine-tuned using the first frame of the video using the ground truth. Because this is a semi-supervised case, the segmented mask (ground truth) for the first frame is available. The first frame data is augmented by zooming/rotating/flipping the first frame and the associated segment mask.
MaskTrack (Learning Video Object Segmentation from Static Images)
MaskTrack takes the output of the previous frame to improve its predictions to generate the segmentation mask for the next frame. Thus the input to the network is 4 channel wide (3 RGB channels from the frame at time (t) + 1 binary segmentation mask from frame (t-1)). The output of the network is the binary segmentation mask for frame at time (t). Using the binary segmentation mask (referred to as guided object segmentation in the paper), the network is able to use some temporal information from previous frame to improve its segmentation mask prediction for the next frame.
The model of the MaskTrack network is similar to a modular VGG-16 and is referred to as MaskTrack ConvNet in the paper. The network is trained offline on saliency segmentation datasets: ECSSD, MSRA 10K, SOD and PASCAL-S. The input mask for the binary segmentation mask channel is generated via non-rigid deformation and affine transformation of the ground truth segmentation mask. Similar data-augmentation techniques are also used during online training. Just like OSVOS, MaskTrack uses the first frame ground truth (with augmented images) to fine-tune the network to improve prediction score for the particular video sequence.
A parallel ConvNet network is used to generate predicted segment mask based on the optical flow magnitude. The optical flow between 2 frames is calculated using the EpicFlow algorithm. The output of the two networks is combined using averaging operation to generate the final predicted segmented mask.
Table 1 gives a summary comparison of the different state of the art algorithms. The noteworthy information included in this table is that the technique presented in this paper is the only one which takes into account long-term temporal information. This is accomplished with a recurrent neural net. Furthermore, the bounding box is also estimated instead of just a segmentation mask. The authors claim that this allows the incorporation of a location prior from the tracked object.
Dataset
The three major datasets used in this paper are DAVIS-2016, DAVIS-2017 and Segtrack v2. DAVIS-2016 dataset provides video sequences with only one segment mask for all salient objects. DAVIS-2017 improves the ground truth data by providing segmentation mask for each salient object as a separate color segment mask. Segtrack v2 also provides multiple segmentation mask for all salient objects in the video sequence. These datasets try to recreate real-life scenarios like occlusions, low resolution videos, background clutter, motion blur, fast motion etc.
MaskRNN: Introduction
Most techniques mentioned above don’t work directly on instance level segmentation of the objects through the video sequence. The above approaches focus on image segmentation on each frame and using additional information (mask propagation and optical flow) from the preceding frame perform predictions for the current frame. To address the instance level segmentation problem, MaskRNN proposes a framework where the salient objects are tracked and segmented by capturing the temporal information in the video sequence using a recurrent neural network.
MaskRNN: Overview
In a video sequence [math]\displaystyle{ I = \{I_1, I_2, …, I_T\} }[/math], the sequence of T frames are given as input to the network, where the video sequence contains N salient objects. The ground truth for the first frame [math]\displaystyle{ y_1^* }[/math] is also provided for N salient objects. In this paper, the problem is formulated as a time dependency problem and using a recurrent neural network, the prediction of the previous frame influences the prediction of the next frame. The approach also computes the optical flow between frames (optical flow is the apparent motion of objects between two consecutive frames in the form of a 2D vector field representing the displacement in brightness patterns for each pixel, apparent because it depends on the relative motion between the observer and the scene) and uses that as the input to the neural network. The optical flow is also used to align the output of the predicted mask. “The warped prediction, the optical flow itself, and the appearance of the current frame are then used as input for N deep nets, one for each of the N objects.”[1 - MaskRNN] Each deep net is a made of a object localization network and a binary segmentation network. The binary segmentation network is used to generate the segmentation mask for an object. The object localization network is used to alleviate outliers from the predictions. The final prediction of the segmentation mask is generated by merging the predictions of the 2 networks. For N objects, there are N deep nets which predict the mask for each salient object. The predictions are then merged into a single prediction using an argmax operation at test time.
MaskRNN: Multiple Instance Level Segmentation
Image segmentation requires producing a pixel level segmentation mask and this can become a mullti-class problem. Instead, using the approach from [2- Mask R-CNN] this approach is converted into a multiple binary segmentation problem. A separate segmentation mask is predicted separately for each salient object and thus we get a binary segmentation problem. The binary segments are combined using an argmax operation where each pixel is assigned to the object containing the largest predicted probability.
MaskRNN: Binary Segmentation Network
The above picture shows a single deep net employed for predicting the segment mask for one salient object in the video frame. The network consists of 2 networks: binary segmentation network and object localization network. The binary segmentation network is split into two streams: appearance and flow stream. The input of the appearance stream is the RGB frame at time t and the wrapped prediction of the binary segmentation mask from time (t-1). The wrapping function uses the optical flow between frame (t-1) and frame (t) to generate a new binary segmentation mask for frame (t). The input to the flow stream is the concatenation of the optical flow magnitude between frames (t-1) to (t) and frames (t) to (t+1) and the wrapped prediction of the segmentation mask from frame (t-1). The magnitude of the optical flow is replicated into an RBG format before feeding it to the flow stream. The network architecture closely resembles a VGG-16 network without the fully connected layers at the end. The fully connected layers are replaced with convolutional and bilinear interpolation upsampling layers to generate a binary segment mask. This technique is borrowed from the Fully Convolutional Network mentioned above. The output of the flow stream and the appearance stream is linearly combined and sigmoid function is applied to the result to generate binary mask for ith object. All parts of the network are fully differentiable and thus it can be fully trained in every pass.
MaskRNN: Object Localization Network:
Using a similar technique to the Fast-RCNN method of object localization, where the region of interest (RoI) pooling of the features of the region proposals (i.e. the bounding box proposals here) is performed and passed through fully connected layers to perform regression, the Object localization network generates a bounding box of the salient object in the frame. This bounding box is enlarged by a factor of 1.25 and combined with the output of binary segmentation mask. Only the segment mask available in the bounding box is used for prediction and the pixels outside of the bounding box are marked as zero. MaskRNN uses the convolutional feature output of the appearance stream as the input to the RoI-pooling layer to generate the predicted bounding box. A pixel is classified as foreground if it is both predicted to be in the foreground by the binary segmentation net and within the enlarged estimated bounding box from the object localization net.
Training and Finetuning
For training the network depicted in Figure 1, back propagation through time is used in order to preserve the recurrence relationship connecting the frames of the video sequence. Predictive performance is further improved by following the algorithm for semi supervised setting for video object segmentation with fine-tuning achieved by using the first frame segmentation mask of the ground truth. In this way, the network is further optimized using the ground truth data.
MaskRNN: Implementation Details
The deep net is first trained offline on a set of static images. The ground truth is randomly perturbed locally to generate the imperfect mask from frame (t-1). Two different networks are trained offline separately for DAVIS-2016 and DAVIS-2017 datasets for a fair evaluation of both datasets. After both the object localization net and binary segmentation networks have trained, the temporal information in the network is used to further improve the segmented prediction results. Because of GPU memory constraints the RNN is only able to backpropagate the gradients back 7 frames and learn long-term temporal information.
For optical flow, a pre-trained flowNet2.0 is used to compute the optical flow between frames.
The deep nets (without the RNN) are then fine-tuned during test time by online training the networks on the ground truth of the first frame and the some augmentations of the first frame data. The learning rate is set to 10-5 for online training for 200 iterations.
MaskRNN: Experimental Results
Evaluation Metrics
There are 3 different techniques for performance analysis for Video Object Segmentation techniques:
1. Region Similarity (Jaccard Index): Region similarity or Intersection-over-union is used to capture precision of the area covered by the prediction segmentation mask compared to the ground truth segmentation mask.
2. Contour Accuracy (F-score): This metric measures the accuracy in the boundary of the predicted segment mask and the ground truth segment mask using bipartite matching between the bounding pixels of the masks.
3. Temporal Stability : This estimates the degree of deformation needed to transform the segmentation masks from one frame to the next and is measured by the dissimilarity of the set of points on the contours of the segmentation between two adjacent frames.
Temporal Stability measures how well the pixels of the two masks match, while Contour Accuracy measures the accuracy of the contours.
Ablation Study
The ablation study summarized how the different components contributed to the algorithm evaluated on DAVIS-2016 and DAVIS-2017 datasets.
The above table presents the contribution of each component of the network to the final prediction score. We observe that online fine-tuning improves the performance by a large margin. Addition of RNN/Localization Net and FStream all seem to positively affect the performance of the deep net.
Quantitative Evaluation
The authors use DAVIS-2016, DAVIS-2017 and Segtrack v2 to compare the performance of the proposed approach to other methods based on foreground-background video object segmentation and multiple instance-level video object segmentation.
The above table shows the results for contour accuracy mean and region similarity. The MaskRNN method seems to outperform all previously proposed methods. The performance gain is significant by employing a Recurrent Neural Network for learning recurrence relationship and using a object localization network to improve prediction results.
The following table shows the improvements in the state of the art achieved by MaskRNN on the DAVIS-2017 and the SegTrack v2 dataset.
Qualitative Evaluation
The authors showed example qualitative results from the DAVIS and Segtrack datasets.
Below are some success cases of object segmentation under complex motion, cluttered background, and/or multiple object occlusion.
Below are a few failure cases. The authors explain two reasons for failure: a) when similar objects of interest are contained in the frame (left two images), and b) when there are large variations in scale and viewpoint (right two images).
Conclusion
In this paper a novel approach to instance level video object segmentation task is presented which performs better than current state of the art. The long-term recurrence relationship is learnt using an RNN. The object localization network is added to improve accuracy of the system. Using online fine-tuning the network is adjusted to predict better for the current video sequence.