DeepVO Towards end to end visual odometry with deep RNN
Visual Odometry (VO) is a computer vision technique for estimating an object’s position and orientation from camera images. It is an important technique commonly used for “pose estimation and robot localization”, with notable applications on the Mars Exploration Rovers and Autonomous Vehicles [x1] [x2]. While the research field of VO is broad, this paper focuses on the topic of monocular visual odometry. Particularly, the authors examine prominent VO methods and argue mainstream geometry based monocular VO methods should be amended with deep learning approaches. Subsequently, the paper proposes a novel deep-learning based end-to-end VO algorithm, and then empirically demonstrates its viability.
Visual odometry algorithms can be grouped into two main categories. The first is known as the conventional methods, and they are based on established principles of geometry. Specifically, an object’s position and orientation are obtained by identifying reference points and calculating how those points change over the image sequence. Moreover, algorithms in this category can be divided into two sub-categories, which differ by how they select reference points. Namely, sparse feature based methods establish reference points using image salient features, such as corners and edges . Whereas, direct methods make use of the whole image and consider every pixel as a reference point . Furthermore, semi-direct methods that combine both approaches are recently gaining popularity .
Today, most of state-of-the-art VO algorithms belong to the geometry family. However, they have significant limitations. For example, direct methods assume “photometric consistency” . Whereas, sparse feature methods are prone to “drifting” because of outliers and noises. As the result, the paper argues geometry-based methods are difficult to engineer and calibrate, thus limiting its practicality. Figure 1 illustrates the general architecture of geometry-based algorithms, and it outlines necessary drift correction techniques such as Camera Calibration, Feature Detection, Feature Matching (tracking), Outlier Rejection, Motion Estimation, Scale Estimation, and Local optimization (bundle adjustment)