learning Long-Range Vision for Autonomous Off-Road Driving

From statwiki
Jump to: navigation, search


Stereo-vision has been used extensively for mobile robots in identifying near-to-far obstacles in its path, but is limited by it's max range of 12 meters. For the safety of high speed mobile robots recognizing obstacles at longer ranges is vital.

The authors of this paper proposed a "long-range vision vision system that uses self-supervised learning to train a classifier in real-time" <ref name="hadsell2009">Hadsell, Raia, et al. "Learning long‐range vision for autonomous off‐road driving." Journal of Field Robotics 26.2 (2009): 120-144.</ref>; to robustly increase the obstacle and path detection range to over 100 meters. This approach has been implemented and tested on the Learning Applied to Ground Robots (LAGR) provided by the National Robotics Engineering Center (NREC).

Related Work

A common approach to vision-based driving is to process images captured from a pair of stereo cameras, produce a point cloud and use various heuristics to build a traversability map <ref name="goldberg2002">Goldberg, Steven B., Mark W. Maimone, and Lany Matthies. "Stereo vision and rover navigation software for planetary exploration." Aerospace Conference Proceedings, 2002. IEEE. Vol. 5. IEEE, 2002.</ref> <ref name="kriegman1989">Kriegman, David J., Ernst Triendl, and Thomas O. Binford. "Stereo vision and navigation in buildings for mobile robots." Robotics and Automation, IEEE Transactions on 5.6 (1989): 792-803.</ref> <ref name="kelly1998">Kelly, Alonzo, and Anthony Stentz. "Stereo vision enhancements for low-cost outdoor autonomous vehicles." Int’l Conf. on Robotics and Automation, Workshop WS-7, Navigation of Outdoor Autonomous Vehicles. Vol. 1. 1998.</ref> , there has been efforts to increase the range of stereo vision by using the color of nearby ground and obstacles, but these color based improvements can easily be fooled by shadows, monochromatic terrain and complex obstacles or ground types.

More recent vision based approaches such as <ref name="hong2002">Hong, Tsai Hong, et al. "Road detection and tracking for autonomous mobile robots." AeroSense 2002 (2002): 311-319.</ref> <ref name="lieb2005">Lieb, David, Andrew Lookingbill, and Sebastian Thrun. "Adaptive Road Following using Self-Supervised Learning and Reverse Optical Flow." Robotics: Science and Systems. 2005.</ref> <ref name="dahlkamp2006">Dahlkamp, Hendrik, et al. "Self-supervised Monocular Road Detection in Desert Terrain." Robotics: science and systems. 2006.</ref> use learning algorithms to map traversability information to color histograms or geometric (point cloud) data has achieved success in the DARPA challenge.

Other, non-vision-based systems have used the near-to-far learning paradigm to classify distant sensor data based on self-supervision from a reliable, close-range sensor. A self-supervised classifier was trained on satellite imagery and ladar sensor data for the Spinner vehicle’s navigation system<ref> Sofman, Boris, et al. "Improving robot navigation through self‐supervised online learning." Journal of Field Robotics 23.11‐12 (2006): 1059-1075. </ref> and an online self-supervised classifier for a ladar-based navigation system was trained to predict load-bearing surfaces in the presence of vegetation.<ref> Wellington, Carl, and Anthony Stentz. "Online adaptive rough-terrain navigation vegetation." Robotics and Automation, 2004. Proceedings. ICRA'04. 2004 IEEE International Conference on. Vol. 1. IEEE, 2004. </ref>


  • Choice of Feature Representation: For example how to choose a robust feature representation that is informative enough that avoid irrelevant transformations
  • Automatic generation of Training labels: Because the classifier devised is trained in real-time, it requires a constant stream of training data and labels to learn from.
  • Ability to generalize from near to far field: Objects captured by the camera scales inversely proportional to the distance away from the camera, therefore the system needs to take this into account and normalize the objects detected.

Overview of the Learning Process

Learning System Proposed by <ref name="hadsell2009" />

The learning process described by is as follows:

  1. Pre-Processing and Normalization: This step involves correcting the skewed horizon captured by the camera and normalizing the scale of objects captured by the camera, since objects captured scales inversely proportional to distance away from camera.
  2. Feature Extraction: Convolutional Neural Networks were trained and used to extract features in order to reduce dimensionality.
  3. Stereo Supervisor Module: Complicated procedure that uses multiple ground plane estimation, heuristics and statistical false obstacle filtering to generate class labels to close range objects in the normalized input. The goal is to generate training data for the classifier at the end of this learning process.
  4. Training and Classification: Once the class labels and feature extraction training data is combined, it is fed into the classifier for real-time training. The classifier is trained on every frame and the authors have used stochastic gradient descent to update the classifier weights and cross entropy as the loss function.

Pre-Processing and Normalization

At the first stage of the learning process there are two issues that needs addressing, namely the skewed horizon due to the roll of camera and terrain, secondly the true scale of objects that appear in the input image, since objects scale inversely proportional to distance away from camera, the objects need to be normalized to represent its true scale.

File:horizon pyramid.png
Horizon Pyramid <ref name="hadsell2009" />

To solve both issues, a normalized “pyramid” containing 7 sub-images are extracted (see figure above), where the top row of the pyramid has a range from 112 meters to infinity and the closest pyramid row has a range of 4 to 11 meters. These pyramid sub images are extracted and normalized from the input image to form the input for the next stage.

File:horizon normalize.png
Creating target sub-image <ref name="hadsell2009" />

To obtain the scaled and horizon corrected sub images the authors have used a combination of a Hough transform and PCA robust refit to estimate the ground plane [math]P = (p_{r}, p_{c}, p_{d}, p_{o})[/math]. Where [math]p_{r}[/math] is the roll, [math]p_{c}[/math] is the column, [math]p_{d}[/math] is the disparity and [math]p_{o}[/math] is the offset. Once the ground plane [math]P[/math] is estimated, the horizon target sub-image [math]A, B, C, D[/math] (see figure above) is computed by calculating the plane [math]\overline{EF}[/math] with stereo disparity of [math]d[/math] pixels. The following equations were used to calculate the center of the line [math]M[/math], the plane [math]\overline{EF}[/math], rotation [math]\theta[/math] and finally points [math]A, B, C, D[/math].

[math]\textbf{M}_{y} = \frac{p_{c} \textbf{M}_{x} + p_{d} d + p-{o}}{-p_{r}}[/math]

[math]E = ( \textbf{M}_{x} - \textbf{M}_{x} \cos{\theta}, \textbf{M}_{y} - \textbf{M}_{y} \sin{\theta}, )[/math]

[math]F = ( \textbf{M}_{x} + \textbf{M}_{x} \cos{\theta}, \textbf{M}_{y} + \textbf{M}_{y} \sin{\theta}, )[/math]

[math]\theta = \left( \frac{\textbf{w}_{pc} + p_{d} + p_{o}}{-p_{r}} - \frac{p_{d} + p_{o}}{-p_{r}} / w \right)[/math]

[math]A = ( \textbf{E}_{x} + \alpha \sin \theta, \textbf{E}_{y} - \alpha \cos \theta, )[/math]

[math]B = ( \textbf{F}_{x} + \alpha \sin \theta, \textbf{F}_{y} - \alpha \cos \theta, )[/math]

[math]C = ( \textbf{F}_{x} - \alpha \sin \theta, \textbf{F}_{y} + \alpha \cos \theta, )[/math]

[math]D = ( \textbf{E}_{x} - \alpha \sin \theta, \textbf{E}_{y} + \alpha \cos \theta,invariance )[/math]

The last step of this stage is that the images were converted from RGB to YUV, common in image processing pipelines.

Feature Extraction

The goal of the feature extraction is to reduce the input dimensionality and increase the generality of the resulting classifier to be trained. Instead of using hand-tuned feature list, <ref name="hadsell2009" /> used a data driven approach and trained 4 different feature extractors, this is the only component of the learning process where it is trained off-line.

  • Radial Basis Functions (RBF): A set of RBF were learned to form a feature vector by calculating the Euclidean distance between input window and each of the 100 RBF centers. Where each feature vector [math]D[/math] has the form:

    [math]D_{j} = exp(-\beta^{i} || X - K^{i} ||^{2}_{2})[/math]

    Where [math]\beta^{i}[/math] is the inverse variance of the RBF center [math]K^{i}[/math], [math]X[/math] is the input window, [math]K[/math] is the set of [math]n[/math] radial basis centers [math]K = \{K^{i} | i = 1 \dots n\}[/math].

  • Convolution Neural Network (CNN): A standard CNN was used, the architecture consisted of two layers, the first has 20 7x6 filters and the second has 369 6x5 filters. During training a 100 fully connected hidden neuron layer is added as a last layer to train with 5 outputs. Once the network is trained however that last layer was removed, and thus the resulting CNN outputs a 100 component feature vector. For training the authors random initialized the weights, used stochastic graident decent for 30 epochs, and [math]L^2[/math] regularization. The network was trained against 450,000 labeled image patches, and tested against 50,000 labeled patches.

  • Supervised and Unsupervised Auto-Encoders: Auto-Encoders or Deep Belief Networks <ref name="hinton2006">Hinton, Geoffrey E., Simon Osindero, and Yee-Whye Teh. "A fast learning algorithm for deep belief nets." Neural computation 18.7 (2006): 1527-1554.</ref> <ref name="ranzato2007">Ranzato, Marc Aurelio, et al. "Unsupervised learning of invariant feature hierarchies with applications to object recognition." Computer Vision and Pattern Recognition, 2007. CVPR'07. IEEE Conference on. IEEE, 2007.</ref> is a layer wise training procedure. The deep belief net trained has 3 layers, where the first and third are convolutional layers, and the second one is a maxpool layer, the architecture is explained in figure below. Since the encoder contains a maxpool layer, the decoder should have an unpool layer, but the author didn't specify which kind of unpool technique they use.

    File:convo arch.png
    Convolution Neural Network <ref name="hadsell2009" />

    For training, the loss function is the mean squared loss between the original input and decoded picture. At first the network is trained with 10,000 unlabeled images (unsupervised training) with varying outdoor settings (150 settings), then the network is fined tuned with labeled dataset (supervised training), the authors did not mention how large the labeled dataset was, and what training parameters were used for the supervised stage.

Stereo Supervisor Module

File:ground plane estimation.png
Ground Plane Estimation <ref name="hadsell2009" />

Once the images have been preprocessed and normalized, stereo vision algorithms are used to produce data samples and labels that are “visually consistent, error free and well distributed”. There are 4 steps at this stage:

  1. 3D point cloud: Step one, a 3D point cloud is produced by using the Triclops stereo vision algorithm from Point Grey Research. The algorithm used has a range of 12 to 15 meters and works by triangulating objects between two images to find the depth.

  2. Estimation of ground plane: Secondly a ground plane model is found by using a combination of Hough transform and principle component analysis (PCA) to fit a plane onto the point cloud [math]S = \{ (x^{i}, y^{i}, z^{i}) | i = 1 \dots n) \} [/math]. Where [math]x^{i}, y^{i}, z^{i}[/math] defines the position of point relative to the robot’s center, and [math]n[/math] is the number of points in the point cloud.

    The rational behind using Hough transform is since multiple ground planes can be found (see figure above), a voting system was introduced where by the parameter vector which denotes the ground plane parameter (such as pitch, roll and offset) and has the most votes is used. It is selected by the following equation:

    [math]X = P_{ijk} | i, j, k = argmax_{i,j,k} (V_{ijk})[/math]

    Where [math]X[/math] is the new plane estimate, [math]V[/math] is a tensor that accumulates the votes and [math]P[/math] is a tensor that records the plane parameter space. Then PCA is used to refit and compute the eigenvalue decomposition of the covariance matrix of the points [math]X^{1 \dots n}[/math].

    [math]\frac{1}{n} \sum^{n}_{1} X^{i} X^{i'} = Q \Lambda Q[/math]

    It should be noted, however, that multiple ground planes does not eliminate all errors from the labeling process. The authors of this paper used the following heuristics to minimize the errors in the training data. The heuristic is and I quote:

    If the mean plane distance is not too high and the variance of the plane distance is very low, then the region is traversable (probably a traversable hillside). Conversely, if the mean plane distance is very low but the variance is higher, then that region is traversable (possibly tall grass). <ref name="hadsell2009" />

  3. Projection: Stereo vision has the limitation of only being able to robustly detect short range (12m max) objects. In an attempt to mitigate the uncertainty of long range objects, footlines of obstacles (the bottom outline of the obstacle) are used. This gives stereo vision better estimates about the scale and distance of long range objects. The footline of long range objects are found by projecting obstacle points onto the ground planes and marking high point-density regions.

  4. Labeling: Once the ground plane estimation, footline projections and obstacle points are found, ground map [math]G[/math], footline-map [math]F[/math] and obstacle-map [math]O[/math] can be produced.

    Conventionally binary classifiers are used for terrain traversability, however, used a classifier that uses 5 labels:

    • Super-traversable

    • Ground

    • Footline

    • Obstacle

    • Super-obstacle

    File:label categories.png
    Label Categories <ref name="hadsell2009" />

    Where super-traversable and super-obstacle are high confidence labels that refer to input windows where only ground or obstacles are seen. Lower confidence labels such as ground and obstacle are used when there are mixture of points in the input window. Lastly footline labels are assigned when footline points are centered in the middle of the input window. The label criteria rules used by <ref name="hadsell2009" /> are outlined in figure below

    File:label criteria.png
    Label Criteria Rules <ref name="hadsell2009" />

Training and Classification

The real-time classifier is the last stage of the learning process. Due to its real-time nature the classifier has to be simple and efficient, therefore 5 logistic regression classifiers (one for each category) with a Kullback-Liebler divergence or relative entropy loss function and stochastic gradient descent was used. Additionally 5 ring buffer or circular buffer are used to store incoming data from the feature extraction and stereo supervisor. The ring buffer acts as a First In First Out (FIFO) queue and stores temporary data as it is being received and processed. The result is that the classifiers outputs a 5 component likelihood vectors for each input.

Experimental Results

Performances of Feature Extractors

File:feature extractors.png
Comparision of Feature Extractors <ref name="hadsell2009" />

For testing the feature extractors, a dataset containing 160 hand labeled frames from over 25 log files were used, the log files can be further divided into 7 groups as seen in figure above, where it is a comparision of the 4 different feature extractors: Radial Basis Functions, Convolutional Neural Network, an Unsupervised Auto-Encoder and finally a supervised Auto-Encoder. In almost all cases it can be observed that the best feature extractor was the CNN trained with Auto-Encoders with the best average error rate of [math]8.46\%[/math].

Performances of Stereo Supervisor Module

File:stereo module comparison.png
Stereo Module Performance <ref name="hadsell2009" />

To test the stereo module it was compared against the online classifier using the same ground truth dataset used in the previous section. As you can see from figure above the online classifier performs better than the stereo supervisor module, the authors note that it is due to the online classifier ability to smooth and regularize the noisy data <ref name="hadsell2009" />.

Field Test

The online classifier was deployed onto a Learning Applied to Ground Robots (LAGR) vehicle provided by the National Robotics Engineering Center (NREC), and tested on three different courses. The system contains 2 processes running simultaneously, a 1-2 Hz online classifier outlined above, and a fast 8 - 10 Hz stereo based obstacle avoidance module. The combination of the both provides good long range and short range obstacle capabilities.

The system was found to be most effective when long-range online classifier was combined with the short range module, as the short range only has a range of around 5 meters it often required human intervention to rescue the vehicle. No quantitative comparisons were given for these field tests, it is purely subjective and only tested during daytime.


This paper did not introduce novel ideas per se in terms of deep learning methods, however the application of deep learning methods (CNN + auto-encoders) along with stereo module to train a 5 label classifier shows great promise in increasing the road classification from a max range of 10 - 12 meters with purely stereo vision to over 100 meters is new in 2009 <ref name="hadsell2009" />.

There were several issues with the experiments I have observed:

  • There were no mention how many times the feature extractors were trained to obtain best parameters, nor the difficulty in training.
  • All data and tests were performed during daytime, no mention of limitations at night.
  • This paper did not compare itself against other state of the art systems such as <ref name="hong2002" /> <ref name="lieb2005" /> <ref name="dahlkamp2006" /> other than stereo vision based systems.
  • In the plot of stereo vision vs online classifier did not contain error bars. Also on the x-axis the groundtruth frames are ordered by error difference, it would be interesting to see what would happen if it was time ordered instead, and whether it would tell us that stereo vision performs well at the beginning but poorly afterwards, supporting the authors claim that an online classifier is able to smooth and regularize the noisy data.
  • Field tests lacked a quantitative measures to compare between the long range system against the short range system.


<references />