MarrNet: 3D Shape Reconstruction via 2.5D Sketches

From statwiki
Revision as of 04:37, 22 March 2018 by J3ku (talk | contribs) (Training)
Jump to: navigation, search


Humans are able to quickly recognize 3D shapes from images, even in spite of drastic differences in object texture, material, lighting, and background.

In this work, the authors propose a novel end-to-end trainable model that sequentially estimates 2.5D sketches and 3D object shape from images. The two step approach makes the network more robust to differences in object texture, material, lighting and background. Based on the idea from [Marr, 1982] that human 3D perception relies on recovering 2.5D sketches, which include depth and surface normal maps, the author’s design an end-to-end trainable pipeline which they call MarrNet. MarrNet first estimates depth, normal maps, and silhouette, followed by a 3D shape.

The authors claim several unique advantages to their method. Single image 3D reconstruction is a highly under-constrained problem, requiring strong prior knowledge of object shapes. As well, accurate 3D object annotations using real images are not common, and many previous approaches rely on purely synthetic data. However, most of these methods suffer from domain adaptation due to imperfect rendering.

Using 2.5D sketches can alleviate the challenges of domain transfer. It is straightforward to generate perfect object surface normals and depths using a graphics engine. Since 2.5D sketches contain only depth, surface normal, and silhouette information, the second step of recovering 3D shape can be trained purely from synthetic data. As well, the introduction of differentiable constraints between 2.5D sketches and 3D shape makes it possible to fine-tune the system, even without any annotations.

The framework is evaluated on both synthetic objects from ShapeNet, and real images from PASCAL 3D+, showing good qualitative and quantitative performance in 3D shape reconstruction.

Related Work

2.5D Sketch Recovery

Researchers have explored recovering 2.5D information from shading, texture, and colour images in the past. More recently, the development of depth sensors has led to the creation of large RGB-D datasets, and papers on estimating depth, surface normals, and other intrinsic images using deep networks. While this method employs 2.5D estimation, the final output is a full 3D shape of an object.

Single Image 3D Reconstruction

The development of large-scale shape repositories like ShapeNet has allowed for the development of models encoding shape priors for single image 3D reconstruction. These methods normally regress voxelized 3D shapes, relying on synthetic data or 2D masks for training. The formulation in the paper tackles domain adaptation better, since the network can be fine-tuned on images without any annotations.

2D-3D Consistency

Intuitively, the 3D shape can be constrained to be consistent with 2D observations. This idea has been explored for decades, with the use of depth and silhouettes, as well as some papers enforcing differentiable 2D-3D constraints for joint training of deep networks. In this work, this idea is exploited to develop differentiable constraints for consistency between the 2.5D sketches and 3D shape.


The 3D structure is recovered from a single RGB view using three steps, shown in Figure 1. The first step estimates 2.5D sketches, including depth, surface normal, and silhouette of the object. The second step, shown in Figure 2, estimates a 3D voxel representation of the object. The third step uses a reprojection consistency function to enforce the 2.5D sketch and 3D structure alignment.


2.5D Sketch Estimation

The first step takes a 2D RGB image and predicts the surface normal, depth, and silhouette of the object. The goal is to estimate intrinsic object properties from the image, while discarding non-essential information. A ResNet-18 encoder-decoder network is used, with the encoder taking a 256 x 256 RGB image, producing 8 x 8 x 512 feature maps. The decoder is four sets of 5 x 5 convolutional and ReLU layers, followed by four sets of 1 x 1 convolutional and ReLU layers. The output is 256 x 256 resolution depth, surface normal, and silhouette images.

3D Shape Estimation

The second step estimates a voxelized 3D shape using the 2.5D sketches from the first step. The focus here is for the network to learn the shape prior that can explain the input well, and can be trained on synthetic data without suffering from the domain adaptation problem. The network architecture is inspired by the TL network, and 3D-VAE-GAN, with an encoder-decoder structure. The normal and depth image, masked by the estimated silhouette, are passed into 5 sets of convolutional, ReLU, and pooling layers, followed by two fully connected layers, with a final output width of 200. The 200-dimensional vector is passed into a decoder of 5 convolutional and ReLU layers, outputting a 128 x 128 x 128 voxelized estimate of the input.

Re-projection Consistency

The third step consists of a depth re-projection loss and surface normal re-projection loss. Here, [math]v_{x, y, z}[/math] represents the value at position [math](x, y, z)[/math] in a 3D voxel grid, with [math]v_{x, y, z} \in [0, 1] ∀ x, y, z[/math]. [math]d_{x, y}[/math] denotes the estimated depth at position [math](x, y)[/math], [math]n_{x, y} = (n_a, n_b, n_c)[/math] denotes the estimated surface normal. Orthographic projection is used.


The voxel with depth [math]v_{x, y}, d_{x, y}[/math] should be 1, while all voxels in front of it should be 0. The projected depth loss is defined as follows:

[math] L_{depth}(x, y, z)= \left\{ \begin{array}{ll} v^2_{x, y, z}, & z \lt d_{x, y} \\ (1 - v_{x, y, z})^2, & z = d_{x, y} \\ 0, & z \gt d_{x, y} \\ \end{array} \right. [/math]

[math] \frac{∂L_{depth}(x, y, z)}{∂v_{x, y, z}} = \left\{ \begin{array}{ll} 2v{x, y, z}, & z \lt d_{x, y} \\ 2(v_{x, y, z} - 1), & z = d_{x, y} \\ 0, & z \gt d_{x, y} \\ \end{array} \right. [/math]

When [math]d_{x, y} = \infty[/math], all voxels in front of it should be 0.

Surface Normals

Since vectors [math]n_{x} = (0, −n_{c}, n_{b})[/math] and [math]n_{y} = (−n_{c}, 0, n_{a})[/math] are orthogonal to the normal vector [math]n_{x, y} = (n_{a}, n_{b}, n_{c})[/math], they can be normalized to obtain [math]n’_{x} = (0, −1, n_{b}/n_{c})[/math] and [math]n’_{y} = (−1, 0, n_{a}/n_{c})[/math] on the estimated surface plane at [math](x, y, z)[/math]. The projected surface normal tried to guarantee voxels at [math](x, y, z) ± n’_{x}[/math] and [math](x, y, z) ± n’_{y}[/math] should be 1 to match the estimated normal. The constraints are only applied when the target voxels are inside the estimated silhouette.

The projected surface normal loss is defined as follows, with [math]z = d_{x, y}[/math]:

[math] L_{normal}(x, y, z) = (1 - v_{x, y-1, z+\frac{n_b}{n_c}})^2 + (1 - v_{x, y+1, z-\frac{n_b}{n_c}})^2 + (1 - v_{x-1, y, z+\frac{n_a}{n_c}})^2 + (1 - v_{x+1, y, z-\frac{n_a}{n_c}})^2 [/math]

Gradients along x are:

[math] \frac{dL_{normal}(x, y, z)}{dv_{x-1, y, z+\frac{n_a}{n_c}}} = 2(v_{x-1, y, z+\frac{n_a}{n_c}}-1) [/math] and [math] \frac{dL_{normal}(x, y, z)}{dv_{x+1, y, z-\frac{n_a}{n_c}}} = 2(v_{x+1, y, z-\frac{n_a}{n_c}}-1) [/math]

Gradients along y are similar to x.


The 2.5D and 3D estimation components are first pre-trained separately on synthetic data from ShapeNet, and then fine-tuned on real images.

For pre-training, the 2.5D sketch estimator is trained on synthetic ShapeNet depth, surface normal, and silhouette ground truth, using an L2 loss. The 3D estimator is trained with ground truth voxels using a cross-entropy loss.

Reprojection consistency loss is used to fine-tune the 3D estimation using real images, using the predicted depth, normals, and silhouette. A straightforward implementation leads to shapes that explain the 2.5D sketches well, but lead to unrealistic 3D appearance due to overfitting.

Instead, the decoder of the 3D estimator is fixed, and only the encoder is fine-tuned. The model is fine-tuned separately on each image for 40 iterations, which takes up to 10 seconds on the GPU. Without fine-tuning, testing time takes around 100 milliseconds. SGD is used for optimization with batch size of 4, learning rate of 0.001, and momentum of 0.9.













Other Data