STAT946F17/ Learning a Probabilistic Latent Space of Object Shapes via 3D GAN

From statwiki
Jump to: navigation, search


In this work, a novel method for 3D object generation is presented. This framework, namely 3D Generative Adversarial Networks (3D GAN) is an extension of GANs for 2D image generation. Here, a probabilistic space is sampled for a latent vector representation which then passes through a set volumetric convolutional layers resulting in a novel generated 3D object. The benefits of this approach are three-fold

  1. the use of adversarial criterion, in place of traditional heuristic criteria, allows the generator to implicitly capture object structure leading to high quality and novel 3D objects
  2. the GAN learns a mapping from latent space to the space of generated objects automatically allowing it to bypass the need for reference CAD models when generating new 3D samples
  3. the adversarial discriminator can learn, in an unsupervised manner, a powerful 3D shape descriptor (i.e., feature vector), that is widely applicable and performs competitively in 3D object recognition.

From the experimental results, the authors prove that the unsupervisedly learned features achieve excellent performance on 3D object recognition, comparable to the supervised learning methods.

Related Work

Modeling and synthesizing 3D shapes: Various AI and vision researchers have contributed to the literature involving 3D object understanding and generation or synthesis of 3D objects. To name a few, Carlson, 1982, Tangelder and Veltkamp, 2008, Van Kaick et al., 2011, Blanz and Vetter, 1999, Kalogerakis et al., 2012, Chaudhuri et al., 2011, Xue et al., 2012, Kar et al., 2015, Bansal et al., 2016, Wu et al., 2016.

  • Huang et al. [2015] explored generating 3D shapes with pre-trained templates and producing object structure and surface geometry.

Deep learning for 3D data:

  • Li et al. [2015], Su et al. [2015b], Girdhar et al. [2016] proposed learning a joint embedding of 3D shapes and synthesized images.
  • Su et al. [2015a], Qi et al. [2016] work involved learning discriminative representations for 3D object recognition.
  • Wu et al. [2016], Xiang et al. [2015], Choy et al. [2016] published 3D object reconstruction from in-the-wild images, possibly with a recurrent network.
  • Girdhar et al. [2016], Sharma et al. [2016] proposed autoencoder-based networks for learning voxel-based object representations.

Learning with an adversarial net:

  • Generative Adversarial Nets were proposed by Goodfellow et al., 2014 and it involved incorporating an adversarial discriminator into the procedure of generative modeling
  • Denton et al., 2015 and Radford et al., 2016 adopted GAN with convolutional networks by introducing LAPGAN and DC-GAN respectively.

In a nutshell, existing methods include

  • Borrow parts from objects in existing CAD model libraries → generate realistic but not novel samples
  • Learn deep object representations based on voxelized objects → fail to capture highly structured differences between 3D objects
  • Mostly learn based on a supervised criterion → limited to the objects in the dataset

According to Karpathy et al. in their OpenAI blog post, there are 3 popular generative model approaches that have been widely adopted, namely Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs) and Recurrent Neural Networks (RNNs) (Karpathy et al., 2016). All of these methods have their own competitive edges and the first 2 approaches are the main focus of research in this paper. In order to construct a more comprehensive picture of generative modelling, RNN approach or more specifically PixelRNN approach will be briefly explained below.

The fundamental concept of PixelRNN is to model each pixel within an image based on previous pixel and their RGB color values (Oord, Kalchbrenner, & Kavukcuoglu, 2016, p. 2). More specifically, PixelRNN “model the pixels as discrete values using a multinomial distribution implemented with a simple softmax layer.” (Oord, Kalchbrenner, & Kavukcuoglu, 2016, p. 2). Such concept can be easily explained mathematically as the following: essentially the goal of the model is to predict the probability of the next pixel [math]x_i[/math] given all previous pixels [math]x_1, …, x_{i - 1} [/math]. Furthermore, probability of each pixel [math] P(x_i) [/math] is also influenced by 3 color channels (red, blue & green or RGB). Thus, overall the probability of a pixel [math]x_i[/math] can be presented by the following equation:

[math] P(x_i)= P(x_{i, R}| x_1, …, x_{i - 1}) P(x_{i, B}| x_{i, R}, x_1, …, x_{i - 1}) P(x_{i, G}| x_{i, R}, x_{i, B}, x_1, …, x_{i - 1}) [/math]

In Oord, Kalchbrenner, & Kavukcuoglu‘s paper, they estimated [math] P(x_i) [/math] as “as a discrete distribution, with every conditional distribution being a multinomial that is modeled with a softmax layer.” (Oord, Kalchbrenner, & Kavukcuoglu, 2016, p. 3). However, a major limitation of this method is the computationally heavy process of sampling because of the sequence of conditional probabilities. Thus, more efficient approaches such as 3D-GANs and 3D-VAE-GANs are introduced in this paper.

Oord et al. [10] presented a deep neural network for predicting the image pixels sequentially along its two dimensions. Their method modelled the probability value of raw pixels thereby encoding the complete set of dependencies the corresponding image.

Karpathy et al. [11] describe some projects which helps in enhancing or utilising the generative models. Particularly, they speak about how to improve GANs, VAEs. Also, they introduce InfoGAN which is an extension of GAN which can learn disentangled and interpretable representations related to images.


Let us first review Generative Adversarial Networks (GANs). As proposed in Goodfellow et al. [2014], GANs consist of a generator (which is the deconvolutional neural network) and the discriminator (which is the convolutional neural network), where the discriminator tries to classify real objects and objects synthesized by the generator, while the generator attempts to confuse the discriminator. In essence, for a GAN, the training data is the provided dataset from which the network learns to discriminate. This discrimination is computed for when the network is provided with randomized inputs sampled from a predefined latent space. These provided samples are then synthesized by the generator which learns the input and is then evaluated by the discriminator. In order for the discriminator to produce quality synthetic images, backpropogation is applied to both the discriminator and the generator. With proper guidance, this adversarial game will result in a generator that is able to synthesize fake samples very similar to the real training samples that the discriminator cannot distinguish. This can be thought of as a zero-sum or minimax two player game. The analogy typically used is that the generative model is like a team of counterfeiters, trying to produce and use fake currency while the discriminative model is like the police, trying to detect the counterfeit currency. The generator is trying to fool the discriminator while the discriminator is trying to not get fooled by the generator. As the models train through alternating optimization, both methods are improved until a point where the counterfeits are indistinguishable from the genuine articles.


In this paper, 3D Generative Adversarial Networks (3D-GANs) are presented as a simple extension of GANs for 2D imagery. Here, the model is composed of i) a generator (G) which maps a 200-dimensional latent vector $z$ to a 64 x 64 x 64 cube, representing the object G($z$) in voxel space, and ii) a discriminator (D) which outputs a confidence value D($x$) of whether a 3D object input $x$ is real or synthetic. Following Goodfellow et al. [2014], the classification loss used at the end of the discriminator is binary cross-entropy as

$L_{3D-GAN} = \log D(x) + \log \big(1 − D(G(z)) \big)$

where $x$ is a real object in a 64 x 64 x 64 space, and $z$ is randomly sampled noise from a distribution $p(z)$. In this work coefficients of $z$ are randomly sampled from a probabilistic latent space (Uniform [0,1]). Each dimension of z is an independent and identically distributed random variable. ( A collection is an IID if all random variables have same probability distribution and all are mutually independent.)


3D-VAE-GANs as an extension of 3D-GANs introduces an additional image encoder (E), which takes as input a 2D image $y$ and outputs the latent representation vector $z$. Inspired by the work of Larsen et al. [2015] on VAE-GANs, the addition of the E component allows for a mapping between 2D objects and their 3D shapes to be learned simultaneously with the adversarial training of GANs that learn to generate synthetic but realistic 3D objects. This means that after training, a 2D image can be inputted into the 3D-VAE-GAN network resulting in a realistic rendering of the corresponding 3D object for that 2D image. One would expect that this network performs better than a singular VAE network that takes 2D images as input and outputs 3D shapes. Unfortunately the authors do not provide any comparsion between such setups.

The loss function for the 3D-VAE-GAN is similar to that of the VAE-GAN. These loss functions have the following form:

\begin{equation}L = L_{3D-GAN} + \alpha_{1}L_{KL} + \alpha_{2}L_{recon},\label{eq1}\end{equation}

where $\alpha_{1}$ and $\alpha_{2}$ are weights of the KL divergence and reconstruction loss. $L_{recon}$ is the reconstruction loss, $L_{3D-GAN}$ is the cross-entropy loss and $L_{KL}$ is the divergence loss.

As depicted in the figure on the right [from Larsen et al., 2015], the setup of VAE-GAN shows that VAE and GAN are combined by sharing the decoder of VAE with the generator of GAN.

As outlined in the supplementary material of this paper, the training of 3D-VAE-GAN is done through the following steps. Here, $y_i$ is a 2D image and $x_i$ is its corresponding 3D shape. In each training iteration $t$, a random sample $z_t$ is generated from $\mathcal{N}(0, I)$, and the discriminator (D), image encoder (E), and generator (G) are updated in turn.

amirhk vae gan.png
  • Step 1: Update the discriminator D by minimizing the following loss function:

$$\log D(x_i) + \log \big(1 − D(G(z_t)) \big)$$

(Although the problem is not exactly the same with that studied by Larsen et al. [2015], better results may be observed when using samples from $q(z|y)$ (i.e. the encoder) in addition to our prior $p(z)$ in the GAN objective: \[ \log D(x_i) + \log \big( 1- D(G(z_t)) \big) -\log \big( 1-D(G(E(y_i))) \big). \] Because the negative sample $G(E(y))$ is much more likely to be similar to $y$ than $G(z_t)$. When updating according to LGAN, Larsen et al. [2015] suspected that having similar positive and negative samples makes for a more useful learning signal.

  • Step 2: Update the image encoder E by minimizing the following loss function:

$$\alpha_1D_{KL} \big(\mathcal{N}(E_{mean}(y_i), E_{var}(y_i)) \big|\big| \mathcal{N}(0, I) \big) + \alpha_2\big|\big|G(E(y_i)) − x_i\big|\big|_2$$

where $E_{mean}(y_i)$ and $E_{var}(y_i)$ are the predicted mean and variance of the latent variable $z$, respectively.

  • Step 3: Update the generator G by minimizing the following loss function:

$$\log \big(1 − D(G(z_t)) \big) + \alpha_2\big|\big|G(E(y_i)) − x_i\big|\big|_2$$

(In Step 2 and Step 3, the formulae in original materials do not include $\alpha_1$ and $\alpha_2$. I believe it is a typo, according to the loss function.)

Training Details

Network Architecture


The generator used in 3D-GAN follows the architecture of Radford et al.'s [2016] all-convolutional network, a neural network with no fully-connected and no pooling layers. As portrayed in Figure 1, this network comprises of 5 volumetric fully convolutional layers with kernels of size 4 x 4 x 4 and stride 2. Batch normalization and ReLU layers $(f(x) = \mathbb{1}(x \ge 0)(x))$ are present after every layer, and the final convolution layer is appended with a Sigmoid layer. The input is a 200-dimensional vector and the output is 64 x 64 x 64 matrix with values in [0,1].

amirhk network arch.png

The discriminator mostly mirrors the generator. Particularly, discriminator network takes input either from the generator or real data and it tries to predict if the input is generated or is in fact real. This network takes a 64 x 64 x 64 matrix as input and outputs a real number in [0,1]. Instead of ReLU activation function, the discriminator has leaky ReLU layers $(f(x) = \mathbb{1}(x \lt 0)(\alpha x) + \mathbb{1}(x \ge 0)(x))$ with $\alpha$ = 0.2. Batch normalization layers and Sigmoid layers are consistent in both the generator and discriminator networks.

Image Encoder

Finally, the image encoder in the VAE network takes as input an RGB image of size 256 x 256 x 3 and outputs a 200-dimensional vector. This network again consists of 5 spatial (not volumetric) convolutional layers with numbers of channels {64, 128, 256, 512, 400}, kernel sizes {11, 5, 5, 5, 8}, and strides {4, 2, 2, 2, 1}, respectively. ReLU and batch normalization layers are interspersed between every convolutional layer. While the output of this image encoder is 200-dimensional, the final layer outputs a 400-dimensional vector that represents a 200-dimensional Gaussian (split evenly to represent the mean and diagonal covariance). This is a common component of variational auto-encoder networks. Therefore, a final sampling layer is appended to the last convolutional layer to sample a 200-dimensional vector from the Gaussian distribution, which is later used by the 3D-GAN.

Coupled Generator-Discriminator Training

Training GANs is tricky because in practice training a network to generate objects is more difficult than training a network to distinguish between real and fake samples. In other words, training the generator is harder than training the discriminator. Intuitively, it becomes difficult for the generator to extract signal for improvement from a discriminator that is way ahead, as all examples it generated would be correctly identified as synthetic with high confidence. This problem is compounded when we deal with 3D generated objects (compared to 2D) due to the higher dimensionality. There exists different strategies to overcome this challenge, some of which we saw in class:

  • 1 discriminator update every N generator updates
  • Capped gradient updates, where only a maximum gradient is propagated back through the network for the discriminator network, essentially capping how fast it can learn

The approach used in this paper is interesting in that it adaptively decides whether to train the network or not. Here, for each batch, D is only updated if its accuracy in the last batch is <= 80%. Additionally, the generator learning rate is set to 2.5 x 10e-3 whereas the discriminator learning rate is set to 10e-5. This further caps the speed of training for the discriminator relative to the generator. In fact, many such techniques are necessary when training GANs, due to the fact that the optimization problem they are designed to solve is inherently different from the intended goal of finding a Nash equilibrium in a non-convex game [Salimans et al.,2016]. Some recently proposed techniques include feature matching, minibatch discrimination, historical averaging, one-sided label smoothing, and virtual batch normalization [Salimans et al.,2016].


To assess the quality of 3D-GAN and 3D-VAE-GAN, the authors performed the following set of experiments

  1. Qualitative results for 3D generated objects
  2. Classification performance of learned representations without supervision
  3. 3D object reconstruction from a single image
  4. Analyzing learned representations for generator and discriminator

Each of these experiments has a dedicated section below with experiment setup and results. First, we shall introduce the datasets used across these experiments.


  • ModelNet 10 & ModelNet 40 [Wu et al., 2016]
    • A comprehensive and clean collection of 3D CAD models for objects used as popular benchmark for 3D classification
    • List of the most common object categories in the world
    • 3D CAD models belonging to each object category using online search engines by querying for each object category
    • Manually annotated using hired human workers on Amazon Mechanical Turk to decide whether each CAD model belongs to the specified cateogries
    • ModelNet 10 & ModelNet 40 datasets completely cleaned in-house
    • Orientations of CAD models in ModelNet 10 are also manually aligned
amirhk eval 1.png
amirhk eval 2.png
amirhk eval 3.png
  • ShapeNet [Chang et al., 2015]
    • Clean 3D models and manually verified category and alignment annotations
    • 55 common object categories with about 51,300 unique 3D models
    • Collaborative effort between researchers at Princeton, Stanford and Toyota Technological Institute at Chicago (TTIC)
  • IKEA Dataset [Lim et al., 2013]
    • 1039 objects centre-cropped from 759 images
    • Images captured in the wild, often with cluttered backgrounds and occluded
    • 6 categories: bed, bookcase, chair, desk, sofa, table

Experiment 1: Qualitative results for 3D generated objects

Figure 2 shows 3D objects generated by the 3D-GAN framework. To generate these objects, a 200-dimensional vector following a uniform distribution between [0,1] is passed as input to the generator, and the largest connected component in the output of the generator is taken as the generated 3D object. One 3D-GAN is trained for each object class.

Unfortunately, measures of comparison for samples generated by generative models are qualitative and subjective. Here, the authors relatively compare samples generated by 3D-GAN with 3D objects synthesized from a probabilistic space [Wu et al., 2015], and those generated by volumetric auto-encoders [Girdhar et al., 2016]. It is important to consider how objects are generated using the volumetric auto-encoder while considering that auto-encoders do not restrict the latent space. To overcome this challenge, and to generate novel samples (rather than simply copying latent variables of samples in the training set), a Gaussian is fit to the emperical mean of the data. Samples drawn from this Gaussian act as the latent representation for a sample that is generated using the decoder of the volumetric auto-encoder.

Results in Figure 2 demonstrate that 3D-GANs are able to synthesize high-resolution 3D objects with detailed geometries, and subjective comparisons are highly in favor of 3D objects generated by 3D-GAN. In Figure 2, the nearest neighbours of each generated object is also depicted in the 2 right-most columns. From this, we see that generated objects via the 3D-GAN framework are novel and do not simply copy components from samples in the training set.

Experiment 2: Classification performance of learned representations without supervision

Another experiment conducted by the authors was to understand the latent representations of the generated objects as encoded in the discriminator. A typical way to evaluate representations learned without supervision is to use them as features for classification. Therefore, for each generated object, the authors concatenate the responses of the second, third, and fourth convolutional layers in the discriminator resulting in 1 vector representation for a given 3D object (training sample or 3D generated sample). A linear SVM was then used to perform classification using these object representations. Here, a single 3D-GAN was trained on seven major ShapeNet classes (chairs, sofas, tables, boats, airplanes, rifles, and cars), but was evaluated using the objects in both ModelNet 10 and ModelNet 40. These results are even more insightful given that the training and test sets are not identical and therefore show the out-of-category generalization power of the 3D-GAN.

Table 1 demonstrates the superior performance of 3D-GAN compared to competing unsupervised methods, and demonstrates performane on par with many supervised strategies. Only Multi-view CNNs, a method designed for classification (not generation of 3D objects) and augmented with ImageNet pretraining, is able to outperform 3D-GAN on discriminator representation classification.

Experiment 3: 3D object reconstruction from a single image

Following previous work [Girdhar et al., 2016] the performance of 3D-VAE-GAN was evaluated on the IKEA dataset to demonstrate how it performs for single image 3D reconstruction. The results in Figure 7 and Table 2 show the performance of both a single 3D-VAE-GAN jointly trained on all 6 IKEA object categories, and six 3D-VAE-GANs independently trained on each category. To evalute the performance of the models across different image setup, a 3D object was generated for permutations, flips, and translational alignments (up to \%10) of an input 3D image. Then the average of generated 3D objects was compared to the 3D ground truth for the 2D image.

The results in this section show that 3D-VAE-GAN consistently outperform previous state-of-the-art method for voxel-level predictions.

Experiment 4: Analyzing learned representations for generator and discriminator

amirhk representations gen 1.png
amirhk representations gen 2.png
amirhk representations gen 3.png
amirhk representations disc.png

In this section we explore the learned representations of the generator and discriminator in a trained 3D-GAN. Starting with a 200-dimensional vector as input, the generator neurons will fire to generate a 3D object, consequently leading to the firing of neurons in the discriminator which will produce a confidence value between [0,1]. To understand the latent space of vectors for object generation, we first vary the intensity of each dimension in the latent vector and observe the effect on the generated 3D objects. In Figure 5, each red region marks the voxels affected by changing values in a particular dimesnion of the latent vector. It can be seen that semantic meaning such as width and thickness of surfaces is encoded in each of these dimensions.

Next, we explore intra-class and inter-class object metamorphosis by interpolating between latent vector representation of a source and target 3D object. In Figure 6, we see a smooth transition exists for various types of chairs (with and without arm rests, and with varying backrest), as well as for a smooth transition between race car and speedboat.

Next, as is common in generative model evaluations, a simple arithmetic scheme is tested on latent vector representation of 3D objects. In Figure 8 shows that not only are generative networks able to encode semantic knowledge of chair and face images in its latent space, but these learned representations behave similarly as well. This can be seen because simple arithmetic on latent vector representations works in accord with intuition in Figure 8.

Finally, the authors explore the neurons in the discriminator. In order to understand firing patterns for specific neurons, the authors iterate through all training objects while keeping track of those samples that result in the highest firing intensity of a specific neuron. Here the neurons in the second-to-last convolutional layers were considered. From Figure 9, we conclude that neurons are selective: for a single neuron, the objects producing strongest activations are similar, and neurons learn semantics: the object parts that activate the neuron the most are consistent across objects.


The authors provide some supplementary resources for their proposed novel methodology. We describe these supplementary resources in this section.

Pre-trained models and sampling code for 3-D GAN can be found in the following git repositories :

Torch 7


Summary of Contributions

In this work, we have presented a novel approach to 3D object generation. We described 3D-GANs, showed their architechture, discussed loss functions, dove into the intricacies of the training process, and demonstrated their ability in generating realistic and novel high-resolution 3D objects. Furthermore, we reviewed the performance of 3D-GANs in producing feature vectors for object recognition and showed how the features learned by the discriminator outperforms all unsupervised methods, and is competitive with many supervised strategies. We extended 3D-GANs to 3D-VAE-GANs and learned a mapping from 2D images to 3D objects corresponding to the 2D image. Using 3D-VAE-GANs the authors were able to reconstruct 3D objects from a single image, with far greater accuracy than previous methods. Finally, the neurons in learned networks were analyzed and it was shown that the neurons learn disentagled features and fire selectively for different objects while learning the semantics of the objects they fire for.


  1. Girdhar, Rohit, et al. Learning a predictable and generative vector representation for objects. European Conference on Computer Vision. Springer International Publishing, 2016.
  2. Wu, Jiajun, et al. Single image 3d interpreter network. European Conference on Computer Vision. Springer International Publishing, 2016.
  3. Wu, Zhirong, et al. 3d shapenets: A deep representation for volumetric shapes. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015.
  4. Chang, Angel X., et al. Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012, 2015.
  5. Larsen, Anders Boesen Lindbo, et al. Autoencoding beyond pixels using a learned similarity metric. arXiv preprint arXiv:1512.09300, 2015.
  6. Lim, Joseph J., Hamed Pirsiavash, and Antonio Torralba. Parsing ikea objects: Fine pose estimation. Proceedings of the IEEE International Conference on Computer Vision, 2013.
  7. Good explanation of Coupled GAN:
  8. 2 Minute Video Summary:
  9. Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., & Chen, X. (2016). Improved techniques for training gans. In Advances in Neural Information Processing Systems (pp. 2234-2242).
  10. Oord, A. van den, Kalchbrenner, N., & Kavukcuoglu, K. (2016). Pixel Recurrent Neural Networks. arXiv:1601.06759 [Cs]. Retrieved from
  11. Karpathy, A., Abbeel, P., Brockman, G., Chen, P., Cheung, V., Duan, R., … Zaremba, W. (2016, June 16). Generative Models. Retrieved October 20, 2017, from