STAT946F17/ Learning a Probabilistic Latent Space of Object Shapes via 3D GAN: Difference between revisions

From statwiki
Jump to navigation Jump to search
No edit summary
No edit summary
Line 25: Line 25:


===== Generator =====
===== Generator =====
[[File:---Users-a6karimi-Desktop-Deep_Learning_Course_Presentation_Images-Screen_Shot_2017-10-16_at_2.56.27_PM.png|800px]]
===== Discriminator =====
===== Discriminator =====
===== Encoder =====
===== Encoder =====
Line 36: Line 38:
The approach used in this paper is interesting in that it adaptively decides whether to train the network or not. Here, for each batch, D is only updated if its accuracy in the last batch is <= 80%. Additionally, the generator learning rate is set to 2.5 x 10e-3 whereas the discriminator learning rate is set to 10e-5. This further caps the speed of training for the discriminator relative to the generator.
The approach used in this paper is interesting in that it adaptively decides whether to train the network or not. Here, for each batch, D is only updated if its accuracy in the last batch is <= 80%. Additionally, the generator learning rate is set to 2.5 x 10e-3 whereas the discriminator learning rate is set to 10e-5. This further caps the speed of training for the discriminator relative to the generator.


=== Training Method ===
=== Evaluation ===
 
===== Qualitative results for 3D generated objects =====
For generation:
* Sample a 200-dimensional vector following an i.i.d. Uniform distr over [0,1]
* Render the largest connected component
Compare with
* 3D object synthesis from a probabilistic space [Wu et al., 2015]
* Volumetric auto-encoders (because latent space is not restricted, we fit a Gaussian to the empirical latent space distribution)
 
[[File:amirhk_eval_1.png|800px]]
 
* Able to synthesize high-res 3D objects with detailed geometries
* Objects are similar, but not identical to training samples → not memorizing
 
 
===== Classification performance of learned representations w/o supervision =====
Typical way to evaluate representations learned without supervision is to use them as features for classification.
Input: 3D object
Output: feature vector (concatenated the responses of the 2nd, 3rd, 4th convolutional layers in the discriminator, w/ applied max-pooling of size {8,4,2})
Classifier: Linear SVM
 
[[File:amirhk_eval_2.png|800px]]
 
Train Data: ShapeNet
Test Data: ModelNet {10, 40}
 


=== Scoring Method ===
===== 3D object reconstruction from a single image =====


=== Results ===
Following previous work [Girdhar et al., 2016] the performance of 3D-VAE-GAN was evaluated on the IKEA dataset
1039 objects centre-cropped from 759 images (supplied by author)
Images captured in the wild, often w/ cluttered backgrounds and occluded
6 categories: bed, bookcase, chair, desk, sofa, table
Performance
Single 3D-VAE-GAN trained on 6 categories
Multiple 3D-VAE-GANs each trained on 1 category
Align each prediction with GT over permutations, flips, and %10 translation


=== Some developments of LSTM ===
[[File:amirhk_eval_3.png|800px]]


=== Open questions ===
= Future Work and Open questions =


= Source =
= Source =

Revision as of 15:04, 17 October 2017

Introduction

Related Work

Existing method

  • Borrow parts from objects in existing CAD model libraries → realistic but not novel
  • Learn deep object representations based on voxelized objects → fail to capture highly structured differences between 3D objects
  • Mostly learn based on a supervised criterion

Methodology

Let us first review GANs...

3D-GANs

3D-GANs are a simple extension of GANs for 2D imagery. Here, the model is composed of a

  • Generator (G): maps a 200-dimensional latent vector z, randomly sampled from a probabilistic latent space (U[0,1]), to a 64 x 64 x 64 cube, representing the object G(z) in voxel space.
  • Discriminator (D): outputs a confidence value D(x ) of whether a 3D object input x input is real or synthetic

and a loss function L3D-GAN = log D(x ) + log( 1 - D( G(z ) ) )

3D-VAE-GANs

An extension of ...

Training and Results

Network Architecture

Generator

Discriminator
Encoder

Coupled Generator-Discriminator Training

Training GANs is tricky because in practice training a network to generate objects is more difficult than training a network to distinguish between real and fake samples. In other words, training the generator is harder than training the discriminator. Intuitively, it becomes difficult for the generator to extract signal for improvement from a discriminator that is way ahead, as all examples it generated would be correctly identified as synthetic with high confidence. This problem is compounded when we deal with 3D generated objects (compared to 2D) due to the higher dimensionality. There exists different strategies to overcome this challenge, some of which we saw in class:

  • 1 D update every N G updates
  • Capped gradient updates, where only a maximum gradient is propagated back through the network for the discriminator network, essentially capping how fast it can learn

The approach used in this paper is interesting in that it adaptively decides whether to train the network or not. Here, for each batch, D is only updated if its accuracy in the last batch is <= 80%. Additionally, the generator learning rate is set to 2.5 x 10e-3 whereas the discriminator learning rate is set to 10e-5. This further caps the speed of training for the discriminator relative to the generator.

Evaluation

Qualitative results for 3D generated objects

For generation:

  • Sample a 200-dimensional vector following an i.i.d. Uniform distr over [0,1]
  • Render the largest connected component

Compare with

  • 3D object synthesis from a probabilistic space [Wu et al., 2015]
  • Volumetric auto-encoders (because latent space is not restricted, we fit a Gaussian to the empirical latent space distribution)

  • Able to synthesize high-res 3D objects with detailed geometries
  • Objects are similar, but not identical to training samples → not memorizing


Classification performance of learned representations w/o supervision

Typical way to evaluate representations learned without supervision is to use them as features for classification. Input: 3D object Output: feature vector (concatenated the responses of the 2nd, 3rd, 4th convolutional layers in the discriminator, w/ applied max-pooling of size {8,4,2}) Classifier: Linear SVM

Train Data: ShapeNet Test Data: ModelNet {10, 40}


3D object reconstruction from a single image

Following previous work [Girdhar et al., 2016] the performance of 3D-VAE-GAN was evaluated on the IKEA dataset 1039 objects centre-cropped from 759 images (supplied by author) Images captured in the wild, often w/ cluttered backgrounds and occluded 6 categories: bed, bookcase, chair, desk, sofa, table Performance Single 3D-VAE-GAN trained on 6 categories Multiple 3D-VAE-GANs each trained on 1 category Align each prediction with GT over permutations, flips, and %10 translation

Future Work and Open questions

Source

Sutskever, I. Vinyals, O. & Le. Q. V. Sequence to sequence learning with neural networks. In Proc. Advances in Neural Information Processing Systems 27 3104–3112 (2014). <references />