Deep Single-View 3D Object Reconstruction with Visual Hull Embedding - - PowerPoint PPT Presentation

deep single view 3d object reconstruction with visual
SMART_READER_LITE
LIVE PREVIEW

Deep Single-View 3D Object Reconstruction with Visual Hull Embedding - - PowerPoint PPT Presentation

Deep Single-View 3D Object Reconstruction with Visual Hull Embedding 1,2 2 1 2 Hanqing Wang, Jiaolong Yang, Wei Liang, Xin Tong 2 1 Beijing Institute of Technology Microsoft Research Asia Beijing, China


slide-1
SLIDE 1

Deep Single-View 3D Object Reconstruction with Visual Hull Embedding

Hanqing Wang, Jiaolong Yang, Wei Liang, Xin Tong

Beijing Institute of Technology Microsoft Research Asia Beijing, China Beijing, China AAAI 2019

1,2 2 1 2 1 2

slide-2
SLIDE 2
  • Input: a single RGB(D) Image
  • Output: the corresponding 3D representation

Single-View 3D Reconstruction

slide-3
SLIDE 3

Previous Works

  • Deep Learning based Methods:

[Girdhar ECCV’16] [Choy ECCV’16]

Other works: [Yan NIPS’16][Wu NIPS’16][Tulsiani CVPR’17][Zhu ICCV’17]

slide-4
SLIDE 4
  • Problems of Existing Deep Learning based Methods:
  • 1. Arbitrary-view images

vs. Canonical-view aligned 3D shapes

  • 2. Unsatisfactory results

2/15/2019 4 Z Y X

Missing shape details Inconsistency with input

Limitations of previous works

slide-5
SLIDE 5
  • Goal: Reconstruct the object precisely with the given image
  • Idea: Embed explicitly the 3D-2D projection geometry into a network
  • Approach: Estimating a single-view visual hull inside of the network

Multi-view Visual Hull Single-view Visual Hull

Core Idea

slide-6
SLIDE 6

Input Image Coarse Shape Silhouette Pose

CNN CNN CNN CNN

Single-View Visual Hull Final Shape

Method Overview

slide-7
SLIDE 7

2D Encoder 3D Decoder 2D Encoder 2D Decoder

(R,T)

Regressor 3D Decoder

+

3D Encoder

  • V-Net: coarse shape prediction
  • P-Net: object pose and camera parameters estimation
  • S-Net: silhouette prediction
  • PSVH layer: visual hull generation
  • R-Net: coarse shape refinement
  • V-Net: coarse shape prediction
  • P-Net: object pose and camera parameters estimation
  • S-Net: silhouette prediction
  • PSVH layer: visual hull generation
  • R-Net: coarse shape refinement

Input Image Coarse Shape Silhouette Pose

CNN CNN CNN CNN

Single-View Visual Hull Final Shape

Components

2D Encoder 3D Decoder 2D Encoder 2D Decoder

(R,T)

Regressor 3D Decoder

+

3D Encoder

slide-8
SLIDE 8
  • V-Net: coarse shape prediction
  • P-Net: object pose and camera parameters estimation
  • S-Net: silhouette prediction
  • PSVH layer: visual hull generation
  • R-Net: coarse shape refinement
  • V-Net: coarse shape prediction
  • P-Net: object pose and camera parameters estimation
  • S-Net: silhouette prediction
  • PSVH layer: visual hull generation
  • R-Net: coarse shape refinement

2D Encoder 3D Decoder 2D Encoder 2D Decoder

(R,T)

Regressor 3D Decoder

+

3D Encoder

Components

2D Encoder 3D Decoder 2D Encoder 2D Decoder

(R,T)

Regressor 3D Decoder

+

3D Encoder

slide-9
SLIDE 9
  • V-Net: coarse shape prediction
  • P-Net: object pose and camera parameters estimation
  • S-Net: silhouette prediction
  • PSVH layer: visual hull generation
  • R-Net: coarse shape refinement
  • V-Net: coarse shape prediction
  • P-Net: object pose and camera parameters estimation
  • S-Net: silhouette prediction
  • PSVH layer: visual hull generation
  • R-Net: coarse shape refinement

2D Encoder 3D Decoder 2D Encoder 2D Decoder

(R,T)

Regressor 3D Decoder

+

3D Encoder

Components

2D Encoder 3D Decoder 2D Encoder 2D Decoder

(R,T)

Regressor 3D Decoder

+

3D Encoder

slide-10
SLIDE 10
  • V-Net: coarse shape prediction
  • P-Net: object pose and camera parameters estimation
  • S-Net: silhouette prediction
  • PSVH layer: visual hull generation
  • R-Net: coarse shape refinement
  • V-Net: coarse shape prediction
  • P-Net: object pose and camera parameters estimation
  • S-Net: silhouette prediction
  • PSVH layer: visual hull generation
  • R-Net: coarse shape refinement

2D Encoder 3D Decoder 2D Encoder 2D Decoder

(R,T)

Regressor 3D Decoder

+

3D Encoder 2D Encoder 3D Decoder 2D Encoder 2D Decoder

(R,T)

Regressor 3D Decoder

+

3D Encoder

Components

slide-11
SLIDE 11

2D Encoder 3D Decoder 2D Encoder 2D Decoder

(R,T)

Regressor 3D Decoder

+

3D Encoder

  • V-Net: coarse shape prediction
  • P-Net: object pose and camera parameters estimation
  • S-Net: silhouette prediction
  • PSVH layer: visual hull generation
  • R-Net: coarse shape refinement
  • V-Net: coarse shape prediction
  • P-Net: object pose and camera parameters estimation
  • S-Net: silhouette prediction
  • PSVH layer: visual hull generation
  • R-Net: coarse shape refinement

2D Encoder 3D Decoder 2D Encoder 2D Decoder

(R,T)

Regressor 3D Decoder

+

3D Encoder

Components

slide-12
SLIDE 12
  • V-Net: coarse shape prediction
  • P-Net: object pose and camera parameters estimation
  • S-Net: silhouette prediction
  • PSVH layer: visual hull generation
  • R-Net: coarse shape refinement

2D Encoder 3D Decoder 2D Encoder 2D Decoder

(R,T)

Regressor 3D Decoder

+

3D Encoder

  • V-Net: coarse shape prediction
  • P-Net: object pose and camera parameters estimation
  • S-Net: silhouette prediction
  • PSVH layer: visual hull generation
  • R-Net: coarse shape refinement

2D Encoder 3D Decoder 2D Encoder 2D Decoder

(R,T)

Regressor 3D Decoder

+

3D Encoder

Components

slide-13
SLIDE 13
  • Overview:

Network Architecture

slide-14
SLIDE 14

We use the binary cross-entropy loss to train V-Net, S-Net and R-Net, let 𝑞𝑜 be the estimated probability at location 𝑜, the loss is defined as (2) Where 𝑞𝑜

∗ is the target probability

For P-Net, we use the 𝑀1 regression loss to train the network: (3) where we set 𝛽 = 1, 𝛿 = 1, 𝛾 = 0.01

𝑚 = − 1 𝑂 ෍

𝑜

(𝑞𝑜

∗ log 𝑞𝑜 + 1 − 𝑞𝑜 ∗ log(1 − 𝑞𝑜))

𝑚 = ෍

𝑗=1,2,3

𝛽 𝜄𝑗 − 𝜄𝑗

∗ + ෍ 𝑘=𝑣,𝑤

𝛾 𝑢𝑘 − 𝑢𝑘

∗ + 𝛿 𝑢𝑎 − 𝑢𝑎 ∗

Training Details

Loss:

slide-15
SLIDE 15

1. Train the V-Net, S-Net, P-Net independently. 2. Train the R-Net with the coarse shape predicted by V-Net and the ground truth visual hull. 3. Train the whole network end-to-end.

Training Details

Steps:

slide-16
SLIDE 16
  • Network implemented in Tensorflow
  • Input image size: 128x128x3
  • Output voxel grid size: 32x32x32

Implementation Details

slide-17
SLIDE 17
  • Object categories: car, airplane, chair, sofa
  • Datasets:
  • Rendered ShapeNet objects – (ShapeNet) dataset of tremendous CAD models
  • Real images - (PASCAL 3D+ dataset) manually associated with limited CAD models

Dataset

slide-18
SLIDE 18
  • Results on the 3D-R2N2 dataset (rendered ShapeNet objects)
  • Ablation study:

Experiments

slide-19
SLIDE 19
  • Results on the rendered ShapeNet objects

Experiments

slide-20
SLIDE 20
  • Results on the rendered ShapeNet objects

Experiments

slide-21
SLIDE 21

Experiments

  • Results on the synthetic dataset (rendered ShapeNet objects)
  • Ablation study:
slide-22
SLIDE 22

Experiments

  • Comparison with

MarrNet[Wu et al. 2017]

  • n the synthetic dataset
slide-23
SLIDE 23
  • Results on the PASCAL 3D+ dataset (real images)

Experiments

slide-24
SLIDE 24
  • Results on the PASCAL 3D+ dataset (real images)

Experiments

IoU 0.716 IoU 0.793 IoU 0.937

slide-25
SLIDE 25
  • ~18ms for one image (55 fps!)
  • (Tested with a batch of 24 images on a NVIDIA Tesla M40 GPU)

Running Time

slide-26
SLIDE 26
  • Embedding Domain knowledge (3D-2D perspective geometry) into

a DNN

  • Performing reconstruction jointly with segmentation and pose

estimation

  • A novel, GPU-friendly PSVH (Probabilistic Single-view Visual Hull)

layer

Contributions

slide-27
SLIDE 27

Thanks for listening!

  • Welcome to ask any problem!
  • Email: hanqingwang@bit.edu.cn