[PPT] - Deep Single-View 3D Object Reconstruction with Visual Hull Embedding PowerPoint Presentation

SLIDE 1

Deep Single-View 3D Object Reconstruction with Visual Hull Embedding

Hanqing Wang, Jiaolong Yang, Wei Liang, Xin Tong

Beijing Institute of Technology Microsoft Research Asia Beijing, China Beijing, China AAAI 2019

1,2 2 1 2 1 2

SLIDE 2

Input: a single RGB(D) Image
Output: the corresponding 3D representation

Single-View 3D Reconstruction

SLIDE 3

Previous Works

Deep Learning based Methods:

[Girdhar ECCV’16] [Choy ECCV’16]

Other works: [Yan NIPS’16][Wu NIPS’16][Tulsiani CVPR’17][Zhu ICCV’17]

SLIDE 4

Problems of Existing Deep Learning based Methods:
1. Arbitrary-view images

vs. Canonical-view aligned 3D shapes

2. Unsatisfactory results

2/15/2019 4 Z Y X

Missing shape details Inconsistency with input

Limitations of previous works

SLIDE 5

Goal: Reconstruct the object precisely with the given image
Idea: Embed explicitly the 3D-2D projection geometry into a network
Approach: Estimating a single-view visual hull inside of the network

Multi-view Visual Hull Single-view Visual Hull

Core Idea

SLIDE 6

Input Image Coarse Shape Silhouette Pose

CNN CNN CNN CNN

Single-View Visual Hull Final Shape

Method Overview

SLIDE 7

2D Encoder 3D Decoder 2D Encoder 2D Decoder

(R,T)

Regressor 3D Decoder

+

3D Encoder

V-Net: coarse shape prediction
P-Net: object pose and camera parameters estimation
S-Net: silhouette prediction
PSVH layer: visual hull generation
R-Net: coarse shape refinement
V-Net: coarse shape prediction
P-Net: object pose and camera parameters estimation
S-Net: silhouette prediction
PSVH layer: visual hull generation
R-Net: coarse shape refinement

Input Image Coarse Shape Silhouette Pose

CNN CNN CNN CNN

Single-View Visual Hull Final Shape

Components

2D Encoder 3D Decoder 2D Encoder 2D Decoder

(R,T)

Regressor 3D Decoder

+

3D Encoder

SLIDE 8

V-Net: coarse shape prediction
P-Net: object pose and camera parameters estimation
S-Net: silhouette prediction
PSVH layer: visual hull generation
R-Net: coarse shape refinement
V-Net: coarse shape prediction
P-Net: object pose and camera parameters estimation
S-Net: silhouette prediction
PSVH layer: visual hull generation
R-Net: coarse shape refinement

2D Encoder 3D Decoder 2D Encoder 2D Decoder

(R,T)

Regressor 3D Decoder

+

3D Encoder

Components

2D Encoder 3D Decoder 2D Encoder 2D Decoder

(R,T)

Regressor 3D Decoder

+

3D Encoder

SLIDE 9

V-Net: coarse shape prediction
P-Net: object pose and camera parameters estimation
S-Net: silhouette prediction
PSVH layer: visual hull generation
R-Net: coarse shape refinement
V-Net: coarse shape prediction
P-Net: object pose and camera parameters estimation
S-Net: silhouette prediction
PSVH layer: visual hull generation
R-Net: coarse shape refinement

2D Encoder 3D Decoder 2D Encoder 2D Decoder

(R,T)

Regressor 3D Decoder

+

3D Encoder

Components

2D Encoder 3D Decoder 2D Encoder 2D Decoder

(R,T)

Regressor 3D Decoder

+

3D Encoder

SLIDE 10

V-Net: coarse shape prediction
P-Net: object pose and camera parameters estimation
S-Net: silhouette prediction
PSVH layer: visual hull generation
R-Net: coarse shape refinement
V-Net: coarse shape prediction
P-Net: object pose and camera parameters estimation
S-Net: silhouette prediction
PSVH layer: visual hull generation
R-Net: coarse shape refinement

2D Encoder 3D Decoder 2D Encoder 2D Decoder

(R,T)

Regressor 3D Decoder

+

3D Encoder 2D Encoder 3D Decoder 2D Encoder 2D Decoder

(R,T)

Regressor 3D Decoder

+

3D Encoder

Components

SLIDE 11

2D Encoder 3D Decoder 2D Encoder 2D Decoder

(R,T)

Regressor 3D Decoder

+

3D Encoder

V-Net: coarse shape prediction
P-Net: object pose and camera parameters estimation
S-Net: silhouette prediction
PSVH layer: visual hull generation
R-Net: coarse shape refinement
V-Net: coarse shape prediction
P-Net: object pose and camera parameters estimation
S-Net: silhouette prediction
PSVH layer: visual hull generation
R-Net: coarse shape refinement

2D Encoder 3D Decoder 2D Encoder 2D Decoder

(R,T)

Regressor 3D Decoder

+

3D Encoder

Components

SLIDE 12

V-Net: coarse shape prediction
P-Net: object pose and camera parameters estimation
S-Net: silhouette prediction
PSVH layer: visual hull generation
R-Net: coarse shape refinement

2D Encoder 3D Decoder 2D Encoder 2D Decoder

(R,T)

Regressor 3D Decoder

+

3D Encoder

V-Net: coarse shape prediction
P-Net: object pose and camera parameters estimation
S-Net: silhouette prediction
PSVH layer: visual hull generation
R-Net: coarse shape refinement

2D Encoder 3D Decoder 2D Encoder 2D Decoder

(R,T)

Regressor 3D Decoder

+

3D Encoder

Components

SLIDE 13

Overview:

Network Architecture

SLIDE 14

We use the binary cross-entropy loss to train V-Net, S-Net and R-Net, let 𝑞𝑜 be the estimated probability at location 𝑜, the loss is defined as (2) Where 𝑞𝑜

∗ is the target probability

For P-Net, we use the 𝑀1 regression loss to train the network: (3) where we set 𝛽 = 1, 𝛿 = 1, 𝛾 = 0.01

𝑚 = − 1 𝑂 ෍

𝑜

(𝑞𝑜

∗ log 𝑞𝑜 + 1 − 𝑞𝑜 ∗ log(1 − 𝑞𝑜))

𝑚 = ෍

𝑗=1,2,3

𝛽 𝜄𝑗 − 𝜄𝑗

∗ + ෍ 𝑘=𝑣,𝑤

𝛾 𝑢𝑘 − 𝑢𝑘

∗ + 𝛿 𝑢𝑎 − 𝑢𝑎 ∗

Training Details

Loss:

SLIDE 15

1. Train the V-Net, S-Net, P-Net independently. 2. Train the R-Net with the coarse shape predicted by V-Net and the ground truth visual hull. 3. Train the whole network end-to-end.

Training Details

Steps:

SLIDE 16

Network implemented in Tensorflow
Input image size: 128x128x3
Output voxel grid size: 32x32x32

Implementation Details

SLIDE 17

Object categories: car, airplane, chair, sofa
Datasets:
Rendered ShapeNet objects – (ShapeNet) dataset of tremendous CAD models
Real images - (PASCAL 3D+ dataset) manually associated with limited CAD models

Dataset

SLIDE 18

Results on the 3D-R2N2 dataset (rendered ShapeNet objects)
Ablation study:

Experiments

SLIDE 19

Results on the rendered ShapeNet objects

Experiments

SLIDE 20

Results on the rendered ShapeNet objects

Experiments

SLIDE 21

Experiments

Results on the synthetic dataset (rendered ShapeNet objects)
Ablation study:

SLIDE 22

Experiments

Comparison with

MarrNet[Wu et al. 2017]

n the synthetic dataset

SLIDE 23

Results on the PASCAL 3D+ dataset (real images)

Experiments

SLIDE 24

Results on the PASCAL 3D+ dataset (real images)

Experiments

IoU 0.716 IoU 0.793 IoU 0.937

SLIDE 25

~18ms for one image (55 fps!)
(Tested with a batch of 24 images on a NVIDIA Tesla M40 GPU)

Running Time

SLIDE 26

Embedding Domain knowledge (3D-2D perspective geometry) into

a DNN

Performing reconstruction jointly with segmentation and pose

estimation

A novel, GPU-friendly PSVH (Probabilistic Single-view Visual Hull)

layer

Contributions

SLIDE 27

Thanks for listening!

Welcome to ask any problem!
Email: hanqingwang@bit.edu.cn