[PPT] - Towards Deep Multi-View Stereo Silvano Galliani October 2, 2017 1 PowerPoint Presentation

SLIDE 1

Towards Deep Multi-View Stereo

Silvano Galliani October 2, 2017

1 / 40

SLIDE 2

Towards Deep Multi-View Stereo

Multi View Stereo

2 / 40

SLIDE 3

Towards Deep Multi-View Stereo

Outline

1 Gipuma: massively parallel multi-view stereo 2 Unsupervised normal prediction for improved multi-view

reconstruction

3 Learned multi-patch similarity 4 Conclusions

3 / 40

SLIDE 4

Towards Deep Multi-View Stereo Gipuma: massively parallel multi-view stereo

Gipuma: Geometry based multi-view stereo reconstruction

S. Galliani, K. Lasinger, K. Schindler, ICCV 2015

4 / 40

SLIDE 5

Towards Deep Multi-View Stereo Gipuma: massively parallel multi-view stereo

1 Accurate multiview stereo reconstruction 2 Highly efficient open source GPU implementation:

Correspondence over ten 2MPix images in 3 sec 1.6 sec

5 / 40

SLIDE 6

Towards Deep Multi-View Stereo Gipuma: massively parallel multi-view stereo

Our approach:

1 Estimate depth and fit patch per view by consecutively

treating each view as reference camera

2 Fuse depth maps in space to obtain final reconstruction

6 / 40

SLIDE 7

Towards Deep Multi-View Stereo Gipuma: massively parallel multi-view stereo

Multi-view stereopsis

Approximate randomized search for the best depth & normal minimizing a local matching error: Initialize all pixels with a random normal Then:

Diffuse locally plane and save it when cost decreases Local optimization of normal Repeat (8 times enough)

Similar to belief propagation

7 / 40

SLIDE 8

Towards Deep Multi-View Stereo Gipuma: massively parallel multi-view stereo

Why it’s fast

1 Red-Black diffusion of planes → maximum parallelization on

GPU

2 Candidates from a bigger neighborhood → faster convergence

8 / 40

SLIDE 9

Towards Deep Multi-View Stereo Gipuma: massively parallel multi-view stereo

Depth map fusion

Fusion of depth & normal maps from different views into one 3D point cloud Consistency check on depth (fε) + normal (fang) on at least fcon views Average of reliable points (depth + normal) Tunable Adjustment between more accurate or complete result by tuning fε, fang and fcon

9 / 40

SLIDE 10

Towards Deep Multi-View Stereo Gipuma: massively parallel multi-view stereo

Results on old Middlebury Benchmark

10 / 40

SLIDE 11

Towards Deep Multi-View Stereo Gipuma: massively parallel multi-view stereo

Current MVS dataset – DTU

Large scale Multi-View dataset 80 different objects, each covered by 49–64 images of resolution 1600 × 1200 pixels (≈ 2 million pixels) ≈3 1.6 seconds per depthmap with fast settings ≈50 13 seconds per depthmap for accurate settings

11 / 40

SLIDE 12

Towards Deep Multi-View Stereo Gipuma: massively parallel multi-view stereo

Results on DTU Dataset

Acc. Comp. Mean Med. Mean Med. Points

urs

0.273 0.196 0.687 0.260

urs comp

0.379 0.234 0.400 0.188

urs fast

0.291 0.208 0.825 0.279 tola [Tola-10] 0.307 0.198 1.097 0.456 furu [fur-10] 0.605 0.321 0.842 0.431 camp [Cam-08] 0.753 0.480 0.540 0.179 Surfaces

urs

0.363 0.215 0.766 0.329

urs comp

0.631 0.262 0.519 0.309

urs fast

0.366 0.223 0.900 0.347 tola [Tol-10] 0.488 0.244 0.974 0.382 furu [Fur-10] 1.299 0.534 0.702 0.405 camp [Cam-08] 1.411 0.579 0.562 0.322

12 / 40

SLIDE 13

Towards Deep Multi-View Stereo Gipuma: massively parallel multi-view stereo

Figure: Ground truth, textured reconstruction, reconstructed triangulation

13 / 40

SLIDE 14

Towards Deep Multi-View Stereo Gipuma: massively parallel multi-view stereo

New dataset and online benchmark

New (multi-view) stereo and video benchmark on unstructured scenes: SLR camera image Multi field of view stereo rig video and images Training dataset available Presented at CVPR2017: eth3d.net

14 / 40

SLIDE 15

Towards Deep Multi-View Stereo Unsupervised normal prediction for improved multi-view reconstruction

Just Look at the Image: Unsupervised normal prediction for improved multi-view reconstruction

S. Galliani, K. Schindler, CVPR2016

15 / 40

SLIDE 16

Towards Deep Multi-View Stereo Unsupervised normal prediction for improved multi-view reconstruction

Multi View Stereo: failure cases

Common failure modes for MVS Ambiguous matches: Occlusions Lack of texture on homogeneous regions

16 / 40

SLIDE 17

Towards Deep Multi-View Stereo Unsupervised normal prediction for improved multi-view reconstruction

Just look at the image

Dichotomy: Stereo correspondences: more accurate in textured regions with many large image gradients

17 / 40

SLIDE 18

Towards Deep Multi-View Stereo Unsupervised normal prediction for improved multi-view reconstruction

Just look at the image

Dichotomy: Stereo correspondences: more accurate in textured regions with many large image gradients Shape-from-shading: typically more robust in flat regions with no albedo variations.

18 / 40

SLIDE 19

Towards Deep Multi-View Stereo Unsupervised normal prediction for improved multi-view reconstruction

Holy grail of Multi View Stereo

Idea

Complement MVS with shading information

19 / 40

SLIDE 20

Towards Deep Multi-View Stereo Unsupervised normal prediction for improved multi-view reconstruction

Explicit modeling of surface, light and material properties is an under-constrained problem: (lights position, lights color, lights intensity, reflectance function)

Discriminative approach

20 / 40

SLIDE 21

Towards Deep Multi-View Stereo Unsupervised normal prediction for improved multi-view reconstruction

2 Observations:

1 Shading affects surface orientation not depth 2 Specific light interaction can be view-dependent: we rule

ut view point based variations like specularity, occluding

edges, etc.. We learn the relation between image and surface normal We train a single model per each view

21 / 40

SLIDE 22

Towards Deep Multi-View Stereo Unsupervised normal prediction for improved multi-view reconstruction

Unsupervised

We start with a reliable MVS reconstruction with gipuma For every image we use it as training data to learn a CNN which predict surface normal from RGB patch around point

22 / 40

SLIDE 23

Towards Deep Multi-View Stereo Unsupervised normal prediction for improved multi-view reconstruction

Unsupervised online training for every image

We use of a Convolutional Neural Network that minimizes the error of training Vs predicted normal

Accurate results w.r.t. training data Joint training of model did not works

Mean Error 18◦ Predicted, 11◦ MVS Mean of Median Error 16◦ Predicted, 9◦ MVS

23 / 40

SLIDE 24

Towards Deep Multi-View Stereo Unsupervised normal prediction for improved multi-view reconstruction 24 / 40

SLIDE 25

Towards Deep Multi-View Stereo Unsupervised normal prediction for improved multi-view reconstruction

Surface normal integration

Normals are dense but without depth information

1 Integrate the new normals with a masked Poisson equation 2 Faaa all the new dense depth maps to obtain the final point

cloud

25 / 40

SLIDE 26

Towards Deep Multi-View Stereo Unsupervised normal prediction for improved multi-view reconstruction

Normal integration

The vector field g consists of the gradients of both functions, ∀x ∈ Ω : g(x) =

∇fmvs,

if x ∈ A ∇f , else (1) Find an interpolant f over Ω\A that minimizes the squared error min

f

Ω\A

∇f − g2 . (2) This leads to the Poisson equation ∆f = div g , (3) Solved with Gauss-Seidel + Successive Over Relaxation (SOR) (few seconds)

26 / 40

SLIDE 27

Towards Deep Multi-View Stereo Unsupervised normal prediction for improved multi-view reconstruction

Results

27 / 40

SLIDE 28

Towards Deep Multi-View Stereo Unsupervised normal prediction for improved multi-view reconstruction

Results

28 / 40

SLIDE 29

Towards Deep Multi-View Stereo Unsupervised normal prediction for improved multi-view reconstruction

Results

29 / 40

SLIDE 30

Towards Deep Multi-View Stereo Learned multi-patch similarity

Learned Multi-Patch Similarity

W. Hartmann, S. Galliani, M. Havlena, L. V. Gool, K. Schindler,

ICCV2017

30 / 40

SLIDE 31

Towards Deep Multi-View Stereo Learned multi-patch similarity

Learned Multi-Patch Similarity

A crucial component of stereo reconstruction is the matching function. Similarity In 2-view stereo matching similarity is uniquely determined: left vs right But what about Multi View Stereo? No Direct Solution → It’s common and robust to average pairwise scores Idea

Learn similarity score across all the views

31 / 40

SLIDE 32

Towards Deep Multi-View Stereo Learned multi-patch similarity

Learned Multi-Patch Similarity

We train a CNN network which directly learn a similarity score from multiple patches Multi-branch siamese network with shared weights and average aggregation Cast as a binary classification problem

mean conv1 conv1 conv1 conv1 conv1

TanH1 TanH1 TanH1 TanH1 TanH1

pool1 pool1 pool1 pool1 pool1 conv2 conv2 conv2 conv2 conv2

TanH2 TanH2 TanH2 TanH2 TanH2

pool2 pool2 pool2 pool2 pool2

Convolutional Layer 3 Convolutional Layer 4 Convolutional Layer 5 ReLU 3 ReLU 4 Softmax

0 .. 1

32 / 40

SLIDE 33

Towards Deep Multi-View Stereo Learned multi-patch similarity

We don’t learn sift → training data from ground truth

We directly extract a set of patches obtained from 3D data points backprojected on images: Positive examples are obtained by cropping a rectangle from the backprojected corrected 3d depth on other views Negative examples are extracted from points far from the real depth but still on the epipolar lines Roughly 15 million positive and negative examples are used

33 / 40

SLIDE 34

Towards Deep Multi-View Stereo Learned multi-patch similarity

Application on Multi View Stereo

To compare directly our method we modified a standard plane sweeping algorithm and used our similarity score For each point to find the correct depth we test all planes at different depth values and pick the one with highest similarity

34 / 40

SLIDE 35

Towards Deep Multi-View Stereo Learned multi-patch similarity

Application on Multi View Stereo

Benefits of joint similarity computation Direct multiple similarity computation across all the patches Reference camera does not have a privileged role: → robust to occlusion w.r.t. the reference view:

35 / 40

SLIDE 36

Towards Deep Multi-View Stereo Learned multi-patch similarity

Application on Multi View Stereo

Benefits of branch averaging Matching different numbers of viewpoints with the similarity network can be done without retraining:

Figure: Input, 3, 5, 9 views

36 / 40

SLIDE 37

Towards Deep Multi-View Stereo Learned multi-patch similarity

Generalization to other datasets

The learned similarity generalizes to a different test environment:

Figure: Fountain from Strecha dataset

37 / 40

SLIDE 38

Towards Deep Multi-View Stereo Learned multi-patch similarity

Quantitative Results

Accuracy Completeness Mean Median Mean Median BIRD SAD 2.452 0.380 4.035 1.105 ZNCC 1.375 0.365 4.253 1.332 SIFT 1.594 0.415 5.269 1.845 LIFT 1.844 0.562 4.387 1.410 OUR concat 1.605 0.305 4.358 1.133 OUR 1.881 0.271 4.167 1.044 FLOWER SAD 2.537 1.143 2.768 1.407 ZNCC 2.018 1.106 2.920 1.467 SIFT 2.795 1.183 4.747 2.480 LIFT 3.049 1.420 4.224 2.358 OUR concat 2.033 0.843 2.609 1.267 OUR 1.973 0.771 2.609 1.208 CAN SAD 1.824 0.664 2.283 1.156 ZNCC 1.187 0.628 2.092 1.098 SIFT 1.769 0.874 3.067 1.726 LIFT 2.411 1.207 3.003 1.823 OUR concat 1.082 0.477 1.896 0.833 OUR 1.123 0.478 1.982 0.874 BUDDHA SAD 0.849 0.250 1.119 0.561 ZNCC 0.688 0.299 1.208 0.656 SIFT 0.696 0.263 1.347 0.618 LIFT 0.688 0.299 1.208 0.656 OUR concat 0.682 0.231 1.017 0.473 OUR 0.637 0.206 1.057 0.475

38 / 40

SLIDE 39

Towards Deep Multi-View Stereo Conclusions

Conclusions

Unsupervised normal estimation works to improve MVS Dataset specific models are better than generic ones if we are self-supervised Similarity score can be trained jointly and proves to be better than hand crafted features Pure end-to-end multi-view-stereo network is still not there Deep Learning applied to 3D reconstruction is and unsolved and open problem

39 / 40

SLIDE 40

Towards Deep Multi-View Stereo Conclusions

Thanks for your attention

gipuma code at http://github.com/kysucix/gipuma eth3d benchmark at http://eth3d.net

40 / 40