Region Merging Driven by Deep Learning for RGB-D Segmentation and - - PowerPoint PPT Presentation

▶

Mar 31, 2023 158 likes •510 views

ICDSC 2019 Region Merging Driven by Deep Learning for RGB-D Segmentation and Labeling U. Michieli, M. Camporese, A. Agiollo, G. Pagnutti, P. Zanuttigh September 9 th , 2019 2 Outline Semantic Segmentation Proposed Framework

SLIDE 1

Region Merging Driven by Deep Learning for RGB-D Segmentation and Labeling

U. Michieli, M. Camporese, A. Agiollo, G. Pagnutti, P. Zanuttigh

September 9th, 2019

ICDSC 2019

SLIDE 2

Outline

¡ Semantic Segmentation ¡ Proposed Framework

¡ Pre-processing ¡ Over-segmentation and Classification ¡ Merging Phase

¡ Results ¡ Conclusions and Future Work

2

SLIDE 3

Semantic Segmentation

3

furniture furniture floor

bjects
bjects

wall wall

¡ Segmentation + labeling (pixel-wise classification) ¡ Deep learning and consumer depth sensors ¡ Very useful for free navigation systems to explore the surroundings

SLIDE 4

Semantic Segmentation

4

SLIDE 5

Semantic Segmentation

4

SLIDE 6

Proposed Framework

5

SLIDE 7

Proposed Framework

6 Use normalized cuts spectral clustering extended for RGBD à but bias toward region of similar sizes

[1] G.Pagnutti, L. Minto, P. Zanuttigh, "Segmentation and Semantic Labeling of RGBD Data with Convolutional Neural Networks and Surface Fitting “, IET Computer Vision, 2017

AIM: propose CNN for region merging and refine boundaries of shapes

Then 2 steps procedure: ¡ Initial over-segmentation to properly separate objects ¡ Region merging procedure to avoid over-segmentation Framework derived from [1] but much faster and simpler

SLIDE 8

Framework of [1]

7

[1] G.Pagnutti, L. Minto, P. Zanuttigh, "Segmentation and Semantic Labeling of RGBD Data with Convolutional Neural Networks and Surface Fitting “, IET Computer Vision, 2017 Merge phase Over-segmentation and classification Pre-processing

320x240x6 160x120x6

Surface fitting accuracy improved? No Yes Depth data Color data Segment descriptors Normalized cuts spectral clustering Convolutional Neural Network (CNN) 1/σg 1/σn 1/σc (x, y, z) point set Normals computation RGB to CIELab conversion Compute similarity of adjacent segments Sort and discard below similarity threshold NURBS fitting NURBS fitting Segment 1 Segment 2 Select two segments to be joined Discard union Keep union Geometry vectors Orientation vectors Color vectors

CONs:

NURBS fitting very slow
Many hand-tuned

thresholds (on depth, color, normals, NURBS fitting)

SLIDE 9

Proposed Framework

8 PROs:

Much faster
Fewer thresholds
Same accuracy

SLIDE 10

Proposed Framework - Preprocessing

9 ¡ 3 channels for 3D location ¡ 3 channels for surface normals ¡ 3 channels for color representation à CIELab for perceptual uniformity ¡ Normalization to achieve consistent representation across the 3 domains.

SLIDE 11

Proposed Framework – Oversegmentation

10

¡ Over-segmentation with normalized cuts spectral clustering with Nystrom acceleration: 9D input ¡ CNN for the semantic labeling of each segment and for guiding the region merging process ¡ 9 conv layers ¡ 15 classes ¡ very simple

SLIDE 12

Proposed Framework – Region Merging

11 ¡ Compute adjacency map of the segments ¡ Compute similarity between adjacent segment descriptors with Bhattacharyya coefficient: 𝑐",$ = ∑' 𝑡"

'𝑡 $ '

𝑢: class scores

𝑡": descriptors (~PDFs) ¡ Sort list on the basis of 𝑐",$

SLIDE 13

Proposed Framework

12 Iterative merging procedure

Ø Select segments with 𝑐",$ > 𝑈-". Ø CNN classifier to decide whether the two segments will be joined or not

If merged: new segment of the union is created and list updated
If not merged: remove segments from the list

SLIDE 14

CNN for Region Merging - PDFs

13 CNN for classification (6 conv. layers, symm. padding, 2x2 maxpool, ReLU) input: 2 outputs of softmax layer of semantic CNN (15 channels each candidate) training: 50 epochs, batch size of 32 samples, CE & L2 regularization losses, Adam with 𝑚𝑠 = 1034, regularization constant = 1035, 𝑈-". = 0.8 training time: about 11 hours on a NVIDIA Titan X GPU

560x425x30 (560x425x6) RELU CONV 4@9x9 MAXP 2x2 280x212x4 140x106x4 70x53x4 35x26x4 17x13x4

1x2

RELU CONV 4@9x9 MAXP 2x2 RELU CONV 4@9x9 MAXP 2x2

RELU CONV 4@9x9 MAXP 2x2

RELU CONV 4@9x9 MAXP 2x2 RELU CONV 4@9x9 MAXP 2x2 RELU CONV 4@9x9 MAXP 2x2

CONV 4@7x7

RELU CONV 4@9x9 MAXP 2x2 RELU CONV 4@9x9 MAXP 2x2 RELU CONV 4@9x9 MAXP 2x2

CONV 4@5x5

RELU CONV 4@9x9 MAXP 2x2 RELU CONV 4@9x9 MAXP 2x2 RELU CONV 4@9x9 MAXP 2x2

CONV 4@3x3

RELU CONV 4@9x9 MAXP 2x2 RELU CONV 4@9x9 MAXP 2x2 RELU CONV 4@9x9 MAXP 2x2

CONV 4@3x3

CONV 4@9x9 CONV 4@9x9 CONV 4@9x9

CONV 2@17x13 ARGMAX

Merged Not merged

MAXP 2x2 MAXP 2x2 MAXP 2x2 MAXP 2x2 RELU RELU RELU RELU

PDFs . .. . ..

SLIDE 15

CNN for classification (6 conv. layers, symm. padding, 2x2 maxpool, ReLU) input: 2 surface normals of the 2 candidate segments (3 channels each) training: 50 epochs, batch size of 32 samples, CE & L2 regularization losses, Adam with 𝑚𝑠 = 1035, regularization constant = 5 ⋅ 103:, 𝑈-". = 0.75 training time: about 3 hours on a NVIDIA Titan X GPU

CNN for Region Merging - Normals

14

560x425x30 (560x425x6) RELU CONV 4@9x9 MAXP 2x2 280x212x4 140x106x4 70x53x4 35x26x4 17x13x4

1x2

RELU CONV 4@9x9 MAXP 2x2 RELU CONV 4@9x9 MAXP 2x2

RELU CONV 4@9x9 MAXP 2x2

RELU CONV 4@9x9 MAXP 2x2 RELU CONV 4@9x9 MAXP 2x2 RELU CONV 4@9x9 MAXP 2x2

CONV 4@7x7

RELU CONV 4@9x9 MAXP 2x2 RELU CONV 4@9x9 MAXP 2x2 RELU CONV 4@9x9 MAXP 2x2

CONV 4@5x5

RELU CONV 4@9x9 MAXP 2x2 RELU CONV 4@9x9 MAXP 2x2 RELU CONV 4@9x9 MAXP 2x2

CONV 4@3x3

RELU CONV 4@9x9 MAXP 2x2 RELU CONV 4@9x9 MAXP 2x2 RELU CONV 4@9x9 MAXP 2x2

CONV 4@3x3

CONV 4@9x9 CONV 4@9x9 CONV 4@9x9

CONV 2@17x13 ARGMAX

Merged Not merged

MAXP 2x2 MAXP 2x2 MAXP 2x2 MAXP 2x2 RELU RELU RELU RELU

à PDFs richer descriptions, while normals are faster with limited impact on the final accuracy

normals

SLIDE 16

Experimental Results

15

SLIDE 17

RGB raw depth GT

NYUDv2 Dataset [2]

16 1449 depth maps + color images of indoor scenes with Kinect sensor training set: 795 scenes test set: 654 scenes

894 classes clustered in 15 classes as [3] unknown & unlabeled classes excluded

[2] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus. 2012. Indoor segmentation and support inference from RGBD images. ECCV. Springer. [3] C. Couprie, C. Farabet, L. Najman, and Y. LeCun. 2013. Indoor semantic segmentation using depth information. ICLR.

SLIDE 18

Merging CNN – Ground Truth Generation

17 Need a dataset to train the merging CNN

¡ Randomly select 10 couples of adjacent segments in each image

¡ Assign label 1 if more than 85% of the union of the segments belongs to same object in the semantic segmentation ground truth ¡ Assign label 0 otherwise

5 6 x 4 2 5 x 3 ( 5 6 x 4 2 5 x 6 ) RELU CONV 4@9x9 MAXP 2x2 2 8 x 2 1 2 x 4 1 4 x 1 6 x 4 7 x 5 3 x 4 3 5 x 2 6 x 4 1 7 x 1 3 x 4

1x2

RELU CONV 4@9x9 MAXP 2x2 RELU CONV 4@9x9 MAXP 2x2

RELU CONV 4@9x9 MAXP 2x2

RELU CONV 4@9x9 MAXP 2x2 RELU CONV 4@9x9 MAXP 2x2 RELU CONV 4@9x9 MAXP 2x2

CONV 4@7x7

RELU CONV 4@9x9 MAXP 2x2 RELU CONV 4@9x9 MAXP 2x2 RELU CONV 4@9x9 MAXP 2x2

CONV 4@5x5

RELU CONV 4@9x9 MAXP 2x2 RELU CONV 4@9x9 MAXP 2x2 RELU CONV 4@9x9 MAXP 2x2

CONV 4@3x3

RELU CONV 4@9x9 MAXP 2x2 RELU CONV 4@9x9 MAXP 2x2 RELU CONV 4@9x9 MAXP 2x2

CONV 4@3x3

CONV 4@9x9 CONV 4@9x9 CONV 4@9x9

CONV 2@17x13 ARGMAX

Merged Not merged

MAXP 2x2 MAXP 2x2 MAXP 2x2 MAXP 2x2 RELU RELU RELU RELU

. . . . . .

Region appears to be uniform

label 1

Selection of a segment Selection of an adjacent segment Ground truth examination

SLIDE 19

Merging CNN – GT Ambiguities

18 ¡ Examples of ambiguities in ground truth:

¡ Inconsistent labeling ¡ Objects not labeled

missing

Bed Objects Chair Furniture Ceiling Floor Picture/Deco Sofa Table Wall Windows Books Monitor/TV Unknown

SLIDE 20

Merging CNN – Results

18

Predicted: Merge GT: Merge

19

Predicted: Not Merged GT: Not Merged

¡ Good oversegmentation (inter-uniformity)

SLIDE 21

Merging CNN – Results

18 20

Predicted: Not Merged GT: Merge Predicted: Merge GT: Not Merged

¡ Bad oversegmentation

SLIDE 22

Qualitative Results

21

[1] G.Pagnutti, L. Minto, P. Zanuttigh, "Segmentation and Semantic Labeling of RGBD Data with Convolutional Neural Networks and Surface Fitting “, IET Computer Vision, 2017

Color view Semantic CNN Pagnutti et al. [21] Our Approach Ground Truth Bed Objects Chair Furniture Ceiling Floor Picture/Deco Sofa Table Wall Windows Books Monitor/TV Unknown

[1]

SLIDE 23

Quantitative Results

22

[1] G.Pagnutti, L. Minto, P. Zanuttigh, "Segmentation and Semantic Labeling of RGBD Data with Convolutional Neural Networks and Surface Fitting “, IET Computer Vision, 2017 [4] C. Couprie, C. Farabet, L. Najman, and Y. Lecun. 2014. Convolutional nets and watershed cuts for real-time semantic Labeling of RGBD videos. JMLR 15, 1 (2014), 3489–3511. [5] S. Hickson, I. Essa, and H. Christensen. 2015. Semantic Instance Labeling Leveraging Hierarchical Segmentation. WCACV. 1068–1075 [6] A. Wang, J. Lu, G. Wang, J. Cai, and T. Cham. 2014. Multi-modal unsupervised feature learning for RGB-D scene labeling. ECCV. 453–467. [7] J. Wang, Z. Wang, D. Tao, S. See, and G. Wang. 2016. Learning Common and Specific Features for RGB-D Semantic Segmentation with Deconvolutional Networks. ECCV. 664–679. [8] A. Hermans, G. Floros, and B. Leibe. 2014. Dense 3D semantic mapping of indoor scenes from rgb-d images. ICRA. 2631–2638. [9] D. Eigen and R. Fergus. 2015. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. ICCV. 2650–2658.

Approach Pixel Accuracy Class Accuracy Couprie et al. [4] 52.4% 36.2% Hickson et al. [5] 53.0% 47.6%

A. Wang et al. [6]

46.3% 42.2%

J. Wang et al. [7]

54.8% 52.7%

A. Hermans et al. [8]

54.2% 48.0%

D. Eigen et al. [9]

75.4% 66.9% Pagnutti et al. [1] 67.2% 54.4% Semantic CNN 64.4% 51.7% Our method (normals) 66.6% 53.6% Our method (PDFs) 67.2% 54.5% Table 1: Average pixel and class accuracies on the test set of

SLIDE 24

Quantitative Results

23

[1] G.Pagnutti, L. Minto, P. Zanuttigh, "Segmentation and Semantic Labeling of RGBD Data with Convolutional Neural Networks and Surface Fitting “, IET Computer Vision, 2017

* on a Intel Core i7-8700K CPU @3.70GHz with NVIDIA GeForce GTX 1070 GPU

¡ Same over-segmentation ¡ Similar results ¡ Much faster

¡ no surface fitting ¡ In [1] time heavily depends on the area to be fit, here it is constant!

¡ Fewer hand-tuned thresholds (1 vs. 4)

Approach Pixel Accuracy Class Accuracy Inference Time* Pagnutti et al. [1] 67.2% 54.4% 58 ms Our method (normals) 66.6% 53.6% 2 ms Our method (PDFs) 67.2% 54.5% 10 ms erage pixel and class accuracies on the test set of the NYUDv2 dataset for some state-of-the-art

SLIDE 25

Conclusions à Future Work

24