Region Merging Driven by Deep Learning for RGB-D Segmentation and - - PowerPoint PPT Presentation

region merging driven by deep learning for rgb d
SMART_READER_LITE
LIVE PREVIEW

Region Merging Driven by Deep Learning for RGB-D Segmentation and - - PowerPoint PPT Presentation

ICDSC 2019 Region Merging Driven by Deep Learning for RGB-D Segmentation and Labeling U. Michieli, M. Camporese, A. Agiollo, G. Pagnutti, P. Zanuttigh September 9 th , 2019 2 Outline Semantic Segmentation Proposed Framework


slide-1
SLIDE 1

Region Merging Driven by Deep Learning for RGB-D Segmentation and Labeling

  • U. Michieli, M. Camporese, A. Agiollo, G. Pagnutti, P. Zanuttigh

September 9th, 2019

ICDSC 2019

slide-2
SLIDE 2

Outline

¡ Semantic Segmentation ¡ Proposed Framework

¡ Pre-processing ¡ Over-segmentation and Classification ¡ Merging Phase

¡ Results ¡ Conclusions and Future Work

2

slide-3
SLIDE 3

Semantic Segmentation

3

furniture furniture floor

  • bjects
  • bjects

wall wall

¡ Segmentation + labeling (pixel-wise classification) ¡ Deep learning and consumer depth sensors ¡ Very useful for free navigation systems to explore the surroundings

slide-4
SLIDE 4

Semantic Segmentation

4

slide-5
SLIDE 5

Semantic Segmentation

4

slide-6
SLIDE 6

Proposed Framework

5

slide-7
SLIDE 7

Proposed Framework

6

Use normalized cuts spectral clustering extended for RGBD à but bias toward region of similar sizes

[1] G.Pagnutti, L. Minto, P. Zanuttigh, "Segmentation and Semantic Labeling of RGBD Data with Convolutional Neural Networks and Surface Fitting “, IET Computer Vision, 2017

AIM: propose CNN for region merging and refine boundaries of shapes

Then 2 steps procedure: ¡ Initial over-segmentation to properly separate objects ¡ Region merging procedure to avoid over-segmentation Framework derived from [1] but much faster and simpler

slide-8
SLIDE 8

Framework of [1]

7

[1] G.Pagnutti, L. Minto, P. Zanuttigh, "Segmentation and Semantic Labeling of RGBD Data with Convolutional Neural Networks and Surface Fitting “, IET Computer Vision, 2017 Merge phase Over-segmentation and classification Pre-processing

320x240x6 160x120x6

Surface fitting accuracy improved? No Yes Depth data Color data Segment descriptors Normalized cuts spectral clustering Convolutional Neural Network (CNN) 1/σg 1/σn 1/σc (x, y, z) point set Normals computation RGB to CIELab conversion Compute similarity of adjacent segments Sort and discard below similarity threshold NURBS fitting NURBS fitting Segment 1 Segment 2 Select two segments to be joined Discard union Keep union Geometry vectors Orientation vectors Color vectors

CONs:

  • NURBS fitting very slow
  • Many hand-tuned

thresholds (on depth, color, normals, NURBS fitting)

slide-9
SLIDE 9

Proposed Framework

8

PROs:

  • Much faster
  • Fewer thresholds
  • Same accuracy
slide-10
SLIDE 10

Proposed Framework - Preprocessing

9 ¡ 3 channels for 3D location ¡ 3 channels for surface normals ¡ 3 channels for color representation à CIELab for perceptual uniformity ¡ Normalization to achieve consistent representation across the 3 domains.

slide-11
SLIDE 11

Proposed Framework – Oversegmentation

10

¡ Over-segmentation with normalized cuts spectral clustering with Nystrom acceleration: 9D input ¡ CNN for the semantic labeling of each segment and for guiding the region merging process ¡ 9 conv layers ¡ 15 classes ¡ very simple

slide-12
SLIDE 12

Proposed Framework – Region Merging

11 ¡ Compute adjacency map of the segments ¡ Compute similarity between adjacent segment descriptors with Bhattacharyya coefficient: 𝑐",$ = ∑' 𝑡"

'𝑡 $ '

  • 𝑢: class scores

𝑡": descriptors (~PDFs) ¡ Sort list on the basis of 𝑐",$

slide-13
SLIDE 13

Proposed Framework

12

Iterative merging procedure

Ø Select segments with 𝑐",$ > 𝑈-". Ø CNN classifier to decide whether the two segments will be joined or not

  • If merged: new segment of the union is created and list updated
  • If not merged: remove segments from the list
slide-14
SLIDE 14

CNN for Region Merging - PDFs

13 CNN for classification (6 conv. layers, symm. padding, 2x2 maxpool, ReLU) input: 2 outputs of softmax layer of semantic CNN (15 channels each candidate) training: 50 epochs, batch size of 32 samples, CE & L2 regularization losses, Adam with 𝑚𝑠 = 1034, regularization constant = 1035, 𝑈-". = 0.8 training time: about 11 hours on a NVIDIA Titan X GPU

560x425x30 (560x425x6) RELU CONV 4@9x9 MAXP 2x2 280x212x4 140x106x4 70x53x4 35x26x4 17x13x4

1x2

RELU CONV 4@9x9 MAXP 2x2 RELU CONV 4@9x9 MAXP 2x2

RELU CONV 4@9x9 MAXP 2x2

RELU CONV 4@9x9 MAXP 2x2 RELU CONV 4@9x9 MAXP 2x2 RELU CONV 4@9x9 MAXP 2x2

CONV 4@7x7

RELU CONV 4@9x9 MAXP 2x2 RELU CONV 4@9x9 MAXP 2x2 RELU CONV 4@9x9 MAXP 2x2

CONV 4@5x5

RELU CONV 4@9x9 MAXP 2x2 RELU CONV 4@9x9 MAXP 2x2 RELU CONV 4@9x9 MAXP 2x2

CONV 4@3x3

RELU CONV 4@9x9 MAXP 2x2 RELU CONV 4@9x9 MAXP 2x2 RELU CONV 4@9x9 MAXP 2x2

CONV 4@3x3

CONV 4@9x9 CONV 4@9x9 CONV 4@9x9

CONV 2@17x13 ARGMAX

Merged Not merged

MAXP 2x2 MAXP 2x2 MAXP 2x2 MAXP 2x2 RELU RELU RELU RELU

PDFs . .. . ..

slide-15
SLIDE 15

CNN for classification (6 conv. layers, symm. padding, 2x2 maxpool, ReLU) input: 2 surface normals of the 2 candidate segments (3 channels each) training: 50 epochs, batch size of 32 samples, CE & L2 regularization losses, Adam with 𝑚𝑠 = 1035, regularization constant = 5 ⋅ 103:, 𝑈-". = 0.75 training time: about 3 hours on a NVIDIA Titan X GPU

CNN for Region Merging - Normals

14

560x425x30 (560x425x6) RELU CONV 4@9x9 MAXP 2x2 280x212x4 140x106x4 70x53x4 35x26x4 17x13x4

1x2

RELU CONV 4@9x9 MAXP 2x2 RELU CONV 4@9x9 MAXP 2x2

RELU CONV 4@9x9 MAXP 2x2

RELU CONV 4@9x9 MAXP 2x2 RELU CONV 4@9x9 MAXP 2x2 RELU CONV 4@9x9 MAXP 2x2

CONV 4@7x7

RELU CONV 4@9x9 MAXP 2x2 RELU CONV 4@9x9 MAXP 2x2 RELU CONV 4@9x9 MAXP 2x2

CONV 4@5x5

RELU CONV 4@9x9 MAXP 2x2 RELU CONV 4@9x9 MAXP 2x2 RELU CONV 4@9x9 MAXP 2x2

CONV 4@3x3

RELU CONV 4@9x9 MAXP 2x2 RELU CONV 4@9x9 MAXP 2x2 RELU CONV 4@9x9 MAXP 2x2

CONV 4@3x3

CONV 4@9x9 CONV 4@9x9 CONV 4@9x9

CONV 2@17x13 ARGMAX

Merged Not merged

MAXP 2x2 MAXP 2x2 MAXP 2x2 MAXP 2x2 RELU RELU RELU RELU

à PDFs richer descriptions, while normals are faster with limited impact on the final accuracy

normals

slide-16
SLIDE 16

Experimental Results

15

slide-17
SLIDE 17

RGB raw depth GT

NYUDv2 Dataset [2]

16

1449 depth maps + color images of indoor scenes with Kinect sensor training set: 795 scenes test set: 654 scenes

894 classes clustered in 15 classes as [3] unknown & unlabeled classes excluded

[2] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus. 2012. Indoor segmentation and support inference from RGBD images. ECCV. Springer. [3] C. Couprie, C. Farabet, L. Najman, and Y. LeCun. 2013. Indoor semantic segmentation using depth information. ICLR.

slide-18
SLIDE 18

Merging CNN – Ground Truth Generation

17

Need a dataset to train the merging CNN

¡ Randomly select 10 couples of adjacent segments in each image

¡ Assign label 1 if more than 85% of the union of the segments belongs to same object in the semantic segmentation ground truth ¡ Assign label 0 otherwise

5 6 x 4 2 5 x 3 ( 5 6 x 4 2 5 x 6 ) RELU CONV 4@9x9 MAXP 2x2 2 8 x 2 1 2 x 4 1 4 x 1 6 x 4 7 x 5 3 x 4 3 5 x 2 6 x 4 1 7 x 1 3 x 4

1x2

RELU CONV 4@9x9 MAXP 2x2 RELU CONV 4@9x9 MAXP 2x2

RELU CONV 4@9x9 MAXP 2x2

RELU CONV 4@9x9 MAXP 2x2 RELU CONV 4@9x9 MAXP 2x2 RELU CONV 4@9x9 MAXP 2x2

CONV 4@7x7

RELU CONV 4@9x9 MAXP 2x2 RELU CONV 4@9x9 MAXP 2x2 RELU CONV 4@9x9 MAXP 2x2

CONV 4@5x5

RELU CONV 4@9x9 MAXP 2x2 RELU CONV 4@9x9 MAXP 2x2 RELU CONV 4@9x9 MAXP 2x2

CONV 4@3x3

RELU CONV 4@9x9 MAXP 2x2 RELU CONV 4@9x9 MAXP 2x2 RELU CONV 4@9x9 MAXP 2x2

CONV 4@3x3

CONV 4@9x9 CONV 4@9x9 CONV 4@9x9

CONV 2@17x13 ARGMAX

Merged Not merged

MAXP 2x2 MAXP 2x2 MAXP 2x2 MAXP 2x2 RELU RELU RELU RELU

. . . . . .

Region appears to be uniform

label 1

Selection of a segment Selection of an adjacent segment Ground truth examination

slide-19
SLIDE 19

Merging CNN – GT Ambiguities

18

¡ Examples of ambiguities in ground truth:

¡ Inconsistent labeling ¡ Objects not labeled

missing

Bed Objects Chair Furniture Ceiling Floor Picture/Deco Sofa Table Wall Windows Books Monitor/TV Unknown

slide-20
SLIDE 20

Merging CNN – Results

18

Predicted: Merge GT: Merge

19

Predicted: Not Merged GT: Not Merged

¡ Good oversegmentation (inter-uniformity)

slide-21
SLIDE 21

Merging CNN – Results

18 20

Predicted: Not Merged GT: Merge Predicted: Merge GT: Not Merged

¡ Bad oversegmentation

slide-22
SLIDE 22

Qualitative Results

21

[1] G.Pagnutti, L. Minto, P. Zanuttigh, "Segmentation and Semantic Labeling of RGBD Data with Convolutional Neural Networks and Surface Fitting “, IET Computer Vision, 2017

Color view Semantic CNN Pagnutti et al. [21] Our Approach Ground Truth Bed Objects Chair Furniture Ceiling Floor Picture/Deco Sofa Table Wall Windows Books Monitor/TV Unknown

[1]

slide-23
SLIDE 23

Quantitative Results

22

[1] G.Pagnutti, L. Minto, P. Zanuttigh, "Segmentation and Semantic Labeling of RGBD Data with Convolutional Neural Networks and Surface Fitting “, IET Computer Vision, 2017 [4] C. Couprie, C. Farabet, L. Najman, and Y. Lecun. 2014. Convolutional nets and watershed cuts for real-time semantic Labeling of RGBD videos. JMLR 15, 1 (2014), 3489–3511. [5] S. Hickson, I. Essa, and H. Christensen. 2015. Semantic Instance Labeling Leveraging Hierarchical Segmentation. WCACV. 1068–1075 [6] A. Wang, J. Lu, G. Wang, J. Cai, and T. Cham. 2014. Multi-modal unsupervised feature learning for RGB-D scene labeling. ECCV. 453–467. [7] J. Wang, Z. Wang, D. Tao, S. See, and G. Wang. 2016. Learning Common and Specific Features for RGB-D Semantic Segmentation with Deconvolutional Networks. ECCV. 664–679. [8] A. Hermans, G. Floros, and B. Leibe. 2014. Dense 3D semantic mapping of indoor scenes from rgb-d images. ICRA. 2631–2638. [9] D. Eigen and R. Fergus. 2015. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. ICCV. 2650–2658.

Approach Pixel Accuracy Class Accuracy Couprie et al. [4] 52.4% 36.2% Hickson et al. [5] 53.0% 47.6%

  • A. Wang et al. [6]

46.3% 42.2%

  • J. Wang et al. [7]

54.8% 52.7%

  • A. Hermans et al. [8]

54.2% 48.0%

  • D. Eigen et al. [9]

75.4% 66.9% Pagnutti et al. [1] 67.2% 54.4% Semantic CNN 64.4% 51.7% Our method (normals) 66.6% 53.6% Our method (PDFs) 67.2% 54.5% Table 1: Average pixel and class accuracies on the test set of

slide-24
SLIDE 24

Quantitative Results

23

[1] G.Pagnutti, L. Minto, P. Zanuttigh, "Segmentation and Semantic Labeling of RGBD Data with Convolutional Neural Networks and Surface Fitting “, IET Computer Vision, 2017

* on a Intel Core i7-8700K CPU @3.70GHz with NVIDIA GeForce GTX 1070 GPU

¡ Same over-segmentation ¡ Similar results ¡ Much faster

¡ no surface fitting ¡ In [1] time heavily depends on the area to be fit, here it is constant!

¡ Fewer hand-tuned thresholds (1 vs. 4)

Approach Pixel Accuracy Class Accuracy Inference Time* Pagnutti et al. [1] 67.2% 54.4% 58 ms Our method (normals) 66.6% 53.6% 2 ms Our method (PDFs) 67.2% 54.5% 10 ms erage pixel and class accuracies on the test set of the NYUDv2 dataset for some state-of-the-art

slide-25
SLIDE 25

Conclusions à Future Work

24

slide-26
SLIDE 26

24

¡ Agnostic to the over-segmentation method

Conclusions à Future Work

slide-27
SLIDE 27

24

¡ Agnostic to the over-segmentation method

¡ use other methods like superpixels

Conclusions à Future Work

slide-28
SLIDE 28

24

¡ Agnostic to the over-segmentation method ¡ Semantic CNN very simple

¡ use other methods like superpixels

Conclusions à Future Work

slide-29
SLIDE 29

24

¡ Agnostic to the over-segmentation method ¡ Semantic CNN very simple

¡ use other methods like superpixels ¡ use more complex one (less speed)

Conclusions à Future Work

slide-30
SLIDE 30

24

¡ Agnostic to the over-segmentation method ¡ Semantic CNN very simple ¡ CNN useful for region merging

¡ use other methods like superpixels ¡ use more complex one (less speed)

Conclusions à Future Work

slide-31
SLIDE 31

24

¡ Agnostic to the over-segmentation method ¡ Semantic CNN very simple ¡ CNN useful for region merging

¡ use other methods like superpixels ¡ use more complex one (less speed) ¡ focus the attention on the edges of the candidates

Conclusions à Future Work

slide-32
SLIDE 32

24

¡ Agnostic to the over-segmentation method ¡ Semantic CNN very simple ¡ CNN useful for region merging ¡ Smaller computational time

¡ use other methods like superpixels ¡ use more complex one (less speed) ¡ focus the attention on the edges of the candidates

Conclusions à Future Work

slide-33
SLIDE 33

24

¡ Agnostic to the over-segmentation method ¡ Semantic CNN very simple ¡ CNN useful for region merging ¡ Smaller computational time

¡ use other methods like superpixels ¡ use more complex one (less speed) ¡ focus the attention on the edges of the candidates ¡ useful for free-navigation and for other fields

Conclusions à Future Work

slide-34
SLIDE 34

Thank you! Questions?