Zhao Chen Machine Learning Intern, NVIDIA
JOINT DETECTION AND SEGMENTATION WITH DEEP HIERARCHICAL NETWORKS - - PowerPoint PPT Presentation
JOINT DETECTION AND SEGMENTATION WITH DEEP HIERARCHICAL NETWORKS - - PowerPoint PPT Presentation
JOINT DETECTION AND SEGMENTATION WITH DEEP HIERARCHICAL NETWORKS Zhao Chen Machine Learning Intern, NVIDIA ABOUT ME 5th year PhD student in physics @ S tanford by day, deep learning computer vision scientist by night. Intern with Deep
2
ABOUT ME
- 5th year PhD student in physics @
S tanford by day, deep learning computer vision scientist by night.
- Intern with Deep Learning Applied
Research (Autonomous Vehicles) @ NVIDIA, Oct-Dec 2016.
Zhao Chen, Joint Det ect ion and S egment at ion with Deep Hierarchical Net works, GTC 2017.
3
TALK OVERVIEW
(1) Problem statement and summary. (2) Dataset and preliminaries. (3) Model motivation. (4) Results and visualizations.
Zhao Chen, Joint Det ect ion and S egment at ion with Deep Hierarchical Net works, GTC 2017.
4
TALK OVERVIEW
(1) Problem statement and summary. (2) Dataset and preliminaries. (3) Model motivation. (4) Results and visualizations.
Zhao Chen, Joint Det ect ion and S egment at ion with Deep Hierarchical Net works, GTC 2017.
5
FROM SINGLE TO MULTITASK LEARNING
Putting deep learning to work in the real world
Detection Model
. . .
S egmentation Model
. . .
Obj ect Bounding Boxes S egmentation Mask
6
FROM SINGLE TO MULTITASK LEARNING
Putting deep learning to work in the real world
Detection Model
. . .
S egmentation Model
. . .
Obj ect Bounding Boxes S egmentation Mask Poor scalability + inefficient use of information!
7
FROM SINGLE TO MULTITASK LEARNING
How do we use one model to perform multiple tasks faster and better?
Putting deep learning to work in the real world
S hared Model
. . .
Obj ect Bounding Boxes S egmentation Mask
8
FROM SINGLE TO MULTITASK LEARNING
How do we use one model to perform multiple tasks faster and better?
Putting deep learning to work in the real world
S hared Model
. . .
Obj ect Bounding Boxes S egmentation Mask + edge detection, + surface normals, + distance estimation…
9
FROM SINGLE TO MULTITASK LEARNING
How do we use one model to perform multiple tasks faster and better?
Putting deep learning to work in the real world
S hared Model
. . .
Obj ect Bounding Boxes S egmentation Mask How do you relate various tasks to each other in a multi-task neural network?
10
WHAT WE WILL SHOW
- By ordering tasks based on receptive field and information density, we improve
segmentation and detection accuracy by ~2% and ~8%
- ver single networks,
respectively.
- The j oint network is robust and easy to tune compared to non-hierarchical
baselines.
Zhao Chen, Joint Det ect ion and S egment at ion with Deep Hierarchical Net works, GTC 2017.
11
TALK OVERVIEW
(1) Problem statement and summary. (2) Dataset and preliminaries. (3) Model motivation. (4) Results and visualizations.
Zhao Chen, Joint Det ect ion and S egment at ion with Deep Hierarchical Net works, GTC 2017.
12
CITYSCAPES DATASET
- 2975 Training Images @
resolution 1024 x 2048.
- 20 classes for semantic segmentation, including 8 obj ect classes. Of these 8, 4 are
much more represented (car, bicycle, person, rider): the “ easy classes.”
- Both segmentation, bounding box, and edge ground truth can be generated.
Raw Image Edge Detection S emantic S eg. Bounding Box
13
HOW TO TRAIN A SEGMENTATION NETWORK
- S
tandard FCN (S helhamer 2015) Architecture: Convolutions followed by a deconvolution to retrieve a pixel-dense prediction mask.
Zhao Chen, Joint Det ect ion and S egment at ion with Deep Hierarchical Net works, GTC 2017.
14
HOW TO TRAIN A DETECTION NETWORK
- Network outputs confidence that a pixel lies near the center of an obj ect.
- Points of high confidence produce bounding box coordinates.
- Confidences are rougher than
full segmentation but robust to occlusion.
Zhao Chen, Joint Det ect ion and S egment at ion with Deep Hierarchical Net works, GTC 2017.
15
TALK OVERVIEW
(1) Problem statement and summary. (2) Dataset and preliminaries. (3) Model motivation. (4) Results and visualizations.
Zhao Chen, Joint Det ect ion and S egment at ion with Deep Hierarchical Net works, GTC 2017.
16
S hared Feature Map (from base CNN)
Input (1024 x 2048)
Deconv
Low-Res S eg Predictions (W x H x 20) Obj . Confidence Positions Bbox Coordinate Positions
L = αLseg + (1- α)Ldet
Zhao Chen, Joint Det ect ion and S egment at ion with Deep Hierarchical Net works, GTC 2017.
17
OUR BASELINE MODEL PERFORMANCE
S
- eg. Weight
- Det. Weight
(α controls how much attention we pay to segmentation vs detection at training)
= α
Zhao Chen, Joint Det ect ion and S egment at ion with Deep Hierarchical Net works, GTC 2017.
18
OUR BASELINE MODEL PERFORMANCE
S
- eg. Weight
- Det. Weight
(α controls how much attention we pay to segmentation vs detection at training)
= α
Zhao Chen, Joint Det ect ion and S egment at ion with Deep Hierarchical Net works, GTC 2017.
19
OUR BASELINE MODEL PERFORMANCE
S
- eg. Weight
- Det. Weight
(α controls how much attention we pay to segmentation vs detection at training)
= α
Zhao Chen, Joint Det ect ion and S egment at ion with Deep Hierarchical Net works, GTC 2017.
20
OUR BASELINE MODEL PERFORMANCE
S
- eg. Weight
- Det. Weight
(α controls how much attention we pay to segmentation vs detection at training)
= α
Zhao Chen, Joint Det ect ion and S egment at ion with Deep Hierarchical Net works, GTC 2017.
21
OUR BASELINE MODEL PERFORMANCE
S
- eg. Weight
- Det. Weight
(α controls how much attention we pay to segmentation vs detection at training)
= α
Zhao Chen, Joint Det ect ion and S egment at ion with Deep Hierarchical Net works, GTC 2017.
22
OUR BASELINE MODEL PERFORMANCE
S
- eg. Weight
- Det. Weight
(α controls how much attention we pay to segmentation vs detection at training)
= α
Zhao Chen, Joint Det ect ion and S egment at ion with Deep Hierarchical Net works, GTC 2017.
23
OUR BASELINE MODEL PERFORMANCE
S
- eg. Weight
- Det. Weight
(α controls how much attention we pay to segmentation vs detection at training)
= α
Zhao Chen, Joint Det ect ion and S egment at ion with Deep Hierarchical Net works, GTC 2017.
24
OUR BASELINE MODEL PERFORMANCE
S
- eg. Weight
- Det. Weight
(α controls how much attention we pay to segmentation vs detection at training)
= α
Zhao Chen, Joint Det ect ion and S egment at ion with Deep Hierarchical Net works, GTC 2017.
25
A LABEL HIERARCHY ALONG TWO AXES
Density of Information Required Receptive Field Obj ect Bounding Boxes
Zhao Chen, Joint Det ect ion and S egment at ion with Deep Hierarchical Net works, GTC 2017.
26
A LABEL HIERARCHY ALONG TWO AXES
Density of Information Required Receptive Field Obj ect Bounding Boxes Obj ect Confidence
Zhao Chen, Joint Det ect ion and S egment at ion with Deep Hierarchical Net works, GTC 2017.
27
A LABEL HIERARCHY ALONG TWO AXES
Density of Information Required Receptive Field Obj ect Bounding Boxes S emantic S egmentation Obj ect Confidence
Zhao Chen, Joint Det ect ion and S egment at ion with Deep Hierarchical Net works, GTC 2017.
28
A LABEL HIERARCHY ALONG TWO AXES
Density of Information Required Receptive Field Obj ect Bounding Boxes Edge Detection S emantic S egmentation Obj ect Confidence
(plus)
Zhao Chen, Joint Det ect ion and S egment at ion with Deep Hierarchical Net works, GTC 2017.
29
S hared Feature Map (from base CNN)
Input (1024 x 2048)
Deconv
Low-Res S eg Predictions (W x H x 20) Obj . Confidence Positions Bbox Coordinate Positions
Zhao Chen, Joint Det ect ion and S egment at ion with Deep Hierarchical Net works, GTC 2017.
30
S hared Feature Map (from base CNN)
Input (1024 x 2048) S egmentation Features
Deconv
Low-Res S eg Predictions (W x H x 20) Obj . Confidence Features Obj . Confidence Positions Obj . BBox Features Bbox Coordinate Positions
Zhao Chen, Joint Det ect ion and S egment at ion with Deep Hierarchical Net works, GTC 2017.
31
S hared Feature Map (from base CNN)
Input (1024 x 2048) S egmentation Features
Deconv
Low-Res S eg Predictions (W x H x 20) Obj . Confidence Features Obj . Confidence Positions Obj . BBox Features Bbox Coordinate Positions
Zhao Chen, Joint Det ect ion and S egment at ion with Deep Hierarchical Net works, GTC 2017.
Decreasing information density
32
S hared Feature Map (from base CNN)
Edge Features
Deconv
Input (1024 x 2048) Low-Res Edge Predictions (W x H x 3) S egmentation Features
Deconv
Low-Res S eg Predictions (W x H x 20) Obj . Confidence Features Obj . Confidence Positions Obj . BBox Features Bbox Coordinate Positions
Decreasing information density
Zhao Chen, Joint Det ect ion and S egment at ion with Deep Hierarchical Net works, GTC 2017.
33
S hared Feature Map (from base CNN)
Edge Features
Deconv
Input (1024 x 2048) Low-Res Edge Predictions (W x H x 3) S egmentation Features
Deconv
Low-Res S eg Predictions (W x H x 20) Obj . Confidence Features Obj . Confidence Positions Obj . BBox Features Bbox Coordinate Positions
Decreasing information density
Zhao Chen, Joint Det ect ion and S egment at ion with Deep Hierarchical Net works, GTC 2017.
34
S hared Feature Map (from base CNN)
Edge Features
Deconv
Input (1024 x 2048) Low-Res Edge Predictions (W x H x 3) S egmentation Features
Deconv
Low-Res S eg Predictions (W x H x 20) Obj . Confidence Features Obj . Confidence Positions Obj . BBox Features Bbox Coordinate Positions
X
Decreasing information density
Zhao Chen, Joint Det ect ion and S egment at ion with Deep Hierarchical Net works, GTC 2017.
35
S hared Feature Map (from base CNN)
Edge Features
Deconv
Input (1024 x 2048) Low-Res Edge Predictions (W x H x 3) S egmentation Features
Deconv
Low-Res S eg Predictions (W x H x 20) Obj . Confidence Features Obj . Confidence Positions Obj . BBox Features Bbox Coordinate Positions
X
Increasing receptive field
Zhao Chen, Joint Det ect ion and S egment at ion with Deep Hierarchical Net works, GTC 2017.
36
S hared Feature Map (from base CNN)
Edge Features
Deconv
Input (1024 x 2048) Low-Res Edge Predictions (W x H x 3) S egmentation Features
Deconv
Low-Res S eg Predictions (W x H x 20) Obj . Confidence Features Obj . Confidence Positions Obj . BBox Features Dilated Bbox Coordinate Positions Dilated Convs
Increasing receptive field
Zhao Chen, Joint Det ect ion and S egment at ion with Deep Hierarchical Net works, GTC 2017.
37
S hared Feature Map (from base CNN)
Edge Features
Deconv
Input (1024 x 2048) Low-Res Edge Predictions (W x H x 3) S egmentation Features
Deconv
Low-Res S eg Predictions (W x H x 20) Obj . Confidence Features Obj . Confidence Positions Obj . BBox Features Dilated Bbox Coordinate Positions Dilated Convs
Deep Hierarchical Network (DHM)
Zhao Chen, Joint Det ect ion and S egment at ion with Deep Hierarchical Net works, GTC 2017.
38
TALK OVERVIEW
(1) Problem statement and summary. (2) Dataset and preliminaries. (3) Model motivation. (4) Results and visualizations.
Zhao Chen, Joint Det ect ion and S egment at ion with Deep Hierarchical Net works, GTC 2017.
39
RESULTS: HIGH ROBUSTNESS
Zhao Chen, Joint Det ect ion and S egment at ion with Deep Hierarchical Net works, GTC 2017.
40
RESULTS: HIGH ROBUSTNESS
Zhao Chen, Joint Det ect ion and S egment at ion with Deep Hierarchical Net works, GTC 2017.
41
Zhao Chen, Joint Det ect ion and S egment at ion with Deep Hierarchical Net works, GTC 2017.
42
Edge Predictions
RAW IMAGE
Segmentation Predictions Bounding Box Predictions
Zhao Chen, Joint Det ect ion and S egment at ion with Deep Hierarchical Net works, GTC 2017.
43
VISUALIZATIONS
SINGLE NETWORK DETECTION SEGMENTAITION DHM (OURS)
44
VISUALIZATIONS
SINGLE NETWORK
SALIENCY (CAR)
SEGMENTAITION DHM (OURS)
45
VISUALIZATIONS
SINGLE NETWORK DETECTION SEGMENTAITION DHM (OURS)
46
VISUALIZATIONS
SINGLE NETWORK DETECTION SEGMENTAITION DHM (OURS)
47
VISUALIZATIONS
SINGLE NETWORK
SALIENCY (BUS)
SEGMENTAITION DHM (OURS)
48
VISUALIZATIONS
SINGLE NETWORK DETECTION SEGMENTAITION DHM (OURS)
49
VISUALIZATIONS
SINGLE NETWORK DETECTION SEGMENTAITION DHM (OURS)
50
VISUALIZATIONS
SINGLE NETWORK DETECTION SEGMENTAITION DHM (OURS)
51
SUMMARY
- Our two hierarchies within our model allow our network to reason about intra-
task relationships:
- Information density: (S
eg +) Edge > S eg > Obj ect Conf > Bbox
- Receptive field: (S
eg +) Edge = Bbox >> Obj ect Conf > S eg
- With these relationships wired in, our network is:
- More accurate
- Robust to tuning
- Simultaneously better at fine detail and more instance aware
- Efficient and scalable (3 tasks, 1 network!)
Zhao Chen, Joint Det ect ion and S egment at ion with Deep Hierarchical Net works, GTC 2017.
52
REFERENCES
- J. Yao, S
. Fidler, and R. Urtasun. Describing the scene as a whole: Joint obj ect detection, scene classificationa and semantic segmentation. In CVPR, 2012.
- S
. Gidaris and N. Komodakis. Obj ect detection via a multiregion and semantic segmentation-aware cnn
- model. In ICCV, 2015.
- B. Hariharan, P. Arbelaez, R. Girshick, and J. Malik. S
imultaneous detection and segmentation. In ECCV, 2014.
- S
. Liu, X. Qi, J. S hi, H. Zhang, and J. Jia. Multi-scale patch aggregation (mpa) for simultaneous detection and segmentation. In CVPR, 2016.
- E. S
helhamer, J. Long, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015.
- B. Hariharan, P. Arbelaez, R. Girshick, and J. Malik. Hypercolumns for obj ect segmentation and fine-
grained localization. In CVPR, 2015.
- J. Dai, K. He, and J. S
- un. Instance-aware semantic segmentation via multi-task network cascades. In
https:/ / arxiv.org/ pdf/ 1512.04412.pdf, 2015.
53
THANK YOU!
Special thanks to: My internship mentor: Jian Yao My managers: John Zedlewski and Andrew Tao All the wonderful people in DLAR/ DLAV. Additional questions/comments: zchen89@stanford.edu
Zhao Chen, Joint Det ect ion and S egment at ion with Deep Hierarchical Net works, GTC 2017.