Object Detection
JunYoung Gwak
1
Object Detection JunYoung Gwak 1 Motivation Image classification - - PowerPoint PPT Presentation
Object Detection JunYoung Gwak 1 Motivation Image classification Input: Image Output: object class 2 Motivation Limitation of classification Multiple classes Location i.e. Object classification assumes Single
1
Image classification
2
Limitation of classification
i.e. Object classification assumes
input image
3
We need high-level understanding of the complex world
4
Object Detection
○
○
5
Object Detection
○
○
Instance:
in contrast to considering them as a same single semantic class
6
Object Detection
○
○
Bounding box:
○ (width, height, center x, center y) ○ (x1, y1, x2, y2) ○ (x1, y1, x2, y2, rotation)
7
Object Detection
○
○
Object class:
○ Similar to object classification task, by predicting a vector of scores
8
around 2014-2017 which built the basis of modern
○ R-CNN ○ Fast R-CNN ○ Faster R-CNN ○ SSD ○ YOLO (v2, v3) ○ FPN ○ Fully convolutional ○ ...
Let’s dissect the modern (2017)
⇒ Detectron
9
Stage 1
○ For every anchor boxes ■ Predict bounding box offsets ■ Predict anchor confidence
(Optional, if two-stage networks) Stage 2
○ Predict bounding box offsets ○ Predict its semantic class
10
Stage 1
○ For every anchor boxes ■ Predict bounding box offsets ■ Predict anchor confidence
(Optional, if two-stage networks) Stage 2
○ Predict bounding box offsets ○ Predict its semantic class
11
Fully Convolutional Every pixel makes prediction!
works in image classification
12
Fully Convolutional Every pixel makes prediction! Key notions
unpooling operation: Recover the resolution of the input image
13
Fully Convolutional Every pixel makes prediction! Key notions
unpooling operation
pixel-wise fully connected layers
14
Fully Convolutional Every pixel makes prediction! ⇒ Every pixel predicts bounding boxes that are centered at its location
15
Stage 1
○ For every anchor boxes ■ Predict bounding box offsets ■ Predict anchor confidence
(Optional, if two-stage networks) Stage 2
○ Predict bounding box offsets ○ Predict its semantic class
16
Anchor boxes Neural network prefers discrete prediction over continuous regression! ⇒ Preselect templates of bounding boxes to alleviate regression problem ⇒ Let neural network classify the anchor box and small refinement of it
17
Stage 1
○ For every anchor boxes ■ Predict bounding box offsets ■ Predict anchor confidence
(Optional, if two-stage networks) Stage 2
○ Predict bounding box offsets ○ Predict its semantic class
18
Bounding box refinement Given
Predict bounding box refinement toward
19
Stage 1
○ For every anchor boxes ■ Predict bounding box offsets ■ Predict anchor confidence
(Optional, if two-stage networks) Stage 2
○ Predict bounding box offsets ○ Predict its semantic class
20
Bounding box classification For each predicted bounding box,
ex) binary cross-entropy loss
Predict semantic class of the instance ex) categorical cross-entropy loss
21
Stage 1
○ For every anchor boxes ■ Predict bounding box offsets ■ Predict anchor confidence
(Optional, if two-stage networks) Stage 2
○ Predict bounding box offsets ○ Predict its semantic class
22
Non-maximum suppression The resulting prediction contains multiple predictions of same instance. Heuristics to remove redundant detections
○ If the current prediction heavily overlaps with any of the final predictions: ■ Discard it ○ Else ■ Add it to the final prediction
23
Stage 1
○ For every anchor boxes ■ Predict bounding box offsets ■ Predict anchor confidence
(Optional, if two-stage networks) Stage 2
○ Predict bounding box offsets ○ Predict its semantic class
24
Two-stage networks Second network to refine the prediction by the first network Pro
○ Better localization ○ Better precision
Con
25
Stage 1
○ For every anchor boxes ■ Predict bounding box offsets ■ Predict anchor confidence
(Optional, if two-stage networks) Stage 2
○ Predict bounding box offsets ○ Predict its semantic class
26
For every region proposal from the fist stage
Using the extracted features,
○ Predict bounding box offsets ○ Predict its semantic class
27
For every region proposal from the fist stage
Using the extracted features,
○ Predict bounding box offsets ○ Predict its semantic class
28
ROI Align: For every region proposal from the fist stage, extract fixed-size feature
29
For every region proposal from the fist stage
Using the extracted features,
○ Predict bounding box offsets ○ Predict its semantic class
30
Bounding box refinement Given
Predict bounding box refinement toward
31
Stage 1
○ For every anchor boxes ■ Predict bounding box offsets ■ Predict anchor confidence
(Optional, if two-stage networks) Stage 2
○ Predict bounding box offsets ○ Predict its semantic class
32
Non-maximum suppression The resulting prediction contains multiple predictions of same instance. Heuristics to remove redundant detections
○ If the current prediction heavily overlaps with any of the final predictions: ■ Discard it ○ Else ■ Add it to the final prediction
33
Stage 1
○ For every anchor boxes ■ Predict bounding box offsets ■ Predict anchor confidence
(Optional, if two-stage networks) Stage 2
○ Predict bounding box offsets ○ Predict its semantic class
34
Feature Pyramid Networks Key observation: Deeper layers of the network has larger receptive fields ⇒ For ROIAlign, extract features for larger bounding boxes from deeper layers of network
35
Stage 1
○ For every anchor boxes ■ Predict bounding box offsets ■ Predict anchor confidence
(Optional, if two-stage networks) Stage 2
○ Predict bounding box offsets ○ Predict its semantic class
36
Given: Single ground-truth bounding box Single prediction bounding box Output: How well are we doing?
37
Given: Single ground-truth bounding box Single prediction bounding box Output: How well are we doing? Intersection over Union (IoU)
38
Given: Multiple ground-truth bounding box Multiple prediction bounding box Output: How well are we doing?
39
Match: if all of the conditions are true
40
there exists a matching prediction
there is no matching prediction
exists no matching groundruth
41
Average Precision (AP)
in descending order of the prediction confidence
Recall at every step
Precision/Recall plot (integral
Precision (AP)
42
the score ordering, we sometimes take max precision to the right of the AP plot
threshold and take average of them to compute mAP
○ Average of (AP evaluated at matching IoU threshold 0.5, 0.55, 0.6, …, 0.95)
43
44
○ Lack of 3D pose, Occlusion information, and 3D location
45
Input
Output
(center location: x, y, z bounding box size: w, h, l rotation around gravity axis: θ) The overall pipeline is not too different from that of 2D
46
Stage 1
○ For every anchor boxes ■ Predict bounding box offsets ■ Predict anchor confidence
(Optional, if two-stage networks) Stage 2
○ Predict bounding box offsets ○ Predict its semantic class
For example, Point R-CNN
47
Mask R-CNN Stage 3
predict instance mask
48
Mesh R-CNN Stage 3
predict 3D voxels and meshes
49
Stage 1
○ For every anchor boxes ■ Predict bounding box offsets ■ Predict anchor confidence
(Optional, if two-stage networks) Stage 2
○ Predict bounding box offsets ○ Predict its semantic class
50