[PPT] - 3D Multi-Object Tracking for Autonomous Driving Xinshuo Weng, Kris PowerPoint Presentation

SLIDE 1

3D Multi-Object Tracking for Autonomous Driving

Xinshuo Weng, Kris Kitani

June 15, 2020

1

SLIDE 2

2

3D multi-object tracking is an important perception task for autonomous driving

SLIDE 3

Standard 3D MOT Pipeline

3

3D Object Detection Data Association

Evaluation Sensor Data

SLIDE 4

Standard 3D MOT Pipeline

4

3D Object Detection Data Association

Evaluation Sensor Data

LiDAR RGB

SLIDE 5

Standard 3D MOT Pipeline

5

3D Object Detection Data Association

Evaluation Sensor Data

Detection results

SLIDE 6

Standard 3D MOT Pipeline

6

3D Object Detection Data Association

Evaluation Sensor Data

Tracking results

SLIDE 7

Standard 3D MOT Pipeline

7

Also important!

3D Object Detection Data Association

Evaluation Sensor Data

Evaluation:

1. MOTA: MOT accuracy
2. MOTP: MOT precision
3. IDS: # of identity switches
4. FRAG: # of trajectory

fragments

5. ……

SLIDE 8

Standard 3D MOT Pipeline

8

3D Object Detection Data Association

Evaluation Sensor Data

SLIDE 9

9

What is the state of the art?

SLIDE 10

State of the Art (3D MOT)

10

3D Object Detection Data Association

Evaluation Sensor Data

Better models from better (bigger) data!

* Mined trajectory data not counted for the Argo dataset

*

150x increase!

SLIDE 11

State of the Art (3D MOT)

11

Image credit to Patrick Langechuan Liu, https://towardsdatascience.com/monocular-3d-object-detection-in-autonomous-driving-2476a3c7f57e

AP

3D Object Detection Data Association

Evaluation Sensor Data

15x increase (3 years)

Monocular 3D Detection (KITTI)

SLIDE 12

State of the Art (3D MOT)

12

3D Object Detection Data Association

Evaluation Sensor Data

27% increase (2 years)

Lidar-based 3D Detection (KITTI)

SLIDE 13

State of the Art (3D MOT)

13

3D Object Detection Data Association

Evaluation Sensor Data

18% increase (4 years)

2D MOT (KITTI) *3D methods compared using 2D evaluation on KITTI

SLIDE 14

State of the Art (3D MOT)

14

3D Object Detection Data Association

Evaluation Sensor Data

Feature Extraction Optimization

D. Frossard R. Urtasun.

End-to-End Learning of Multi-Sensor 3D Tracking by Detection. ICRA 2018. Zhang et al. Robust Multi-Modality Multi-Object Tracking. ICCV 2019.

Jointly optimized

Recent trend:

SLIDE 15

State of the Art (3D MOT)

15

3D Object Detection Evaluation

Sensor Data

Feature Extraction Optimization

What are open problems in 3D MOT?

SLIDE 16

Some Open Problems (3D MOT)

16

3D Object Detection Evaluation

Sensor Data

Feature Extraction Optimization

Many large-scale datasets but sensor suite and annotations are not unified 3D detection performance is improving but doesn't take into account sensor physics Doesn't take into account context of multi- level optimization problem (sensors, forecasting, control) Representation doesn't take into account context of other objects and the scene Weak 3D MOT evaluation datasets and metrics Should also take into account sensor

ptimization and redundancy

Detection and tracking should be coupled more tightly

SLIDE 17

Some Open Problems (3D MOT)

17

3D Object Detection Evaluation

Sensor Data

Feature Extraction Optimization

Many large-scale datasets but sensor suite and annotations are not unified 3D detection performance is improving but doesn't take into account sensor physics Doesn't take into account context of multi- level optimization problem (sensors, forecasting, control) Representation doesn't take into account context of other objects and the scene Weak 3D MOT evaluation datasets and metrics Should also take into account sensor

ptimization and redundancy

Detection and tracking should be coupled more tightly This talk This talk

SLIDE 18

Recent Work on Evaluation

18

SLIDE 19

What are the Issues of Evaluation?

IoU (intersection of union)
For the pioneering 3D MOT dataset KITTI, evaluation is done in 2D
IoU is computed on the 2D image plane (not 3D)
The common practice for evaluating 3D MOT methods is:
First project 3D trajectories onto the image plane
Run the 2D evaluation code provided by KITTI

19

IoU in 2D space

Image credit to Xu et al: 3D-GIoU

3D Object Detector Feature Extractor

Optimizer Evaluation

Sensor Data

IoU in 3D space

Bp: the predicted box Bg: the ground truth box Bc: the smallest enclosing box I2D, I3D: the intersection

SLIDE 20

What are the Issues of Evaluation?

Why is it not good to evaluate 3D MOT methods on the 2D space?
Cannot demonstrate the strength of 3D MOT methods
Throw away the extra information (e.g., depth value, length of the object, heading orientation)
Cannot fairly compare 3D MOT methods, why?
Not penalized by the wrong predicted depth value, length, heading as long as the 2D

projection is good

Which predicted box is better, blue or green?
Conclusion: should not use 2D metrics to evaluate 3D MOT methods

20

C

Blue: the predicted box 1 Green: the predicted box 2 Red: the ground truth box

3D Object Detector Feature Extractor

Optimizer Evaluation

Sensor Data

SLIDE 21

Our Solution: Upgrade the Metrics Using 3D IoU

21

X. Weng, K. Kitani. A Baseline for 3D Multi-Object Tracking. arXiv 2019.
Replace the metrics in KITTI evaluation code with 2D IoU by 3D IoU
https://github.com/xinshuoweng/AB3DMOT (~800 stars)
Work with nuTonomy collaborators and use our 3D metrics in the nuScenes evaluation
https://www.nuscenes.org/

Our released new evaluation code nuScenes 3D MOT evaluation with our metrics

SLIDE 22

What are the Issues of Evaluation?

Are we done with the evaluation? Can we further

improve the current metrics?

E.g., MOTA (multi-object tracking accuracy)
Performance is measured at a single recall point
Common practice
Select a confidence threshold, e.g., 0.9
Filter out detections with lower confidence
Data association performed on the rest of detections

22

MOTA over Recall curve

SLIDE 23

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 3D MOT system 1 3D MOT system 2

MOTA Recall

What are the Issues of Evaluation?

Why is it not good to evaluate at a single recall point?
Consequences
The confidence threshold needs to be carefully tuned, non-trivial effort
Cannot understand the full spectrum of accuracy and precision of a MOT system
Which MOT system is better, blue or orange?
The orange one has higher MOTA at its best recall point (r = 0.9)
The blue one has overall higher MOTA at many recall points
Ideally, we want high performance on all recall points

23

MOTA over Recall curve

SLIDE 24

Our Solution: Integral Metrics

MOTA does not take into account of the confidence
What do we do to improve the evaluation?
Compute the integral metrics through the area under the

curve, e.g., average MOTA (AMOTA)

Analogous to the average precision (AP) in object detection
Can model the full spectrum of the MOT accuracy now

24

X. Weng, K. Kitani. A Baseline for 3D Multi-Object Tracking. arXiv 2019.

MOTA over Recall curve

Area under the curve

SLIDE 25

Recent Work on Improve Feature Learning for 3D MOT

25

SLIDE 26

What are the Issues of Feature Learning?

26

X. Weng, Y. Wang, Y. Man, K. Kitani. GNN3DMOT: Graph Neural Network for 3D Multi-Object Tracking with 2D-3D Multi-Feature Learning. CVPR 2020.
Goal: learn discriminative features for different objects
Issues in the feature learning?
Feature extraction for each object is independent of other objects
Why not good? No communication between objects, ignoring the context information
Employ feature from one or two modalities
E.g., 2D appearance, or 2D motion, or 3D motion, or 3D appearance
Why not good? Not utilize all the information that is complementary

2D (or 3D) feature extractor 2D (or 3D) feature extractor Objects in frame t Objects in frame t+1 frame t frame t+1 Hungarian algorithm Affinity matrix

Pipeline from Prior work

3D Object Detector Feature Extractor

Optimizer Evaluation

Sensor Data

SLIDE 27

Improve Feature Learning for 3D MOT

27

X. Weng, Y. Wang, Y. Man, K. Kitani. GNN3DMOT: Graph Neural Network for 3D Multi-Object Tracking with 2D-3D Multi-Feature Learning. CVPR 2020.
How can we address these two issues?
Shouldn’t feature depending on the context of other objects?
Propose a novel feature interaction mechanism
How can we utilize the information from all the modalities?
Extract multi-modal features that are complimentary to each other
i.e., 2D motion + 2D appearance + 3D motion + 3D appearance

2D + 3D feature extractor 2D + 3D feature extractor

Feature interaction

Objects in frame t+1 Objects in frame t frame t frame t+1 Hungarian algorithm Affinity matrix

Pipeline from Our work

Iteratively

3D Object Detector Feature Extractor

Optimizer Evaluation

Sensor Data

SLIDE 28

Improve Feature Learning for 3D MOT

31

X. Weng, Y. Wang, Y. Man, K. Kitani. GNN3DMOT: Graph Neural Network for 3D Multi-Object Tracking with 2D-3D Multi-Feature Learning. CVPR 2020
Is encoding the multi-modal features really useful?
Answer: Yes
We should encode different features so that they can compliment

each other

Use feature from single modality

A: appearance feature, M: motion feature

Use feature from multiple modalities: Performance increased!

3D Object Detector Feature Extractor

Optimizer Evaluation

Sensor Data

SLIDE 29

Improve Feature Learning for 3D MOT

32

X. Weng, Y. Wang, Y. Man, K. Kitani. GNN3DMOT: Graph Neural Network for 3D Multi-Object Tracking with 2D-3D Multi-Feature Learning. CVPR 2020
Is feature interaction using GNN useful to 3D MOT?
Answer: Yes
We should let objects communicate and encode the context

information

Performance largely increased with GNN layers = 3 v.s. 0 !

3D Object Detector Feature Extractor

Optimizer Evaluation

Sensor Data

SLIDE 30

Improve Feature Learning for 3D MOT

33

X. Weng, Y. Wang, Y. Man, K. Kitani. GNN3DMOT: Graph Neural Network for 3D Multi-Object Tracking with 2D-3D Multi-Feature Learning. CVPR 2020
For more details in our CVPR work, our poster

session is as follows:

Date: Wednesday, June 17
Q&A Time: 12:00–14:00 Pacific Time
Session: Poster 2.2 — Face, Gesture, and Body Pose;

Motion and Tracking; Representation Learning

Link: http://cvpr20.com/event/gnn3dmot-graph-

neural-network-for-3d-multi-object-tracking-with- 2d-3d-multi-feature-learning/

SLIDE 31

Moving Forward

End-to-End Perception and Prediction Pipeline

34

SLIDE 32

End-to-End Perception and Prediction Pipeline

We now have only data association

jointly optimized

What is next? Can we go further?
End-to-end MOT and detection?
End-to-end MOT and trajectory forecasting?
End-to-end MOT and both detection, forecasting?

35

3D Object Detector Feature Extractor 3D detections

Pairwise affinity matrix

Optimizer

3D Object Trajectories Sensor data

Jointly

ptimized

Trajectory Forecasting

Past object trajectories

SLIDE 33

Joint 3D MOT and Trajectory Forecasting

Prior work separates 3D MOT and trajectory forecasting
Why is it not good to separate the two?
Optimization of entire pipeline is impossible, leading to sub-optimal performance
Slow inference due to separate modular design, each network takes time
What can we do?

36

X. Weng, Y. Ye, K. Kitani. Joint 3D Tracking and Forecasting with Graph Neural Network and Diversity Sampling. arXiv 2020

Pipeline from Prior Work

Detected

bjects in

current frame

Objects trajectories in past H frames

Last frame Current frame

Trajectory forecasting head Predicted trajectories in future T frames Objects trajectories up to current frame

3D Multi-Object Tracking Trajectory Forecasting

Feature extraction Feature extraction 3D MOT head Feature extraction

Separate

3D Object Detector Feature Extractor

Optimizer

Trajectory Forecasting

Sensor Data

SLIDE 34

Joint 3D MOT and Trajectory Forecasting

Parallelize the MOT and forecasting
Share the feature learning process
Use GNN3DMOT as part of our network for tracking
Add a multi-modal trajectory forecasting head

37

X. Weng, Y. Ye, K. Kitani. Joint 3D Tracking and Forecasting with Graph Neural Network and Diversity Sampling. arXiv 2020

Edge features

Diversity sampling

Node features

GNN for feature interaction

Predicted trajectories in future T frames Detected objects in current frame Objects trajectories in past H frames

Last frame

Current frame Feature extraction Feature extraction

3D MOT head Trajectory forecasting head

Joint 3D Tracking and Forecasting

GNN3DMOT Forecasting Shared Feature Learning

3D Object Detector Feature Extractor

Optimizer

Trajectory Forecasting

Sensor Data

SLIDE 35

Joint 3D MOT and Trajectory Forecasting

Is it useful to do joint optimization?
Add forecasting is useful to tracking
How does adding forecasting affect 3D MOT?
Add joint optimization with forecasting improves performance on tracking

38

X. Weng, Y. Ye, K. Kitani. Joint 3D Tracking and Forecasting with Graph Neural Network and Diversity Sampling. arXiv 2020

Improvement on 5 out of 6 entries!

3D MOT evaluation without forecasting module

3D Object Detector Feature Extractor

Optimizer

Trajectory Forecasting

Sensor Data

SLIDE 36

Joint 3D MOT and Trajectory Forecasting

Is it useful to do joint optimization?
Add forecasting is useful to tracking
Add MOT is useful to forecasting
How does adding 3D MOT affect trajectory forecasting?
Add joint optimization with 3D MOT improves performance on forecasting

39

X. Weng, Y. Ye, K. Kitani. Joint 3D Tracking and Forecasting with Graph Neural Network and Diversity Sampling. arXiv 2020

Forecasting evaluation without 3D MOT Performance improved after adding MOT!

3D Object Detector Feature Extractor

Optimizer

Trajectory Forecasting

Sensor Data

SLIDE 37

Joint 3D MOT and Trajectory Forecasting

Is it useful to do joint optimization?
Yes. Joint optimization is useful to both modules!
For more details in this arXiv work
Scan the code for the paper

40

X. Weng, Y. Ye, K. Kitani. Joint 3D Tracking and Forecasting with Graph Neural Network and Diversity Sampling. arXiv 2020

3D Object Detector Feature Extractor

Optimizer

Trajectory Forecasting

Sensor Data

SLIDE 38

Joint MOT and Object Detection

Now we have method for joint MOT and forecasting
Can we do joint detection and MOT?
Use GNN3DMOT as part of our network for tracking
Add a detection head to classify/regress objects
Can be possibly extended to BEV and 3D detection and MOT
Will be released soon

41

Y. Wang, X. Weng, K. Kitani. Joint Detection and Multi-Object Tracking with Graph Neural Networks and Complete Feature Learning. arXiv 2020

GNN3DMOT Detection

Edge features Node features

GNN for feature interaction

Trajectories up to the current frames Anchors in current frame Objects trajectories in past H frames

Last frame

Current frame Feature extraction Feature extraction

3D MOT head Object detection head

Joint 3D Tracking and Forecasting

3D Object Detector Feature Extractor

Optimizer

Trajectory Forecasting

Sensor Data

SLIDE 39

Joint MOT, Detection and Forecasting

The most complete joint pipeline for detection, tracking and forecasting

42 Liang et al. PnPNet: End-to-End Perception and Prediction with Tracking in the Loop. CVPR 2020 3D Object Detector Feature Extractor

Optimizer

Trajectory Forecasting

Sensor Data

SLIDE 40

Moving Forward

Achieve trajectory forecasting as tracking

44

SLIDE 41

Conventional Perception and Prediction Pipeline

Traditional pipeline:
Detection -> data association -> trajectory forecasting
Is this pipeline the best?
What are other options?

Weng et al. Unsupervised Sequence Forecasting of 100,000 Points for Unsupervised Trajectory Forecasting. arXiv 2020

3D Object Detection Trajectory Forecasting

Sensor Data Feature Extraction

Optimization

SLIDE 42

Trajectory Forecasting as Tracking

Traditional pipeline:
Detection -> MOT -> trajectory forecasting
Our new pipeline
Sensor data forecasting -> detection -> MOT

46

Weng et al. Unsupervised Sequence Forecasting of 100,000 Points for Unsupervised Trajectory Forecasting. arXiv 2020

Switch the order

SLIDE 43

Take Home Message

Important to develop appropriate evaluation metrics for 3D MOT to measure progress
Representation of objects in 3D MOT should take into account other objects

47

3D Object Detection Evaluation

Sensor Data Feature Extraction

Optimization

Many large-scale datasets but sensor suite and annotations are not unified 3D detection performance is improving but doesn't take into account sensor physics Doesn't take into account context of multi-level optimization problem (sensors, forecasting, control) Representation doesn't take into account context of other objects, scene and past Need 3D MOT evaluation datasets Should also take into account sensor optimization and redundancy Detection and tracking should be coupled more tightly

Open Problems

Dynamics models should be customized to object type

SLIDE 44

3D Multi-Object Tracking for Autonomous Driving

Xinshuo Weng, Kris Kitani

June 15, 2020

48