[PPT] - Early Experience in Benchmarking Edge AI Processors with Object PowerPoint Presentation

SLIDE 1

The Ohio State University

Early Experience in Benchmarking Edge AI Processors with Object Detection Workloads

1 Department of Computer Science and Engineering,

The Ohio State University

{hui.82, lu.932}@osu.edu

2 NovuMind Inc.

jlien@novumind.com

Yujie Hui1, Jeffrey Lien2, and Xiaoyi Lu1 Bench 2019

SLIDE 2

The Ohio State University

Overview

Introduction
Overview of Edge AI Processors
Benchmarking Methodology
Evaluation
Conclusion

2

SLIDE 3

The Ohio State University

Store and process the data closer to the location where it is needed
Deliver low latency to the end users

Edge Computing

3

APP APP DATA

Network

APP APP DATA DATA

Edge Computing

APP

SLIDE 4

The Ohio State University

Artificial Intelligence at the Edge

4

Data Features Training Evaluation Inference Datacenter (e.g., GPU) Data Features Training Evaluation Inference Datacenter (e.g., GPU) Edge Devices

Inference is moving to the

edge

❖Heavy workloads in datacenters ❖Less computationally demanding ❖Low power consumption ❖Low cost

SLIDE 5

The Ohio State University

5

Killer Applications for AI@Edge – Object Detection

Recommendation 2% RNN ASR 10% RNN Translator 6% Image Classification 42% Object Detection 34% Object Segmentation 3% Face ID 3%

Ma Machine Learning Use Cases in Facebook

Recommendat ion RNN ASR RNN Translator Image Classification Object Detection Object Segmentation Face ID Wu et al., Machine Learning at Facebook: Understanding Inference at Edge, HPCA-2019

C. Wu, At-Scale Infrastructure Challenges for Machine Learning, IISWC-2019 (Invited Talk)
Object Detection:

❖Higher resolution of input images ❖Larger output tensors ❖More complicated tasks

SLIDE 6

The Ohio State University

6

Object Detection Workloads - Demo

Real life applications:

❖Self driving cars ❖Tracking objects ❖Face detection ❖Pedestrian detection ❖Medical imaging ❖Robotics

Low latency and high accuracy inference needs high performance edge devices!

SLIDE 7

The Ohio State University

Overview

Introduction
Overview of Edge AI Processors
Edge TPU
NVIDIA Xavier
NovuTensor
Benchmarking Methodology
Evaluation
Conclusion

7

SLIDE 8

The Ohio State University

8

Edge AI Processors - EdgeTPU

https://coral.withgoogle.com/products/dev-board

A single-board computer
On-board Edge TPU coprocessor with

capable for performing 4 TOPS

1 GB LPDDR4 memory
Precision: INT 8
Power: 2.5 watts
Supports TensorFlow Lite model

SLIDE 9

The Ohio State University

9

Edge AI Processors - Xavier

https://developer.nvidia.com/embedded/jetson-agx-xavier-developer-kit

Volta GPU with 512 CUDA cores
TOPS: 22.6/11.3/1.3
16GB LPDDR4X memory
Precision: INT8/FP16/FP32
Power: 10/15/30 watts
Supports CUDA, cuDNN, TensorRT

SLIDE 10

The Ohio State University

10

Edge AI Processors – NovuTensor

Domain specific architecture focusing on performing 3D tensor

computation

2GB DDR4 memory, 15 TOPS
Precision: INT8
TOPS: 15
Power: 20 watts
Support PyTorch

[1]https://patentscope.wipo.int/search/en/detail.jsf?docId=US225521272&tab=NATIONALBIBLIO

Weight Tensor Data Tensor Output Tensor Tensor Convolution Novutensor’s 3D Operation

SLIDE 11

The Ohio State University

11

Challenges of Benchmarking Edge AI Processors

Challenge-1: Workload Selection

v What are the representative models and datasets for benchmarking edge AI processors with object detection workload?

Challenge-2: Deployment

v How to deploy deep neural networks on edge devices, given that each edge device needs a specific framework?

Challenge-3: Metrics and Dimensions

v How to select an essential set of metrics and dimensions to comprehensively evaluate edge AI devices?

SLIDE 12

The Ohio State University

Overview

Introduction
Overview of Edge AI Processors
Benchmarking Methodology
Workload and Dataset Selection
Deployment Experience
Metrics and Dimensions Selection
Evaluation
Conclusion

12

SLIDE 13

The Ohio State University

13

Object Detection Workloads – YOLOv2

https://pjreddie.com/darknet/yolov2/

A real-time object detection system, which tells us what objects are seen
Tiny-YOLO is a lite version of YOLOv2
Based on Darknet framework, can detect objects in an image or a video
Darknet-19 neural network

Darknet-19

YOLO9000: Better, Faster, Stronger. Joseph Redmon, Ali Farhadi

SLIDE 14

The Ohio State University

14

Object Detection Workloads – MS COCO

Microsoft COCO Dataset Examples

http://cocodataset.org/#home

330K images (>200K labeled)
1.5 million object instances
80 object categories

❖Images contain rich information with many objects per image ❖Large in number of instances per category

Microsoft COCO: Common Objects in Context. Lin et al.

SLIDE 15

The Ohio State University

TeFde

32-bab

TeFLe

8-bdb

Post-Training Integer Quantization

Caadaa

EdeTPUde

.

Ed TPU C

EdeTPUde

.

D

OH OE P

EdgeTPU

Modify the weights of first convolutional layer DarkFlow[1] Post-Training Integer Quantization[2] EdgeTPU compiler

Deployment Experience

15

[1]https://github.com/thtrieu/darkflow [2]https://medium.com/tensorflow/tensorflow-model-optimization-toolkit-post-training-integer-quantization-b4964a1ea9ba [3]https://github.com/NVIDIA-AI-IOT/deepstream_reference_apps

Xavier

NVIDIA’s deepstream reference applications[3] TensorRT 5.0.3 15-watt and 30-watt modes

NovuTensor

NovuSDK

Retrain the model using ReLU activation function

SLIDE 16

The Ohio State University

16

Metrics and Dimensions

Latency (ms) Energy Efficiency (Images/sec/watt) Accuracy

Mean Average Precision: !" = $

% &

' ( )( *!" = 1 N -

./&

!" Execution time:

Preprocess
Execution
Postprocess

Number of input images can be fully processed per unit-power

SLIDE 17

The Ohio State University

Overview

Introduction
Overview of Edge AI Processors
Benchmarking Methodology
Evaluation
Conclusion

17

SLIDE 18

The Ohio State University

18

Accuracy Dimension

Latency (ms) Energy Efficiency (Images/sec/watt) Accuracy

Mean Average Precision: !" = $

% &

' ( )( *!" = 1 N -

./&

!" Execution time:

Preprocess
Execution
Postprocess

Number of input images can be fully processed per unit-power

SLIDE 19

The Ohio State University

19

Evaluation Results - Accuracy

Provide accurate results with 1% to 3% accuracy difference due to lower precision arithmetic
Accuracy degradation is different since the diversified implementation of quantization

Performance running YOLOv2 and Tiny-YOLO with 416x416 input images

0.2 0.4 0.6

Edge TPU Xavier 15w Xavier MAXW NovuTensor 1080Ti+TensorRT 1080Ti

mAP

Tiny-YOLO YOLOv2

SLIDE 20

The Ohio State University

20

Latency Dimension

Latency (ms) Energy Efficiency (Images/sec/watt) Accuracy

Mean Average Precision: !" = $

% &

' ( )( *!" = 1 N -

./&

!" Execution time:

Preprocess
Execution
Postprocess

Number of input images can be fully processed per unit-power

SLIDE 21

The Ohio State University

21

Evaluation Results - Latency

Performance running YOLOv2 and Tiny-YOLO with 416x416 input images

20 40 60 80 100

Edge TPU Xavier 15w Xavier MAXW NovuTensor 1080Ti+TensorRT 1080Ti

Latency (ms)

Tiny-YOLO YOLOv2

❖ EdgeTPU is 9.5X and 14.79X slower than GPU with running Tiny-YOLO and YOLOv2 ❖ NovuTensor and Xavier are 4.66X - 6.08X slower than the GPU ❖ Xavier is 2X and 5.28X faster than EdgeTPU in the max power mode ❖ NovuTensor is 2.04X and 3.8X faster than EdgeTPU for YOLOv2 and Tiny-YOLO

SLIDE 22

The Ohio State University

22

Energy Efficiency Dimension

Latency (ms) Energy Efficiency (Images/sec/watt) Accuracy

Mean Average Precision: !" = $

% &

' ( )( *!" = 1 N -

./&

!" Execution time:

Preprocess
Execution
Postprocess

Number of input images can be fully processed per unit-power

SLIDE 23

The Ohio State University

23

Evaluation Results – Energy Efficiency

❖ All edge AI processors have higher energy efficiency due to low power consumptions ❖ EdgeTPU delivers 2.9X and 1.13X higher energy efficiency than Xavier; 1.96X and 1.04X higher than NovuTensor

5 10 15

Edge TPU Xavier 15w Xavier MAXW NovuTensor 1080Ti+TensorRT 1080Ti

Energy Efficiency (image/sec/watt)

Tiny-YOLO YOLOv2

Performance running YOLOv2 and Tiny-YOLO with 416x416 input images

SLIDE 24

The Ohio State University

24

Evaluation Results – Large Images

Performance running YOLOv2 and Tiny-YOLO with 1024X1024 input images

Xavier in the 15-watt mode delivers the best energy efficiency
1080Ti using TensorRT has the best performance of latency

(a) Latency (b) Energy Efficiency 100 200

X a v i e r 1 5 w X a v i e r M A X W N

v

u T e n s

r

1 8 T i + T e n s

r

R T 1 8 T i

Latency (ms)

0.2 0.4 0.6 0.8 1

X a v i e r 1 5 w X a v i e r M A X W N

v

u T e n s

r

1 8 T i + T e n s

r

R T 1 8 T i

Energy Efficiency (image/sec/watt)

SLIDE 25

The Ohio State University

25

Summary

0.8

1.2

Accuracy Performance Energy Efficiency

Edge TPU Xavier 15w Xavier MAXW NovuTensor 1080Ti+TensorRT 1080Ti

0.2 0.4 0.6 0.8 1 1.2

Accuracy Performance Energy Efficiency

0.2 0.4 0.6 0.8 1 1.2

Accuracy Performance Energy Efficiency

(a) YOLOv2 (b) Tiny-YOLO

Comparison of factors running YOLOv2 and Tiny-YOLO with 416x416 input images

Observations:

①

EdgeTPU provides the best energy efficiency

②

Xavier and NovuTensor provide comparable latency and energy efficiency performance

③

TensorRT optimizes and accelerates GPU platforms

SLIDE 26

The Ohio State University

Overview

Introduction
Overview of Edge AI Processors
Benchmarking Methodology
Evaluation
Conclusion

26

SLIDE 27

The Ohio State University

Conclusion

Propose a benchmarking methodology to evaluate three different kinds of edge AI processors (i.e.,

Edge TPU, NVIDIA Xavier, and NovuTensor) from three dimensions

Deploy YOLO on EdgeTPU for the first time
Observe that NovuTensor and Xavier can provide comparable performance in latency and energy

efficiency, Edge TPU consumes less energy but is much slower for inference

Future Work

Open source our benchmark and evaluate more combinations of different neural networks and edge

AI platforms

Propose an easy-to-use benchmarking toolkit for edge AI processors

27

SLIDE 28

The Ohio State University

Early Experience in Benchmarking Edge AI Processors with Object Detection Workloads

Yujie Hui1, Jeffrey Lien2, and Xiaoyi Lu1 Bench 2019

Overview

Edge Computing

Network

Edge Computing

Artificial Intelligence at the Edge

edge

❖Heavy workloads in datacenters ❖Less computationally demanding ❖Low power consumption ❖Low cost

Killer Applications for AI@Edge – Object Detection

❖Higher resolution of input images ❖Larger output tensors ❖More complicated tasks

Object Detection Workloads - Demo

Real life applications:

❖Self driving cars ❖Tracking objects ❖Face detection ❖Pedestrian detection ❖Medical imaging ❖Robotics

Overview

Edge AI Processors - EdgeTPU

capable for performing 4 TOPS

Edge AI Processors - Xavier

Edge AI Processors – NovuTensor

computation

Challenges of Benchmarking Edge AI Processors

v What are the representative models and datasets for benchmarking edge AI processors with object detection workload?

v How to deploy deep neural networks on edge devices, given that each edge device needs a specific framework?

v How to select an essential set of metrics and dimensions to comprehensively evaluate edge AI devices?

Overview

Object Detection Workloads – YOLOv2

Object Detection Workloads – MS COCO

❖Images contain rich information with many objects per image ❖Large in number of instances per category

TeFde

TeFLe

EdeTPUde

EdeTPUde

EdgeTPU

Deployment Experience

Xavier

NovuTensor

Retrain the model using ReLU activation function

Metrics and Dimensions

Latency (ms) Energy Efficiency (Images/sec/watt) Accuracy

Overview

Accuracy Dimension

Latency (ms) Energy Efficiency (Images/sec/watt) Accuracy

Evaluation Results - Accuracy

mAP

Latency Dimension

Latency (ms) Energy Efficiency (Images/sec/watt) Accuracy

Evaluation Results - Latency

Energy Efficiency Dimension

Latency (ms) Energy Efficiency (Images/sec/watt) Accuracy

Evaluation Results – Energy Efficiency

Evaluation Results – Large Images

Summary

Overview

Conclusion

Future Work

Thank You!

Questions?