Early Experience in Benchmarking Edge AI Processors with Object - - PowerPoint PPT Presentation

early experience in benchmarking edge ai processors with
SMART_READER_LITE
LIVE PREVIEW

Early Experience in Benchmarking Edge AI Processors with Object - - PowerPoint PPT Presentation

Early Experience in Benchmarking Edge AI Processors with Object Detection Workloads Bench 2019 Yujie Hui 1 , Jeffrey Lien 2 , and Xiaoyi Lu 1 1 Department of Computer Science and Engineering, The Ohio State University {hui.82, lu.932}@osu.edu 2


slide-1
SLIDE 1

The Ohio State University

Early Experience in Benchmarking Edge AI Processors with Object Detection Workloads

1 Department of Computer Science and Engineering,

The Ohio State University

{hui.82, lu.932}@osu.edu

2 NovuMind Inc.

jlien@novumind.com

Yujie Hui1, Jeffrey Lien2, and Xiaoyi Lu1 Bench 2019

slide-2
SLIDE 2

The Ohio State University

Overview

  • Introduction
  • Overview of Edge AI Processors
  • Benchmarking Methodology
  • Evaluation
  • Conclusion

2

slide-3
SLIDE 3

The Ohio State University

  • Store and process the data closer to the location where it is needed
  • Deliver low latency to the end users

Edge Computing

3

APP APP DATA

Network

APP APP DATA DATA

Edge Computing

APP

slide-4
SLIDE 4

The Ohio State University

Artificial Intelligence at the Edge

4

Data Features Training Evaluation Inference Datacenter (e.g., GPU) Data Features Training Evaluation Inference Datacenter (e.g., GPU) Edge Devices

  • Inference is moving to the

edge

❖Heavy workloads in datacenters ❖Less computationally demanding ❖Low power consumption ❖Low cost

slide-5
SLIDE 5

The Ohio State University

5

Killer Applications for AI@Edge – Object Detection

Recommendation 2% RNN ASR 10% RNN Translator 6% Image Classification 42% Object Detection 34% Object Segmentation 3% Face ID 3%

Ma Machine Learning Use Cases in Facebook

Recommendat ion RNN ASR RNN Translator Image Classification Object Detection Object Segmentation Face ID Wu et al., Machine Learning at Facebook: Understanding Inference at Edge, HPCA-2019

  • C. Wu, At-Scale Infrastructure Challenges for Machine Learning, IISWC-2019 (Invited Talk)
  • Object Detection:

❖Higher resolution of input images ❖Larger output tensors ❖More complicated tasks

slide-6
SLIDE 6

The Ohio State University

6

Object Detection Workloads - Demo

Real life applications:

❖Self driving cars ❖Tracking objects ❖Face detection ❖Pedestrian detection ❖Medical imaging ❖Robotics

Low latency and high accuracy inference needs high performance edge devices!

slide-7
SLIDE 7

The Ohio State University

Overview

  • Introduction
  • Overview of Edge AI Processors
  • Edge TPU
  • NVIDIA Xavier
  • NovuTensor
  • Benchmarking Methodology
  • Evaluation
  • Conclusion

7

slide-8
SLIDE 8

The Ohio State University

8

Edge AI Processors - EdgeTPU

https://coral.withgoogle.com/products/dev-board

  • A single-board computer
  • On-board Edge TPU coprocessor with

capable for performing 4 TOPS

  • 1 GB LPDDR4 memory
  • Precision: INT 8
  • Power: 2.5 watts
  • Supports TensorFlow Lite model
slide-9
SLIDE 9

The Ohio State University

9

Edge AI Processors - Xavier

https://developer.nvidia.com/embedded/jetson-agx-xavier-developer-kit

  • Volta GPU with 512 CUDA cores
  • TOPS: 22.6/11.3/1.3
  • 16GB LPDDR4X memory
  • Precision: INT8/FP16/FP32
  • Power: 10/15/30 watts
  • Supports CUDA, cuDNN, TensorRT
slide-10
SLIDE 10

The Ohio State University

10

Edge AI Processors – NovuTensor

  • Domain specific architecture focusing on performing 3D tensor

computation

  • 2GB DDR4 memory, 15 TOPS
  • Precision: INT8
  • TOPS: 15
  • Power: 20 watts
  • Support PyTorch

[1]https://patentscope.wipo.int/search/en/detail.jsf?docId=US225521272&tab=NATIONALBIBLIO

Weight Tensor Data Tensor Output Tensor Tensor Convolution Novutensor’s 3D Operation

slide-11
SLIDE 11

The Ohio State University

11

Challenges of Benchmarking Edge AI Processors

  • Challenge-1: Workload Selection

v What are the representative models and datasets for benchmarking edge AI processors with object detection workload?

  • Challenge-2: Deployment

v How to deploy deep neural networks on edge devices, given that each edge device needs a specific framework?

  • Challenge-3: Metrics and Dimensions

v How to select an essential set of metrics and dimensions to comprehensively evaluate edge AI devices?

slide-12
SLIDE 12

The Ohio State University

Overview

  • Introduction
  • Overview of Edge AI Processors
  • Benchmarking Methodology
  • Workload and Dataset Selection
  • Deployment Experience
  • Metrics and Dimensions Selection
  • Evaluation
  • Conclusion

12

slide-13
SLIDE 13

The Ohio State University

13

Object Detection Workloads – YOLOv2

https://pjreddie.com/darknet/yolov2/

  • A real-time object detection system, which tells us what objects are seen
  • Tiny-YOLO is a lite version of YOLOv2
  • Based on Darknet framework, can detect objects in an image or a video
  • Darknet-19 neural network

Darknet-19

YOLO9000: Better, Faster, Stronger. Joseph Redmon, Ali Farhadi

slide-14
SLIDE 14

The Ohio State University

14

Object Detection Workloads – MS COCO

Microsoft COCO Dataset Examples

http://cocodataset.org/#home

  • 330K images (>200K labeled)
  • 1.5 million object instances
  • 80 object categories

❖Images contain rich information with many objects per image ❖Large in number of instances per category

Microsoft COCO: Common Objects in Context. Lin et al.

slide-15
SLIDE 15

The Ohio State University

TeFde

32-bab

TeFLe

8-bdb

Post-Training Integer Quantization

Caadaa

EdeTPUde

.

Ed TPU C

EdeTPUde

.

D

OH OE P

EdgeTPU

Modify the weights of first convolutional layer DarkFlow[1] Post-Training Integer Quantization[2] EdgeTPU compiler

Deployment Experience

15

[1]https://github.com/thtrieu/darkflow [2]https://medium.com/tensorflow/tensorflow-model-optimization-toolkit-post-training-integer-quantization-b4964a1ea9ba [3]https://github.com/NVIDIA-AI-IOT/deepstream_reference_apps

Xavier

NVIDIA’s deepstream reference applications[3] TensorRT 5.0.3 15-watt and 30-watt modes

NovuTensor

NovuSDK

Retrain the model using ReLU activation function

slide-16
SLIDE 16

The Ohio State University

16

Metrics and Dimensions

Latency (ms) Energy Efficiency (Images/sec/watt) Accuracy

Mean Average Precision: !" = $

% &

' ( )( *!" = 1 N -

./&

!" Execution time:

  • Preprocess
  • Execution
  • Postprocess

Number of input images can be fully processed per unit-power

slide-17
SLIDE 17

The Ohio State University

Overview

  • Introduction
  • Overview of Edge AI Processors
  • Benchmarking Methodology
  • Evaluation
  • Conclusion

17

slide-18
SLIDE 18

The Ohio State University

18

Accuracy Dimension

Latency (ms) Energy Efficiency (Images/sec/watt) Accuracy

Mean Average Precision: !" = $

% &

' ( )( *!" = 1 N -

./&

!" Execution time:

  • Preprocess
  • Execution
  • Postprocess

Number of input images can be fully processed per unit-power

slide-19
SLIDE 19

The Ohio State University

19

Evaluation Results - Accuracy

  • Provide accurate results with 1% to 3% accuracy difference due to lower precision arithmetic
  • Accuracy degradation is different since the diversified implementation of quantization

Performance running YOLOv2 and Tiny-YOLO with 416x416 input images

0.2 0.4 0.6

Edge TPU Xavier 15w Xavier MAXW NovuTensor 1080Ti+TensorRT 1080Ti

mAP

Tiny-YOLO YOLOv2

slide-20
SLIDE 20

The Ohio State University

20

Latency Dimension

Latency (ms) Energy Efficiency (Images/sec/watt) Accuracy

Mean Average Precision: !" = $

% &

' ( )( *!" = 1 N -

./&

!" Execution time:

  • Preprocess
  • Execution
  • Postprocess

Number of input images can be fully processed per unit-power

slide-21
SLIDE 21

The Ohio State University

21

Evaluation Results - Latency

Performance running YOLOv2 and Tiny-YOLO with 416x416 input images

20 40 60 80 100

Edge TPU Xavier 15w Xavier MAXW NovuTensor 1080Ti+TensorRT 1080Ti

Latency (ms)

Tiny-YOLO YOLOv2

❖ EdgeTPU is 9.5X and 14.79X slower than GPU with running Tiny-YOLO and YOLOv2 ❖ NovuTensor and Xavier are 4.66X - 6.08X slower than the GPU ❖ Xavier is 2X and 5.28X faster than EdgeTPU in the max power mode ❖ NovuTensor is 2.04X and 3.8X faster than EdgeTPU for YOLOv2 and Tiny-YOLO

slide-22
SLIDE 22

The Ohio State University

22

Energy Efficiency Dimension

Latency (ms) Energy Efficiency (Images/sec/watt) Accuracy

Mean Average Precision: !" = $

% &

' ( )( *!" = 1 N -

./&

!" Execution time:

  • Preprocess
  • Execution
  • Postprocess

Number of input images can be fully processed per unit-power

slide-23
SLIDE 23

The Ohio State University

23

Evaluation Results – Energy Efficiency

❖ All edge AI processors have higher energy efficiency due to low power consumptions ❖ EdgeTPU delivers 2.9X and 1.13X higher energy efficiency than Xavier; 1.96X and 1.04X higher than NovuTensor

5 10 15

Edge TPU Xavier 15w Xavier MAXW NovuTensor 1080Ti+TensorRT 1080Ti

Energy Efficiency (image/sec/watt)

Tiny-YOLO YOLOv2

Performance running YOLOv2 and Tiny-YOLO with 416x416 input images

slide-24
SLIDE 24

The Ohio State University

24

Evaluation Results – Large Images

Performance running YOLOv2 and Tiny-YOLO with 1024X1024 input images

  • Xavier in the 15-watt mode delivers the best energy efficiency
  • 1080Ti using TensorRT has the best performance of latency

(a) Latency (b) Energy Efficiency 100 200

X a v i e r 1 5 w X a v i e r M A X W N

  • v

u T e n s

  • r

1 8 T i + T e n s

  • r

R T 1 8 T i

Latency (ms)

0.2 0.4 0.6 0.8 1

X a v i e r 1 5 w X a v i e r M A X W N

  • v

u T e n s

  • r

1 8 T i + T e n s

  • r

R T 1 8 T i

Energy Efficiency (image/sec/watt)

slide-25
SLIDE 25

The Ohio State University

25

Summary

  • 0.8

1.2

Accuracy Performance Energy Efficiency

Edge TPU Xavier 15w Xavier MAXW NovuTensor 1080Ti+TensorRT 1080Ti

0.2 0.4 0.6 0.8 1 1.2

Accuracy Performance Energy Efficiency

0.2 0.4 0.6 0.8 1 1.2

Accuracy Performance Energy Efficiency

(a) YOLOv2 (b) Tiny-YOLO

Comparison of factors running YOLOv2 and Tiny-YOLO with 416x416 input images

  • Observations:

EdgeTPU provides the best energy efficiency

Xavier and NovuTensor provide comparable latency and energy efficiency performance

TensorRT optimizes and accelerates GPU platforms

slide-26
SLIDE 26

The Ohio State University

Overview

  • Introduction
  • Overview of Edge AI Processors
  • Benchmarking Methodology
  • Evaluation
  • Conclusion

26

slide-27
SLIDE 27

The Ohio State University

Conclusion

  • Propose a benchmarking methodology to evaluate three different kinds of edge AI processors (i.e.,

Edge TPU, NVIDIA Xavier, and NovuTensor) from three dimensions

  • Deploy YOLO on EdgeTPU for the first time
  • Observe that NovuTensor and Xavier can provide comparable performance in latency and energy

efficiency, Edge TPU consumes less energy but is much slower for inference

Future Work

  • Open source our benchmark and evaluate more combinations of different neural networks and edge

AI platforms

  • Propose an easy-to-use benchmarking toolkit for edge AI processors

27

slide-28
SLIDE 28

The Ohio State University

Thank You!

Questions?