Professor Media IC & System Lab Graduate Institute of - - PowerPoint PPT Presentation

professor media ic system lab graduate institute of
SMART_READER_LITE
LIVE PREVIEW

Professor Media IC & System Lab Graduate Institute of - - PowerPoint PPT Presentation

Shao-Yi Chien () Professor Media IC & System Lab Graduate Institute of Electronics Engineering National Taiwan University Outline AI edge: distributed intelligence Tensor transform for memory-efficient operations


slide-1
SLIDE 1

Shao-Yi Chien (簡韶逸) Professor Media IC & System Lab Graduate Institute of Electronics Engineering National Taiwan University

slide-2
SLIDE 2

Outline

— AI edge: distributed intelligence — Tensor transform for memory-efficient operations — Implementation results — Conclusion

slide-3
SLIDE 3

Internet-of-AI-Things

AI IoT Big Data

slide-4
SLIDE 4

Where Should Computing be Located?

— Data from Internet: big

data

— Data from IoT: Ultra-big

data!

— AI on the cloud? — AI on the edge?

Cloud Servers Aggregator Aggregator Smart Devices

slide-5
SLIDE 5

Senso sor Ag Aggregator/ Ga Gate teway Cl Cloud

Semantic Level Data from Each Sensor

La Large Sm Small Hi High Low Low Data Filtering Process Context Inferring Process Cloud Serve vers rs with CPU/GPU PU/FPG PGA HSA, NPU, DSP, P, Neura ral Proce cesso ssors rs Li Light-We Weight Learning/Reco cognition En Engine

Distributed Intelligence

AI Edge

slide-6
SLIDE 6

Deep Learning Ecosystem

  • Memory efficient

is the most important target for optimization

slide-7
SLIDE 7

Unroll: Fast and Simple

7

slide-8
SLIDE 8

Formulation of Unrolling

8

slide-9
SLIDE 9

Unroll: More than Conv.

9

slide-10
SLIDE 10

Unrolling: Where and Who?

— Where the unrolling operation is employed?

— Everywhere in optimized parallel computing

systems!

— CPU, GPU, DSP, VPU, ASIC

— Who will execute unrolling in a system

— General purpose processors: the software developers

need to handle it

— VPU and ASIC: it is embedded in the hardware for

specific applications

slide-11
SLIDE 11

Problem of Unrolling

11

  • Main memory

Main memory

slide-12
SLIDE 12

Unroll is a Fast Blackbox

12

Unroll Blackbox

Main memory Processors

slide-13
SLIDE 13

Efficient Blackbox: Unroll as Last as Possible

13

slide-14
SLIDE 14

Naïve Unrolling

14

slide-15
SLIDE 15

Unroll at Shared Memory

15

slide-16
SLIDE 16

Unroll Upon Computation

16

slide-17
SLIDE 17

Useful Unrolling Framework Requires

— Formulation of unrolling — Build algorithms by unrolling

— DNN — CV, ML — …

— Memory efficient unrolling

— GPUs — ASICs

17

slide-18
SLIDE 18

UMI (Unrolled Memory Inner-Products) Operator

— You simply write code for

— Describing the unroll pattern and — Defining what to do for each row.

— Efficient blackbox make you code fast.

18

slide-19
SLIDE 19

Memory Efficient Unrolling

— Smooth dataflow must consider:

1.

DRAM reuse

2.

Bank conflict

— Both can be analyzed by the formula:

19

slide-20
SLIDE 20

UMI: Experimental Results

— UMI blackbox

— CUDA version is available

  • n Github

— Code reduction 2--4x — Speed-up 1.4--26x — Hardware

implementation is coming soon

Baseline: OpenCV, Parboil and Caffe

Ref: Y. S. Lin, W. C. Chen and S. Y. Chien, "Unrolled Memory Inner-Products: An Abstract GPU Operator for Efficient Vision-Related Computations," ICCV 2017.

slide-21
SLIDE 21

ASIC Design

— TAU: 32-core parallel processor — Scaled up linearly

21

slide-22
SLIDE 22

Conclusion

— AI edge: distributed intelligence — Memory access optimization is the key for efficient

CNN computing

— Unrolling plays an important role for memory

  • ptimization, which can also benefit other operations

— A unrolling framework, tensor transform for memory-

efficient operations, is developed to decouple unrolling operations

— Implementation results: code reduction 2--4x; speed-

up 1.4--26x

slide-23
SLIDE 23

Using UMI Operator is…

23