Professor Media IC & System Lab Graduate Institute of - - PowerPoint PPT Presentation

▶

Jul 12, 2023 170 likes •410 views

Shao-Yi Chien () Professor Media IC & System Lab Graduate Institute of Electronics Engineering National Taiwan University Outline AI edge: distributed intelligence Tensor transform for memory-efficient operations

SLIDE 1

Shao-Yi Chien (簡韶逸) Professor Media IC & System Lab Graduate Institute of Electronics Engineering National Taiwan University

SLIDE 2

Outline

AI edge: distributed intelligence Tensor transform for memory-efficient operations Implementation results Conclusion

SLIDE 3

Internet-of-AI-Things

AI IoT Big Data

SLIDE 4

Where Should Computing be Located?

Data from Internet: big

data

Data from IoT: Ultra-big

data!

AI on the cloud? AI on the edge?

Cloud Servers Aggregator Aggregator Smart Devices

SLIDE 5

Senso sor Ag Aggregator/ Ga Gate teway Cl Cloud

Semantic Level Data from Each Sensor

La Large Sm Small Hi High Low Low Data Filtering Process Context Inferring Process Cloud Serve vers rs with CPU/GPU PU/FPG PGA HSA, NPU, DSP, P, Neura ral Proce cesso ssors rs Li Light-We Weight Learning/Reco cognition En Engine

Distributed Intelligence

AI Edge

SLIDE 6

Deep Learning Ecosystem

Memory efficient

is the most important target for optimization

SLIDE 7

Unroll: Fast and Simple

SLIDE 8

Formulation of Unrolling

SLIDE 9

Unroll: More than Conv.

SLIDE 10

Unrolling: Where and Who?

Where the unrolling operation is employed?

Everywhere in optimized parallel computing

systems!

CPU, GPU, DSP, VPU, ASIC

Who will execute unrolling in a system

General purpose processors: the software developers

need to handle it

VPU and ASIC: it is embedded in the hardware for

specific applications

SLIDE 11

Problem of Unrolling

Main memory

Main memory

SLIDE 12

Unroll is a Fast Blackbox

Unroll Blackbox

Main memory Processors

SLIDE 13

Efficient Blackbox: Unroll as Last as Possible

SLIDE 14

Naïve Unrolling

SLIDE 15

Unroll at Shared Memory

SLIDE 16

Unroll Upon Computation

SLIDE 17

Useful Unrolling Framework Requires

Formulation of unrolling Build algorithms by unrolling

DNN CV, ML …

Memory efficient unrolling

GPUs ASICs

SLIDE 18

UMI (Unrolled Memory Inner-Products) Operator

You simply write code for

Describing the unroll pattern and Defining what to do for each row.

Efficient blackbox make you code fast.

SLIDE 19

Memory Efficient Unrolling

Smooth dataflow must consider:

DRAM reuse

Bank conflict

Both can be analyzed by the formula:

SLIDE 20

UMI: Experimental Results

UMI blackbox

CUDA version is available

n Github

Code reduction 2--4x Speed-up 1.4--26x Hardware

implementation is coming soon

Baseline: OpenCV, Parboil and Caffe

Ref: Y. S. Lin, W. C. Chen and S. Y. Chien, "Unrolled Memory Inner-Products: An Abstract GPU Operator for Efficient Vision-Related Computations," ICCV 2017.

SLIDE 21

ASIC Design

TAU: 32-core parallel processor Scaled up linearly

SLIDE 22

Conclusion

AI edge: distributed intelligence Memory access optimization is the key for efficient

CNN computing

Unrolling plays an important role for memory

ptimization, which can also benefit other operations

A unrolling framework, tensor transform for memory-

efficient operations, is developed to decouple unrolling operations

Implementation results: code reduction 2--4x; speed-

up 1.4--26x

SLIDE 23