[PPT] - TASO: Optimizing Deep Learning with Automatic Generation of Graph PowerPoint Presentation

SLIDE 1

TASO: Optimizing Deep Learning with Automatic Generation of Graph Substitutions

Zhihao Jia, Oded Padon, James Thomas, Todd Warszawski, Matei Zaharia, and Alex Aiken Stanford University

1

12/14/19

SOSP’19

SLIDE 2

Current Rule-based DNN Optimizations

2 conv3x3 + relu conv1x1 + relu

Input conv3x3 add relu

…

Rule-based Optimizer

conv3x3 conv1x1 Input conv3x3 add relu relu relu

Computation Graph Optimized Graph Fuse conv + relu

conv relu conv + relu

SLIDE 3

3

Fuse conv + relu Fuse conv + batch normalization Fuse multi. convs

…

Current Rule-based DNN Optimizations

Rule-based Optimizer

TensorFlow currently includes ~200 rules (~53,000 LOC)

SLIDE 4

Limitations of Rule-based Optimizations

When I turned on XLA (TensorFlow’s graph optimizer), the training speed is about 20% slower. With XLA, my program is almost 2x slower than without XLA

Robustness

Experts’ heuristics do not apply to all DNNs/hardware

4

SLIDE 5

Limitations of Rule-based Optimizations

Robustness

Experts’ heuristics do not apply to all DNNs/hardware

Scalability

New operators and graph structures require more rules TensorFlow currently uses ~4K LOC to optimize convolution

5

SLIDE 6

Limitations of Rule-based Optimizations

Robustness

Experts’ heuristics do not apply to all DNNs/hardware

Scalability

New operators and graph structures require more rules

Performance

Miss subtle optimizations for specific DNNs/hardware

6

SLIDE 7

Motivating Example

Conv3x3 + Relu Conv1x1 + Relu

Input

Conv3x3

Add Relu

Conv3x3 + Relu Conv3x3 + Relu

Input

Conv3x3

Add Relu

Enlarge convs

Conv3x3 + Relu

Input

Conv3x3

Add Relu Split

Fuse convs Fuse conv & add

The final graph is 30% faster on V100 but 10% slower on K80.

Conv3x3 + Relu

Input

Conv3x3 + Relu

Fuse conv & relu

Conv3x3 + Relu

Input

Conv3x3

Relu 7

SLIDE 8

8

DNN Graph Optimizations DNN Operators Graph Architectures Hardware Backends

How should we address the complexity of designing DNN graph optimizations?

SLIDE 9

TASO: Tensor Algebra SuperOptimizer

Key idea: replace manually-designed graph optimizations with automated

generation and verification of graph substitutions for deep learning

Less engineering effort: 53,000 LOC for manual graph optimizations in

TensorFlow → 1,400 LOC in TASO

Better performance: outperform existing optimizers by up to 2.8x

9

SLIDE 10

Graph Substitution

10

Conv3x3 W1 X W2 Conv3x3 Concat Conv3x3 W1 W2 X Y1 Y2 Split Y1 Y2

SLIDE 11

TASO Workflow

11

Operator Specifications Graph Subst. Generator Graph Subst. Verifier Candidate Substitutions Verified Substitutions Graph Optimizer

… …

SLIDE 12

TASO Workflow

12

Input Comp. Graph

Search-Based Graph Optimizer

…

Verified Substitutions

Optimized

Comp. Graph

SLIDE 13

Key Challenges

1. How to generate potential substitutions?
2. How to verify their correctness?

Graph fingerprints Operator specifications + theorem prover

13

SLIDE 14

Graph Substitution Generator

Enumerate all possible graphs up to a fixed size using available operators

14

Subst. Generator Subst. Verifier Graph Optimizer

Operators supported by hardware backend

16

Subst. Generator Subst. Verifier Graph Optimizer

Pairs of graphs with identical fingerprint are candidate substitutions

17

Subst. Generator Subst. Verifier Graph Optimizer

SLIDE 18

Graph Substitution Generator

18

TASO generates ~29,000 substitutions by enumerating graphs w/ up to 4 operators 743 substitutions remain after applying pruning techniques to eliminate redundancy

Subst. Generator Subst. Verifier Graph Optimizer

SLIDE 19

Graph Substitution Verifier

19

…

Graph Subst. Verifier

…

Candidate Substitutions Verified Substitutions

P1. conv is distributive
ver concatenation
P2. conv is bilinear

…

Pn.

Operator Specifications

∀𝑦, 𝑥%, 𝑥& . 𝐷𝑝𝑜𝑤 𝑦, 𝐷𝑝𝑜𝑑𝑏𝑢 𝑥%, 𝑥& = 𝐷𝑝𝑜𝑑𝑏𝑢 𝐷𝑝𝑜𝑤(𝑦, 𝑥%), 𝐷𝑝𝑜𝑤 𝑦, 𝑥&

Subst. Generator Subst. Verifier Graph Optimizer

SLIDE 20

Verification Workflow

20

∃𝑦, 𝑥%, 𝑥& . 𝐷𝑝𝑜𝑤 𝑦, 𝑥%), 𝐷𝑝𝑜𝑤(𝑦, 𝑥& ≠ 𝑇𝑞𝑚𝑗𝑢 𝐷𝑝𝑜𝑤 𝑦, 𝐷𝑝𝑜𝑑𝑏𝑢 𝑥%, 𝑥&

Conv W1 X W2 Conv Concat Conv W1 W2 X Y1 Y2 Split Y1 Y2

(Conv(x, w1), Conv (x, w2)) Split(Conv(x, Concat(w1, w2)))

Theorem Prover

P1. ∀𝑦, 𝑥%, 𝑥& .

𝐷𝑝𝑜𝑤 𝑦, 𝐷𝑝𝑜𝑑𝑏𝑢 𝑥%, 𝑥& = 𝐷𝑝𝑜𝑑𝑏𝑢 𝐷𝑝𝑜𝑤(𝑦, 𝑥%), 𝐷𝑝𝑜𝑤 𝑦, 𝑥&

P2. …

Operator Specifications UNSAT

SLIDE 21

Verification Effort

21

TASO generates all 743 substitutions in 5 minutes, and verifies them against 43 operator properties in 10 minutes Supporting a new operator requires a few hours of human effort to discover its properties Operator specifications in TASO ≈ 1,400 LOC Manual graph optimizations in TensorFlow ≈ 53,000 LOC

SLIDE 22

Search-Based Graph Optimizer1

Goal: applying verified substitutions to obtain an optimized graph
Cost model2
Based on the sum of individual operators’ cost
Measure the cost of each operator on hardware
Cost-based backtracking search
Backtrack local optimal solutions
Optimizing a DNN model takes less than 10 minutes

22

Subst. Generator Subst. Verifier Graph Optimizer

1. Z. Jia et al. Optimizing DNN Computation with Relaxed Graph Substitutions. In SysML’19.
2. Z. Jia et al. Exploring Hidden Dimensions in Parallelizing Convolutional Neural Networks. ICML’18.

SLIDE 23

End-to-end Inference Performance (V100 GPU w/ cuDNN)

3 6 9 12 15

ResNet-50 NasNet-A ResNeXt-50 NasRNN BERT-Large

Runtime(ms) TensorFlow TensorRT MetaFlow TASO w/ cuDNN

1.3x 1.0x 2.8x 1.4x 1.4x

23

Competitive on standard models Larger speedups on emerging models

SLIDE 24

End-to-end Inference Performance (V100 GPU w/ TVM)

3 6 9 12 15

ResNet-50 NasNet-A ResNeXt-50 NasRNN BERT-Large

Runtime(ms) TVM TASO w/ TVM

1.3x 1.0x 1.8x 1.3x 1.1x

24

Similar speedups on the TVM backend

SLIDE 25

Heatmap of Used Substitutions

25

Different DNN models require different substitutions.

Not covered in TensorFlow

How many times a subst. is used to optimize a DNN

SLIDE 26

Conclusion

TASO is the first DNN optimizer that automatically generates substitutions

Less engineering effort
Better performance
Formal verification

https://github.com/jiazhihao/taso

Support DNN models in ONNX, TensorFlow, and PyTorch

26

SLIDE 27

Scalability Analysis

27

1 2 3 4

Maxmum GraSh SubstitutiRn Size

1 1.5 2 2.5 3

Relative SSeeGuS 1as1et-A 5es1eXt-50 BE5T

SLIDE 28

DWC 3x3 Input1 Add Conv 1x1 Conv 1x1 Add Avg 3x3 Avg 3x3 Avg 3x3 concat Add Add Add DWC 5x5 DWC 3x3 Conv 1x1 Conv 1x1 DWC 5x5 DWC 3x3 Conv 1x1 Input2

Case Study: NASNet

A X

Add Avg 3x3 Avg 3x3

X

DWC 3x3

Y Y

Add DWC 3x3 Conv 1x1 Conv 1x1 DWC 5x5 X2 X1 W1 W3 W2 W4 Conv 1x1 Concat DWC 5x5 Concat Concat X2 X1 W3 W4 W1 W2

Add: element-wise addition Conv: standard conv DWC: depth-wise conv

28

SLIDE 29

Future Work: Query Optimizations

A database query is expressed as a tree of relational operators
Query optimizations are tree transformations

29

SLIDE 30

Contribution

Replacing current manually-designed graph optimizations with

automatic generation of graph substitutions for deep learning

Less engineering effort: 53,000 LOC for graph optimizations in

TensorFlow → 1,400 LOC

Better performance: outperform existing optimizers by up to 2.8x
Correctness: formal verification of graph substitutions

30

SLIDE 31

Limitations of Rule-based Optimizations

Robustness

Experts’ heuristics do not apply to all DNNs/hardware

Scalability

New operators and graph structures require more rules

Performance

Miss subtle optimizations for specific DNNs/hardware

31

add DWC 3x3 conv 1x1 conv 1x1 DWC 5x5 X2 X1 W1 W3 W2 W4 conv 1x1 concat DWC 5x5 concat concat X2 X1 W3 W4 W1 W2 add conv 3x3 conv 3x3 conv 1x1 X W1 W3 W2 concat W2 W1 pad 3x3 conv 3x3 X conv 3x3 W3

Only apply to specific hardware Only apply to specialized graph structures

SLIDE 32

TASO: Tensor Algebra SuperOptimizer

Key idea: automatically generate graph substitutions and verify them

32

Operator Specifications Graph Subst. Generator

…

Graph Subst. Verifier

…

Candidate Substitutions Verified Substitutions

SLIDE 33

TASO: Tensor Algebra SuperOptimizer

Search-Based Graph Optimizer

Optimized

Comp. Graph

Input Comp. Graph Operator Specifications

Graph Subst. Generator Graph Subst. Verifier

33

Candidate Substitutions

…

Verified Substitutions

…

TASO

SLIDE 34

End-to-end Inference Performance

0.5 1 1.5 2 2.5 3

ResNet-50 NasNet-A ResNeXt-50 NasRNN BERT-Large

Relative Speedup over Existing Frameworks

TensorFlow TensorRT MetaFlow TASO

1.3x 1.0x 2.8x 1.4x 1.4x

34

SLIDE 35

Joint Optimizer for Graph Substitution and Data Layout

Motivation: some graph substitutions only improve performance when

combined with particular layout transformations

Idea: consider potential layout transformations along with graph substitutions

(additional 1.3x speedup)

Cost-based backtracking search
Assume the cost to run a model is the sum of individual operators’ costs
Measure the cost of each operator on hardware
A search takes less than 10 minutes

35