TASO: Optimizing Deep Learning with Automatic Generation of Graph - - PowerPoint PPT Presentation

taso optimizing deep learning with automatic generation
SMART_READER_LITE
LIVE PREVIEW

TASO: Optimizing Deep Learning with Automatic Generation of Graph - - PowerPoint PPT Presentation

TASO: Optimizing Deep Learning with Automatic Generation of Graph Substitutions Zhihao Jia , Oded Padon, James Thomas, Todd Warszawski, Matei Zaharia, and Alex Aiken Stanford University SOSP19 12/14/19 1 Current Rule-based DNN


slide-1
SLIDE 1

TASO: Optimizing Deep Learning with Automatic Generation of Graph Substitutions

Zhihao Jia, Oded Padon, James Thomas, Todd Warszawski, Matei Zaharia, and Alex Aiken Stanford University

1

12/14/19

SOSP’19

slide-2
SLIDE 2

Current Rule-based DNN Optimizations

2 conv3x3 + relu conv1x1 + relu

Input conv3x3 add relu

Rule-based Optimizer

conv3x3 conv1x1 Input conv3x3 add relu relu relu

Computation Graph Optimized Graph Fuse conv + relu

conv relu conv + relu

slide-3
SLIDE 3

3

Fuse conv + relu Fuse conv + batch normalization Fuse multi. convs

Current Rule-based DNN Optimizations

Rule-based Optimizer

TensorFlow currently includes ~200 rules (~53,000 LOC)

slide-4
SLIDE 4

Limitations of Rule-based Optimizations

When I turned on XLA (TensorFlow’s graph optimizer), the training speed is about 20% slower. With XLA, my program is almost 2x slower than without XLA

Robustness

Experts’ heuristics do not apply to all DNNs/hardware

4

slide-5
SLIDE 5

Limitations of Rule-based Optimizations

Robustness

Experts’ heuristics do not apply to all DNNs/hardware

Scalability

New operators and graph structures require more rules TensorFlow currently uses ~4K LOC to optimize convolution

5

slide-6
SLIDE 6

Limitations of Rule-based Optimizations

Robustness

Experts’ heuristics do not apply to all DNNs/hardware

Scalability

New operators and graph structures require more rules

Performance

Miss subtle optimizations for specific DNNs/hardware

6

slide-7
SLIDE 7

Motivating Example

Conv3x3 + Relu Conv1x1 + Relu

Input

Conv3x3

Add Relu

Conv3x3 + Relu Conv3x3 + Relu

Input

Conv3x3

Add Relu

Enlarge convs

Conv3x3 + Relu

Input

Conv3x3

Add Relu Split

Fuse convs Fuse conv & add

The final graph is 30% faster on V100 but 10% slower on K80.

Conv3x3 + Relu

Input

Conv3x3 + Relu

Fuse conv & relu

Conv3x3 + Relu

Input

Conv3x3

Relu 7

slide-8
SLIDE 8

8

DNN Graph Optimizations DNN Operators Graph Architectures Hardware Backends

How should we address the complexity of designing DNN graph optimizations?

slide-9
SLIDE 9

TASO: Tensor Algebra SuperOptimizer

  • Key idea: replace manually-designed graph optimizations with automated

generation and verification of graph substitutions for deep learning

  • Less engineering effort: 53,000 LOC for manual graph optimizations in

TensorFlow → 1,400 LOC in TASO

  • Better performance: outperform existing optimizers by up to 2.8x

9

slide-10
SLIDE 10

Graph Substitution

10

Conv3x3 W1 X W2 Conv3x3 Concat Conv3x3 W1 W2 X Y1 Y2 Split Y1 Y2

slide-11
SLIDE 11

TASO Workflow

11

Operator Specifications Graph Subst. Generator Graph Subst. Verifier Candidate Substitutions Verified Substitutions Graph Optimizer

… …

slide-12
SLIDE 12

TASO Workflow

12

Input Comp. Graph

Search-Based Graph Optimizer

Verified Substitutions

Optimized

  • Comp. Graph
slide-13
SLIDE 13

Key Challenges

  • 1. How to generate potential substitutions?
  • 2. How to verify their correctness?

Graph fingerprints Operator specifications + theorem prover

13

slide-14
SLIDE 14

Graph Substitution Generator

Enumerate all possible graphs up to a fixed size using available operators

14

Subst. Generator Subst. Verifier Graph Optimizer

Operators supported by hardware backend

slide-15
SLIDE 15

Graph Substitution Generator

15

66M graphs with up to 4 operators

Subst. Generator Subst. Verifier Graph Optimizer

Directly evaluating all pairs requires a quadratic number of tests.

slide-16
SLIDE 16

Graph Substitution Generator

I1 IK

O1 OK

O1 OK

O1 OK

O1 OK

O1 OK

O1 OK

O1 OK

O1 OK

Compute output fingerprints with random input tensors

16

Subst. Generator Subst. Verifier Graph Optimizer

slide-17
SLIDE 17

Graph Substitution Generator

I1 IK

O1 OK

O1 OK

O1 OK

O1 OK

O1 OK

O1 OK

O1 OK

O1 OK

Pairs of graphs with identical fingerprint are candidate substitutions

17

Subst. Generator Subst. Verifier Graph Optimizer

slide-18
SLIDE 18

Graph Substitution Generator

18

TASO generates ~29,000 substitutions by enumerating graphs w/ up to 4 operators 743 substitutions remain after applying pruning techniques to eliminate redundancy

Subst. Generator Subst. Verifier Graph Optimizer

slide-19
SLIDE 19

Graph Substitution Verifier

19

Graph Subst. Verifier

Candidate Substitutions Verified Substitutions

  • P1. conv is distributive
  • ver concatenation
  • P2. conv is bilinear

Pn.

Operator Specifications

∀𝑦, 𝑥%, 𝑥& . 𝐷𝑝𝑜𝑤 𝑦, 𝐷𝑝𝑜𝑑𝑏𝑢 𝑥%, 𝑥& = 𝐷𝑝𝑜𝑑𝑏𝑢 𝐷𝑝𝑜𝑤(𝑦, 𝑥%), 𝐷𝑝𝑜𝑤 𝑦, 𝑥&

Subst. Generator Subst. Verifier Graph Optimizer

slide-20
SLIDE 20

Verification Workflow

20

∃𝑦, 𝑥%, 𝑥& . 𝐷𝑝𝑜𝑤 𝑦, 𝑥%), 𝐷𝑝𝑜𝑤(𝑦, 𝑥& ≠ 𝑇𝑞𝑚𝑗𝑢 𝐷𝑝𝑜𝑤 𝑦, 𝐷𝑝𝑜𝑑𝑏𝑢 𝑥%, 𝑥&

Conv W1 X W2 Conv Concat Conv W1 W2 X Y1 Y2 Split Y1 Y2

(Conv(x, w1), Conv (x, w2)) Split(Conv(x, Concat(w1, w2)))

Theorem Prover

  • P1. ∀𝑦, 𝑥%, 𝑥& .

𝐷𝑝𝑜𝑤 𝑦, 𝐷𝑝𝑜𝑑𝑏𝑢 𝑥%, 𝑥& = 𝐷𝑝𝑜𝑑𝑏𝑢 𝐷𝑝𝑜𝑤(𝑦, 𝑥%), 𝐷𝑝𝑜𝑤 𝑦, 𝑥&

  • P2. …

Operator Specifications UNSAT

slide-21
SLIDE 21

Verification Effort

21

TASO generates all 743 substitutions in 5 minutes, and verifies them against 43 operator properties in 10 minutes Supporting a new operator requires a few hours of human effort to discover its properties Operator specifications in TASO ≈ 1,400 LOC Manual graph optimizations in TensorFlow ≈ 53,000 LOC

slide-22
SLIDE 22

Search-Based Graph Optimizer1

  • Goal: applying verified substitutions to obtain an optimized graph
  • Cost model2
  • Based on the sum of individual operators’ cost
  • Measure the cost of each operator on hardware
  • Cost-based backtracking search
  • Backtrack local optimal solutions
  • Optimizing a DNN model takes less than 10 minutes

22

Subst. Generator Subst. Verifier Graph Optimizer

  • 1. Z. Jia et al. Optimizing DNN Computation with Relaxed Graph Substitutions. In SysML’19.
  • 2. Z. Jia et al. Exploring Hidden Dimensions in Parallelizing Convolutional Neural Networks. ICML’18.
slide-23
SLIDE 23

End-to-end Inference Performance (V100 GPU w/ cuDNN)

3 6 9 12 15

ResNet-50 NasNet-A ResNeXt-50 NasRNN BERT-Large

Runtime(ms) TensorFlow TensorRT MetaFlow TASO w/ cuDNN

1.3x 1.0x 2.8x 1.4x 1.4x

23

Competitive on standard models Larger speedups on emerging models

slide-24
SLIDE 24

End-to-end Inference Performance (V100 GPU w/ TVM)

3 6 9 12 15

ResNet-50 NasNet-A ResNeXt-50 NasRNN BERT-Large

Runtime(ms) TVM TASO w/ TVM

1.3x 1.0x 1.8x 1.3x 1.1x

24

Similar speedups on the TVM backend

slide-25
SLIDE 25

Heatmap of Used Substitutions

25

Different DNN models require different substitutions.

Not covered in TensorFlow

How many times a subst. is used to optimize a DNN

slide-26
SLIDE 26

Conclusion

TASO is the first DNN optimizer that automatically generates substitutions

  • Less engineering effort
  • Better performance
  • Formal verification

https://github.com/jiazhihao/taso

  • Support DNN models in ONNX, TensorFlow, and PyTorch

26

slide-27
SLIDE 27

Scalability Analysis

27

1 2 3 4

Maxmum GraSh SubstitutiRn Size

1 1.5 2 2.5 3

Relative SSeeGuS 1as1et-A 5es1eXt-50 BE5T

slide-28
SLIDE 28

DWC 3x3 Input1 Add Conv 1x1 Conv 1x1 Add Avg 3x3 Avg 3x3 Avg 3x3 concat Add Add Add DWC 5x5 DWC 3x3 Conv 1x1 Conv 1x1 DWC 5x5 DWC 3x3 Conv 1x1 Input2

Case Study: NASNet

A X

Add Avg 3x3 Avg 3x3

X

DWC 3x3

Y Y

Add DWC 3x3 Conv 1x1 Conv 1x1 DWC 5x5 X2 X1 W1 W3 W2 W4 Conv 1x1 Concat DWC 5x5 Concat Concat X2 X1 W3 W4 W1 W2

Add: element-wise addition Conv: standard conv DWC: depth-wise conv

28

slide-29
SLIDE 29

Future Work: Query Optimizations

  • A database query is expressed as a tree of relational operators
  • Query optimizations are tree transformations

29

slide-30
SLIDE 30

Contribution

  • Replacing current manually-designed graph optimizations with

automatic generation of graph substitutions for deep learning

  • Less engineering effort: 53,000 LOC for graph optimizations in

TensorFlow → 1,400 LOC

  • Better performance: outperform existing optimizers by up to 2.8x
  • Correctness: formal verification of graph substitutions

30

slide-31
SLIDE 31

Limitations of Rule-based Optimizations

Robustness

Experts’ heuristics do not apply to all DNNs/hardware

Scalability

New operators and graph structures require more rules

Performance

Miss subtle optimizations for specific DNNs/hardware

31

add DWC 3x3 conv 1x1 conv 1x1 DWC 5x5 X2 X1 W1 W3 W2 W4 conv 1x1 concat DWC 5x5 concat concat X2 X1 W3 W4 W1 W2 add conv 3x3 conv 3x3 conv 1x1 X W1 W3 W2 concat W2 W1 pad 3x3 conv 3x3 X conv 3x3 W3

Only apply to specific hardware Only apply to specialized graph structures

slide-32
SLIDE 32

TASO: Tensor Algebra SuperOptimizer

Key idea: automatically generate graph substitutions and verify them

32

Operator Specifications Graph Subst. Generator

Graph Subst. Verifier

Candidate Substitutions Verified Substitutions

slide-33
SLIDE 33

TASO: Tensor Algebra SuperOptimizer

Search-Based Graph Optimizer

Optimized

  • Comp. Graph

Input Comp. Graph Operator Specifications

Graph Subst. Generator Graph Subst. Verifier

33

Candidate Substitutions

Verified Substitutions

TASO

slide-34
SLIDE 34

End-to-end Inference Performance

0.5 1 1.5 2 2.5 3

ResNet-50 NasNet-A ResNeXt-50 NasRNN BERT-Large

Relative Speedup over Existing Frameworks

TensorFlow TensorRT MetaFlow TASO

1.3x 1.0x 2.8x 1.4x 1.4x

34

slide-35
SLIDE 35

Joint Optimizer for Graph Substitution and Data Layout

  • Motivation: some graph substitutions only improve performance when

combined with particular layout transformations

  • Idea: consider potential layout transformations along with graph substitutions

(additional 1.3x speedup)

  • Cost-based backtracking search
  • Assume the cost to run a model is the sum of individual operators’ costs
  • Measure the cost of each operator on hardware
  • A search takes less than 10 minutes

35