[PPT] - Parity Models Erasure-Coded Resilience for Prediction Serving PowerPoint Presentation

SLIDE 1

Parity Models Erasure-Coded Resilience for Prediction Serving Systems

Jack Kosaian Rashmi Vinayak Shivaram Venkataraman

SLIDE 2

2

Rashmi Vinayak Shivaram Venkataraman

SLIDE 3

Inference: using a trained ML model

3

SLIDE 4

Inference: using a trained ML model

4

SLIDE 5

Inference: using a trained ML model

5

queries

SLIDE 6

Inference: using a trained ML model

6

queries predictions

SLIDE 7

Inference: using a trained ML model

7

queries predictions

SLIDE 8

Inference: using a trained ML model

8

queries predictions cat 0.15 0.8 0.05 dog bird

SLIDE 9

Inference: using a trained ML model

9

queries predictions cat 0.15 0.8 0.05 dog bird

SLIDE 10

Inference used in latency-sensitive apps

10

SLIDE 11

Inference used in latency-sensitive apps

11

translation ranking search

SLIDE 12

Inference used in latency-sensitive apps

12

translation ranking search Inference must operate with low, predictable latency

SLIDE 13

Prediction serving systems: inference in clusters

13

SLIDE 14

Prediction serving systems: inference in clusters

Cloud Services

14

Open Source

Clipper TensorFlow Serving

SLIDE 15

Prediction serving systems: inference in clusters

15

SLIDE 16

Prediction serving systems: inference in clusters

16

Frontend

SLIDE 17

Prediction serving systems: inference in clusters

17

Frontend model instances

SLIDE 18

Prediction serving systems: inference in clusters

18

Frontend queries model instances

SLIDE 19

Prediction serving systems: inference in clusters

19

Frontend queries model instances

SLIDE 20

Prediction serving systems: inference in clusters

20

Frontend queries predictions model instances

SLIDE 21

Slowdowns and failures in inference

21

Frontend queries predictions

SLIDE 22

Slowdowns and failures in inference

22

Frontend network contention compute contention queries predictions

SLIDE 23

Slowdowns and failures in inference

23

Frontend network contention compute contention failures queries predictions

SLIDE 24

Slowdowns and failures in inference

24

Frontend network contention compute contention failures queries predictions

Must alleviate effects of slowdowns and failures to reduce tail latency

SLIDE 25

Erasure codes widely deployed in systems

25

SLIDE 26

Erasure codes widely deployed in systems

26

Storage systems

D1 D2 P

resource-efficient resilience

SLIDE 27

Erasure codes widely deployed in systems

27

Storage systems

D1

Communication systems

D2 1 2 P P

resource-efficient resilience low-latency packet loss recovery

SLIDE 28

Erasure codes widely deployed in systems

28

Storage systems

D1

Communication systems

D2 1 2 P P

resource-efficient resilience low-latency packet loss recovery

Erasure codes for systems that compute

ver data (e.g., serving systems)?

SLIDE 29

Erasure codes for resilient ML inference

29

vercome fundamental challenges, use erasure codes

for reducing tail latency in machine learning inference

This work:

SLIDE 30

Erasure codes for resilient ML inference

30

vercome fundamental challenges, use erasure codes

for reducing tail latency in machine learning inference

This work: more resource-efficient than replication low recovery latency Bring benefits of erasure codes to inference

SLIDE 31

End goal: erasure-coded prediction serving

31

SLIDE 32

End goal: erasure-coded prediction serving

32

Frontend queries

SLIDE 33

End goal: erasure-coded prediction serving

33

Frontend queries Encoder

SLIDE 34

End goal: erasure-coded prediction serving

34

Frontend queries Encoder parity query

SLIDE 35

End goal: erasure-coded prediction serving

35

Frontend queries Encoder parity model parity query

SLIDE 36

End goal: erasure-coded prediction serving

36

Frontend queries Encoder parity model

SLIDE 37

End goal: erasure-coded prediction serving

37

Frontend queries Encoder parity model

SLIDE 38

End goal: erasure-coded prediction serving

38

Frontend queries Encoder parity model

SLIDE 39

End goal: erasure-coded prediction serving

39

Frontend queries Encoder Decoder parity model

SLIDE 40

40

What does it mean to use erasure codes for ML inference? Why is this hard?

SLIDE 41

41

Quick recap of erasure codes

SLIDE 42

42

Quick recap of erasure codes

SLIDE 43

43

D1 D1 D2 D2

Quick recap of erasure codes

SLIDE 44

44

D1 P D1 D2 P = D1 + D2 D2 “parity”

Quick recap of erasure codes

encoding

SLIDE 45

45

D1 P D1 D2 P = D1 + D2 D2 “parity”

Quick recap of erasure codes

encoding

SLIDE 46

46

D1 P D1 D2 P = D1 + D2 D2 = P – D1 D2 “parity”

Quick recap of erasure codes

encoding decoding

SLIDE 47

47

D1 P D1 D2 P = D1 + D2 + … + Dk D2

Quick recap of erasure codes: parameter k

Dk Dk

SLIDE 48

48

D1 P D1 D2 P = D1 + D2 D2

Erasure Coding

D1 D2 D2 D1 D1 D2

Replication

Quick recap of erasure codes: benefits

lower resource-overhead same resilience

SLIDE 49

49

Using erasure codes for inference

SLIDE 50

50

Using erasure codes for inference

F F F

SLIDE 51

51

Using erasure codes for inference

F F F models

SLIDE 52

52

Using erasure codes for inference

X1 X2 F F F models queries

SLIDE 53

53

Using erasure codes for inference

X1 X2 F F F models queries

SLIDE 54

54

Using erasure codes for inference

F(X1) F(X2) X1 X2 F F F models queries predictions

SLIDE 55

55

Goal: preserve results of inference over queries

Using erasure codes for inference

F(X1) F(X2) X1 X2 F F F

SLIDE 56

Using erasure codes for inference

56

Encode queries

encode “parity query”

F F F F(X1) F(X2) X1 X2

SLIDE 57

Using erasure codes for inference

57

Decode results of inference over queries

decode

F(X2) F(X1) F(P)

encode

F F F X1 X2

SLIDE 58

encode

Traditional coding vs. codes for inference

58

Need to handle computation over inputs

Codes for inference Codes for storage D2 D1 D1 D2 P D2

decode

F(X2) F(X1) F(P)

encode

F F F X1 X2

decode

SLIDE 59

encode

Traditional coding vs. codes for inference

59

Need to handle computation over inputs

Codes for inference Codes for storage D2 D1 D1 D2 P D2

decode

F(X2) F(X1) F(P)

encode

F F F X1 X2

decode

Encoding and decoding must hold

ver computation F

SLIDE 60

Designing erasure codes for inference is hard

60

decode F(X2) F(X1) F(P) encode F F F X1 X2

SLIDE 61

Designing erasure codes for inference is hard

61

Theoretical framework: “coded-computation”

decode F(X2) F(X1) F(P) encode F F F X1 X2

SLIDE 62

Designing erasure codes for inference is hard

62

Theoretical framework: “coded-computation”

decode F(X2) F(X1) F(P) encode F F F X1 X2

Currently: handcraft erasure code

SLIDE 63

Designing erasure codes for inference is hard

63

Theoretical framework: “coded-computation”

decode F(X2) F(X1) F(P) encode F F F X1 X2

Currently: handcraft erasure code

Straight-forward for linear F

SLIDE 64

Designing erasure codes for inference is hard

64

Theoretical framework: “coded-computation”

decode F(X2) F(X1) F(P) encode F F F X1 X2

Currently: handcraft erasure code

Straight-forward for linear F
Far more challenging for non-linear F
Apply to only restricted

functions (polynomials)

Require 2x resource-overhead

SLIDE 65

Designing erasure codes for inference is hard

65

Theoretical framework: “coded-computation”

decode F(X2) F(X1) F(P) encode F F F X1 X2

Currently: handcraft erasure code

Straight-forward for linear F
Far more challenging for non-linear F
Apply to only restricted

functions (polynomials)

Require 2x resource-overhead

Current handcrafted coded-computation approaches cannot support neural networks

SLIDE 66

66

This work:

vercome challenges of handcrafting

erasure codes for coded-computation by taking a learning-based approach to erasure-coded resilience

SLIDE 67

Learning an erasure code?

67

Design encoder and decoder as neural networks

encoder X1 X2 decoder

SLIDE 68

Learning an erasure code?

68

encoder X1 X2 decoder

Accurate

Design encoder and decoder as neural networks

SLIDE 69

Learning an erasure code?

69

X1 X2 decoder

Accurate Computationally expensive Expensive encoder/decoder

encoder

Design encoder and decoder as neural networks

SLIDE 70

Learn computation over parities

70

SLIDE 71

Learn computation over parities

Use simple, fast encoders and decoders

71

Learn computation over parities: “parity model”

SLIDE 72

Learn computation over parities

72

P = X1 + X2 X1 X2 F(X2) = FP(P) – F(X1)

Use simple, fast encoders and decoders Learn computation over parities: “parity model”

SLIDE 73

Learn computation over parities

73

P = X1 + X2 X1 X2 F(X2) = FP(P) – F(X1)

Use simple, fast encoders and decoders Learn computation over parities: “parity model”

SLIDE 74

Learn computation over parities

74

P = X1 + X2 X1 X2

parity model (FP)

F(X2) = FP(P) – F(X1)

Use simple, fast encoders and decoders Learn computation over parities: “parity model”

SLIDE 75

Accurate

Learn computation over parities

75

P = X1 + X2 X1 X2

parity model (FP)

F(X2) = FP(P) – F(X1)

Efficient encoder/decoder

Use simple, fast encoders and decoders Learn computation over parities: “parity model”

SLIDE 76

Designing parity models

76

P = X1 + X2 X1 X2

parity model (FP)

F(X2) = FP(P) – F(X1)

SLIDE 77

Designing parity models

77

P = X1 + X2 X1 X2

parity model (FP)

F(X2) = FP(P) – F(X1)

Goal: transform parities into a form that enables decoder to reconstruct unavailable predictions

SLIDE 78

Designing parity models

78

P = X1 + X2 X1 X2

parity model (FP)

F(X2) = FP(P) – F(X1)

Goal: transform parities into a form that enables decoder to reconstruct unavailable predictions

SLIDE 79

FP(P) = F(X1) + F(X2)

Designing parity models

79

P = X1 + X2 X1 X2

parity model (FP)

F(X2) = FP(P) – F(X1)

Goal: transform parities into a form that enables decoder to reconstruct unavailable predictions

SLIDE 80

FP(P) = F(X1) + F(X2)

Designing parity models

80

P = X1 + X2 X1 X2

parity model (FP)

F(X2) = FP(P) – F(X1)

Learn a parity model

Goal: transform parities into a form that enables decoder to reconstruct unavailable predictions

SLIDE 81

Training a parity model

81

F(X1) + F(X2) P = X1 + X2 Desired output:

SLIDE 82

Training a parity model

82

F(X1) + F(X2) P = X1 + X2

1. Sample inputs and encode

Desired output: queries

SLIDE 83

Training a parity model

83

F(X1) + F(X2) P = X1 + X2

1. Sample inputs and encode

0.8 0.15 0.05

Desired output: predictions queries

0.2 0.7 0.1

SLIDE 84

Training a parity model

84

F(X1) + F(X2) P = X1 + X2

1. Sample inputs and encode
2. Perform inference with parity model

FP(P)1

0.8 0.15 0.05

Desired output: predictions queries

0.2 0.7 0.1

SLIDE 85

Training a parity model

85

F(X1) + F(X2) P = X1 + X2

1. Sample inputs and encode
2. Perform inference with parity model
3. Compute loss

FP(P)1

0.8 0.15 0.05

Desired output: predictions queries compute loss

0.2 0.7 0.1

SLIDE 86

Training a parity model

86

F(X1) + F(X2) P = X1 + X2

1. Sample inputs and encode
2. Perform inference with parity model
3. Compute loss
4. Backpropogate loss

FP(P)1

0.8 0.15 0.05

Desired output: predictions queries compute loss

0.2 0.7 0.1

SLIDE 87

Training a parity model

87

F(X1) + F(X2) P = X1 + X2

1. Sample inputs and encode
2. Perform inference with parity model
3. Compute loss
4. Backpropogate loss

FP(P)1

0.8 0.15 0.05

Desired output: predictions queries compute loss

0.2 0.7 0.1

5. Repeat

SLIDE 88

Training a parity model

88

F(X1) + F(X2) P = X1 + X2

1. Sample inputs and encode
2. Perform inference with parity model
3. Compute loss
4. Backpropogate loss

FP(P)2

0.15 0.8 0.05

Desired output: predictions queries compute loss

0.3 0.5 0.2

5. Repeat

SLIDE 89

Training a parity model

89

F(X1) + F(X2) P = X1 + X2

1. Sample inputs and encode
2. Perform inference with parity model
3. Compute loss
4. Backpropogate loss

FP(P)3

0.03 0.02 0.95

Desired output: predictions queries compute loss

0.3 0.3 0.4

5. Repeat

SLIDE 90

Training a parity model: higher parameter k

90

Can use higher code parameter k

F(X1) + F(X2) + F(X3) + F(X4) P = X1 + X2 + X3 + X4 FP(P)1 Desired output:

SLIDE 91

Training a parity model: different encoders

91

P = FP(P)1

SLIDE 92

Training a parity model: different encoders

92

P = FP(P)1

SLIDE 93

Training a parity model: different encoders

93

P = FP(P)1

SLIDE 94

Training a parity model: different encoders

94

Can specialize encoders and decoders to inference task at hand

P = FP(P)1

SLIDE 95

95

Learning results in approximate reconstructions

SLIDE 96

Appropriate for machine learning inference

96

Learning results in approximate reconstructions

SLIDE 97

Appropriate for machine learning inference

97

1. Predictions resulting from inference are approximations

Learning results in approximate reconstructions

SLIDE 98

Appropriate for machine learning inference

98

1. Predictions resulting from inference are approximations
2. Inaccuracy only at play when predictions otherwise slow/failed

Learning results in approximate reconstructions

SLIDE 99

Parity models in action in Clipper

99

Frontend queries Encoder Decoder parity model parity query

SLIDE 100

Evaluation

1. How accurate are reconstructions using parity models?

100

2. By how much can parity models help reduce tail latency?

SLIDE 101

Accuracy of parity models

101

SLIDE 102

Accuracy of parity models

102

SLIDE 103

Accuracy of parity models

103

Reconstructed output only comes into play when original predictions are slow/failed

SLIDE 104

Accuracy of parity models

104

Reconstructed output only comes into play when original predictions are slow/failed Example: assuming 10% slow predictions à at most 0.7% lower overall accuracy

SLIDE 105

Tail latency reduction

105

In presence of resource contention

SLIDE 106

Tail latency reduction

106

In presence of resource contention 40% same median

SLIDE 107

Extensive evaluation in paper

Evaluate accuracy with different:
Different encoders
Inference tasks (image classification, object localization, speech)
Neural network architectures (ResNets, VGG, LeNet, MLP)
Code parameters (k = 3, k = 4)
Evaluate tail latency with different:
Inference hardware (GPUs, CPUs)
Query arrival rates
Batch sizes
Levels of load imbalance
Amounts of redundancy
Baseline approaches

107

SLIDE 108

Parity models summary

108

SLIDE 109

Parity models summary

109

Overcome challenges of handcrafting erasure codes for

coded-computation through learning-based coded-resilience

SLIDE 110

Parity models summary

110

Overcome challenges of handcrafting erasure codes for

coded-computation through learning-based coded-resilience

Parity models: transform parities to enable decoding
Applicable to many inference tasks, neural networks
Reduce tail latency in presence of resource-contention

SLIDE 111

Parity models summary

111

Overcome challenges of handcrafting erasure codes for

coded-computation through learning-based coded-resilience

Parity models: transform parities to enable decoding
Applicable to many inference tasks, neural networks
Reduce tail latency in presence of resource-contention
Bring benefits of erasure codes to ML inference

SLIDE 112

Parity models summary

112

Code: github.com/Thesys-lab/parity-models Project: pdl.cmu.edu/MLCodedComputation/

Overcome challenges of handcrafting erasure codes for

coded-computation through learning-based coded-resilience

Parity models: transform parities to enable decoding
Applicable to many inference tasks, neural networks
Reduce tail latency in presence of resource-contention
Bring benefits of erasure codes to ML inference