Parity Models Erasure-Coded Resilience for Prediction Serving - - PowerPoint PPT Presentation

parity models erasure coded resilience for prediction
SMART_READER_LITE
LIVE PREVIEW

Parity Models Erasure-Coded Resilience for Prediction Serving - - PowerPoint PPT Presentation

Parity Models Erasure-Coded Resilience for Prediction Serving Systems Jack Kosaian Rashmi Vinayak Shivaram Venkataraman Rashmi Vinayak Shivaram Venkataraman 2 Inference: using a trained ML model 3 Inference: using a trained ML model 4


slide-1
SLIDE 1

Parity Models Erasure-Coded Resilience for Prediction Serving Systems

Jack Kosaian Rashmi Vinayak Shivaram Venkataraman

slide-2
SLIDE 2

2

Rashmi Vinayak Shivaram Venkataraman

slide-3
SLIDE 3

Inference: using a trained ML model

3

slide-4
SLIDE 4

Inference: using a trained ML model

4

slide-5
SLIDE 5

Inference: using a trained ML model

5

queries

slide-6
SLIDE 6

Inference: using a trained ML model

6

queries predictions

slide-7
SLIDE 7

Inference: using a trained ML model

7

queries predictions

slide-8
SLIDE 8

Inference: using a trained ML model

8

queries predictions cat 0.15 0.8 0.05 dog bird

slide-9
SLIDE 9

Inference: using a trained ML model

9

queries predictions cat 0.15 0.8 0.05 dog bird

slide-10
SLIDE 10

Inference used in latency-sensitive apps

10

slide-11
SLIDE 11

Inference used in latency-sensitive apps

11

translation ranking search

slide-12
SLIDE 12

Inference used in latency-sensitive apps

12

translation ranking search Inference must operate with low, predictable latency

slide-13
SLIDE 13

Prediction serving systems: inference in clusters

13

slide-14
SLIDE 14

Prediction serving systems: inference in clusters

Cloud Services

14

Open Source

Clipper TensorFlow Serving

slide-15
SLIDE 15

Prediction serving systems: inference in clusters

15

slide-16
SLIDE 16

Prediction serving systems: inference in clusters

16

Frontend

slide-17
SLIDE 17

Prediction serving systems: inference in clusters

17

Frontend model instances

slide-18
SLIDE 18

Prediction serving systems: inference in clusters

18

Frontend queries model instances

slide-19
SLIDE 19

Prediction serving systems: inference in clusters

19

Frontend queries model instances

slide-20
SLIDE 20

Prediction serving systems: inference in clusters

20

Frontend queries predictions model instances

slide-21
SLIDE 21

Slowdowns and failures in inference

21

Frontend queries predictions

slide-22
SLIDE 22

Slowdowns and failures in inference

22

Frontend network contention compute contention queries predictions

slide-23
SLIDE 23

Slowdowns and failures in inference

23

Frontend network contention compute contention failures queries predictions

slide-24
SLIDE 24

Slowdowns and failures in inference

24

Frontend network contention compute contention failures queries predictions

Must alleviate effects of slowdowns and failures to reduce tail latency

slide-25
SLIDE 25

Erasure codes widely deployed in systems

25

slide-26
SLIDE 26

Erasure codes widely deployed in systems

26

Storage systems

D1 D2 P

resource-efficient resilience

slide-27
SLIDE 27

Erasure codes widely deployed in systems

27

Storage systems

D1

Communication systems

D2 1 2 P P

resource-efficient resilience low-latency packet loss recovery

slide-28
SLIDE 28

Erasure codes widely deployed in systems

28

Storage systems

D1

Communication systems

D2 1 2 P P

resource-efficient resilience low-latency packet loss recovery

Erasure codes for systems that compute

  • ver data (e.g., serving systems)?
slide-29
SLIDE 29

Erasure codes for resilient ML inference

29

  • vercome fundamental challenges, use erasure codes

for reducing tail latency in machine learning inference

This work:

slide-30
SLIDE 30

Erasure codes for resilient ML inference

30

  • vercome fundamental challenges, use erasure codes

for reducing tail latency in machine learning inference

This work: more resource-efficient than replication low recovery latency Bring benefits of erasure codes to inference

slide-31
SLIDE 31

End goal: erasure-coded prediction serving

31

slide-32
SLIDE 32

End goal: erasure-coded prediction serving

32

Frontend queries

slide-33
SLIDE 33

End goal: erasure-coded prediction serving

33

Frontend queries Encoder

slide-34
SLIDE 34

End goal: erasure-coded prediction serving

34

Frontend queries Encoder parity query

slide-35
SLIDE 35

End goal: erasure-coded prediction serving

35

Frontend queries Encoder parity model parity query

slide-36
SLIDE 36

End goal: erasure-coded prediction serving

36

Frontend queries Encoder parity model

slide-37
SLIDE 37

End goal: erasure-coded prediction serving

37

Frontend queries Encoder parity model

slide-38
SLIDE 38

End goal: erasure-coded prediction serving

38

Frontend queries Encoder parity model

slide-39
SLIDE 39

End goal: erasure-coded prediction serving

39

Frontend queries Encoder Decoder parity model

slide-40
SLIDE 40

40

What does it mean to use erasure codes for ML inference? Why is this hard?

slide-41
SLIDE 41

41

Quick recap of erasure codes

slide-42
SLIDE 42

42

Quick recap of erasure codes

slide-43
SLIDE 43

43

D1 D1 D2 D2

Quick recap of erasure codes

slide-44
SLIDE 44

44

D1 P D1 D2 P = D1 + D2 D2 “parity”

Quick recap of erasure codes

encoding

slide-45
SLIDE 45

45

D1 P D1 D2 P = D1 + D2 D2 “parity”

Quick recap of erasure codes

encoding

slide-46
SLIDE 46

46

D1 P D1 D2 P = D1 + D2 D2 = P – D1 D2 “parity”

Quick recap of erasure codes

encoding decoding

slide-47
SLIDE 47

47

D1 P D1 D2 P = D1 + D2 + … + Dk D2

Quick recap of erasure codes: parameter k

Dk Dk

slide-48
SLIDE 48

48

D1 P D1 D2 P = D1 + D2 D2

Erasure Coding

D1 D2 D2 D1 D1 D2

Replication

Quick recap of erasure codes: benefits

lower resource-overhead same resilience

slide-49
SLIDE 49

49

Using erasure codes for inference

slide-50
SLIDE 50

50

Using erasure codes for inference

F F F

slide-51
SLIDE 51

51

Using erasure codes for inference

F F F models

slide-52
SLIDE 52

52

Using erasure codes for inference

X1 X2 F F F models queries

slide-53
SLIDE 53

53

Using erasure codes for inference

X1 X2 F F F models queries

slide-54
SLIDE 54

54

Using erasure codes for inference

F(X1) F(X2) X1 X2 F F F models queries predictions

slide-55
SLIDE 55

55

Goal: preserve results of inference over queries

Using erasure codes for inference

F(X1) F(X2) X1 X2 F F F

slide-56
SLIDE 56

Using erasure codes for inference

56

Encode queries

encode “parity query”

F F F F(X1) F(X2) X1 X2

slide-57
SLIDE 57

Using erasure codes for inference

57

Decode results of inference over queries

decode

F(X2) F(X1) F(P)

encode

F F F X1 X2

slide-58
SLIDE 58

encode

Traditional coding vs. codes for inference

58

Need to handle computation over inputs

Codes for inference Codes for storage D2 D1 D1 D2 P D2

decode

F(X2) F(X1) F(P)

encode

F F F X1 X2

decode

slide-59
SLIDE 59

encode

Traditional coding vs. codes for inference

59

Need to handle computation over inputs

Codes for inference Codes for storage D2 D1 D1 D2 P D2

decode

F(X2) F(X1) F(P)

encode

F F F X1 X2

decode

Encoding and decoding must hold

  • ver computation F
slide-60
SLIDE 60

Designing erasure codes for inference is hard

60

decode F(X2) F(X1) F(P) encode F F F X1 X2

slide-61
SLIDE 61

Designing erasure codes for inference is hard

61

Theoretical framework: “coded-computation”

decode F(X2) F(X1) F(P) encode F F F X1 X2

slide-62
SLIDE 62

Designing erasure codes for inference is hard

62

Theoretical framework: “coded-computation”

decode F(X2) F(X1) F(P) encode F F F X1 X2

Currently: handcraft erasure code

slide-63
SLIDE 63

Designing erasure codes for inference is hard

63

Theoretical framework: “coded-computation”

decode F(X2) F(X1) F(P) encode F F F X1 X2

Currently: handcraft erasure code

  • Straight-forward for linear F
slide-64
SLIDE 64

Designing erasure codes for inference is hard

64

Theoretical framework: “coded-computation”

decode F(X2) F(X1) F(P) encode F F F X1 X2

Currently: handcraft erasure code

  • Straight-forward for linear F
  • Far more challenging for non-linear F
  • Apply to only restricted

functions (polynomials)

  • Require 2x resource-overhead
slide-65
SLIDE 65

Designing erasure codes for inference is hard

65

Theoretical framework: “coded-computation”

decode F(X2) F(X1) F(P) encode F F F X1 X2

Currently: handcraft erasure code

  • Straight-forward for linear F
  • Far more challenging for non-linear F
  • Apply to only restricted

functions (polynomials)

  • Require 2x resource-overhead

Current handcrafted coded-computation approaches cannot support neural networks

slide-66
SLIDE 66

66

This work:

  • vercome challenges of handcrafting

erasure codes for coded-computation by taking a learning-based approach to erasure-coded resilience

slide-67
SLIDE 67

Learning an erasure code?

67

Design encoder and decoder as neural networks

encoder X1 X2 decoder

slide-68
SLIDE 68

Learning an erasure code?

68

encoder X1 X2 decoder

Accurate

Design encoder and decoder as neural networks

slide-69
SLIDE 69

Learning an erasure code?

69

X1 X2 decoder

Accurate Computationally expensive Expensive encoder/decoder

encoder

Design encoder and decoder as neural networks

slide-70
SLIDE 70

Learn computation over parities

70

slide-71
SLIDE 71

Learn computation over parities

Use simple, fast encoders and decoders

71

Learn computation over parities: “parity model”

slide-72
SLIDE 72

Learn computation over parities

72

P = X1 + X2 X1 X2 F(X2) = FP(P) – F(X1)

Use simple, fast encoders and decoders Learn computation over parities: “parity model”

slide-73
SLIDE 73

Learn computation over parities

73

P = X1 + X2 X1 X2 F(X2) = FP(P) – F(X1)

Use simple, fast encoders and decoders Learn computation over parities: “parity model”

slide-74
SLIDE 74

Learn computation over parities

74

P = X1 + X2 X1 X2

parity model (FP)

F(X2) = FP(P) – F(X1)

Use simple, fast encoders and decoders Learn computation over parities: “parity model”

slide-75
SLIDE 75

Accurate

Learn computation over parities

75

P = X1 + X2 X1 X2

parity model (FP)

F(X2) = FP(P) – F(X1)

Efficient encoder/decoder

Use simple, fast encoders and decoders Learn computation over parities: “parity model”

slide-76
SLIDE 76

Designing parity models

76

P = X1 + X2 X1 X2

parity model (FP)

F(X2) = FP(P) – F(X1)

slide-77
SLIDE 77

Designing parity models

77

P = X1 + X2 X1 X2

parity model (FP)

F(X2) = FP(P) – F(X1)

Goal: transform parities into a form that enables decoder to reconstruct unavailable predictions

slide-78
SLIDE 78

Designing parity models

78

P = X1 + X2 X1 X2

parity model (FP)

F(X2) = FP(P) – F(X1)

Goal: transform parities into a form that enables decoder to reconstruct unavailable predictions

slide-79
SLIDE 79

FP(P) = F(X1) + F(X2)

Designing parity models

79

P = X1 + X2 X1 X2

parity model (FP)

F(X2) = FP(P) – F(X1)

Goal: transform parities into a form that enables decoder to reconstruct unavailable predictions

slide-80
SLIDE 80

FP(P) = F(X1) + F(X2)

Designing parity models

80

P = X1 + X2 X1 X2

parity model (FP)

F(X2) = FP(P) – F(X1)

Learn a parity model

Goal: transform parities into a form that enables decoder to reconstruct unavailable predictions

slide-81
SLIDE 81

Training a parity model

81

F(X1) + F(X2) P = X1 + X2 Desired output:

slide-82
SLIDE 82

Training a parity model

82

F(X1) + F(X2) P = X1 + X2

  • 1. Sample inputs and encode

Desired output: queries

slide-83
SLIDE 83

Training a parity model

83

F(X1) + F(X2) P = X1 + X2

  • 1. Sample inputs and encode

0.8 0.15 0.05

Desired output: predictions queries

0.2 0.7 0.1

slide-84
SLIDE 84

Training a parity model

84

F(X1) + F(X2) P = X1 + X2

  • 1. Sample inputs and encode
  • 2. Perform inference with parity model

FP(P)1

0.8 0.15 0.05

Desired output: predictions queries

0.2 0.7 0.1

slide-85
SLIDE 85

Training a parity model

85

F(X1) + F(X2) P = X1 + X2

  • 1. Sample inputs and encode
  • 2. Perform inference with parity model
  • 3. Compute loss

FP(P)1

0.8 0.15 0.05

Desired output: predictions queries compute loss

0.2 0.7 0.1

slide-86
SLIDE 86

Training a parity model

86

F(X1) + F(X2) P = X1 + X2

  • 1. Sample inputs and encode
  • 2. Perform inference with parity model
  • 3. Compute loss
  • 4. Backpropogate loss

FP(P)1

0.8 0.15 0.05

Desired output: predictions queries compute loss

0.2 0.7 0.1

slide-87
SLIDE 87

Training a parity model

87

F(X1) + F(X2) P = X1 + X2

  • 1. Sample inputs and encode
  • 2. Perform inference with parity model
  • 3. Compute loss
  • 4. Backpropogate loss

FP(P)1

0.8 0.15 0.05

Desired output: predictions queries compute loss

0.2 0.7 0.1

  • 5. Repeat
slide-88
SLIDE 88

Training a parity model

88

F(X1) + F(X2) P = X1 + X2

  • 1. Sample inputs and encode
  • 2. Perform inference with parity model
  • 3. Compute loss
  • 4. Backpropogate loss

FP(P)2

0.15 0.8 0.05

Desired output: predictions queries compute loss

0.3 0.5 0.2

  • 5. Repeat
slide-89
SLIDE 89

Training a parity model

89

F(X1) + F(X2) P = X1 + X2

  • 1. Sample inputs and encode
  • 2. Perform inference with parity model
  • 3. Compute loss
  • 4. Backpropogate loss

FP(P)3

0.03 0.02 0.95

Desired output: predictions queries compute loss

0.3 0.3 0.4

  • 5. Repeat
slide-90
SLIDE 90

Training a parity model: higher parameter k

90

Can use higher code parameter k

F(X1) + F(X2) + F(X3) + F(X4) P = X1 + X2 + X3 + X4 FP(P)1 Desired output:

slide-91
SLIDE 91

Training a parity model: different encoders

91

P = FP(P)1

slide-92
SLIDE 92

Training a parity model: different encoders

92

P = FP(P)1

slide-93
SLIDE 93

Training a parity model: different encoders

93

P = FP(P)1

slide-94
SLIDE 94

Training a parity model: different encoders

94

Can specialize encoders and decoders to inference task at hand

P = FP(P)1

slide-95
SLIDE 95

95

Learning results in approximate reconstructions

slide-96
SLIDE 96

Appropriate for machine learning inference

96

Learning results in approximate reconstructions

slide-97
SLIDE 97

Appropriate for machine learning inference

97

  • 1. Predictions resulting from inference are approximations

Learning results in approximate reconstructions

slide-98
SLIDE 98

Appropriate for machine learning inference

98

  • 1. Predictions resulting from inference are approximations
  • 2. Inaccuracy only at play when predictions otherwise slow/failed

Learning results in approximate reconstructions

slide-99
SLIDE 99

Parity models in action in Clipper

99

Frontend queries Encoder Decoder parity model parity query

slide-100
SLIDE 100

Evaluation

  • 1. How accurate are reconstructions using parity models?

100

  • 2. By how much can parity models help reduce tail latency?
slide-101
SLIDE 101

Accuracy of parity models

101

slide-102
SLIDE 102

Accuracy of parity models

102

slide-103
SLIDE 103

Accuracy of parity models

103

Reconstructed output only comes into play when original predictions are slow/failed

slide-104
SLIDE 104

Accuracy of parity models

104

Reconstructed output only comes into play when original predictions are slow/failed Example: assuming 10% slow predictions à at most 0.7% lower overall accuracy

slide-105
SLIDE 105

Tail latency reduction

105

In presence of resource contention

slide-106
SLIDE 106

Tail latency reduction

106

In presence of resource contention 40% same median

slide-107
SLIDE 107

Extensive evaluation in paper

  • Evaluate accuracy with different:
  • Different encoders
  • Inference tasks (image classification, object localization, speech)
  • Neural network architectures (ResNets, VGG, LeNet, MLP)
  • Code parameters (k = 3, k = 4)
  • Evaluate tail latency with different:
  • Inference hardware (GPUs, CPUs)
  • Query arrival rates
  • Batch sizes
  • Levels of load imbalance
  • Amounts of redundancy
  • Baseline approaches

107

slide-108
SLIDE 108

Parity models summary

108

slide-109
SLIDE 109

Parity models summary

109

  • Overcome challenges of handcrafting erasure codes for

coded-computation through learning-based coded-resilience

slide-110
SLIDE 110

Parity models summary

110

  • Overcome challenges of handcrafting erasure codes for

coded-computation through learning-based coded-resilience

  • Parity models: transform parities to enable decoding
  • Applicable to many inference tasks, neural networks
  • Reduce tail latency in presence of resource-contention
slide-111
SLIDE 111

Parity models summary

111

  • Overcome challenges of handcrafting erasure codes for

coded-computation through learning-based coded-resilience

  • Parity models: transform parities to enable decoding
  • Applicable to many inference tasks, neural networks
  • Reduce tail latency in presence of resource-contention
  • Bring benefits of erasure codes to ML inference
slide-112
SLIDE 112

Parity models summary

112

Code: github.com/Thesys-lab/parity-models Project: pdl.cmu.edu/MLCodedComputation/

  • Overcome challenges of handcrafting erasure codes for

coded-computation through learning-based coded-resilience

  • Parity models: transform parities to enable decoding
  • Applicable to many inference tasks, neural networks
  • Reduce tail latency in presence of resource-contention
  • Bring benefits of erasure codes to ML inference