Parity Models Erasure-Coded Resilience for Prediction Serving - - PowerPoint PPT Presentation
Parity Models Erasure-Coded Resilience for Prediction Serving - - PowerPoint PPT Presentation
Parity Models Erasure-Coded Resilience for Prediction Serving Systems Jack Kosaian Rashmi Vinayak Shivaram Venkataraman Rashmi Vinayak Shivaram Venkataraman 2 Inference: using a trained ML model 3 Inference: using a trained ML model 4
2
Rashmi Vinayak Shivaram Venkataraman
Inference: using a trained ML model
3
Inference: using a trained ML model
4
Inference: using a trained ML model
5
queries
Inference: using a trained ML model
6
queries predictions
Inference: using a trained ML model
7
queries predictions
Inference: using a trained ML model
8
queries predictions cat 0.15 0.8 0.05 dog bird
Inference: using a trained ML model
9
queries predictions cat 0.15 0.8 0.05 dog bird
Inference used in latency-sensitive apps
10
Inference used in latency-sensitive apps
11
translation ranking search
Inference used in latency-sensitive apps
12
translation ranking search Inference must operate with low, predictable latency
Prediction serving systems: inference in clusters
13
Prediction serving systems: inference in clusters
Cloud Services
14
Open Source
Clipper TensorFlow Serving
Prediction serving systems: inference in clusters
15
Prediction serving systems: inference in clusters
16
Frontend
Prediction serving systems: inference in clusters
17
Frontend model instances
Prediction serving systems: inference in clusters
18
Frontend queries model instances
Prediction serving systems: inference in clusters
19
Frontend queries model instances
Prediction serving systems: inference in clusters
20
Frontend queries predictions model instances
Slowdowns and failures in inference
21
Frontend queries predictions
Slowdowns and failures in inference
22
Frontend network contention compute contention queries predictions
Slowdowns and failures in inference
23
Frontend network contention compute contention failures queries predictions
Slowdowns and failures in inference
24
Frontend network contention compute contention failures queries predictions
Must alleviate effects of slowdowns and failures to reduce tail latency
Erasure codes widely deployed in systems
25
Erasure codes widely deployed in systems
26
Storage systems
D1 D2 P
resource-efficient resilience
Erasure codes widely deployed in systems
27
Storage systems
D1
Communication systems
D2 1 2 P P
resource-efficient resilience low-latency packet loss recovery
Erasure codes widely deployed in systems
28
Storage systems
D1
Communication systems
D2 1 2 P P
resource-efficient resilience low-latency packet loss recovery
Erasure codes for systems that compute
- ver data (e.g., serving systems)?
Erasure codes for resilient ML inference
29
- vercome fundamental challenges, use erasure codes
for reducing tail latency in machine learning inference
This work:
Erasure codes for resilient ML inference
30
- vercome fundamental challenges, use erasure codes
for reducing tail latency in machine learning inference
This work: more resource-efficient than replication low recovery latency Bring benefits of erasure codes to inference
End goal: erasure-coded prediction serving
31
End goal: erasure-coded prediction serving
32
Frontend queries
End goal: erasure-coded prediction serving
33
Frontend queries Encoder
End goal: erasure-coded prediction serving
34
Frontend queries Encoder parity query
End goal: erasure-coded prediction serving
35
Frontend queries Encoder parity model parity query
End goal: erasure-coded prediction serving
36
Frontend queries Encoder parity model
End goal: erasure-coded prediction serving
37
Frontend queries Encoder parity model
End goal: erasure-coded prediction serving
38
Frontend queries Encoder parity model
End goal: erasure-coded prediction serving
39
Frontend queries Encoder Decoder parity model
40
What does it mean to use erasure codes for ML inference? Why is this hard?
41
Quick recap of erasure codes
42
Quick recap of erasure codes
43
D1 D1 D2 D2
Quick recap of erasure codes
44
D1 P D1 D2 P = D1 + D2 D2 “parity”
Quick recap of erasure codes
encoding
45
D1 P D1 D2 P = D1 + D2 D2 “parity”
Quick recap of erasure codes
encoding
46
D1 P D1 D2 P = D1 + D2 D2 = P – D1 D2 “parity”
Quick recap of erasure codes
encoding decoding
47
D1 P D1 D2 P = D1 + D2 + … + Dk D2
Quick recap of erasure codes: parameter k
Dk Dk
48
D1 P D1 D2 P = D1 + D2 D2
Erasure Coding
D1 D2 D2 D1 D1 D2
Replication
Quick recap of erasure codes: benefits
lower resource-overhead same resilience
49
Using erasure codes for inference
50
Using erasure codes for inference
F F F
51
Using erasure codes for inference
F F F models
52
Using erasure codes for inference
X1 X2 F F F models queries
53
Using erasure codes for inference
X1 X2 F F F models queries
54
Using erasure codes for inference
F(X1) F(X2) X1 X2 F F F models queries predictions
55
Goal: preserve results of inference over queries
Using erasure codes for inference
F(X1) F(X2) X1 X2 F F F
Using erasure codes for inference
56
Encode queries
encode “parity query”
F F F F(X1) F(X2) X1 X2
Using erasure codes for inference
57
Decode results of inference over queries
decode
F(X2) F(X1) F(P)
encode
F F F X1 X2
encode
Traditional coding vs. codes for inference
58
Need to handle computation over inputs
Codes for inference Codes for storage D2 D1 D1 D2 P D2
decode
F(X2) F(X1) F(P)
encode
F F F X1 X2
decode
encode
Traditional coding vs. codes for inference
59
Need to handle computation over inputs
Codes for inference Codes for storage D2 D1 D1 D2 P D2
decode
F(X2) F(X1) F(P)
encode
F F F X1 X2
decode
Encoding and decoding must hold
- ver computation F
Designing erasure codes for inference is hard
60
decode F(X2) F(X1) F(P) encode F F F X1 X2
Designing erasure codes for inference is hard
61
Theoretical framework: “coded-computation”
decode F(X2) F(X1) F(P) encode F F F X1 X2
Designing erasure codes for inference is hard
62
Theoretical framework: “coded-computation”
decode F(X2) F(X1) F(P) encode F F F X1 X2
Currently: handcraft erasure code
Designing erasure codes for inference is hard
63
Theoretical framework: “coded-computation”
decode F(X2) F(X1) F(P) encode F F F X1 X2
Currently: handcraft erasure code
- Straight-forward for linear F
Designing erasure codes for inference is hard
64
Theoretical framework: “coded-computation”
decode F(X2) F(X1) F(P) encode F F F X1 X2
Currently: handcraft erasure code
- Straight-forward for linear F
- Far more challenging for non-linear F
- Apply to only restricted
functions (polynomials)
- Require 2x resource-overhead
Designing erasure codes for inference is hard
65
Theoretical framework: “coded-computation”
decode F(X2) F(X1) F(P) encode F F F X1 X2
Currently: handcraft erasure code
- Straight-forward for linear F
- Far more challenging for non-linear F
- Apply to only restricted
functions (polynomials)
- Require 2x resource-overhead
Current handcrafted coded-computation approaches cannot support neural networks
66
This work:
- vercome challenges of handcrafting
erasure codes for coded-computation by taking a learning-based approach to erasure-coded resilience
Learning an erasure code?
67
Design encoder and decoder as neural networks
encoder X1 X2 decoder
Learning an erasure code?
68
encoder X1 X2 decoder
Accurate
Design encoder and decoder as neural networks
Learning an erasure code?
69
X1 X2 decoder
Accurate Computationally expensive Expensive encoder/decoder
encoder
Design encoder and decoder as neural networks
Learn computation over parities
70
Learn computation over parities
Use simple, fast encoders and decoders
71
Learn computation over parities: “parity model”
Learn computation over parities
72
P = X1 + X2 X1 X2 F(X2) = FP(P) – F(X1)
Use simple, fast encoders and decoders Learn computation over parities: “parity model”
Learn computation over parities
73
P = X1 + X2 X1 X2 F(X2) = FP(P) – F(X1)
Use simple, fast encoders and decoders Learn computation over parities: “parity model”
Learn computation over parities
74
P = X1 + X2 X1 X2
parity model (FP)
F(X2) = FP(P) – F(X1)
Use simple, fast encoders and decoders Learn computation over parities: “parity model”
Accurate
Learn computation over parities
75
P = X1 + X2 X1 X2
parity model (FP)
F(X2) = FP(P) – F(X1)
Efficient encoder/decoder
Use simple, fast encoders and decoders Learn computation over parities: “parity model”
Designing parity models
76
P = X1 + X2 X1 X2
parity model (FP)
F(X2) = FP(P) – F(X1)
Designing parity models
77
P = X1 + X2 X1 X2
parity model (FP)
F(X2) = FP(P) – F(X1)
Goal: transform parities into a form that enables decoder to reconstruct unavailable predictions
Designing parity models
78
P = X1 + X2 X1 X2
parity model (FP)
F(X2) = FP(P) – F(X1)
Goal: transform parities into a form that enables decoder to reconstruct unavailable predictions
FP(P) = F(X1) + F(X2)
Designing parity models
79
P = X1 + X2 X1 X2
parity model (FP)
F(X2) = FP(P) – F(X1)
Goal: transform parities into a form that enables decoder to reconstruct unavailable predictions
FP(P) = F(X1) + F(X2)
Designing parity models
80
P = X1 + X2 X1 X2
parity model (FP)
F(X2) = FP(P) – F(X1)
Learn a parity model
Goal: transform parities into a form that enables decoder to reconstruct unavailable predictions
Training a parity model
81
F(X1) + F(X2) P = X1 + X2 Desired output:
Training a parity model
82
F(X1) + F(X2) P = X1 + X2
- 1. Sample inputs and encode
Desired output: queries
Training a parity model
83
F(X1) + F(X2) P = X1 + X2
- 1. Sample inputs and encode
0.8 0.15 0.05
Desired output: predictions queries
0.2 0.7 0.1
Training a parity model
84
F(X1) + F(X2) P = X1 + X2
- 1. Sample inputs and encode
- 2. Perform inference with parity model
FP(P)1
0.8 0.15 0.05
Desired output: predictions queries
0.2 0.7 0.1
Training a parity model
85
F(X1) + F(X2) P = X1 + X2
- 1. Sample inputs and encode
- 2. Perform inference with parity model
- 3. Compute loss
FP(P)1
0.8 0.15 0.05
Desired output: predictions queries compute loss
0.2 0.7 0.1
Training a parity model
86
F(X1) + F(X2) P = X1 + X2
- 1. Sample inputs and encode
- 2. Perform inference with parity model
- 3. Compute loss
- 4. Backpropogate loss
FP(P)1
0.8 0.15 0.05
Desired output: predictions queries compute loss
0.2 0.7 0.1
Training a parity model
87
F(X1) + F(X2) P = X1 + X2
- 1. Sample inputs and encode
- 2. Perform inference with parity model
- 3. Compute loss
- 4. Backpropogate loss
FP(P)1
0.8 0.15 0.05
Desired output: predictions queries compute loss
0.2 0.7 0.1
- 5. Repeat
Training a parity model
88
F(X1) + F(X2) P = X1 + X2
- 1. Sample inputs and encode
- 2. Perform inference with parity model
- 3. Compute loss
- 4. Backpropogate loss
FP(P)2
0.15 0.8 0.05
Desired output: predictions queries compute loss
0.3 0.5 0.2
- 5. Repeat
Training a parity model
89
F(X1) + F(X2) P = X1 + X2
- 1. Sample inputs and encode
- 2. Perform inference with parity model
- 3. Compute loss
- 4. Backpropogate loss
FP(P)3
0.03 0.02 0.95
Desired output: predictions queries compute loss
0.3 0.3 0.4
- 5. Repeat
Training a parity model: higher parameter k
90
Can use higher code parameter k
F(X1) + F(X2) + F(X3) + F(X4) P = X1 + X2 + X3 + X4 FP(P)1 Desired output:
Training a parity model: different encoders
91
P = FP(P)1
Training a parity model: different encoders
92
P = FP(P)1
Training a parity model: different encoders
93
P = FP(P)1
Training a parity model: different encoders
94
Can specialize encoders and decoders to inference task at hand
P = FP(P)1
95
Learning results in approximate reconstructions
Appropriate for machine learning inference
96
Learning results in approximate reconstructions
Appropriate for machine learning inference
97
- 1. Predictions resulting from inference are approximations
Learning results in approximate reconstructions
Appropriate for machine learning inference
98
- 1. Predictions resulting from inference are approximations
- 2. Inaccuracy only at play when predictions otherwise slow/failed
Learning results in approximate reconstructions
Parity models in action in Clipper
99
Frontend queries Encoder Decoder parity model parity query
Evaluation
- 1. How accurate are reconstructions using parity models?
100
- 2. By how much can parity models help reduce tail latency?
Accuracy of parity models
101
Accuracy of parity models
102
Accuracy of parity models
103
Reconstructed output only comes into play when original predictions are slow/failed
Accuracy of parity models
104
Reconstructed output only comes into play when original predictions are slow/failed Example: assuming 10% slow predictions à at most 0.7% lower overall accuracy
Tail latency reduction
105
In presence of resource contention
Tail latency reduction
106
In presence of resource contention 40% same median
Extensive evaluation in paper
- Evaluate accuracy with different:
- Different encoders
- Inference tasks (image classification, object localization, speech)
- Neural network architectures (ResNets, VGG, LeNet, MLP)
- Code parameters (k = 3, k = 4)
- Evaluate tail latency with different:
- Inference hardware (GPUs, CPUs)
- Query arrival rates
- Batch sizes
- Levels of load imbalance
- Amounts of redundancy
- Baseline approaches
107
Parity models summary
108
Parity models summary
109
- Overcome challenges of handcrafting erasure codes for
coded-computation through learning-based coded-resilience
Parity models summary
110
- Overcome challenges of handcrafting erasure codes for
coded-computation through learning-based coded-resilience
- Parity models: transform parities to enable decoding
- Applicable to many inference tasks, neural networks
- Reduce tail latency in presence of resource-contention
Parity models summary
111
- Overcome challenges of handcrafting erasure codes for
coded-computation through learning-based coded-resilience
- Parity models: transform parities to enable decoding
- Applicable to many inference tasks, neural networks
- Reduce tail latency in presence of resource-contention
- Bring benefits of erasure codes to ML inference
Parity models summary
112
Code: github.com/Thesys-lab/parity-models Project: pdl.cmu.edu/MLCodedComputation/
- Overcome challenges of handcrafting erasure codes for
coded-computation through learning-based coded-resilience
- Parity models: transform parities to enable decoding
- Applicable to many inference tasks, neural networks
- Reduce tail latency in presence of resource-contention
- Bring benefits of erasure codes to ML inference