SLIDE 1 Continuous Distributed Monitoring in the Evolved Packet Core
Industry Experience Report
Romaric Duvignau 1 Marina Papatriantafilou 1 Konstantinos Peratinos 3 Eric Nordstr¨
Patrik Nyman 2 DEBS 2019, Darmstadt (June 26).
1 Chalmers University of Technology, 2 Ericsson, 3 Chalmers student and Ericsson intern.
SLIDE 2
Introduction
SLIDE 3 Context: Monitoring the Evolved Packet Core (EPC) in 4G
The Evolved Packet Core
User plane (UP)
GTP
Base station Control Plane (CP) User Equipment (UE)
PFCP TEID TEID
W1 W2 WN
.....
LTE
EPG
EPC
Servers
PDN
Sxa/Sxb
User plane (UP)
GTP
Base station Control Plane (CP) User Equipment (UE)
PFCP TEID TEID
W1 W2 WN
.....
LTE
EPG
EPC
Servers
PDN
Sxa/Sxb
User plane (UP)
GTP
Base station Control Plane (CP) User Equipment (UE)
PFCP TEID TEID
W1 W2 WN
.....
LTE
EPG
EPC
Servers
PDN
Sxa/Sxb
User plane (UP)
GTP
Base station Control Plane (CP) User Equipment (UE)
PFCP TEID TEID W1 W2 WN
.....
LTE
EPG
EPC
Servers
PDN
Sxa/Sxb
User plane (UP)
GTP
Base station Control Plane (CP) User Equipment (UE)
PFCP TEID TEID W1 W2 WN
.....
LTE
EPG
EPC
Servers
PDN
Sxa/Sxb
User plane (UP)
GTP
Base station Control Plane (CP) User Equipment (UE)
PFCP TEID TEID W1 W2 WN
.....
LTE
EPG
EPC
Servers
PDN
Sxa/Sxb EPC
1
SLIDE 4 Context: Monitoring the Evolved Packet Core (EPC) in 4G
The Evolved Packet Core
User plane (UP)
GTP
Base station Control Plane (CP) User Equipment (UE)
PFCP TEID TEID
W1 W2 WN
.....
LTE
EPG
EPC
Servers
PDN
Sxa/Sxb
User plane (UP)
GTP
Base station Control Plane (CP) User Equipment (UE)
PFCP TEID TEID
W1 W2 WN
.....
LTE
EPG
EPC
Servers
PDN
Sxa/Sxb
User plane (UP)
GTP
Base station Control Plane (CP) User Equipment (UE)
PFCP TEID TEID
W1 W2 WN
.....
LTE
EPG
EPC
Servers
PDN
Sxa/Sxb
User plane (UP)
GTP
Base station Control Plane (CP) User Equipment (UE)
PFCP TEID TEID W1 W2 WN
.....
LTE
EPG
EPC
Servers
PDN
Sxa/Sxb
User plane (UP)
GTP
Base station Control Plane (CP) User Equipment (UE)
PFCP TEID TEID W1 W2 WN
.....
LTE
EPG
EPC
Servers
PDN
Sxa/Sxb
User plane (UP)
GTP
Base station Control Plane (CP) User Equipment (UE)
PFCP TEID TEID W1 W2 WN
.....
LTE
EPG
EPC
Servers
PDN
Sxa/Sxb EPC
1
MME, QoS, billing, ...
SLIDE 5 Context: Monitoring the Evolved Packet Core (EPC) in 4G
The Evolved Packet Core
User plane (UP)
GTP
Base station Control Plane (CP) User Equipment (UE)
PFCP TEID TEID
W1 W2 WN
.....
LTE
EPG
EPC
Servers
PDN
Sxa/Sxb
User plane (UP)
GTP
Base station Control Plane (CP) User Equipment (UE)
PFCP TEID TEID
W1 W2 WN
.....
LTE
EPG
EPC
Servers
PDN
Sxa/Sxb
User plane (UP)
GTP
Base station Control Plane (CP) User Equipment (UE)
PFCP TEID TEID
W1 W2 WN
.....
LTE
EPG
EPC
Servers
PDN
Sxa/Sxb
User plane (UP)
GTP
Base station Control Plane (CP) User Equipment (UE)
PFCP TEID TEID W1 W2 WN
.....
LTE
EPG
EPC
Servers
PDN
Sxa/Sxb
User plane (UP)
GTP
Base station Control Plane (CP) User Equipment (UE)
PFCP TEID TEID W1 W2 WN
.....
LTE
EPG
EPC
Servers
PDN
Sxa/Sxb
User plane (UP)
GTP
Base station Control Plane (CP) User Equipment (UE)
PFCP TEID TEID W1 W2 WN
.....
LTE
EPG
EPC
Servers
PDN
Sxa/Sxb EPC
1
MME, QoS, billing, ... Packet Gateway
SLIDE 6 Context: Monitoring the Evolved Packet Core (EPC) in 4G
The Evolved Packet Core
User plane (UP)
GTP
Base station Control Plane (CP) User Equipment (UE)
PFCP TEID TEID
W1 W2 WN
.....
LTE
EPG
EPC
Servers
PDN
Sxa/Sxb
User plane (UP)
GTP
Base station Control Plane (CP) User Equipment (UE)
PFCP TEID TEID
W1 W2 WN
.....
LTE
EPG
EPC
Servers
PDN
Sxa/Sxb
User plane (UP)
GTP
Base station Control Plane (CP) User Equipment (UE)
PFCP TEID TEID
W1 W2 WN
.....
LTE
EPG
EPC
Servers
PDN
Sxa/Sxb
User plane (UP)
GTP
Base station Control Plane (CP) User Equipment (UE)
PFCP TEID TEID W1 W2 WN
.....
LTE
EPG
EPC
Servers
PDN
Sxa/Sxb
User plane (UP)
GTP
Base station Control Plane (CP) User Equipment (UE)
PFCP TEID TEID W1 W2 WN
.....
LTE
EPG
EPC
Servers
PDN
Sxa/Sxb
User plane (UP)
GTP
Base station Control Plane (CP) User Equipment (UE)
PFCP TEID TEID W1 W2 WN
.....
LTE
EPG
EPC
Servers
PDN
Sxa/Sxb EPC
- Large-Scale, Distributed, Performance-critical system.
- Strong need to continuously monitor the EPC: e.g. detection
- f under- or over-used subcomponents.
1
MME, QoS, billing, ... Packet Gateway
SLIDE 7
Continuous Distributed Monitoring
SLIDE 8
Continuous Distributed Monitoring (CDM) Model
2
SLIDE 9
Continuous Distributed Monitoring (CDM) Model
2
f (S1, S2, · · · , Sk)
SLIDE 10
Continuous Distributed Monitoring (CDM) Model
2
f (S1, S2, · · · , Sk) There exist variants (unidirectional, relay nodes, etc).
SLIDE 11 Continuous Distributed Monitoring (CDM) Model
2
f (S1, S2, · · · , Sk) There exist variants (unidirectional, relay nodes, etc).
communication
SLIDE 12
System Architecture
SLIDE 13
System Architecture Overview
C Agg1 w1
1
w1
2
· · · w1
ℓ
Agg2 w2
1
w2
2
· · · w2
ℓ
· · ·
Load Balancer Incoming Traffic
3
SLIDE 14
System Architecture Overview
C Agg1 w1
1
w1
2
· · · w1
ℓ
Agg2 w2
1
w2
2
· · · w2
ℓ
· · ·
Load Balancer Incoming Traffic Fetched Statis- tics
3
SLIDE 15
System Architecture Overview
C Agg1 w1
1
w1
2
· · · w1
ℓ
Agg2 w2
1
w2
2
· · · w2
ℓ
· · ·
Load Balancer Incoming Traffic Fetched Statis- tics Monitoring Messages
3
SLIDE 16
System Architecture Overview
C Agg1 w1
1
w1
2
· · · w1
ℓ
Agg2 w2
1
w2
2
· · · w2
ℓ
· · ·
Load Balancer Incoming Traffic Fetched Statis- tics Monitoring Messages Display (analysts)
3
SLIDE 17 System Architecture Overview
C Agg1 w1
1
w1
2
· · · w1
ℓ
Agg2 w2
1
w2
2
· · · w2
ℓ
· · ·
Load Balancer Incoming Traffic Fetched Statis- tics Monitoring Messages Display (analysts)
Differences with CDM models
- Sites identity matters, performance statistics = “events”, etc
- Need to account for comp. and communication delays!
3
SLIDE 18
System Architecture Overview
C Agg1 w1
1
w1
2
· · · w1
ℓ
Agg2 w2
1
w2
2
· · · w2
ℓ
· · ·
Load Balancer Incoming Traffic Fetched Statis- tics Monitoring Messages Display (analysts)
time Monitoring Period
3
SLIDE 19
System Architecture Overview
C Agg1 w1
1
w1
2
· · · w1
ℓ
Agg2 w2
1
w2
2
· · · w2
ℓ
· · ·
Load Balancer Incoming Traffic Fetched Statis- tics Monitoring Messages Display (analysts)
time Monitoring Period Fetches
3
SLIDE 20
System Architecture Overview
C Agg1 w1
1
w1
2
· · · w1
ℓ
Agg2 w2
1
w2
2
· · · w2
ℓ
· · ·
Load Balancer Incoming Traffic Fetched Statis- tics Monitoring Messages Display (analysts)
time Monitoring Period Fetches Sliding Window
3
SLIDE 21
System Architecture Overview
C Agg1 w1
1
w1
2
· · · w1
ℓ
Agg2 w2
1
w2
2
· · · w2
ℓ
· · ·
Load Balancer Incoming Traffic Fetched Statis- tics Monitoring Messages Display (analysts)
time Monitoring Period Fetches Sliding Window
3
SLIDE 22
System Architecture Overview
C Agg1 w1
1
w1
2
· · · w1
ℓ
Agg2 w2
1
w2
2
· · · w2
ℓ
· · ·
Load Balancer Incoming Traffic Fetched Statis- tics Monitoring Messages Display (analysts)
time Monitoring Period Fetches Sliding Window
3
SLIDE 23
System Architecture Overview
C Agg1 w1
1
w1
2
· · · w1
ℓ
Agg2 w2
1
w2
2
· · · w2
ℓ
· · ·
Load Balancer Incoming Traffic Fetched Statis- tics Monitoring Messages Display (analysts)
time Monitoring Period Fetches Sliding Window → At the Agg: monitoring decisions then 1 monitoring message.
3
SLIDE 24
Monitoring Algorithms
SLIDE 25 Selected CDM Algorithms for Counting problems
Basic Mode: Exact Monitoring
- Send an update if last value sent is
different to measured value
- Keep an exact sliding window of
the last n values
SLIDE 26 Selected CDM Algorithms for Counting problems
Basic Mode: Exact Monitoring
- Send an update if last value sent is
different to measured value
- Keep an exact sliding window of
the last n values
- Approximation Mode: Relative Error of ε
- Uses Exponential Histograms for
approximate counting
- Send the approximate count when
it is beyond some error bound from the last value sent
- Requires in all O(log(nε)/ε) words
- 4
SLIDE 27
Results
SLIDE 28 Experimental setup
- EPG setup: 2 aggregators, 72 workers per aggregator
- 2 phases: increasing load (20min) then stable load (15min)
20 40 60 80 100 CPU utilization (%) Max p95 Median p5 Min 500 1000 1500 2000 1M 2M 3M Packet rate (packets/s) Max p95 Median p5 Min
5
SLIDE 29 Experimental setup
- EPG setup: 2 aggregators, 72 workers per aggregator
- 2 phases: increasing load (20min) then stable load (15min)
20 40 60 80 100 CPU utilization (%) Max p95 Median p5 Min 500 1000 1500 2000 1M 2M 3M Packet rate (packets/s) Max p95 Median p5 Min
5
1000 fetches /s – high precision 1 fetch /s – low precision
SLIDE 30
- No. of Monitoring Updates per Round
- 5-10% of data sent for packet proc. rate; 30-70% for CPU.
0.4 0.6 0.8 1.0 Updates vs Basic (cpu) 5% 10% 20% 5%W60 500 1000 1500 2000 0.0 0.2 0.4 0.6 0.8 Updates vs Basic (pkt) 5% 10% 20% 5%W60
6
SLIDE 31
- No. of Monitoring Updates per Round
- 5-10% of data sent for packet proc. rate; 30-70% for CPU.
0.4 0.6 0.8 1.0 Updates vs Basic (cpu) 5% 10% 20% 5%W60 500 1000 1500 2000 0.0 0.2 0.4 0.6 0.8 Updates vs Basic (pkt) 5% 10% 20% 5%W60
9 and average < ε 5. 6
SLIDE 32 Monitoring Availability
- 8 runs (ca 4h of data) with monitoring round = 1s
1.5 2.0 2.5 Update time (s,MA300) B5%W60 Agg1/2 B5% Agg1/2 5% Agg1/2 B Agg1/2 250 500 750 1000 1250 1500 1750 2000 0.4 0.6 0.8 Availability (MA300) B 20% 10% 5% B20% B10% B5% B5%W60
7
SLIDE 33
Conclusion
SLIDE 34 Conclusions
- Adjusted state-of-the-art CDM implementations in the EPC
- Keys to popularize CDM within a production level system
- From experiments, only 6% of data sent for 1.6% avg error
- Useful for the upcoming transition to 5G architecture
8
SLIDE 35
Thank you!
8
SLIDE 36 Error Analysis
- Max relative error is always close to 5ε
9
- Larger window influences absolute error on CPU
0.00 0.02 5% 0.00 0.02 5%W60 1 5% 500 1000 1500 2000 2 5%W60 Max p90 Median p10 Min
SLIDE 37 Comparison with Simple Approximation
- Simple Approximation: keep an exact window and send
updates when last count is beyond some predefined relative bound
B 5% 10% 20% 5%W60 20 40 60 No of updates (pkt) B 5% 10% 20% 5%W60 5 10 Relative errors (%,pkt)
- ε-Approximate algorithm presents similar tradeoffs as the
simple approximation with bound 5ε
9
SLIDE 38 CDM approaches
Simple approaches
- Flooding, do not scale!
- Polling, but hard to choose right polling interval!
- Sampling, do not capture scarce under/over-used components!
Solutions
- Communication-optimal algorithms
- Geometric Monitoring → efficient network-wide aggregate.
- Tailored algorithms for particular tasks → e.g. computing the
frequency of items or most popular ones.
- Heuristics → e.g. adaptive filters.
- Compromises: Magpie, Dapper, Ganglia...
SLIDE 39 Proposed Monitoring Solutions
time Monitoring Period Fetches Sliding Window Monitoring Logic for each monitored value
- Implemented as part of the aggregator nodes
- once all fetched have been collected, a monitoring decision is
taken upon propagating the update
- Aggregation of all monitoring updates: sending of (up to) a
single monitoring message per aggregator
SLIDE 40 Selected CDM Algorithms
Basic Mode
- Send an update if last value sent is different
- Keep an exact sliding window of length n
ε-Approximation Mode
9-approximate Exponential Histogram for
counting approximate sum ˆ c of items over a sliding window of the last n events
c > (1 + 4ε
9 )c or ˆ
c < (1 − 4ε
9 )c, send an update,
where c is the last value sent
- Requires in all O(log(nε)/ε) words of memory
SLIDE 41 Measuring Metrics of Interests: 2 modes
With high granularity: CPU usage
- 1. P fetches of CPU-usage for past 1ms each within one
monitoring period
- 2. Frequency chart (histogram of F bins) for the P fetches
- 3. Sliding Windows are updated: each bin is monitored
- 4. For each changed (basic) or outside of bounds (approx) value,
a monitoring update is sent
- 5. Upon receiving an update: C updates its frequency counts for
the resp. observer and CPU-bin and then may display the average CPU over the window as
1≤i≤F ifi/ 1≤i≤F fi
With low granularity: Packet Processing Rate
- Only the no. of processed packets per mon. period is tracked