Using Multi-System Monitoring Time Series to Predict Performance - - PowerPoint PPT Presentation
Using Multi-System Monitoring Time Series to Predict Performance - - PowerPoint PPT Presentation
Using Multi-System Monitoring Time Series to Predict Performance Events Andreas Schrgenhumer Mario Kahlhofer Peter Chalupar Hanspeter Mssenbck Paul Grnbacher 09.11.2018 Motivation t 2 Motivation t 2 Motivation t Train ML 2
Motivation
2
t
Motivation
2
t
Train
Motivation
2
t ML
Train
Motivation
2
t t ML
Predict Train
Motivation
2
t t ML
Predict Train
Motivation
2
t t ML
Predict Train
Motivation
2
t t ML Straightforward:
- Single system
- Single component
- Univariate time series
Motivation
3
Multiple, interlinked components
Motivation
3
Multiple, interlinked components Multivariate time series
Motivation
3
Multiple, interlinked components Multivariate time series Event to data connection
Motivation
3
Multiple, interlinked components Multivariate time series Event to data connection … Multiple systems
Motivation
3
Multiple, interlinked components Multivariate time series Event to data connection … Multiple systems ML Train
Approach
4
Preprocessing Framework Multi- System Data Configs CSVs ML
Approach
4
Preprocessing Framework Multi- System Data Configs CSVs ML
(1) Data
Approach
4
Preprocessing Framework Multi- System Data Configs CSVs ML
(1) Data (2) Preprocessing
Approach
4
Preprocessing Framework Multi- System Data Configs CSVs ML
(1) Data (2) Preprocessing (3) Prediction
(1) Data
5
Service Host Network Interface Disk * * * * 1 1 System 1 *
(1) Data
5
Service Host Network Interface Disk * * * * 1 1 System 1 * 250 systems 20-day export
(1) Data
5
Service Host Network Interface Disk * * * * 1 1 Events System 1 * 250 systems 20-day export Service slowdowns
(1) Data
5
Service Host Network Interface Disk * * * * 1 1 Events 11 Time Series System 1 * 250 systems 20-day export Service slowdowns CPU load Memory available SWAP available …
(1) Data
5
Service Host Network Interface Disk * * * * 1 1 Events 11 Time Series 13 Time Series System 1 * 250 systems 20-day export Service slowdowns CPU load Memory available SWAP available … Available Read time Write time …
(1) Data
5
Service Host Network Interface Disk * * * * 1 1 Events 11 Time Series 13 Time Series 10 Time Series System 1 * 250 systems 20-day export Service slowdowns CPU load Memory available SWAP available … Available Read time Write time … Bytes received Bytes sent Packets dropped … ... 1-minute resolution
(2) Preprocessing – Framework
6
Preprocessing Framework
(2) Preprocessing – Framework
- Input: YAMLs (configurations/configs)
- Contains all necessary data processing settings
- Easily changeable due to YAML format
6
Preprocessing Framework
systems:
- “sys1”
- “sys2”
timeSeries:
- CPU_LOAD
from: “2018-01-19 09:00” to: “2018-02-02 09:00” ... leadTime: 0
- bservationWindowsBoxes:
CPU_LOAD:
- size: 60
step: 1 aggregationFunctions:
- “AVG”
combinationFunctions:
- “AVG”
samplingMode: “PER_EVENT” missingDataPointMode: “NAN” addAttributes: true ... systems:
- “sys1”
- “sys2”
timeSeries:
- CPU_LOAD
from: “2018-01-19 09:00” to: “2018-02-02 09:00” ... leadTime: 0
- bservationWindowsBoxes:
CPU_LOAD:
- size: 60
step: 1 aggregationFunctions:
- “AVG”
combinationFunctions:
- “AVG”
samplingMode: “PER_EVENT” missingDataPointMode: “NAN” addAttributes: true ... systems:
- “sys1”
- “sys2”
timeSeries:
- CPU_LOAD
from: “2018-01-19 09:00” to: “2018-02-02 09:00” ... leadTime: 0
- bservationWindowsBoxes:
CPU_LOAD:
- size: 60
step: 1 aggregationFunctions:
- “AVG”
combinationFunctions:
- “AVG”
samplingMode: “PER_EVENT” missingDataPointMode: “NAN” addAttributes: true ...
(2) Preprocessing – Framework
- Input: YAMLs (configurations/configs)
- Contains all necessary data processing settings
- Easily changeable due to YAML format
- Output: CSVs (feature vectors)
- Portable format, directly useable for ML
6
Preprocessing Framework
systems:
- “sys1”
- “sys2”
timeSeries:
- CPU_LOAD
from: “2018-01-19 09:00” to: “2018-02-02 09:00” ... leadTime: 0
- bservationWindowsBoxes:
CPU_LOAD:
- size: 60
step: 1 aggregationFunctions:
- “AVG”
combinationFunctions:
- “AVG”
samplingMode: “PER_EVENT” missingDataPointMode: “NAN” addAttributes: true ... CPU_LOAD:AVG System Label 0.95 sys1 Event 0.71 sys2 No event 0.90 sys2 Event 0.87 sys2 No event 0.84 sys1 No event systems:
- “sys1”
- “sys2”
timeSeries:
- CPU_LOAD
from: “2018-01-19 09:00” to: “2018-02-02 09:00” ... leadTime: 0
- bservationWindowsBoxes:
CPU_LOAD:
- size: 60
step: 1 aggregationFunctions:
- “AVG”
combinationFunctions:
- “AVG”
samplingMode: “PER_EVENT” missingDataPointMode: “NAN” addAttributes: true ... systems:
- “sys1”
- “sys2”
timeSeries:
- CPU_LOAD
from: “2018-01-19 09:00” to: “2018-02-02 09:00” ... leadTime: 0
- bservationWindowsBoxes:
CPU_LOAD:
- size: 60
step: 1 aggregationFunctions:
- “AVG”
combinationFunctions:
- “AVG”
samplingMode: “PER_EVENT” missingDataPointMode: “NAN” addAttributes: true ... CPU_LOAD:AVG System Label 0.95 sys1 Event 0.71 sys2 No event 0.90 sys2 Event 0.87 sys2 No event 0.84 sys1 No event CPU_LOAD:AVG System Label 0.95 sys1 Event 0.71 sys2 No event 0.90 sys2 Event 0.87 sys2 No event 0.84 sys1 No event
(2) Preprocessing – Config Settings
7
Setting Example Systems
[sys1, sys2, ...]
Time series
[Host: CPU_LOAD, Disk: AVAILABLE, ...]
Time frame
From: 19-01-2018 09:00 To: 02-02-2018 09:00
Sampling mode
PER_EVENT, SLIDE_THROUGH
Negative sampling source
NON_EVENT_SERVICES, EVENT_SERVICES, ...
Lead time
10 min
Observation windows
60 min, AVG aggregation, AVG combination
Missing data mode
DROP, NAN, LAST_VALUE, ...
Metadata
System, special attributes, ...
... ...
(2) Preprocessing – Example
8
... samplingMode: “PER_EVENT” leadTime: 5
- bservationWindowsBoxes:
CPU_LOAD:
- size: 15
aggregationFunctions:
- “MIN”
- “MAX”
... DISK_WRITE:
- size: 30
aggregationFunctions:
- “AVG”
- “MIN”
- “MAX”
- “STD_DEV”
combinationFunctions:
- “AVG”
- “MIN”
- “MAX”
- “AVG”
BYTES_SENT:
- size: 5
aggregationFunctions:
- “NONE”
...
- size: 30
aggregationFunctions:
- “AVG”
... ...
Service Host Disk 1 Disk 2 Disk 3 Network
(2) Preprocessing – Example
8
... samplingMode: “PER_EVENT” leadTime: 5
- bservationWindowsBoxes:
CPU_LOAD:
- size: 15
aggregationFunctions:
- “MIN”
- “MAX”
... DISK_WRITE:
- size: 30
aggregationFunctions:
- “AVG”
- “MIN”
- “MAX”
- “STD_DEV”
combinationFunctions:
- “AVG”
- “MIN”
- “MAX”
- “AVG”
BYTES_SENT:
- size: 5
aggregationFunctions:
- “NONE”
...
- size: 30
aggregationFunctions:
- “AVG”
... ...
Service Host Disk 1 Disk 2 Disk 3 Network
(2) Preprocessing – Example
8
... samplingMode: “PER_EVENT” leadTime: 5
- bservationWindowsBoxes:
CPU_LOAD:
- size: 15
aggregationFunctions:
- “MIN”
- “MAX”
... DISK_WRITE:
- size: 30
aggregationFunctions:
- “AVG”
- “MIN”
- “MAX”
- “STD_DEV”
combinationFunctions:
- “AVG”
- “MIN”
- “MAX”
- “AVG”
BYTES_SENT:
- size: 5
aggregationFunctions:
- “NONE”
...
- size: 30
aggregationFunctions:
- “AVG”
... ...
Service Host Disk 1 Disk 2 Disk 3 Network
(2) Preprocessing – Example
8
... samplingMode: “PER_EVENT” leadTime: 5
- bservationWindowsBoxes:
CPU_LOAD:
- size: 15
aggregationFunctions:
- “MIN”
- “MAX”
... DISK_WRITE:
- size: 30
aggregationFunctions:
- “AVG”
- “MIN”
- “MAX”
- “STD_DEV”
combinationFunctions:
- “AVG”
- “MIN”
- “MAX”
- “AVG”
BYTES_SENT:
- size: 5
aggregationFunctions:
- “NONE”
...
- size: 30
aggregationFunctions:
- “AVG”
... ...
Service Host Disk 1 Disk 2 Disk 3 Network
(2) Preprocessing – Example
8
... samplingMode: “PER_EVENT” leadTime: 5
- bservationWindowsBoxes:
CPU_LOAD:
- size: 15
aggregationFunctions:
- “MIN”
- “MAX”
... DISK_WRITE:
- size: 30
aggregationFunctions:
- “AVG”
- “MIN”
- “MAX”
- “STD_DEV”
combinationFunctions:
- “AVG”
- “MIN”
- “MAX”
- “AVG”
BYTES_SENT:
- size: 5
aggregationFunctions:
- “NONE”
...
- size: 30
aggregationFunctions:
- “AVG”
... ...
Service Host Disk 1 Disk 2 Disk 3 Network
(2) Preprocessing – Example
8
... samplingMode: “PER_EVENT” leadTime: 5
- bservationWindowsBoxes:
CPU_LOAD:
- size: 15
aggregationFunctions:
- “MIN”
- “MAX”
... DISK_WRITE:
- size: 30
aggregationFunctions:
- “AVG”
- “MIN”
- “MAX”
- “STD_DEV”
combinationFunctions:
- “AVG”
- “MIN”
- “MAX”
- “AVG”
BYTES_SENT:
- size: 5
aggregationFunctions:
- “NONE”
...
- size: 30
aggregationFunctions:
- “AVG”
... ...
Service Host Disk 1 Disk 2 Disk 3 Network
(3) Prediction
9
t Preprocessing Framework 20 days, 250 systems, 34 time series
(3) Prediction
9
t Preprocessing Framework 14d Train 6d Test 20 days, 250 systems, 34 time series
(3) Prediction
9
t Preprocessing Framework 14d Train 6d Test 20 days, 250 systems, 34 time series Lead time: 0 Windows:
- Size: [5, 10, 15, 30, 60]
- Aggregation: min,
max, mean, stddev
(3) Prediction
9
t Preprocessing Framework 14d Train 6d Test 20 days, 250 systems, 34 time series ML: Random Forest ~3400 FVs ~2100 FVs Lead time: 0 Windows:
- Size: [5, 10, 15, 30, 60]
- Aggregation: min,
max, mean, stddev
(3) Prediction
9
t Preprocessing Framework 14d Train 6d Test 20 days, 250 systems, 34 time series ML: Random Forest ~3400 FVs ~2100 FVs Lead time: 0 Windows:
- Size: [5, 10, 15, 30, 60]
- Aggregation: min,
max, mean, stddev Metric Value Accuracy 0.81 Recall 0.81 Precision 0.82 FPR 0.15 F1 Score 0.81
Future Work
- More training and testing:
- Splits
- Config settings
- System generalization
10
Future Work
- More training and testing:
- Splits
- Config settings
- System generalization
- Other ML models
10
Future Work
- More training and testing:
- Splits
- Config settings
- System generalization
- Other ML models
- Other events
10