Syslog Processing for Switch Failure Diagnosis and Prediction in Datacenter Networks
Shenglin Zhang, Weibin Meng, Jiahao Bu, Sen Yang Dan Pei, Ying Liu, Jun (Jim) Xu, Yu Chen, Hui Dong, Xianping Qu, Lei Song
9/21/2017 IWQOS 2017 1
Syslog Processing for Switch Failure Diagnosis and Prediction in - - PowerPoint PPT Presentation
Syslog Processing for Switch Failure Diagnosis and Prediction in Datacenter Networks Shenglin Zhang, Weibin Meng, Jiahao Bu, Sen Yang Dan Pei, Ying Liu, Jun (Jim) Xu, Yu Chen, Hui Dong, Xianping Qu, Lei Song 9/21/2017 IWQOS 2017 1 Network
Shenglin Zhang, Weibin Meng, Jiahao Bu, Sen Yang Dan Pei, Ying Liu, Jun (Jim) Xu, Yu Chen, Hui Dong, Xianping Qu, Lei Song
9/21/2017 IWQOS 2017 1
9/21/2017 IWQOS 2017 2
Inter-DC Network ToR Switch Server Aggregation Switch Access Router Core Router IDPS Firewall VPN Load balancer IDPS Firewall VPN Load balancer L3 L2 Core
9/21/2017 IWQOS 2017 3
Inter-DC Network ToR Switch Server Aggregation Switch Access Router Core Router IDPS Firewall VPN Load balancer IDPS Firewall VPN Load balancer L3 L2 Core
prevention system (IDPS)
9/21/2017 IWQOS 2017 4
Inter-DC Network ToR Switch Server Aggregation Switch Access Router Core Router IDPS Firewall VPN Load balancer IDPS Firewall VPN Load balancer L3 L2 Core
prevention system (IDPS)
9/21/2017 IWQOS 2017 5
Microsoft (C. Guo, et al., SIGCOMM’15)
9/21/2017 IWQOS 2017 6
Microsoft (C. Guo, et al., SIGCOMM’15)
Baidu
9/21/2017 IWQOS 2017 7
Microsoft (C. Guo, et al., SIGCOMM’15)
Baidu
Swich failures are the norm rather than the exception (P. Gill, et al., SIGCOMM’11)
8
the datacenter of Hosting.com
services including AWS for 1.5 hours
9
the datacenter of Hosting.com
services including AWS for 1.5 hours
went dark after a switch failure
branch agency are affected for a few hours
Frameworks
Based on analyzing syslogs
9/21/2017 IWQOS 2017 10
11
Switch ID Message timestamp Message type Detailed message Switch 1 Jun 12 19:03:03 2014 SIF Interface te-1/1/59, changed state to down Switch 2 Jul 15 11:05:07 2015 OSPF Neighbour(rid:10.231.0.43, addr:10.231.39.61) on vlan23, changed state from Exchange to Loading Switch 3 Jan 12 21:03:01 2016 %%SLOT SFP te-1/1/33 is plugged in, vendor: BROCADE, serial number: AAA210383148232
12
Describe events occurring on switches
Important to failure diagnosis and proactive detection Extracting events from the detailed message field
Syslog Messages Under the Type “SIF”
13
Syslog Messages Under the Type “SIF” Before A Failure
14
Syslog Messages Under the Type “SIF” Before A Failure
15
A template is a combination of words with high frequency
9/21/2017 CoNEXT 2015 16
17
Unstructured texts Huge amount of syslog messages
training (two years)
Diverse types of syslog messages
18
Unstructured texts Huge amount of syslog messages
training (two years)
Diverse types of syslog messages
9/21/2017 IWQOS 2017 19
Failure diagnosis and prediction
to keep up-to-date New kinds of syslog messages
firmware upgrades
any existing template
be extracted Templates should be updated periodically
9/21/2017 IWQOS 2017 20
Template extraction method Template extraction method
Not incrementally re-trainable Incrementally re-trainable Computationally efficient
21
Method Conference Merits Drawbacks Signature Tree IMC 10 Accurate Not incrementally re-trainable STE INFOCOM 14 None Inaccurate and not incrementally re-trainable LogSimilarity CNSM 15 Learn incrementally Inaccurate
Existing template extraction methods
Method Conference Merits Drawbacks Signature Tree IMC 10 Accurate Not incrementally re-trainable STE INFOCOM 14 None Inaccurate and not incrementally re-trainable LogSimilarity CNSM 15 Learn incrementally Inaccurate
22
Our goal
9/21/2017 CoNEXT 2015 23
24
the descending order of support
25
the descending order of support
26
M
Words Support “changed”, “state”, “to” 8 “Interface”, “Vlan-interface”, “up”, “down” 4 “vlan20”, “vlan22”, “ae1”, “ae3” 2
support
➢V1 = {“changed”, “state”, “to”, “Interface”, “down”, “ae3”}
➢V2 = {“changed”, “state”, “to”, “Vlan-interface”, “down”, “vlan22”}
➢V3 = {“changed”, “state”, “to”, “Interface”, “up”, “ae3” }
➢V4 = {“changed”, “state”, “to”, “Vlan-interface”, “up”, “vlan22”}
27
M
Words Support “changed”, “state”, “to” 8 “Interface”, “Vlan-interface”, “up”, “down” 4 “vlan20”, “vlan22”, “ae1”, “ae3” 2
28
SIF
29
SIF changed State to Interface ae3 down SIF
V1 = {“changed”, “state”, “to”, “Interface”, “down”, “ae3”}
30
SIF changed State to Interface ae3 down SIF changed State to Interface Vlan-interface ae3 down down vlan22 SIF
V1 = {“changed”, “state”, “to”, “Interface”, “down”, “ae3”} V2 = {“changed”, “state”, “to”, “Vlan-interface”, “down”, “vlan22”}
changed State to Interface Vlan-interface ae3 ae3 down up down vlan22 SIF
31
V3 = {“changed”, “state”, “to”, “Interface”, “up”, “ae3” }
changed State to Interface Vlan-interface ae3 ae3 down up down vlan22 SIF
32
changed State to Interface Vlan-interface ae3 ae1 ae3 ae1 down up down up vlan22 vlan20 vlan22 vlan20 SIF
… V3 = {“changed”, “state”, “to”, “Interface”, “up”, “ae3” }
33
34
constitute a template
35
A template is a combination
Accurately extract events from syslog messages
Incrementally re-trainable
9/21/2017 IWQOS 2017 36
9/21/2017 CoNEXT 2015 37
9/21/2017 IWQOS 2017 38
39
framework
9/21/2017 IWQOS 2017 40
Method Precision Recall F1 measure FT-tree 32.27% 95.3% 48.21% Signature Tree 32.27% 95.3% 48.21% STE 9.14% 99.6% 16.75% LogSimilarity 10.67% 83.5% 18.93%
41
Method FT-tree Signature Tree STE LogSimilarity Training time 51 mins 628 hours 100 hours 80 mins
9/21/2017 CoNEXT 2015 42
Challenges of template extraction
FT-tree
Evaluation
Future work
9/21/2017 CoNEXT 2015 43
slzhangsd@gmail.com
9/21/2017 CoNEXT 2015 44
slzhangsd@gmail.com
9/21/2017 CoNEXT 2015 45