Syslog Processing for Switch Failure Diagnosis and Prediction in - - PowerPoint PPT Presentation

syslog processing for switch failure diagnosis and
SMART_READER_LITE
LIVE PREVIEW

Syslog Processing for Switch Failure Diagnosis and Prediction in - - PowerPoint PPT Presentation

Syslog Processing for Switch Failure Diagnosis and Prediction in Datacenter Networks Shenglin Zhang, Weibin Meng, Jiahao Bu, Sen Yang Dan Pei, Ying Liu, Jun (Jim) Xu, Yu Chen, Hui Dong, Xianping Qu, Lei Song 9/21/2017 IWQOS 2017 1 Network


slide-1
SLIDE 1

Syslog Processing for Switch Failure Diagnosis and Prediction in Datacenter Networks

Shenglin Zhang, Weibin Meng, Jiahao Bu, Sen Yang Dan Pei, Ying Liu, Jun (Jim) Xu, Yu Chen, Hui Dong, Xianping Qu, Lei Song

9/21/2017 IWQOS 2017 1

slide-2
SLIDE 2

Network Devices in Data Center Networks

9/21/2017 IWQOS 2017 2

Inter-DC Network ToR Switch Server Aggregation Switch Access Router Core Router IDPS Firewall VPN Load balancer IDPS Firewall VPN Load balancer L3 L2 Core

slide-3
SLIDE 3

Network Devices in Data Center Networks

9/21/2017 IWQOS 2017 3

Inter-DC Network ToR Switch Server Aggregation Switch Access Router Core Router IDPS Firewall VPN Load balancer IDPS Firewall VPN Load balancer L3 L2 Core

  • Switch
  • Top-of-rack switch
  • Aggregation switch
  • Router
  • Access router
  • Core router
  • Middle box
  • Firewall
  • Intrusion detection and

prevention system (IDPS)

  • Load balancer
  • VPN
slide-4
SLIDE 4

Network Devices in Data Center Networks

9/21/2017 IWQOS 2017 4

Inter-DC Network ToR Switch Server Aggregation Switch Access Router Core Router IDPS Firewall VPN Load balancer IDPS Firewall VPN Load balancer L3 L2 Core

  • Switch
  • Top-of-rack switch
  • Aggregation switch
  • Router
  • Access router
  • Core router
  • Middle box
  • Firewall
  • Intrusion detection and

prevention system (IDPS)

  • Load balancer
  • VPN
slide-5
SLIDE 5

Scale of Network Devices in Datacenter

9/21/2017 IWQOS 2017 5

  • Hundreds of thousands to millions of servers
  • Hundreds of thousands of switches
  • Millions of cables and fibers

Microsoft (C. Guo, et al., SIGCOMM’15)

slide-6
SLIDE 6

Scale of Network Devices in Datacenter

9/21/2017 IWQOS 2017 6

  • Hundreds of thousands to millions of servers
  • Hundreds of thousands of switches
  • Millions of cables and fibers

Microsoft (C. Guo, et al., SIGCOMM’15)

  • Hundreds of thousands of servers
  • Tens of thousands of switches

Baidu

slide-7
SLIDE 7

Scale of Network Devices in Datacenter

9/21/2017 IWQOS 2017 7

  • Hundreds of thousands to millions of servers
  • Hundreds of thousands of switches
  • Millions of cables and fibers

Microsoft (C. Guo, et al., SIGCOMM’15)

  • Hundreds of thousands of servers
  • Tens of thousands of switches

Baidu

  • More than 400 switch failures per year

Swich failures are the norm rather than the exception (P. Gill, et al., SIGCOMM’11)

slide-8
SLIDE 8

Switch Failures Lead to Outages

8

  • A Cisco switch failure at

the datacenter of Hosting.com

  • Affected a number of

services including AWS for 1.5 hours

slide-9
SLIDE 9

Switch Failures Lead to Outages

9

  • A Cisco switch failure at

the datacenter of Hosting.com

  • Affected a number of

services including AWS for 1.5 hours

  • The datacenter network

went dark after a switch failure

  • Almost every executive

branch agency are affected for a few hours

slide-10
SLIDE 10

Switch Failure Diagnosis and Proactive Detection

Frameworks

  • SyslogDigest (IMC 2010)
  • Spatio-temporal Factorization (INFOCOM 2014)
  • Proactive Failure Detection (CNSM 2015)

Based on analyzing syslogs

9/21/2017 IWQOS 2017 10

slide-11
SLIDE 11

Syslog Structure

11

Switch ID Message timestamp Message type Detailed message Switch 1 Jun 12 19:03:03 2014 SIF Interface te-1/1/59, changed state to down Switch 2 Jul 15 11:05:07 2015 OSPF Neighbour(rid:10.231.0.43, addr:10.231.39.61) on vlan23, changed state from Exchange to Loading Switch 3 Jan 12 21:03:01 2016 %%SLOT SFP te-1/1/33 is plugged in, vendor: BROCADE, serial number: AAA210383148232

slide-12
SLIDE 12

12

The detailed message field

Describe events occurring on switches

  • Interface up/down
  • Plug in/out of slot
  • DDoS attack
  • Operator log in/out

Important to failure diagnosis and proactive detection Extracting events from the detailed message field

  • Pre-processing for failure diagnosis
  • Pre-processing for proactive failure detection
slide-13
SLIDE 13

Syslog Messages Under the Type “SIF”

  • 1. Interface ae3, changed state to down
  • 2. Vlan-interface vlan22, changed state to down
  • 3. Interface ae3, changed state to up
  • 4. Vlan-interface vlan22, changed state to up
  • 5. Interface ae1, changed state to down
  • 6. Vlan-interface vlan20, changed state to down
  • 7. Interface ae1, changed state to up
  • 8. Vlan-interface vlan20, changed state to up

13

slide-14
SLIDE 14

Syslog Messages Under the Type “SIF” Before A Failure

  • 1. Interface *, changed state to down
  • 2. Vlan-interface *, changed state to down
  • 3. Interface *, changed state to up
  • 4. Vlan-interface *, changed state to up

14

Common practice for syslog pre-processing: Extracting templates from syslog messages Matching syslog messages to templates

slide-15
SLIDE 15

Syslog Messages Under the Type “SIF” Before A Failure

  • 1. Interface *, changed state to down
  • 2. Vlan-interface *, changed state to down
  • 3. Interface *, changed state to up
  • 4. Vlan-interface *, changed state to up

15

Common practice for syslog pre-processing: Extracting templates from syslog messages Matching syslog messages to templates

A template is a combination of words with high frequency

slide-16
SLIDE 16

Outline

  • Background and Motivation
  • Challenges
  • Key Ideas
  • Results
  • Conclusion

9/21/2017 CoNEXT 2015 16

slide-17
SLIDE 17

17

Unstructured texts Huge amount of syslog messages

  • Tens of millions everyday
  • Long period of historical data for

training (two years)

Diverse types of syslog messages

  • Operator log in/out
  • Interface up/down
  • Plug in/out of slot

Challenges

slide-18
SLIDE 18

18

Unstructured texts Huge amount of syslog messages

  • Tens of millions everyday
  • Long period of historical data for

training (two years)

Diverse types of syslog messages

  • Operator log in/out
  • Interface up/down
  • Plug in/out of slot

Challenges

slide-19
SLIDE 19

Templates should be updated periodically

9/21/2017 IWQOS 2017 19

Failure diagnosis and prediction

  • Based on templates
  • Periodically retrained

to keep up-to-date New kinds of syslog messages

  • Due to software or

firmware upgrades

  • Cannot be matched to

any existing template

  • New templates should

be extracted Templates should be updated periodically

slide-20
SLIDE 20

Incrementally re-trainable

9/21/2017 IWQOS 2017 20

Template extraction method Template extraction method

Not incrementally re-trainable Incrementally re-trainable Computationally efficient

slide-21
SLIDE 21

21

Method Conference Merits Drawbacks Signature Tree IMC 10 Accurate Not incrementally re-trainable STE INFOCOM 14 None Inaccurate and not incrementally re-trainable LogSimilarity CNSM 15 Learn incrementally Inaccurate

Existing template extraction methods

slide-22
SLIDE 22

Method Conference Merits Drawbacks Signature Tree IMC 10 Accurate Not incrementally re-trainable STE INFOCOM 14 None Inaccurate and not incrementally re-trainable LogSimilarity CNSM 15 Learn incrementally Inaccurate

22

Our goal

Accurate, incrementally re-trainable, efficient template extraction method

slide-23
SLIDE 23

Outline

  • Background and Motivation
  • Challenges
  • Key Ideas
  • Results
  • Conclusion

9/21/2017 CoNEXT 2015 23

slide-24
SLIDE 24

Construct FT-tree

  • Support: if a word W appears in some message, (the support
  • f W) ++

24

slide-25
SLIDE 25

Construct FT-tree

  • Support: if a word W appears in some message, (the support
  • f W) ++
  • Scan all the messages, order all of the words into a map M in

the descending order of support

25

slide-26
SLIDE 26

Construct FT-tree

  • Support: if a word W appears in some message, (the support
  • f W) ++
  • Scan all the messages, order all of the words into a map M in

the descending order of support

26

M

Words Support “changed”, “state”, “to” 8 “Interface”, “Vlan-interface”, “up”, “down” 4 “vlan20”, “vlan22”, “ae1”, “ae3” 2

slide-27
SLIDE 27

Construct FT-tree

  • Order words in each message in the descending order of

support

  • Interface ae3, changed state to down

➢V1 = {“changed”, “state”, “to”, “Interface”, “down”, “ae3”}

  • Vlan-interface vlan22, changed state to down

➢V2 = {“changed”, “state”, “to”, “Vlan-interface”, “down”, “vlan22”}

  • Interface ae3, changed state to up

➢V3 = {“changed”, “state”, “to”, “Interface”, “up”, “ae3” }

  • Vlan-interface vlan22, changed state to up

➢V4 = {“changed”, “state”, “to”, “Vlan-interface”, “up”, “vlan22”}

27

M

Words Support “changed”, “state”, “to” 8 “Interface”, “Vlan-interface”, “up”, “down” 4 “vlan20”, “vlan22”, “ae1”, “ae3” 2

slide-28
SLIDE 28

Construct FT-tree

28

SIF

slide-29
SLIDE 29

Construct FT-tree

29

SIF changed State to Interface ae3 down SIF

V1 = {“changed”, “state”, “to”, “Interface”, “down”, “ae3”}

slide-30
SLIDE 30

Construct FT-tree

30

SIF changed State to Interface ae3 down SIF changed State to Interface Vlan-interface ae3 down down vlan22 SIF

V1 = {“changed”, “state”, “to”, “Interface”, “down”, “ae3”} V2 = {“changed”, “state”, “to”, “Vlan-interface”, “down”, “vlan22”}

slide-31
SLIDE 31

changed State to Interface Vlan-interface ae3 ae3 down up down vlan22 SIF

Construct FT-tree

31

V3 = {“changed”, “state”, “to”, “Interface”, “up”, “ae3” }

slide-32
SLIDE 32

changed State to Interface Vlan-interface ae3 ae3 down up down vlan22 SIF

Construct FT-tree

32

changed State to Interface Vlan-interface ae3 ae1 ae3 ae1 down up down up vlan22 vlan20 vlan22 vlan20 SIF

… V3 = {“changed”, “state”, “to”, “Interface”, “up”, “ae3” }

slide-33
SLIDE 33

FT-tree Definition

  • The item in the root node is syslog message type
  • Each node in the tree has one field, word

33

slide-34
SLIDE 34

FT-tree Definition

  • The item in the root node is syslog message type
  • Each node in the tree has one field, word
  • Prune the FT-tree
  • A parent node has P+ children
  • The children of this node will be pruned
  • The parent node will become a leaf node

34

slide-35
SLIDE 35

FT-tree Definition

  • The item in the root node is syslog message type
  • Each node in the tree has one field, word
  • Prune the FT-tree
  • A parent node has P+ children
  • The children of this node will be pruned
  • The parent node will become a leaf node
  • Words on the path from each leaf to root

constitute a template

35

slide-36
SLIDE 36

FT-tree: accurate and incrementally re-trainable

Based on word frequency

A template is a combination

  • f words with high frequency

Accurately extract events from syslog messages

Naturally incrementally built

Incrementally re-trainable

9/21/2017 IWQOS 2017 36

slide-37
SLIDE 37

Outline

  • Background and Motivation
  • Challenges
  • Key Ideas
  • Results
  • Conclusion

9/21/2017 CoNEXT 2015 37

slide-38
SLIDE 38

Evaluation

Dataset

Syslogs & failure tickets 2000+ switches

10+ datacenters

Two-year period

Benchmark methods

  • Signature Tree

(IMC 10)

  • STE (INFOCOM 14)
  • LogSimilarity

(CNSM 15)

9/21/2017 IWQOS 2017 38

slide-39
SLIDE 39
  • Compare accuracy
  • Based on manual labels by operators
  • Four types of syslog messages

39

Evaluation

slide-40
SLIDE 40

Evaluation

  • Compare failure prediction accuracy
  • Hidden Semi-Markov Model (HSMM) as the failure prediction

framework

  • 10-fold cross validation

9/21/2017 IWQOS 2017 40

Method Precision Recall F1 measure FT-tree 32.27% 95.3% 48.21% Signature Tree 32.27% 95.3% 48.21% STE 9.14% 99.6% 16.75% LogSimilarity 10.67% 83.5% 18.93%

slide-41
SLIDE 41
  • Compare computational efficiency
  • 10 million syslog messages per day
  • Retrained everyday to match new syslog messages
  • The same type of CPU core

41

Method FT-tree Signature Tree STE LogSimilarity Training time 51 mins 628 hours 100 hours 80 mins

Evaluation

slide-42
SLIDE 42

Outline

  • Background and Motivation
  • Challenges
  • Key Ideas
  • Results
  • Conclusion

9/21/2017 CoNEXT 2015 42

slide-43
SLIDE 43

Conclusion

  • Unstructured texts
  • Huge amount of syslogs
  • Diverse types of syslogs

Challenges of template extraction

  • Accurately extract events from syslogs
  • Incrementally re-trainable

FT-tree

  • Real-world data

Evaluation

  • Switch failure prediction

Future work

9/21/2017 CoNEXT 2015 43

slide-44
SLIDE 44

Thank you!

Q&A

slzhangsd@gmail.com

9/21/2017 CoNEXT 2015 44

slide-45
SLIDE 45

Q&A

slzhangsd@gmail.com

9/21/2017 CoNEXT 2015 45