L IFE G UARD: Practical Repair of Persistent Route Failures Ethan - - PowerPoint PPT Presentation
L IFE G UARD: Practical Repair of Persistent Route Failures Ethan - - PowerPoint PPT Presentation
L IFE G UARD: Practical Repair of Persistent Route Failures Ethan Katz-Bassett (USC) Colin Scott, David Choffnes, Italo Cunha, Valas Valancius, Nick Feamster, Harsha Madhyastha, Tom Anderson, Arvind Krishnamurthy This work is generously funded
LIFEGUARD: Practical Repair of Persistent Route Failures 3
LIFEGUARD: Practical Repair of Persistent Route Failures 4
LIFEGUARD: Practical Repair of Persistent Route Failures 4
LIFEGUARD: Practical Repair of Persistent Route Failures
! Monitor outages from Amazon’s EC2 ! Fraction of outages of duration ! X? ! Fraction of unavailability due to outages of duration ! X?
5
Long Outages Cause Most Unavailability
LIFEGUARD: Practical Repair of Persistent Route Failures
! Monitor outages from Amazon’s EC2 ! Fraction of outages of duration ! X? ! Fraction of unavailability due to outages of duration ! X?
5
Long Outages Cause Most Unavailability
86% of outages last less than 5 minutes
LIFEGUARD: Practical Repair of Persistent Route Failures
! Monitor outages from Amazon’s EC2 ! Fraction of outages of duration ! X? ! Fraction of unavailability due to outages of duration ! X?
5
Long Outages Cause Most Unavailability
86% of outages last less than 5 minutes
LIFEGUARD: Practical Repair of Persistent Route Failures
! Monitor outages from Amazon’s EC2 ! Fraction of outages of duration ! X? ! Fraction of unavailability due to outages of duration ! X?
5
Long Outages Cause Most Unavailability
86% of outages last less than 5 minutes But longer outages account for 90% of the unavailability
LIFEGUARD: Practical Repair of Persistent Route Failures
Operators Struggle to Locate Failures
“Traffic attempting to pass through Level3’s network in the Washington, DC area is getting lost in the abyss. Here's a trace from Verizon residential to Level3.” Outages mailing list, Dec. 2010
6
LIFEGUARD: Practical Repair of Persistent Route Failures
Operators Struggle to Locate Failures
Mailing List User 1 1 Home router 2 Verizon in Baltimore 3 Verizon in Philly 4 Alter.net in DC 5 Level3 in DC 6 * * * 7 * * * “Traffic attempting to pass through Level3’s network in the Washington, DC area is getting lost in the abyss. Here's a trace from Verizon residential to Level3.” Outages mailing list, Dec. 2010
6
LIFEGUARD: Practical Repair of Persistent Route Failures
Operators Struggle to Locate Failures
Mailing List User 1 1 Home router 2 Verizon in Baltimore 3 Verizon in Philly 4 Alter.net in DC 5 Level3 in DC 6 * * * 7 * * * Mailing List User 2 1 Home router 2 Verizon in DC 3 Alter.net in DC 4 Level3 in DC 5 Level3 in Chicago 6 Level3 in Denver 7 * * * 8 * * * “Traffic attempting to pass through Level3’s network in the Washington, DC area is getting lost in the abyss. Here's a trace from Verizon residential to Level3.” Outages mailing list, Dec. 2010
6
LIFEGUARD: Practical Repair of Persistent Route Failures
Reasons for Long-Lasting Outages
Long-term outages are:
! Repaired over slow, human timescales ! Not well understood ! Caused by routers advertising paths that do not work
! E.g., corrupted memory on line card causes black hole ! E.g., bad cross-layer interactions cause failed MPLS tunnel
! Complicated by lack of visibility into or control over
routes in other ISPs
7
LIFEGUARD: Practical Repair of Persistent Route Failures
Our Approach and Outline
8
LIFEGUARD: Locating Internet Failures Effectively and Generating Usable Alternate Routes Dynamically
! Locate the ISP / link causing the problem ! Suggest that other ISPs reroute around the problem
! Building blocks ! Example ! Description of technique
LIFEGUARD: Practical Repair of Persistent Route Failures
Our Approach and Outline
8
LIFEGUARD: Locating Internet Failures Effectively and Generating Usable Alternate Routes Dynamically
! Locate the ISP / link causing the problem ! Suggest that other ISPs reroute around the problem
LIFEGUARD: Practical Repair of Persistent Route Failures
Building blocks for failure isolation
LIFEGUARD can use:
! Ping to test reachability ! Traceroute to measure forward path ! Distributed vantage points (VPs)
! PlanetLab for our experiments ! Some can source spoof
! Reverse traceroute to measure reverse path (NSDI ’10) ! Atlas of historical forward/reverse paths between
VPs and targets
9
LIFEGUARD: Practical Repair of Persistent Route Failures
! Historical atlas enables reasoning about changes ! Traceroute yields only path from GMU to target ! Reverse traceroute reveals path asymmetry
10
Source: GMU Target: Smartkom
How does LIFEGUARD locate a failure?
Before outage:
LIFEGUARD: Practical Repair of Persistent Route Failures
! Historical atlas enables reasoning about changes ! Traceroute yields only path from GMU to target ! Reverse traceroute reveals path asymmetry
10
Source: GMU Target: Smartkom Level3 Telia TransTelecom ZSTTK
How does LIFEGUARD locate a failure?
Before outage:
LIFEGUARD: Practical Repair of Persistent Route Failures
! Historical atlas enables reasoning about changes ! Traceroute yields only path from GMU to target ! Reverse traceroute reveals path asymmetry
10
Source: GMU Target: Smartkom Level3 Telia TransTelecom ZSTTK Rostelecom NTT
How does LIFEGUARD locate a failure?
Before outage:
Source: GMU Target: Smartkom Source: GMU Level3 Telia ZSTTK Rostelecom NTT TransTelecom Target: Smartkom
LIFEGUARD: Practical Repair of Persistent Route Failures 11
How does LIFEGUARD locate a failure?
During outage:
Source: GMU Target: Smartkom Source: GMU Level3 Telia ZSTTK Rostelecom NTT TransTelecom Target: Smartkom
LIFEGUARD: Practical Repair of Persistent Route Failures 11
?
How does LIFEGUARD locate a failure?
During outage:
Source: GMU Target: Smartkom Source: GMU Level3 Telia ZSTTK Rostelecom NTT TransTelecom Target: Smartkom
LIFEGUARD: Practical Repair of Persistent Route Failures 11
?
Problem with ZSTTK?
How does LIFEGUARD locate a failure?
During outage:
Source: GMU Target: Smartkom Source: GMU Level3 Telia ZSTTK Rostelecom NTT TransTelecom Target: Smartkom
LIFEGUARD: Practical Repair of Persistent Route Failures 11
?
Problem with ZSTTK?
VP
How does LIFEGUARD locate a failure?
During outage:
Source: GMU Target: Smartkom Source: GMU Level3 Telia ZSTTK Rostelecom NTT TransTelecom Target: Smartkom
LIFEGUARD: Practical Repair of Persistent Route Failures 11
?
Problem with ZSTTK?
VP
How does LIFEGUARD locate a failure?
During outage:
Source: GMU Target: Smartkom Source: GMU Level3 Telia ZSTTK Rostelecom NTT TransTelecom Target: Smartkom
LIFEGUARD: Practical Repair of Persistent Route Failures 11
?
Problem with ZSTTK?
VP
Ping? Fr:VP
How does LIFEGUARD locate a failure?
During outage:
Source: GMU Target: Smartkom Source: GMU Level3 Telia ZSTTK Rostelecom NTT TransTelecom Target: Smartkom
LIFEGUARD: Practical Repair of Persistent Route Failures 11
?
Problem with ZSTTK?
VP
Ping? Fr:VP
How does LIFEGUARD locate a failure?
During outage:
Source: GMU Target: Smartkom Source: GMU Level3 Telia ZSTTK Rostelecom NTT TransTelecom Target: Smartkom
LIFEGUARD: Practical Repair of Persistent Route Failures 11
?
Problem with ZSTTK?
VP
Ping? Fr:VP
How does LIFEGUARD locate a failure?
Ping! To:VP
During outage:
Source: GMU Target: Smartkom Source: GMU Level3 Telia ZSTTK Rostelecom NTT TransTelecom Target: Smartkom
LIFEGUARD: Practical Repair of Persistent Route Failures 11
?
Problem with ZSTTK?
VP
Ping? Fr:VP
How does LIFEGUARD locate a failure?
Ping! To:VP
During outage:
Source: GMU Target: Smartkom Source: GMU Level3 Telia ZSTTK Rostelecom NTT TransTelecom Target: Smartkom
LIFEGUARD: Practical Repair of Persistent Route Failures 11
! Forward path works
Problem with ZSTTK?
VP
How does LIFEGUARD locate a failure?
Ping! To:VP
During outage:
Source: GMU Target: Smartkom Source: GMU Level3 Telia ZSTTK Rostelecom NTT TransTelecom Target: Smartkom
LIFEGUARD: Practical Repair of Persistent Route Failures 11
! Forward path works
Problem with ZSTTK?
VP
How does LIFEGUARD locate a failure?
Ping! To:VP
During outage:
Source: GMU Target: Smartkom Source: GMU Level3 Telia ZSTTK Rostelecom NTT TransTelecom Target: Smartkom
LIFEGUARD: Practical Repair of Persistent Route Failures 12
! Forward path works
How does LIFEGUARD locate a failure?
During outage:
Source: GMU Target: Smartkom Source: GMU Level3 Telia ZSTTK Rostelecom NTT TransTelecom Target: Smartkom Source: GMU
LIFEGUARD: Practical Repair of Persistent Route Failures 12
! Forward path works
How does LIFEGUARD locate a failure?
During outage:
Source: GMU Target: Smartkom Source: GMU Level3 Telia ZSTTK Rostelecom NTT TransTelecom Target: Smartkom Source: GMU
LIFEGUARD: Practical Repair of Persistent Route Failures 12
! Forward path works
How does LIFEGUARD locate a failure?
During outage:
Source: GMU Target: Smartkom Source: GMU Level3 Telia ZSTTK Rostelecom NTT TransTelecom Target: Smartkom Source: GMU
LIFEGUARD: Practical Repair of Persistent Route Failures 12
! Forward path works
How does LIFEGUARD locate a failure?
NTT:Ping? Fr:GMU
During outage:
Source: GMU Target: Smartkom Source: GMU Level3 Telia ZSTTK Rostelecom NTT TransTelecom Target: Smartkom Source: GMU
LIFEGUARD: Practical Repair of Persistent Route Failures 12
! Forward path works
How does LIFEGUARD locate a failure?
GMU:Ping! Fr:NTT
During outage:
Source: GMU Target: Smartkom Source: GMU Level3 Telia ZSTTK Rostelecom NTT TransTelecom Target: Smartkom Source: GMU
LIFEGUARD: Practical Repair of Persistent Route Failures 12
! Forward path works
How does LIFEGUARD locate a failure?
During outage:
Source: GMU Target: Smartkom Source: GMU Level3 Telia ZSTTK Rostelecom NTT TransTelecom Target: Smartkom Source: GMU
LIFEGUARD: Practical Repair of Persistent Route Failures 12
! Forward path works
How does LIFEGUARD locate a failure?
During outage:
Source: GMU Target: Smartkom Source: GMU Level3 Telia ZSTTK Rostelecom NTT TransTelecom Target: Smartkom Source: GMU
LIFEGUARD: Practical Repair of Persistent Route Failures 13
! Forward path works ! Rostelcom is not forwarding traffic towards GMU
Rostele: Ping? Fr:GMU
How does LIFEGUARD locate a failure?
During outage:
Source: GMU Target: Smartkom Source: GMU Level3 Telia ZSTTK Rostelecom NTT TransTelecom Target: Smartkom Source: GMU
LIFEGUARD: Practical Repair of Persistent Route Failures 13
! Forward path works ! Rostelcom is not forwarding traffic towards GMU
How does LIFEGUARD locate a failure?
During outage:
Source: GMU Target: Smartkom Source: GMU Level3 Telia ZSTTK Rostelecom NTT TransTelecom Target: Smartkom Source: GMU
LIFEGUARD: Practical Repair of Persistent Route Failures 13
! Forward path works ! Rostelcom is not forwarding traffic towards GMU
How does LIFEGUARD locate a failure?
During outage:
Source: GMU Target: Smartkom Source: GMU Level3 Telia ZSTTK Rostelecom NTT TransTelecom Target: Smartkom Source: GMU
LIFEGUARD: Practical Repair of Persistent Route Failures 13
! Forward path works ! Rostelcom is not forwarding traffic towards GMU
How does LIFEGUARD locate a failure?
During outage:
Source: GMU Target: Smartkom Source: GMU Level3 Telia ZSTTK Rostelecom NTT TransTelecom Target: Smartkom Source: GMU
LIFEGUARD: Practical Repair of Persistent Route Failures 13
! Forward path works ! Rostelcom is not forwarding traffic towards GMU
How does LIFEGUARD locate a failure?
During outage:
LIFEGUARD: Practical Repair of Persistent Route Failures
How LIFEGUARD Locates Failures
LIFEGUARD:
- 1. Maintains background historical atlas
- 2. Isolates direction of failure, measures working direction
- 3. Tests historical paths in failing direction in order to
prune candidate failure locations
- 4. Locates failure as being at the horizon of reachability
14
LIFEGUARD: Practical Repair of Persistent Route Failures
Our Approach and Outline
15
LIFEGUARD: Locating Internet Failures Effectively and Generating Usable Alternate Routes Dynamically
! Locate the ISP / link causing the problem ! Suggest that other ISPs reroute around the problem
! What would we like to add to BGP to enable this? ! What can we deploy today, using only available protocols
and router support?
LIFEGUARD: Practical Repair of Persistent Route Failures
Our Approach and Outline
15
LIFEGUARD: Locating Internet Failures Effectively and Generating Usable Alternate Routes Dynamically
! Locate the ISP / link causing the problem ! Suggest that other ISPs reroute around the problem
LIFEGUARD: Practical Repair of Persistent Route Failures
Our Goal for Failure Avoidance
! Enable content / service providers to repair
persistent routing problems affecting them, regardless of which ISP is causing them Setting
! Assume we can locate problem ! Assume we are multi-homed / have multiple data centers ! Assume we speak BGP ! We use BGP-Mux to speak BGP to the real Internet:
5 US universities as providers
16
LIFEGUARD: Practical Repair of Persistent Route Failures
Straightforward: Choose a path that avoids the problem.
17
Self-Repair of Forward Paths
LIFEGUARD: Practical Repair of Persistent Route Failures
Straightforward: Choose a path that avoids the problem.
17
Self-Repair of Forward Paths
LIFEGUARD: Practical Repair of Persistent Route Failures
Straightforward: Choose a path that avoids the problem.
17
Self-Repair of Forward Paths
LIFEGUARD: Practical Repair of Persistent Route Failures
Straightforward: Choose a path that avoids the problem.
17
Self-Repair of Forward Paths
LIFEGUARD: Practical Repair of Persistent Route Failures
A Mechanism for Failure Avoidance
Forward path: Choose route that avoids ISP or ISP-ISP link Reverse path: Want others to choose paths to my prefix P that avoid ISP or ISP-ISP link X
! Want a BGP announcement AVOID(X,P):
! Any ISP with a route to P that avoids X uses such a route ! Any ISP not using X need only pass on the announcement
18
LIFEGUARD: Practical Repair of Persistent Route Failures 19
Ideal Self-Repair of Reverse Paths
LIFEGUARD: Practical Repair of Persistent Route Failures
AVOID(L3,WS)
19
Ideal Self-Repair of Reverse Paths
LIFEGUARD: Practical Repair of Persistent Route Failures
AVOID(L3,WS) AVOID(L3,WS)
19
Ideal Self-Repair of Reverse Paths
LIFEGUARD: Practical Repair of Persistent Route Failures
AVOID(L3,WS) AVOID(L3,WS) AVOID(L3,WS)
19
Ideal Self-Repair of Reverse Paths
LIFEGUARD: Practical Repair of Persistent Route Failures
AVOID(L3,WS) AVOID(L3,WS) AVOID(L3,WS)
19
Ideal Self-Repair of Reverse Paths
LIFEGUARD: Practical Repair of Persistent Route Failures
Do paths exist that AVOID problem?
LIFEGUARD repairs outages by instructing others to avoid particular routes. Q: Do alternative routes exist? A: Alternate policy-compliant paths exist in 90% of simulated
AVOID(X,P) announcements.
! Simulated 10 million AVOIDs on actual measured routes.
20
LIFEGUARD: Practical Repair of Persistent Route Failures 21
Practical Self-Repair of Reverse Paths
LIFEGUARD: Practical Repair of Persistent Route Failures
WS
21
Practical Self-Repair of Reverse Paths
LIFEGUARD: Practical Repair of Persistent Route Failures
WS ATT ! WS Qwest ! WS
21
Practical Self-Repair of Reverse Paths
LIFEGUARD: Practical Repair of Persistent Route Failures
WS ATT ! WS Sprint ! Qwest ! WS AISP ! Qwest ! WS L3 ! ATT ! WS Qwest ! WS
21
Practical Self-Repair of Reverse Paths
LIFEGUARD: Practical Repair of Persistent Route Failures
WS ATT ! WS UW ! L3 ! ATT ! WS Sprint ! Qwest ! WS AISP ! Qwest ! WS L3 ! ATT ! WS Qwest ! WS
21
Practical Self-Repair of Reverse Paths
LIFEGUARD: Practical Repair of Persistent Route Failures
WS ATT ! WS UW ! L3 ! ATT ! WS Sprint ! Qwest ! WS AISP ! Qwest ! WS L3 ! ATT ! WS Qwest ! WS
21
Practical Self-Repair of Reverse Paths
LIFEGUARD: Practical Repair of Persistent Route Failures
WS ATT ! WS UW ! L3 ! ATT ! WS Sprint ! Qwest ! WS AISP ! Qwest ! WS L3 ! ATT ! WS Qwest ! WS
21
Practical Self-Repair of Reverse Paths
LIFEGUARD: Practical Repair of Persistent Route Failures
WS ATT ! WS UW ! L3 ! ATT ! WS Sprint ! Qwest ! WS AISP ! Qwest ! WS Qwest ! WS AVOID(L3,WS)
22
Practical Self-Repair of Reverse Paths
L3 ! ATT ! WS
LIFEGUARD: Practical Repair of Persistent Route Failures
WS ATT ! WS UW ! L3 ! ATT ! WS Sprint ! Qwest ! WS AISP ! Qwest ! WS Qwest ! WS WS ! L3! WS
22
Practical Self-Repair of Reverse Paths
L3 ! ATT ! WS
BGP loop prevention encourages switch to working path.
LIFEGUARD: Practical Repair of Persistent Route Failures
WS ATT ! WS UW ! L3 ! ATT ! WS Sprint ! Qwest ! WS AISP ! Qwest ! WS WS ! L3! WS Qwest ! WS ! L3! WS
22
Practical Self-Repair of Reverse Paths
L3 ! ATT ! WS
BGP loop prevention encourages switch to working path.
LIFEGUARD: Practical Repair of Persistent Route Failures
WS ATT ! WS UW ! L3 ! ATT ! WS Sprint ! Qwest ! WS AISP ! Qwest ! WS ! L3! WS WS ! L3! WS Qwest ! WS ! L3! WS
22
Practical Self-Repair of Reverse Paths
L3 ! ATT ! WS
BGP loop prevention encourages switch to working path.
LIFEGUARD: Practical Repair of Persistent Route Failures
WS ATT ! WS UW ! L3 ! ATT ! WS Sprint ! Qwest ! WS Sprint ! Qwest ! WS ! L3! WS WS ! L3! WS Qwest ! WS ! L3! WS
22
Practical Self-Repair of Reverse Paths
L3 ! ATT ! WS
BGP loop prevention encourages switch to working path.
LIFEGUARD: Practical Repair of Persistent Route Failures
WS ATT ! WS UW ! L3 ! ATT ! WS Sprint ! Qwest ! WS Sprint ! Qwest ! WS ! L3! WS ATT ! WS ! L3! WS WS ! L3! WS
22
Practical Self-Repair of Reverse Paths
L3 ! ATT ! WS
BGP loop prevention encourages switch to working path.
LIFEGUARD: Practical Repair of Persistent Route Failures
WS ATT ! WS UW ! L3 ! ATT ! WS Sprint ! Qwest ! WS ? Sprint ! Qwest ! WS ! L3! WS ATT ! WS ! L3! WS WS ! L3! WS
22
Practical Self-Repair of Reverse Paths
BGP loop prevention encourages switch to working path.
LIFEGUARD: Practical Repair of Persistent Route Failures
WS ATT ! WS UW ! L3 ! ATT ! WS Sprint ! Qwest ! WS ? UW ! Sprint ! Qwest ! WS ! L3! WS Sprint ! Qwest ! WS ! L3! WS ATT ! WS ! L3! WS WS ! L3! WS
22
Practical Self-Repair of Reverse Paths
BGP loop prevention encourages switch to working path.
LIFEGUARD: Practical Repair of Persistent Route Failures
WS ATT ! WS UW ! L3 ! ATT ! WS Sprint ! Qwest ! WS ? UW ! Sprint ! Qwest ! WS ! L3! WS Sprint ! Qwest ! WS ! L3! WS ATT ! WS ! L3! WS WS ! L3! WS
22
Practical Self-Repair of Reverse Paths
BGP loop prevention encourages switch to working path.
LIFEGUARD: Practical Repair of Persistent Route Failures
Stuff I Don’t Have Time to Talk About
23
Results from real poisonings
! Poisoning in the wild / poisoning anomalies ! Case study of restoring connectivity
Making poisoning flexible
! Monitoring broken path while it is disabled ! Allowing ISPs w/o alternatives to use disabled route
LIFEGUARD’s scalability
! Overhead and speed of failure location ! Router update load if many ISPs deploy our approach
Alternatives to poisoning
! Compatibility with secure routing (BGPSEC, etc.) ! Comparing to other route control mechanisms
LIFEGUARD: Practical Repair of Persistent Route Failures
Can poisoning approximate AVOID effects?
24
LIFEGUARD’s poisoning repairs outages by disabling routes to induce route exploration. Q: Does poisoning disrupt working routes? A: No. As I will describe: (a) Under certain circumstances, we can disable a link without disabling the full ISP . (b) We can speed BGP convergence by carefully crafting announcements.
O B1 B2 A C1 C2 C3 C4 D1 D2 Network link Transitive link Original path New path
LIFEGUARD: Practical Repair of Persistent Route Failures
What if some routes in an ISP still work?
25
! We only want C3 to change its route, to avoid A-B2
O B1 B2 A C1 C2 C3 C4 D1 D2 Network link Transitive link Original path New path
LIFEGUARD: Practical Repair of Persistent Route Failures
What if some routes in an ISP still work?
25
! We only want C3 to change its route, to avoid A-B2
O B1 B2 A C1 C2 C3 C4 D1 D2 Network link Transitive link Original path New path
LIFEGUARD: Practical Repair of Persistent Route Failures
What if some routes in an ISP still work?
26
! We only want C3 to change its route, to avoid A-B2 ! Forward direction is easy: choose a different route
O B1 B2 A C1 C2 C3 C4 D1 D2 Network link Transitive link Original path New path
LIFEGUARD: Practical Repair of Persistent Route Failures
What if some routes in an ISP still work?
26
! We only want C3 to change its route, to avoid A-B2 ! Forward direction is easy: choose a different route
O B1 B2 A C1 C2 C3 C4 D1 D2 Network link Transitive link Original path New path
LIFEGUARD: Practical Repair of Persistent Route Failures
What if some routes in an ISP still work?
27
! We only want C3 to change its route, to avoid A-B2 ! Forward direction is easy: choose a different route
O B1 B2 A C1 C2 C3 C4 D1 D2 O O Network link Transitive link Pre-poisoning path Post-poisoning path
LIFEGUARD: Practical Repair of Persistent Route Failures
What if some routes in an ISP still work?
28
! We only want C3 to change its route, to avoid A-B2 ! Poisoning seems blunt, disabling an entire ISP
O B1 B2 A C1 C2 C3 C4 D1 D2 O O Network link Transitive link Pre-poisoning path Post-poisoning path
LIFEGUARD: Practical Repair of Persistent Route Failures
What if some routes in an ISP still work?
28
! We only want C3 to change its route, to avoid A-B2 ! Poisoning seems blunt, disabling an entire ISP
O B1 B2 A C1 C2 C3 C4 D1 D2 O-O-O O-A-O O-A-O O-A-O Network link Transitive link Pre-poisoning path Post-poisoning path
LIFEGUARD: Practical Repair of Persistent Route Failures
What if some routes in an ISP still work?
29
! We only want C3 to change its route, to avoid A-B2 ! Poisoning seems blunt, disabling an entire ISP
O B1 B2 A C1 C2 C3 C4 D1 D2
? ?
O-O-O O-A-O O-A-O O-A-O Network link Transitive link Pre-poisoning path Post-poisoning path
LIFEGUARD: Practical Repair of Persistent Route Failures
What if some routes in an ISP still work?
30
! We only want C3 to change its route, to avoid A-B2 ! Poisoning seems blunt, disabling an entire ISP
O B1 B2 A C1 C2 C3 C4 D1 D2 O Network link Transitive link Original path New path
LIFEGUARD: Practical Repair of Persistent Route Failures
What if some routes in an ISP still work?
31
! We only want C3 to change its route, to avoid A-B2 ! Poisoning seems blunt, disabling an entire ISP ! Selective advertising via just D1 is also blunt
O B1 B2 A C1 C2 C3 C4 D1 D2 O Network link Transitive link Original path New path
LIFEGUARD: Practical Repair of Persistent Route Failures
What if some routes in an ISP still work?
31
! We only want C3 to change its route, to avoid A-B2 ! Poisoning seems blunt, disabling an entire ISP ! Selective advertising via just D1 is also blunt
O B1 B2 A C1 C2 C3 C4 D1 D2
? ? ?
O Network link Transitive link Original path New path
LIFEGUARD: Practical Repair of Persistent Route Failures
What if some routes in an ISP still work?
32
! We only want C3 to change its route, to avoid A-B2 ! Poisoning seems blunt, disabling an entire ISP ! Selective advertising via just D1 is also blunt
O B1 B2 A C1 C2 C3 C4 D1 D2 O O Network link Transitive link Pre-poisoning path Post-poisoning path
LIFEGUARD: Practical Repair of Persistent Route Failures
What if some routes in an ISP still work?
33
! We only want C3 to change its route, to avoid A-B2 ! Poisoning seems blunt, disabling an entire ISP ! If D1 and D2 (transitively) connect to different PoPs of A,
selectively poison via D2 and not D1
O B1 B2 A C1 C2 C3 C4 D1 D2 O O Network link Transitive link Pre-poisoning path Post-poisoning path
LIFEGUARD: Practical Repair of Persistent Route Failures
What if some routes in an ISP still work?
33
! We only want C3 to change its route, to avoid A-B2 ! Poisoning seems blunt, disabling an entire ISP ! If D1 and D2 (transitively) connect to different PoPs of A,
selectively poison via D2 and not D1
LIFEGUARD: Practical Repair of Persistent Route Failures
What if some routes in an ISP still work?
34
! We only want C3 to change its route, to avoid A-B2 ! Poisoning seems blunt, disabling an entire ISP ! If D1 and D2 (transitively) connect to different PoPs of A,
selectively poison via D2 and not D1
O B1 B2 A C1 C2 C3 C4 D1 D2 O-O-O O-A-O Network link Transitive link Pre-poisoning path Post-poisoning path
O B1 B2 A C1 C2 C3 C4 D1 D2 O-O-O O-A-O Network link Transitive link Pre-poisoning path Post-poisoning path
LIFEGUARD: Practical Repair of Persistent Route Failures 35
What if some routes in an ISP still work?
! We only want C3 to change its route, to avoid A-B2 ! Poisoning seems blunt, disabling an entire ISP ! If D1 and D2 (transitively) connect to different PoPs of A,
selectively poison via D2 and not D1
LIFEGUARD: Practical Repair of Persistent Route Failures
Can poisoning approximate AVOID effects?
36
LIFEGUARD’s poisoning repairs outages by disabling routes to induce route exploration. Q: Does poisoning disrupt working routes? A: No. As I will describe: (a) “Selective poisoning” can avoid 73% of links without disabling entire AS.
- Real-world results from 5 provider BGP-Mux testbed
(b) We can speed BGP convergence by carefully crafting announcements.
LIFEGUARD: Practical Repair of Persistent Route Failures
Naive Poisoning Causes Transient Loss
O A B C F D E O A-O D-A-O F-B-A-O B-A-O E-D-A-O A-O B-A-O
! Some ISPs may have
working paths that avoid problem ISP X
! Naively, poisoning
causes path exploration even for these ISPs
! Path exploration causes
transient loss
37
AVOID(X,P)
O A B C F D E O-X-O A-O D-A-O F-B-A-O B-A-O E-D-A-O A-O B-A-O
LIFEGUARD: Practical Repair of Persistent Route Failures
Naive Poisoning Causes Transient Loss
! Some ISPs may have
working paths that avoid problem ISP X
! Naively, poisoning
causes path exploration even for these ISPs
! Path exploration causes
transient loss
38
AVOID(X,P)
O A B C F D E O-X-O A-O-X-O D-A-O F-B-A-O B-A-O E-D-A-O A-O-X-O B-A-O
LIFEGUARD: Practical Repair of Persistent Route Failures
Naive Poisoning Causes Transient Loss
! Some ISPs may have
working paths that avoid problem ISP X
! Naively, poisoning
causes path exploration even for these ISPs
! Path exploration causes
transient loss
39
AVOID(X,P)
O A B C F D E O-X-O A-O-X-O A-O-X-O D-A-O-X-O F-B-A-O B-A-O-X-O E-D-A-O B-A-O-X-O F-B-A-O E-D-A-O
LIFEGUARD: Practical Repair of Persistent Route Failures
Naive Poisoning Causes Transient Loss
! Some ISPs may have
working paths that avoid problem ISP X
! Naively, poisoning
causes path exploration even for these ISPs
! Path exploration causes
transient loss
40
AVOID(X,P)
O A B C F D E O-X-O A-O-X-O A-O-X-O D-A-O-X-O F-B-A-O B-A-O-X-O E-D-A-O B-A-O-X-O F-B-A-O E-D-A-O F-B-A-O D-A-O-X-O E-D-A-O B-A-O-X-O E-D-A-O F-B-A-O
LIFEGUARD: Practical Repair of Persistent Route Failures
Naive Poisoning Causes Transient Loss
! Some ISPs may have
working paths that avoid problem ISP X
! Naively, poisoning
causes path exploration even for these ISPs
! Path exploration causes
transient loss
41
AVOID(X,P)
O A B C F D E O-X-O A-O-X-O A-O-X-O D-A-O-X-O F-B-A-O B-A-O-X-O E-D-A-O B-A-O-X-O F-B-A-O E-D-A-O F-B-A-O D-A-O-X-O E-D-A-O B-A-O-X-O E-D-A-O F-B-A-O E-D-A-O F-B-A-O
LIFEGUARD: Practical Repair of Persistent Route Failures
Naive Poisoning Causes Transient Loss
! Some ISPs may have
working paths that avoid problem ISP X
! Naively, poisoning
causes path exploration even for these ISPs
! Path exploration causes
transient loss
42
AVOID(X,P)
O A B C F D E O-X-O A-O-X-O A-O-X-O D-A-O-X-O F-B-A-O B-A-O-X-O E-D-A-O B-A-O-X-O F-B-A-O E-D-A-O F-B-A-O D-A-O-X-O E-D-A-O B-A-O-X-O E-D-A-O F-B-A-O E-D-A-O F-B-A-O B-A-O-X-O E-D-A-O D-A-O-X-O F-B-A-O
LIFEGUARD: Practical Repair of Persistent Route Failures
Naive Poisoning Causes Transient Loss
! Some ISPs may have
working paths that avoid problem ISP X
! Naively, poisoning
causes path exploration even for these ISPs
! Path exploration causes
transient loss
43
AVOID(X,P)
O A B C F D E O-X-O A-O-X-O D-A-O-X-O F-B-A-O-X-O B-A-O-X-O E-D-A-O-X-O A-O-X-O B-A-O-X-O
LIFEGUARD: Practical Repair of Persistent Route Failures
Naive Poisoning Causes Transient Loss
! Some ISPs may have
working paths that avoid problem ISP X
! Naively, poisoning
causes path exploration even for these ISPs
! Path exploration causes
transient loss
44
AVOID(X,P)
O A B C F D E O-O-O A-O-O-O D-A-O-O-O F-B-A-O-O-O B-A-O-O-O E-D-A-O-O-O A-O-O-O B-A-O-O-O
LIFEGUARD: Practical Repair of Persistent Route Failures
Prepend to Reduce Path Exploration
! Most routing decisions
based on: (1) next hop ISP (2) path length
! Keep these fixed to
speed convergence
! Prepending prepares
ISPs for later poison
45
AVOID(X,P)
O A B C F D E O-O-O A-O-O-O D-A-O-O-O F-B-A-O-O-O B-A-O-O-O E-D-A-O-O-O A-O-O-O B-A-O-O-O O-X-O
LIFEGUARD: Practical Repair of Persistent Route Failures
Prepend to Reduce Path Exploration
! Most routing decisions
based on: (1) next hop ISP (2) path length
! Keep these fixed to
speed convergence
! Prepending prepares
ISPs for later poison
46
AVOID(X,P)
O A B C F D E O-O-O A-O-O-O D-A-O-O-O F-B-A-O-O-O B-A-O-O-O E-D-A-O-O-O A-O-O-O B-A-O-O-O O-X-O A-O-X-O A-O-X-O
LIFEGUARD: Practical Repair of Persistent Route Failures
Prepend to Reduce Path Exploration
! Most routing decisions
based on: (1) next hop ISP (2) path length
! Keep these fixed to
speed convergence
! Prepending prepares
ISPs for later poison
47
AVOID(X,P)
O A B C F D E O-X-O A-O-X-O A-O-X-O D-A-O-X-O F-B-A-O-O-O B-A-O-X-O E-D-A-O-O-O B-A-O-X-O E-D-A-O-O-O F-B-A-O-O-O
LIFEGUARD: Practical Repair of Persistent Route Failures
Prepend to Reduce Path Exploration
! Most routing decisions
based on: (1) next hop ISP (2) path length
! Keep these fixed to
speed convergence
! Prepending prepares
ISPs for later poison
48
AVOID(X,P)
O A B C F D E O-X-O A-O-X-O D-A-O-X-O F-B-A-O-X-O B-A-O-X-O E-D-A-O-X-O A-O-X-O B-A-O-X-O
LIFEGUARD: Practical Repair of Persistent Route Failures
Prepend to Reduce Path Exploration
! Most routing decisions
based on: (1) next hop ISP (2) path length
! Keep these fixed to
speed convergence
! Prepending prepares
ISPs for later poison
49
AVOID(X,P)
0.9999 0.999 0.99 0.95 0.65 1 2 3 4 5 6 7 8 Cumulative Fraction of Convergences (CDF) Peer Convergence Time (minutes) Prepend, no change No prepend, no change
LIFEGUARD: Practical Repair of Persistent Route Failures
Prepending Speeds Convergence
! With no prepend, only 65% of unaffected ISPs converge instantly ! With prepending, 95% of unaffected ISPs re-converge instantly, 98%<1/2 min. ! Also speeds convergence to new paths for affected peers
50
LIFEGUARD: Practical Repair of Persistent Route Failures
Conclusion
! We increasingly depend on the Internet, but availability lags ! Much of Internet unavailability due to long-lasting outages ! LIFEGUARD: Let edge networks reroute around failures ! Location challenge: Find problem, given unidirectional
failures and tools that depend on connectivity
! Use reverse traceroute, isolate directions, use historical view
! Avoidance challenge: Reroute without participation of
transit networks
! BGP poisoning gives control to the destination ! Well-crafted announcements ease concerns
51