The Unexpected Responsiveness of Internet Hosts Neil Spring Me - - PowerPoint PPT Presentation

the unexpected responsiveness of internet hosts
SMART_READER_LITE
LIVE PREVIEW

The Unexpected Responsiveness of Internet Hosts Neil Spring Me - - PowerPoint PPT Presentation

The Unexpected Responsiveness of Internet Hosts Neil Spring Me Measure the Internet to evaluate and justify protocols that increase network reliability. Thesis work - measuring how routers are connected in practice to evaluate and


slide-1
SLIDE 1

The Unexpected Responsiveness of Internet Hosts

Neil Spring

slide-2
SLIDE 2

Me

  • Measure the Internet to evaluate and justify protocols

that increase network reliability.

  • Thesis work - measuring how routers are connected

in practice to evaluate and enhance routing protocols in terms of how they exploit common network designs in routing around failures.

  • Recent work - measuring when residential links fail

to determine how people and protocols should respond to faults.

slide-3
SLIDE 3

Residential Link Reliability

  • Residential links are:
  • Important: VoIP/911, Security cameras,

Thermostats

  • Vulnerable: Exposed to 


weather, loss of power, 
 singly-connected

photo credit: Ode Street Tribune

slide-4
SLIDE 4

It’s Personal

slide-5
SLIDE 5

What I mean by “how … to respond to faults”

  • Small-scale individual questions:
  • Should I get more than one provider? Or change?
  • Is it just me?
  • System builder questions:
  • Would it help to coordinate with neighbors for mutual backup?
  • What fraction of errors can “Network Diagnostics” diagnose?
  • Policy questions:
  • Do cities with more buried wiring fare better or worse?
  • How does Maryland compare to Virginia, North America to Europe?
slide-6
SLIDE 6

How to detect network failures

  • “ping” is the fundamental tool.
  • Innocuous packets that have only one purpose

(excuse me, are you alive?)

  • A response shows that the recipient is

reachable and alive.

slide-7
SLIDE 7

No response ⇏ failure

  • IP service allows four bad things to happen

to your packets: delay, duplication, corruption, and loss.

  • A lost echo request (are you there?) or reply

(I sure am!) should happen 1-3% of the time without major failure.

slide-8
SLIDE 8

ThunderPing

  • 1. Watch for severe weather

alert forecasts

  • 2. Ping addresses thought to be

in that region before, during, and after the alert

  • 3. Figure out if there actually

was weather, correlate failures with conditions

☀ ☁ ⛅ ⛈ ⛄ .3% .4% .3% 2.0% 3.0%

slide-9
SLIDE 9

Lost pings ⇒ outages

2 4 6 8 10 time (hour) 10 vantage points 20 40 60 80 100 120 RTT (ms)

slide-10
SLIDE 10
slide-11
SLIDE 11

Failures in weather

0.0 0.1 0.2 0.3 0.4 0.5 Clear Cloudy Fog Rain T-storm UP ➡ DOWN rate relative to total rate Charter Comcast Cox Ameritech CenturyLink MegaPath Speakeasy Windstream Verizon DSL WildBlue Verizon FiOS

slide-12
SLIDE 12

1 2 time (hour) 10 vantage points 20 40 60 80 100 120 RTT (ms)

Some lost pings ⇒ ??

UP ??? UP

slide-13
SLIDE 13

Two Questions

  • Could high delay create false outages?
  • Could renumbering cause false outages and

alter their duration?

slide-14
SLIDE 14

When should pings time out?

slide-15
SLIDE 15

When should pings time out?

Measurement platform Timeout (seconds)

RIPE Atlas 1 Scamper 2 (configurable) Hubble / iPlane 2 (one retry) SamKnows 3 Scriptroute / Thunderping 3 (configurable) ISI survey 3 (collects all)

Measurement platform Timeout (seconds)

RIPE Atlas 1 Scamper 2 (configurable) Hubble / iPlane 2 (one retry) SamKnows 3 Scriptroute / Thunderping 3 (configurable) ISI survey 3 (collects all)

slide-16
SLIDE 16

Let’s confirm ~3s!

  • Dataset: ISI survey data: 1% of routed /24’s,

pinged every 11 minutes.

  • Precise timing below 3s timeout.
  • Imprecise timing above 3s timeout. Any

received echo reply is logged with time and source.

  • Approach: Look at all response times, including

those longer than the timeout.

slide-17
SLIDE 17

Survey-detected RTTs

About 10% of addresses routinely respond after one second. The distribution appears clipped by the 3s limit.

Percentile of pings

2 4 6

RTT (Latency) (seconds)

0.2 0.4 0.6 0.8 1.0

Fraction of addresses

median 80th 90th 95th 98th 99th

About 10% of addresses routinely respond after one second. The distribution appears clipped by the 3s limit. About 10% of addresses routinely respond after one second. The distribution appears clipped by the 3s limit.

slide-18
SLIDE 18

Transform Survey Data

[1320291701.0] P v119 1.99.16.242 1.99.16.242 2960.995 45 [1320292364.0] P v119 1.99.16.242 1.99.16.242 2767.092 45 [1320293027.0] P v119 1.99.16.242 error_time_out [1320293031.0] P v119 no_probe_ip 1.99.16.242 0.000 45 [d004] [1320293691.0] P v119 1.99.16.242 error_time_out [1320293696.0] P v119 no_probe_ip 1.99.16.242 0.000 45 [d005] [1320294354.0] P v119 1.99.16.242 error_time_out [1320294358.0] P v119 no_probe_ip 1.99.16.242 0.000 45 [d004] [1320295017.0] P v119 1.99.16.242 error_time_out [1320295030.0] P v119 no_probe_ip 1.99.16.242 0.000 45 [d013] [1320291701.0] P v119 1.99.16.242 1.99.16.242 2960.995 45 [1320292364.0] P v119 1.99.16.242 1.99.16.242 2767.092 45 [1320293027.0] P v119 1.99.16.242 1.99.16.242 4000.0000 45 [1320293691.0] P v119 1.99.16.242 1.99.16.242 5000.0000 45 [1320294354.0] P v119 1.99.16.242 1.99.16.242 4000.0000 45 [1320295017.0] P v119 1.99.16.242 1.99.16.242 13000.0000 45 [1320291701.0] P v119 1.99.16.242 1.99.16.242 2960.995 45 [1320292364.0] P v119 1.99.16.242 1.99.16.242 2767.092 45 [1320293027.0] P v119 1.99.16.242 error_time_out [1320293031.0] P v119 no_probe_ip 1.99.16.242 0.000 45 [d004] [1320293691.0] P v119 1.99.16.242 error_time_out [1320293696.0] P v119 no_probe_ip 1.99.16.242 0.000 45 [d005] [1320294354.0] P v119 1.99.16.242 error_time_out [1320294358.0] P v119 no_probe_ip 1.99.16.242 0.000 45 [d004] [1320295017.0] P v119 1.99.16.242 error_time_out [1320295030.0] P v119 no_probe_ip 1.99.16.242 0.000 45 [d013] [1320291701.0] P v119 1.99.16.242 1.99.16.242 2960.995 45 [1320292364.0] P v119 1.99.16.242 1.99.16.242 2767.092 45 [1320293027.0] P v119 1.99.16.242 error_time_out [1320293031.0] P v119 no_probe_ip 1.99.16.242 0.000 45 [d004] [1320293691.0] P v119 1.99.16.242 error_time_out [1320293696.0] P v119 no_probe_ip 1.99.16.242 0.000 45 [d005] [1320294354.0] P v119 1.99.16.242 error_time_out [1320294358.0] P v119 no_probe_ip 1.99.16.242 0.000 45 [d004] [1320295017.0] P v119 1.99.16.242 error_time_out [1320295030.0] P v119 no_probe_ip 1.99.16.242 0.000 45 [d013] [1320291701.0] P v119 1.99.16.242 1.99.16.242 2960.995 45 [1320292364.0] P v119 1.99.16.242 1.99.16.242 2767.092 45 [1320293027.0] P v119 1.99.16.242 error_time_out [1320293031.0] P v119 no_probe_ip 1.99.16.242 0.000 45 [d004] [1320293691.0] P v119 1.99.16.242 error_time_out [1320293696.0] P v119 no_probe_ip 1.99.16.242 0.000 45 [d005] [1320294354.0] P v119 1.99.16.242 error_time_out [1320294358.0] P v119 no_probe_ip 1.99.16.242 0.000 45 [d004] [1320295017.0] P v119 1.99.16.242 error_time_out [1320295030.0] P v119 no_probe_ip 1.99.16.242 0.000 45 [d013] [1320291701.0] P v119 1.99.16.242 1.99.16.242 2960.995 45 [1320292364.0] P v119 1.99.16.242 1.99.16.242 2767.092 45 [1320293027.0] P v119 1.99.16.242 1.99.16.242 4000.0000 45 [1320293691.0] P v119 1.99.16.242 1.99.16.242 5000.0000 45 [1320294354.0] P v119 1.99.16.242 1.99.16.242 4000.0000 45 [1320295017.0] P v119 1.99.16.242 1.99.16.242 13000.0000 45 [1320291701.0] P v119 1.99.16.242 1.99.16.242 2960.995 45 [1320292364.0] P v119 1.99.16.242 1.99.16.242 2767.092 45 [1320293027.0] P v119 1.99.16.242 error_time_out [1320293031.0] P v119 no_probe_ip 1.99.16.242 0.000 45 [d004] [1320293691.0] P v119 1.99.16.242 error_time_out [1320293696.0] P v119 no_probe_ip 1.99.16.242 0.000 45 [d005] [1320294354.0] P v119 1.99.16.242 error_time_out [1320294358.0] P v119 no_probe_ip 1.99.16.242 0.000 45 [d004] [1320295017.0] P v119 1.99.16.242 error_time_out [1320295030.0] P v119 no_probe_ip 1.99.16.242 0.000 45 [d013]

Probe Destination Reply Source RTT Time

slide-19
SLIDE 19

Absurdly long RTTs

Unexpected responses caused by broadcast and duplicate responses

Percentile of pings

slide-20
SLIDE 20

Filtering broadcast responses removes modes

1% of pings from 1% of addresses have RTTs > 145s

Percentile of pings

200 400 600

RTT (Latency) (seconds)

1.0 0.98 0.99

Fraction of addresses

median 80th 90th 95th 98th 99th

slide-21
SLIDE 21

% of pings 1% 50% 80% 90% 95% 98% 99% 1% 0.01 0.03 0.04 0.07 0.10 0.13 0.18 50% 0.16 0.19 0.21 0.26 0.42 0.53 0.64 80% 0.19 0.26 0.33 0.43 0.54 0.74 1.21 90% 0.22 0.31 0.42 0.57 0.84 1.61 3 95% 0.25 1.42 2.38 3 5 9 15 98% 0.30 1.94 4 6 12 41 78 99% 0.33 2.31 4 8 22 76 145 % of addresses

When should probes time out?

99% of pings from 99% of addrs have RTTs < 145s Most addresses can respond within 350ms 3s timeout misses 10% of pings from 5% of addrs 5s timeout misses 5% of pings from 5% of addrs

slide-22
SLIDE 22

Inconceivable!

  • Is it an unrepresentative sample?
  • Is it temporary?
  • Is it just ICMP (the protocol used by ping)?
  • Is this new?
  • What addresses take so long to respond?
slide-23
SLIDE 23

Did we sample bad addresses?

Ping them All More than 150,000 addresses had RTTs > 100s

slide-24
SLIDE 24

Is it temporary?

slide-25
SLIDE 25

Is it just ICMP?

  • Use Scamper to send TCP, UDP, ICMP

probes to high-latency addresses

  • “high-latency”: ~5K addresses from ISI

2015 surveys whose 50th, 80th, 90th or 95th percentile RTTs are in the top 5%

  • Sent ICMP, UDP, TCP packets 20 mins

apart, for 36 hours

slide-26
SLIDE 26

0.1 1 10 100

Max RTT per address (s)

1.0 0.0 0.2 0.4 0.6 0.8 1.0

Fraction of addresses

ICMP UDP TCP

Mode for TCP Likely caused by firewall UDP and ICMP are identical

Is it just ICMP?

slide-27
SLIDE 27

Removed ~500 addresses belonging to firewalling AS

0.1 1 10 100

Max RTT per address (s)

1.0 0.0 0.2 0.4 0.6 0.8 1.0

Fraction of addresses

ICMP UDP TCP

Is it just ICMP?

slide-28
SLIDE 28

Is this new?

99th 98th 95th Percentile of pings

slide-29
SLIDE 29

99th 98th 95th Percentile of pings

Is this new?

slide-30
SLIDE 30

Continent July 2015 high RTT addresses Number % per continent South America 8.05M 26.9 Asia 4.56M 3.2 Europe 2.32M 2.4 Africa 1.30M 31.7 North America 1.14M 1.2 Oceania 0.08M 3.7 Continent July 2015 high RTT addresses Number % per continent South America 8.05M 26.9 Asia 4.56M 3.2 Europe 2.32M 2.4 Africa 1.30M 31.7 North America 1.14M 1.2 Oceania 0.08M 3.7

What addresses take so long? 1/2: Where?

slide-31
SLIDE 31

All cellular Majority of responsive addresses

Autonomous System July 2015 high RTT addresses Number % per AS Telefonica Brasil 4.20M 77 Tim Celular S.A. 1.72M 71.6 Bharti Airtel Ltd. 1.03M 79.2 Cellco Partnership 0.63M 72.7 Tele2 0.58M 67.4

What addresses take so long? 2/2: Which providers?

slide-32
SLIDE 32

Lessons

  • Pings reach cell phones; may use power,

expose activity.

  • Duration of buffering across disconnection is

extraordinary, violates TTL and MSL.

  • Long timeouts necessary to disambiguate
  • utages from disconnection.
slide-33
SLIDE 33

Two Questions

  • Could high delay create false outages?
  • Could renumbering cause false outages and

alter their duration?

slide-34
SLIDE 34
  • “Dynamic” addresses may change because:
  • The administrator needs to reassign devices

to networks

  • A long outage allows the network to forget
  • A rebooted machine gets a new address
  • The provider limits the lifetime of addresses

What’s Renumbering

slide-35
SLIDE 35

Data: RIPE Atlas Probes

  • Logs show when

these devices:

  • Get a new address
  • Reboot
  • Lose connectivity
slide-36
SLIDE 36

36 24 5 10 15 20

IP address-duration (hours)

0.0 0.2 0.4 0.6 0.8 1.0

Fraction of total address-duration Probe 16893: 60 hours

Weight address durations

Address Duration (hours) IP1 79.194.205.144 NA IP2 79.194.192.169 24 IP3 79.194.196.241 24 IP4 79.194.194.4 12 IP5 91.9.219.235 NA

Sum: 60

slide-37
SLIDE 37

Addresses often last days

IPv6 addresses last longer

1h 6h 12h 1d 3d 1w 2w 1mo 2mo

IP address-duration (log-scale)

0.0 0.2 0.4 0.6 0.8 1.0

Fraction of total address-duration

global-connlogs (1359.31) global-ipecho-v4 (1882.53) global-ipecho-v6 (312.41)

slide-38
SLIDE 38

Periodic address durations are common in Germany and France

1h 6h 12h 1d 3d 1w 2w 1mo 2mo

IP address-duration (log-scale)

0.0 0.2 0.4 0.6 0.8 1.0

Fraction of total address-duration

US-ipecho-v4 (171.68) DE-ipecho-v4 (339.21) FR-ipecho-v4 (164.51) GB-ipecho-v4 (117.1) RU-ipecho-v4 (113.51) BE-ipecho-v4 (73.5)

slide-39
SLIDE 39

Cable seems stable.

39

1h 6h 12h 1d 3d 1w 2w 1mo 2mo

IP address-duration (log-scale)

0.0 0.2 0.4 0.6 0.8 1.0

Fraction of total address-duration

Deutsche Telekom, AS3320 (151.82) Vodafone DE, AS3209 (21.78) Telefonica DE, AS6805 (15.4) Telefonica DE, AS13184 (14.12) NetCologne, AS8422 (9.12) Liberty Global, AS6830 (60.68) Kabel DE, AS31334 (32.49) Kabel BW, AS29562 (9.06)

slide-40
SLIDE 40

Could renumbering cause false

  • utages?
  • We don’t see periodic renumbering in the US,

so, unlikely here.

  • Where there is periodic renumbering, can

account for it.

40

slide-41
SLIDE 41

Two Questions

  • Could high delay create false outages?
  • Could renumbering cause false outages and

alter their duration?

slide-42
SLIDE 42

Renumbering by outage duration from Atlas probes for one ISP

200 400 600 800

Number of outages

< 5m 5-10m 10-20m 20-30m 30-60m 1-3h 3-6h 6-12h 12-24h 1-3d 3d-7d >1w

Outage duration

20 40 60 80 100

slide-43
SLIDE 43

Now

  • Building tools to identify hosts after address

changes and outages

  • Studying how a sample of address space can be

representative

  • Providing information to users about their own

and adjacent networks

43

slide-44
SLIDE 44

Remember

  • When sending a packet into the Internet,

you might see a response after minutes.

  • When blacklisting an IP address for

misbehavior, you might see the same machine at a different address in a few hours.

44

slide-45
SLIDE 45

Great Students

45

slide-46
SLIDE 46

Questions?