Web Content Cartography Bernhard Ager uhlbauer Wolfgang M Georgios - - PowerPoint PPT Presentation

web content cartography
SMART_READER_LITE
LIVE PREVIEW

Web Content Cartography Bernhard Ager uhlbauer Wolfgang M Georgios - - PowerPoint PPT Presentation

Web Content Cartography Bernhard Ager uhlbauer Wolfgang M Georgios Smaragdakis Steve Uhlig Technische Universtit at Berlin / T-Labs ETH Z urich Internet Measurement Conference 2011 Ager, M uhlbauer, Smaragdakis,


slide-1
SLIDE 1

Web Content Cartography

Bernhard Ager† Wolfgang M¨ uhlbauer‡ Georgios Smaragdakis† Steve Uhlig†

†Technische Universtit¨

at Berlin / T-Labs

‡ETH Z¨

urich

Internet Measurement Conference 2011

Ager, M¨ uhlbauer, Smaragdakis, Uhlig (TUB/T-Labs, ETH) Web Content Cartography IMC’11 1

slide-2
SLIDE 2

Motivation

Motivation

Content is King

  • Web traffic currently dominates: ∼ 60 %
  • Hosting infrastructures are the work-horse of content delivery
  • But: “The only constant is change”: Hyper-giants, Meta CDNs,

IETF CDNi, virtualization, applications

Ager, M¨ uhlbauer, Smaragdakis, Uhlig (TUB/T-Labs, ETH) Web Content Cartography IMC’11 2

slide-3
SLIDE 3

Motivation

How is the hosting landscape evolving?

We need to characterize hosting infrastructures

  • Researchers: Understand the content eco-system better
  • Content providers: Discover choice of available infrastructures
  • ISPs: Perform strategic decisions: Peering, CDN infrastructure
  • Infrastructures: Understand position in the market

Ager, M¨ uhlbauer, Smaragdakis, Uhlig (TUB/T-Labs, ETH) Web Content Cartography IMC’11 3

slide-4
SLIDE 4

Motivation

How we complement existing work

Earlier approaches to characterize infrastructures Hyper-giants, Google [La10]; Hosting models [Le09]; Rapidshare [An09], Akamai and Limelight [Hu08]; Akamai [Su06]; Akamai, Digital Island, and 12 more [Kr01]; ...

[La10]

  • C. Labovitz, S. Lekel-Johnson, D. McPherson, J. Oberheide, and F. Jahanian. Internet Inter-Domain Traffic. In Proc.

ACM SIGCOMM, 2010. [Le09]

  • T. Leighton. Improving Performance on the Internet. Commun. ACM, 2009.

[An09]

  • D. Antoniades, E. Markatos, and C. Dovrolis. One-click Hosting Services: A File-Sharing Hideout. In Proc. ACM IMC,

2009. [Hu08]

  • C. Huang, A. Wang, J. Li, and K. Ross. Measuring and Evaluating Large-scale CDNs. In Proc. ACM IMC, 2008.

[Su06]

  • A. Su, D. Choffnes, A. Kuzmanovic, and F. Bustamante. Drafting Behind Akamai: Inferring Network Conditions Based
  • n CDN Redirections. IEEE/ACM Trans. Netw., 2009.

[Kr01]

  • B. Krishnamurthy, C. Wills, and Y. Zhang. On the Use and Performance of Content Distribution Networks. In Proc.

ACM IMW, 2001. Ager, M¨ uhlbauer, Smaragdakis, Uhlig (TUB/T-Labs, ETH) Web Content Cartography IMC’11 4

slide-5
SLIDE 5

Motivation

How we complement existing work

Earlier approaches to characterize infrastructures Hyper-giants, Google [La10]; Hosting models [Le09]; Rapidshare [An09], Akamai and Limelight [Hu08]; Akamai [Su06]; Akamai, Digital Island, and 12 more [Kr01]; ... ... and how our approach is different

  • No a-priori signatures
  • Aiming at the broad picture
  • Automatable, lightweight

[La10]

  • C. Labovitz, S. Lekel-Johnson, D. McPherson, J. Oberheide, and F. Jahanian. Internet Inter-Domain Traffic. In Proc.

ACM SIGCOMM, 2010. [Le09]

  • T. Leighton. Improving Performance on the Internet. Commun. ACM, 2009.

[An09]

  • D. Antoniades, E. Markatos, and C. Dovrolis. One-click Hosting Services: A File-Sharing Hideout. In Proc. ACM IMC,

2009. [Hu08]

  • C. Huang, A. Wang, J. Li, and K. Ross. Measuring and Evaluating Large-scale CDNs. In Proc. ACM IMC, 2008.

[Su06]

  • A. Su, D. Choffnes, A. Kuzmanovic, and F. Bustamante. Drafting Behind Akamai: Inferring Network Conditions Based
  • n CDN Redirections. IEEE/ACM Trans. Netw., 2009.

[Kr01]

  • B. Krishnamurthy, C. Wills, and Y. Zhang. On the Use and Performance of Content Distribution Networks. In Proc.

ACM IMW, 2001. Ager, M¨ uhlbauer, Smaragdakis, Uhlig (TUB/T-Labs, ETH) Web Content Cartography IMC’11 4

slide-6
SLIDE 6

Outline

1 Motivation 2 Approach 3 Data 4 Results 5 Conclusion

Ager, M¨ uhlbauer, Smaragdakis, Uhlig (TUB/T-Labs, ETH) Web Content Cartography IMC’11 5

slide-7
SLIDE 7

Approach

What are the characteristics of content hosting?

Web content cartography

  • What are those hosting infrastructures?
  • Where are they located?
  • At the network level
  • Geographically
  • Who is operating them?
  • Which role does each infrastructure play?

We propose web content cartography: building maps of hosting infrastructures

Ager, M¨ uhlbauer, Smaragdakis, Uhlig (TUB/T-Labs, ETH) Web Content Cartography IMC’11 6

slide-8
SLIDE 8

Approach

A sketch of HTTP content delivery

Observation DNS exposes network footprint

Ager, M¨ uhlbauer, Smaragdakis, Uhlig (TUB/T-Labs, ETH) Web Content Cartography IMC’11 7

slide-9
SLIDE 9

Approach

Identifying infrastructures

Features

  • IP address, /24
  • Prefix, AS

Two-level clustering process

  • First phase: k-means
  • Second phase: based on address space

Ager, M¨ uhlbauer, Smaragdakis, Uhlig (TUB/T-Labs, ETH) Web Content Cartography IMC’11 8

slide-10
SLIDE 10

Data

Collecting data

Hostnames Requirement: Good coverage of hosting infrastructures

  • Extracted from Alexa top 1 Mio. list
  • 2000 top, 2000 tail, ∼ 3000 embedded, ∼ 850 cnames

Traces Requirement: Sampling a large enough network footprint

  • Script
  • Run by volunteers
  • Trace collection via website

Traces 133 ASN 78 Countries 27 Continents 6

Ager, M¨ uhlbauer, Smaragdakis, Uhlig (TUB/T-Labs, ETH) Web Content Cartography IMC’11 9

slide-11
SLIDE 11

Data

Estimating coverage

How should you choose vantage points?

20 40 60 80 100 120 2000 4000 6000 8000 Trace Number of /24 subnetworks discovered Optimized Max random Median random Min random

Insights

  • Optimized: first 30

traces from 30 ASs in 24 countries ⇒ sampling diversity comes from geographic and network diversity

  • Median: tail traces yield

20 /24s per trace ⇒ limited utility when adding more traces

Ager, M¨ uhlbauer, Smaragdakis, Uhlig (TUB/T-Labs, ETH) Web Content Cartography IMC’11 10

slide-12
SLIDE 12

Data

Estimating coverage

How should you choose hostnames?

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Similarity CDF Embedded Top 2000 Total Tail 2000

Insights

  • embedded: similarity

low ⇒ better distributed

  • tail: similarity high ⇒

mostly centralized

Ager, M¨ uhlbauer, Smaragdakis, Uhlig (TUB/T-Labs, ETH) Web Content Cartography IMC’11 11

slide-13
SLIDE 13

Results

Characterizing infrastructures

Rank # hostnames

  • wner

content mix 1 476 Akamai 3 108 Google 4 70 Akamai 5 70 Google 6 57 Limelight 7 57 ThePlanet 12 28 Wordpress

  • nly on top,

both on top and embedded,

  • nly on embedded,

tail.

Main findings in Top 20

  • tail content is important: consolidation
  • Some companies run multiple infrastructures
  • embedded often dominating

Ager, M¨ uhlbauer, Smaragdakis, Uhlig (TUB/T-Labs, ETH) Web Content Cartography IMC’11 12

slide-14
SLIDE 14

Results

Content potential and monopoly

Location CP AS 1 1 AS 2 0.5 Content Potential (CP) Fraction of content available from a location.

Ager, M¨ uhlbauer, Smaragdakis, Uhlig (TUB/T-Labs, ETH) Web Content Cartography IMC’11 13

slide-15
SLIDE 15

Results

Content potential and monopoly

Location CP NCP AS 1 1 0.75 AS 2 0.5 0.25 Content Potential (CP) Fraction of content available from a location. Normalized Content Potential (NCP) CP weighted by distributedness.

Ager, M¨ uhlbauer, Smaragdakis, Uhlig (TUB/T-Labs, ETH) Web Content Cartography IMC’11 13

slide-16
SLIDE 16

Results

Content potential and monopoly

Location CP NCP CMI AS 1 1 0.75 0.75 AS 2 0.5 0.25 0.5 Content Potential (CP) Fraction of content available from a location. Normalized Content Potential (NCP) CP weighted by distributedness. Content Monopoly Index (CMI) CMI = NCP / CP

Ager, M¨ uhlbauer, Smaragdakis, Uhlig (TUB/T-Labs, ETH) Web Content Cartography IMC’11 13

slide-17
SLIDE 17

Results

Normalized content potential: Top 12 ASs

1 2 3 4 5 6 7 8 9 10 11 12 CP NCP Rank Potential 0.00 0.05 0.10 0.15

Rank AS name CMI 1 Chinanet 0.699 2 Google 0.996 3 ThePlanet.com 0.985 4 SoftLayer 0.967 5 China169 BB 0.576 6 Level 3 0.109 7 China Telecom 0.470 8 Rackspace 0.954 9 1&1 Internet 0.969 10 OVH 0.969 11 NTT America 0.070 12 EdgeCast 0.688

Ager, M¨ uhlbauer, Smaragdakis, Uhlig (TUB/T-Labs, ETH) Web Content Cartography IMC’11 14

slide-18
SLIDE 18

Results

Comparing AS rankings

CAIDA-cone [CAIDA]

  • Number of

customer ASs Arbor [La10]

  • Inter-AS traffic

volume Normalized potential

  • Weighted content

availability

Rank CAIDA-cone Arbor Normalized potential 1 Level 3 Level 3 Chinanet 2 AT&T Global Crossing Google 3 MCI Google ThePlanet 4 Cogent/PSI * SoftLayer 5 Global Crossing * China169 backbone 6 Sprint Comcast Level 3 7 Qwest * Rackspace 8 Hurricane Electric * China Telecom 9 tw telecom * 1&1 Internet 10 TeliaNet * OVH

[La10]

  • C. Labovitz, S. Lekel-Johnson, D. McPherson, J. Oberheide, and F. Jahanian. Internet Inter-Domain Traffic. In Proc.

ACM SIGCOMM, 2010. [CAIDA] http://as-rank.caida.org/ Ager, M¨ uhlbauer, Smaragdakis, Uhlig (TUB/T-Labs, ETH) Web Content Cartography IMC’11 15

slide-19
SLIDE 19

Conclusion

Conclusion

Summary

  • Lightweight discovery of hosting infrastructures
  • Characterization of hosting infrastructures
  • We can detect the inhomogenous use of infrastructures
  • Content-centric AS rankings
  • “Content monopolies”: Google, Chinese ISPs
  • Complementary to traditional rankings

Future work

  • Relate with other metrics: traffic volume, finances, ...
  • Explore the interplay of content delivery with the topology
  • Break-down content by other categories: language, category, ...
  • Follow-up work: increase coverage

Ager, M¨ uhlbauer, Smaragdakis, Uhlig (TUB/T-Labs, ETH) Web Content Cartography IMC’11 16

slide-20
SLIDE 20

Appendix

Backup slides

Backup slides

Ager, M¨ uhlbauer, Smaragdakis, Uhlig (TUB/T-Labs, ETH) Web Content Cartography IMC’11 17

slide-21
SLIDE 21

Appendix Background

Top 20 content clusters by hostname count

Rank # hostnames # ASes # prefixes

  • wner

content mix 1 476 79 294 Akamai 2 161 70 216 Akamai 3 108 1 45 Google 4 70 35 137 Akamai 5 70 1 45 Google 6 57 6 15 Limelight 7 57 1 1 ThePlanet 8 53 1 1 ThePlanet 9 49 34 123 Akamai 10 34 1 2 Skyrock 11 29 6 17 Cotendo 12 28 4 5 Wordpress 13 27 6 21 Footprint 14 26 1 1 Ravand 15 23 1 1 Xanga 16 22 1 4 Edgecast 17 22 1 1 ThePlanet 18 21 1 1 ivwbox.de 19 21 1 5 AOL 20 20 1 1 Leaseweb

  • nly on top,

both on top and embedded,

  • nly on embedded,

tail.

Ager, M¨ uhlbauer, Smaragdakis, Uhlig (TUB/T-Labs, ETH) Web Content Cartography IMC’11 18

slide-22
SLIDE 22

Appendix Background

Marginal utility: hostnames

2000 4000 6000 2000 4000 6000 8000 Hostname ordered by utility Number of /24 subnetworks discovered Total Embedded Top 2000 Tail 2000

Ager, M¨ uhlbauer, Smaragdakis, Uhlig (TUB/T-Labs, ETH) Web Content Cartography IMC’11 19

slide-23
SLIDE 23

Appendix Background

Content exchange matrix: top

Requested Served from from Africa Asia Europe N. America Oceania S. America Africa 0.3 18.6 32.0 46.7 0.3 0.8 Asia 0.3 26.0 20.7 49.8 0.3 0.8 Europe 0.3 18.6 32.2 46.6 0.2 0.8

  • N. America

0.3 18.6 20.7 58.2 0.2 0.8 Oceania 0.3 20.8 20.5 49.2 5.9 0.8

  • S. America

0.2 18.7 20.6 49.3 0.2 10.1

Ager, M¨ uhlbauer, Smaragdakis, Uhlig (TUB/T-Labs, ETH) Web Content Cartography IMC’11 20

slide-24
SLIDE 24

Appendix Background

Content exchange matrix: embedded

Requested Served from from Africa Asia Europe N. America Oceania S. America Africa 0.3 26.9 35.5 35.8 0.3 0.6 Asia 0.3 37.9 18.3 40.1 1.1 0.6 Europe 0.3 26.8 35.6 35.6 0.4 0.6

  • N. America

0.3 26.5 18.4 52.9 0.3 0.6 Oceania 0.3 29.2 18.5 38.7 11.3 0.6

  • S. America

0.3 26.4 18.2 39.3 0.3 14.2

Ager, M¨ uhlbauer, Smaragdakis, Uhlig (TUB/T-Labs, ETH) Web Content Cartography IMC’11 21

slide-25
SLIDE 25

Appendix Background

Sizes of similarity clusters

1 5 10 50 500 1 2 5 10 20 50 100 500 Infrastructure cluster by rank Number of hostnames on infrastructure

Ager, M¨ uhlbauer, Smaragdakis, Uhlig (TUB/T-Labs, ETH) Web Content Cartography IMC’11 22

slide-26
SLIDE 26

Appendix Background

Hosting model of the less distributed infrastructures

1 (2620) 2 (289) 3 (61) 4 (26) 5 (9) 6 (11) 7 (6) 8 (2) 9 (3) 10 (2) 13 countries 12 countries 11 countries 10 countries 9 countries 8 countries 7 countries 6 countries 5 countries 4 countries 3 countries 2 countries 1 country Fraction 0.0 0.2 0.4 0.6 0.8 1.0 Number of ASN for infrastructure (Number of infrastructure clusters in parenthesis)

Ager, M¨ uhlbauer, Smaragdakis, Uhlig (TUB/T-Labs, ETH) Web Content Cartography IMC’11 23

slide-27
SLIDE 27

Appendix Background

Determining the hosting model

An example: Skyrock vs. Cotendo

Rank # hostnames # ASes # prefixes

  • wner

content mix 10 34 1 2 Skyrock 11 29 6 17 Cotendo

  • nly on top,

both on top and embedded,

  • nly on embedded,

tail.

Skyrock

  • 4 /24-subnetworks
  • Website offering blogs/OSN
  • Single country: France

Cotendo

  • 21 /24-subnetworks
  • Website offers CDN service
  • 8 countries on 4 continents

Ager, M¨ uhlbauer, Smaragdakis, Uhlig (TUB/T-Labs, ETH) Web Content Cartography IMC’11 24

slide-28
SLIDE 28

Appendix Background

Determining the hosting model

An example: Skyrock vs. Cotendo

Rank # hostnames # ASes # prefixes

  • wner

content mix 10 34 1 2 Skyrock 11 29 6 17 Cotendo

  • nly on top,

both on top and embedded,

  • nly on embedded,

tail.

Skyrock

  • 4 /24-subnetworks
  • Website offering blogs/OSN
  • Single country: France

⇒ Data center Cotendo

  • 21 /24-subnetworks
  • Website offers CDN service
  • 8 countries on 4 continents

⇒ Global scale CDN

Ager, M¨ uhlbauer, Smaragdakis, Uhlig (TUB/T-Labs, ETH) Web Content Cartography IMC’11 24