Distributed Web Crawling over DHTs Boon Thau Loo, Owen Cooper, - - PowerPoint PPT Presentation

distributed web crawling over dhts
SMART_READER_LITE
LIVE PREVIEW

Distributed Web Crawling over DHTs Boon Thau Loo, Owen Cooper, - - PowerPoint PPT Presentation

Distributed Web Crawling over DHTs Boon Thau Loo, Owen Cooper, Sailesh Krishnamurthy CS294-4 Search Today Search Index Crawl Crawl Whats Wrong? Users have a limited search interface Todays web is dynamic and growing: Timely


slide-1
SLIDE 1

Distributed Web Crawling

  • ver DHTs

Boon Thau Loo, Owen Cooper, Sailesh Krishnamurthy CS294-4

slide-2
SLIDE 2

Index

Search Today

Search Crawl Crawl

slide-3
SLIDE 3

What’s Wrong?

Users have a limited search interface Today’s web is dynamic and growing:

Timely re-crawls required. Not feasible for all web sites.

Search engines control your search results:

Decide which sites get crawled:

550 billion documents estimated in 2001 (BrightPlanet) Google indexes 3.3 billion documents.

Decide which sites gets updated more frequently May censor or skew results rankings.

Challenge: User customizable searches that scale.

slide-4
SLIDE 4

Our Solution: A Distributed Crawler

P2P users donate excess bandwidth and computation resources to crawl the web.

Organized using Distributed Hash tables (DHTs) DHT and Query Processor agnostic crawler:

Designed to work over any DHT Crawls can be expressed as declarative recursive queries

Easy for user customization. Queries can be executed over PIER, a DHT-based relational P2P

Query Processor

Crawlers: PI ER nodes Crawlees: Web Servers

slide-5
SLIDE 5

Potential

Infrastructure for crawl personalization:

User-defined focused crawlers Collaborative crawling/filtering (special interest groups)

Other possibilities:

Bigger, better, faster web crawler Enables new search and indexing technologies

P2P Web Search Web archival and storage (with OceanStore)

Generalized crawler for querying distributed graph structures.

Monitor file-sharing networks. E.g. Gnutella. P2P network maintenance:

Routing information. OceanStore meta-data.

slide-6
SLIDE 6

Challenges that We Investigated

Scalability and Throughput

DHT communication overheads. Balance network load on crawlers

2 components of network load: Download and DHT bandwidth.

Network Proximity. Exploit network locality of crawlers.

Limit download rates on web sites

Prevents denial of service attacks.

Main tradeoff: Tension between coordination and communication

Balance load either on crawlers or on crawlees ! Exploit network proximity at the cost of communication.

slide-7
SLIDE 7

Crawler Thread

Crawl as a Recursive Query

Downloader Extractor Publish WebPage(url)

Input Urls Output Links

Π: Link.destUrl WebPage(url)

Redirect CrawlWrapper DupElim Dup Elim Filters

Publish Link (sourceUrl, destUrl)

Rate Throttle & Reorder DHT Scan: WebPage(url)

Seed Urls

slide-8
SLIDE 8

Crawl Distribution Strategies

Partition by URL

Ensures even distribution of crawler workload. High DHT communication traffic.

Partition by Hostname

One crawler per hostname. Creates a “control point” for per-server rate

throttling.

May lead to uneven crawler load distribution Single point of failure:

“Bad” choice of crawler affects per-site crawl throughput.

Slight variation: X crawlers per hostname.

slide-9
SLIDE 9

Redirection

  • Simple technique that allows a crawler to redirect or pass on its

assigned work to another crawler (and so on….)

  • A second chance distribution mechanism orthogonal to the

partitioning scheme.

  • Example: Partition by hostname
  • Node responsible for google.com (red) dispatches work (by URL) to

grey nodes

  • Load balancing benefits of partition by URL
  • Control benefits of partition by hostname
  • When? Policy-based
  • Crawler load (queue size)
  • Network proximity
  • Why not? Cost of redirection
  • Increased DHT control traffic
  • Hence, put a limit number of redirections per URL.

www.google.com

slide-10
SLIDE 10

Experiments

Deployment

  • WebCrawler over PIER, Bamboo DHT, up to 80 PlanetLab nodes
  • 3 Crawl Threads per crawler, 15 min crawl duration

Distribution (Partition) Schemes

  • URL
  • Hostname
  • Hostname with 8 crawlers per unique host
  • Hostname, one level redirection on overload.

Crawl Workload

  • Exhaustive crawl

Seed URL: http://www.google.com 78244 different web servers

  • Crawl of fixed number of sites

Seed URL: http://www.google.com 45 web servers within google

  • Crawl of single site within http://groups.google.com
slide-11
SLIDE 11

Crawl of Multiple Sites I

Hostname: Can only exploit at most 45 crawlers. Redirect (hybrid hostname/url) does the best. Partition by Hostname shows poor imbalance (70% idle).

Better off when more crawlers are busy CDF of Per-crawler Downloads (80 nodes) Crawl Throughput Scaleup

slide-12
SLIDE 12

Crawl of Multiple Sites II

Redirection incurs higher overheads only after queue size exceeds a threshold. Hostname incurs low overheads since crawl only looks at google.com which has lots of self-links.

Redirect: The per-URL DHT

  • verheads hit their

maximum around 70 nodes.

Per-URL DHT Overheads

slide-13
SLIDE 13

Network Proximity

Sampled 5100 crawl targets and measured ping times from each of 80 PlanetLab hosts Partition by hostname approximates random assignment

Best-3 random is “close enough” to Best-5 random

Sanity check: what if a single host crawls all targets ?

slide-14
SLIDE 14

Summary of Schemes

+

  • +

Load- balance download bandwidth DHT Communication

  • verheads

Network proximity Rate limit Crawlees Load- balance DHT bandwidth

  • +

+ ?

Redirect

+ ? +

  • Hostname
  • +

URL

slide-15
SLIDE 15

Related Work

Herodotus, at MIT (Chord-based)

Partition by URL Batching with ring-based forwarding. Experimented on 4 local machines

Apoidea, at GaTech (Chord-based)

Partition by hostname. Forwards crawl to DHT neighbor closest to

website.

Experimented on 12 local machines.

slide-16
SLIDE 16

Conclusion

Our main contributions:

Propose a DHT and QP agnostic Distributed

Crawler.

Express crawl as a query.

Permits user-customizable refinement of crawls

Discover important trade-offs in distributed

crawling:

Co-ordination comes with extra communication costs

Deployment and experimentation on PlanetLab.

Examine crawl distribution strategies under different workloads on live web sources Measure the potential benefits of network proximity.

slide-17
SLIDE 17

Backup slides

slide-18
SLIDE 18

Existing Crawlers

Cluster-based crawlers

Google: Centralized dispatcher sends urls

to be crawled.

Hash-based parallel crawlers.

Focused Crawlers

BINGO! Crawls the web given basic training set.

Peer-to-Peer

Grub SETI@Home infrastructure. 23993 members .

slide-19
SLIDE 19

Exhaustive Crawl

Partition by Hostname shows imbalance. Some crawlers are over-utilized for downloads. Little difference in

  • throughput. Most crawler

threads are kept busy.

slide-20
SLIDE 20

Single Site

URL is best, followed by redirect and hostname.

slide-21
SLIDE 21

Future Work

Fault Tolerance Security Single-Node Throughput Work-Sharing between Crawl Queries

Essential for overlapping users.

Crawl Global Prioritization

A requirement of personalized crawls. Online relevance feedback.

Deep web retrieval.