Distributed Web Crawling
- ver DHTs
Distributed Web Crawling over DHTs Boon Thau Loo, Owen Cooper, - - PowerPoint PPT Presentation
Distributed Web Crawling over DHTs Boon Thau Loo, Owen Cooper, Sailesh Krishnamurthy CS294-4 Search Today Search Index Crawl Crawl Whats Wrong? Users have a limited search interface Todays web is dynamic and growing: Timely
Timely re-crawls required. Not feasible for all web sites.
Decide which sites get crawled:
550 billion documents estimated in 2001 (BrightPlanet) Google indexes 3.3 billion documents.
Decide which sites gets updated more frequently May censor or skew results rankings.
Organized using Distributed Hash tables (DHTs) DHT and Query Processor agnostic crawler:
Designed to work over any DHT Crawls can be expressed as declarative recursive queries
Easy for user customization. Queries can be executed over PIER, a DHT-based relational P2P
Query Processor
Crawlers: PI ER nodes Crawlees: Web Servers
User-defined focused crawlers Collaborative crawling/filtering (special interest groups)
Bigger, better, faster web crawler Enables new search and indexing technologies
P2P Web Search Web archival and storage (with OceanStore)
Monitor file-sharing networks. E.g. Gnutella. P2P network maintenance:
Routing information. OceanStore meta-data.
DHT communication overheads. Balance network load on crawlers
2 components of network load: Download and DHT bandwidth.
Network Proximity. Exploit network locality of crawlers.
Prevents denial of service attacks.
Balance load either on crawlers or on crawlees ! Exploit network proximity at the cost of communication.
Downloader Extractor Publish WebPage(url)
Input Urls Output Links
Π: Link.destUrl WebPage(url)
Redirect CrawlWrapper DupElim Dup Elim Filters
Publish Link (sourceUrl, destUrl)
Rate Throttle & Reorder DHT Scan: WebPage(url)
Seed Urls
Ensures even distribution of crawler workload. High DHT communication traffic.
One crawler per hostname. Creates a “control point” for per-server rate
May lead to uneven crawler load distribution Single point of failure:
“Bad” choice of crawler affects per-site crawl throughput.
Slight variation: X crawlers per hostname.
assigned work to another crawler (and so on….)
partitioning scheme.
grey nodes
www.google.com
Deployment
Distribution (Partition) Schemes
Crawl Workload
Seed URL: http://www.google.com 78244 different web servers
Seed URL: http://www.google.com 45 web servers within google
Hostname: Can only exploit at most 45 crawlers. Redirect (hybrid hostname/url) does the best. Partition by Hostname shows poor imbalance (70% idle).
Better off when more crawlers are busy CDF of Per-crawler Downloads (80 nodes) Crawl Throughput Scaleup
Redirection incurs higher overheads only after queue size exceeds a threshold. Hostname incurs low overheads since crawl only looks at google.com which has lots of self-links.
Redirect: The per-URL DHT
maximum around 70 nodes.
Per-URL DHT Overheads
Sampled 5100 crawl targets and measured ping times from each of 80 PlanetLab hosts Partition by hostname approximates random assignment
Best-3 random is “close enough” to Best-5 random
Sanity check: what if a single host crawls all targets ?
Load- balance download bandwidth DHT Communication
Network proximity Rate limit Crawlees Load- balance DHT bandwidth
Redirect
URL
Partition by URL Batching with ring-based forwarding. Experimented on 4 local machines
Partition by hostname. Forwards crawl to DHT neighbor closest to
Experimented on 12 local machines.
Propose a DHT and QP agnostic Distributed
Express crawl as a query.
Permits user-customizable refinement of crawls
Discover important trade-offs in distributed
Co-ordination comes with extra communication costs
Deployment and experimentation on PlanetLab.
Examine crawl distribution strategies under different workloads on live web sources Measure the potential benefits of network proximity.
Google: Centralized dispatcher sends urls
Hash-based parallel crawlers.
BINGO! Crawls the web given basic training set.
Grub SETI@Home infrastructure. 23993 members .
Partition by Hostname shows imbalance. Some crawlers are over-utilized for downloads. Little difference in
threads are kept busy.
URL is best, followed by redirect and hostname.
Essential for overlapping users.
A requirement of personalized crawls. Online relevance feedback.