Distributed Web Crawling over DHTs Boon Thau Loo, Owen Cooper, - - PowerPoint PPT Presentation

▶

Feb 07, 2023 604 likes •829 views

Distributed Web Crawling over DHTs Boon Thau Loo, Owen Cooper, Sailesh Krishnamurthy CS294-4 Search Today Search Index Crawl Crawl Whats Wrong? Users have a limited search interface Todays web is dynamic and growing: Timely

SLIDE 1

Distributed Web Crawling

ver DHTs

Boon Thau Loo, Owen Cooper, Sailesh Krishnamurthy CS294-4

SLIDE 2

Index

Search Today

Search Crawl Crawl

SLIDE 3

What’s Wrong?

Users have a limited search interface Today’s web is dynamic and growing:

Timely re-crawls required. Not feasible for all web sites.

Search engines control your search results:

Decide which sites get crawled:

550 billion documents estimated in 2001 (BrightPlanet) Google indexes 3.3 billion documents.

Decide which sites gets updated more frequently May censor or skew results rankings.

Challenge: User customizable searches that scale.

SLIDE 4

Our Solution: A Distributed Crawler

P2P users donate excess bandwidth and computation resources to crawl the web.

Organized using Distributed Hash tables (DHTs) DHT and Query Processor agnostic crawler:

Designed to work over any DHT Crawls can be expressed as declarative recursive queries

Easy for user customization. Queries can be executed over PIER, a DHT-based relational P2P

Query Processor

Crawlers: PI ER nodes Crawlees: Web Servers

SLIDE 5

Potential

Infrastructure for crawl personalization:

User-defined focused crawlers Collaborative crawling/filtering (special interest groups)

Other possibilities:

Bigger, better, faster web crawler Enables new search and indexing technologies

P2P Web Search Web archival and storage (with OceanStore)

Generalized crawler for querying distributed graph structures.

Monitor file-sharing networks. E.g. Gnutella. P2P network maintenance:

Routing information. OceanStore meta-data.

SLIDE 6

Challenges that We Investigated

Scalability and Throughput

DHT communication overheads. Balance network load on crawlers

2 components of network load: Download and DHT bandwidth.

Network Proximity. Exploit network locality of crawlers.

Limit download rates on web sites

Prevents denial of service attacks.

Main tradeoff: Tension between coordination and communication

Balance load either on crawlers or on crawlees ! Exploit network proximity at the cost of communication.

SLIDE 7

Crawler Thread

Crawl as a Recursive Query

Downloader Extractor Publish WebPage(url)

Input Urls Output Links

Π: Link.destUrl WebPage(url)

Redirect CrawlWrapper DupElim Dup Elim Filters

Publish Link (sourceUrl, destUrl)

Rate Throttle & Reorder DHT Scan: WebPage(url)

Seed Urls

SLIDE 8

Crawl Distribution Strategies

Partition by URL

Ensures even distribution of crawler workload. High DHT communication traffic.

Partition by Hostname

One crawler per hostname. Creates a “control point” for per-server rate

throttling.

May lead to uneven crawler load distribution Single point of failure:

“Bad” choice of crawler affects per-site crawl throughput.

Slight variation: X crawlers per hostname.

SLIDE 9

Redirection

Simple technique that allows a crawler to redirect or pass on its

assigned work to another crawler (and so on….)

A second chance distribution mechanism orthogonal to the

partitioning scheme.

Example: Partition by hostname
Node responsible for google.com (red) dispatches work (by URL) to

grey nodes

Load balancing benefits of partition by URL
Control benefits of partition by hostname
When? Policy-based
Crawler load (queue size)
Network proximity
Why not? Cost of redirection
Increased DHT control traffic
Hence, put a limit number of redirections per URL.

www.google.com

SLIDE 10

Experiments

Deployment

WebCrawler over PIER, Bamboo DHT, up to 80 PlanetLab nodes
3 Crawl Threads per crawler, 15 min crawl duration

Distribution (Partition) Schemes

URL
Hostname
Hostname with 8 crawlers per unique host
Hostname, one level redirection on overload.

Crawl Workload

Exhaustive crawl

Seed URL: http://www.google.com 78244 different web servers

Crawl of fixed number of sites

Seed URL: http://www.google.com 45 web servers within google

Crawl of single site within http://groups.google.com

SLIDE 11

Crawl of Multiple Sites I

Hostname: Can only exploit at most 45 crawlers. Redirect (hybrid hostname/url) does the best. Partition by Hostname shows poor imbalance (70% idle).

Better off when more crawlers are busy CDF of Per-crawler Downloads (80 nodes) Crawl Throughput Scaleup

SLIDE 12

Crawl of Multiple Sites II

Redirection incurs higher overheads only after queue size exceeds a threshold. Hostname incurs low overheads since crawl only looks at google.com which has lots of self-links.

Redirect: The per-URL DHT

verheads hit their

maximum around 70 nodes.

Per-URL DHT Overheads

SLIDE 13

Network Proximity

Sampled 5100 crawl targets and measured ping times from each of 80 PlanetLab hosts Partition by hostname approximates random assignment

Best-3 random is “close enough” to Best-5 random

Sanity check: what if a single host crawls all targets ?

SLIDE 14

Summary of Schemes

+

Load- balance download bandwidth DHT Communication

verheads

Network proximity Rate limit Crawlees Load- balance DHT bandwidth

+ ?

Redirect

+ ? +

Hostname
+

URL

SLIDE 15

Related Work

Herodotus, at MIT (Chord-based)

Partition by URL Batching with ring-based forwarding. Experimented on 4 local machines

Apoidea, at GaTech (Chord-based)

Partition by hostname. Forwards crawl to DHT neighbor closest to

website.

Experimented on 12 local machines.

SLIDE 16

Conclusion

Our main contributions:

Propose a DHT and QP agnostic Distributed

Crawler.

Express crawl as a query.

Permits user-customizable refinement of crawls

Discover important trade-offs in distributed

crawling:

Co-ordination comes with extra communication costs

Deployment and experimentation on PlanetLab.

Examine crawl distribution strategies under different workloads on live web sources Measure the potential benefits of network proximity.

SLIDE 17

Backup slides

SLIDE 18

Existing Crawlers

Cluster-based crawlers

Google: Centralized dispatcher sends urls

to be crawled.

Hash-based parallel crawlers.

Focused Crawlers

BINGO! Crawls the web given basic training set.

Peer-to-Peer

Grub SETI@Home infrastructure. 23993 members .

SLIDE 19

Exhaustive Crawl

Partition by Hostname shows imbalance. Some crawlers are over-utilized for downloads. Little difference in

throughput. Most crawler

threads are kept busy.

SLIDE 20

Single Site

URL is best, followed by redirect and hostname.

SLIDE 21

Future Work

Fault Tolerance Security Single-Node Throughput Work-Sharing between Crawl Queries

Essential for overlapping users.

Crawl Global Prioritization

A requirement of personalized crawls. Online relevance feedback.