Web Crawling
Introduction to Information Retrieval INF 141 Donald J. Patterson
Content adapted from Hinrich Schütze http://www.informationretrieval.org
Web Crawling Introduction to Information Retrieval INF 141 Donald - - PowerPoint PPT Presentation
Web Crawling Introduction to Information Retrieval INF 141 Donald J. Patterson Content adapted from Hinrich Schtze http://www.informationretrieval.org Web Crawlers Robust Crawling A Robust Crawl Architecture Doc. Finger- Robots.txt URL
Content adapted from Hinrich Schütze http://www.informationretrieval.org
Web Crawlers
Robust Crawling
WWW DNS Fetch Parse Seen? Doc. Finger- prints URL Filter Robots.txt Duplicate Elimination URL Index URL Frontier Queue
Parsing
Robust Crawling
WWW DNS Fetch Parse Seen? Doc. Finger- prints URL Filter Robots.txt Duplicate Elimination URL Index URL Frontier Queue
Duplication
any further
Robust Crawling
WWW DNS Fetch Parse Seen? Doc. Finger- prints URL Filter Robots.txt Duplicate Elimination URL Index URL Frontier Queue
Filters
Robust Crawling
WWW DNS Fetch Parse Seen? Doc. Finger- prints URL Filter Robots.txt Duplicate Elimination URL Index URL Frontier Queue
Duplicate Elimination
Distributing the crawl
cache locality
multiple nodes for robustness
Robust Crawling
WWW DNS Fetch Parse Seen? Doc. Finger- prints URL Filter Robots.txt Duplicate Elimination URL Index URL Frontier Queue Host Splitter To Other Nodes From Other Nodes
The output of the URL Filter at each node is sent to the Duplicate Eliminator at all other nodes
URL Frontier
URL Frontier
creating bursty references
can hit it repeatedly
Magnitude of the crawl
Web Crawling Outline
Robust Crawling
WWW DNS Fetch Parse Seen? Doc. Finger- prints URL Filter Robots.txt Duplicate Elimination URL Index URL Frontier Queue Host Splitter To Other Nodes From Other Nodes
The output of the URL Filter at each node is sent to the Duplicate Eliminator at all other nodes
URL Frontier Implementation - Mercator
Prioritizer F "Front" Queues
1 2 F
B "Back" Queues Front Queue Selector Back Queue Router Host to Back Queue Mapping Table
1 2 B
Back Queue Selector Timing Heap
http://research.microsoft.com/~najork/mercator.pdf
URL Frontier Implementation - Mercator
priority
Prioritizer F "Front" Queues
1 2 F
Front Queue Selector
URL Frontier Implementation - Mercator
initiated from back queues
B "Back" Queues Back Queue Router Host to Back Queue Mapping Table
1 2 B
Back Queue Selector Timing Heap
URL Frontier Implementation - Mercator
while crawling
queues (mapping) to help
B "Back" Queues Back Queue Router Host to Back Queue Mapping Table
1 2 B
Back Queue Selector Timing Heap
URL Frontier Implementation - Mercator
be hit again
B "Back" Queues Back Queue Router Host to Back Queue Mapping Table
1 2 B
Back Queue Selector Timing Heap
URL Frontier Implementation - Mercator
based on time, b.
it in that queue, repeat.
B "Back" Queues Back Queue Router Host to Back Queue Mapping Table
1 2 B
Back Queue Selector Timing Heap
URL Frontier Implementation - Mercator
as crawler threads
keep a portion in memory.
B "Back" Queues Back Queue Router Host to Back Queue Mapping Table
1 2 B
Back Queue Selector Timing Heap