Web Crawling Introduction to Information Retrieval INF 141 Donald - - PowerPoint PPT Presentation

▶

Mar 24, 2024 129 likes •399 views

Web Crawling Introduction to Information Retrieval INF 141 Donald J. Patterson Content adapted from Hinrich Schtze http://www.informationretrieval.org Web Crawlers Robust Crawling A Robust Crawl Architecture Doc. Finger- Robots.txt URL

SLIDE 1

Web Crawling

Introduction to Information Retrieval INF 141 Donald J. Patterson

Content adapted from Hinrich Schütze http://www.informationretrieval.org

SLIDE 2

Web Crawlers

SLIDE 3

A Robust Crawl Architecture

Robust Crawling

WWW DNS Fetch Parse Seen? Doc. Finger- prints URL Filter Robots.txt Duplicate Elimination URL Index URL Frontier Queue

SLIDE 4

Parsing: URL normalization

Parsing

When a fetched document is parsed
some outlink URLs are relative
For example:
http://en.wikipedia.org/wiki/Main_Page
has a link to “/wiki/Special:Statistics”
which is the same as
http://en.wikipedia.org/wiki/Special:Statistics
Parsing involves normalizing (expanding) relative URLs

SLIDE 5

A Robust Crawl Architecture

Robust Crawling

WWW DNS Fetch Parse Seen? Doc. Finger- prints URL Filter Robots.txt Duplicate Elimination URL Index URL Frontier Queue

SLIDE 6

Content Seen?

Duplication

Duplication is widespread on the web
If a page just fetched is already in the index, don’t process it

any further

This can be done by using document fingerprints/shingles
A type of hashing scheme

SLIDE 7

A Robust Crawl Architecture

Robust Crawling

WWW DNS Fetch Parse Seen? Doc. Finger- prints URL Filter Robots.txt Duplicate Elimination URL Index URL Frontier Queue

SLIDE 8

Compliance with webmasters wishes...

Filters

Robots.txt
Filters is a regular expression for a URL to be excluded
How often do you check robots.txt?
Cache to avoid using bandwidth and loading web server
Sitemaps
A mechanism to better manage the URL frontier

SLIDE 9

A Robust Crawl Architecture

Robust Crawling

WWW DNS Fetch Parse Seen? Doc. Finger- prints URL Filter Robots.txt Duplicate Elimination URL Index URL Frontier Queue

SLIDE 10

Duplicate Elimination

For a one-time crawl
Test to see if an extracted,parsed, filtered URL
has already been sent to the frontier.
has already been indexed.
For a continuous crawl
See full frontier implementation:
Update the URL’s priority
Based on staleness
Based on quality
Based on politeness

SLIDE 11

Distributing the crawl

The key goal for the architecture of a distributed crawl is

cache locality

We want multiple crawl threads in multiple processes at

multiple nodes for robustness

Geographically distributed for speed
Partition the hosts being crawled across nodes
Hash typically used for partition
How do the nodes communicate?

SLIDE 12

Robust Crawling

WWW DNS Fetch Parse Seen? Doc. Finger- prints URL Filter Robots.txt Duplicate Elimination URL Index URL Frontier Queue Host Splitter To Other Nodes From Other Nodes

The output of the URL Filter at each node is sent to the Duplicate Eliminator at all other nodes

SLIDE 13

URL Frontier

Freshness
Crawl some pages more often than others
Keep track of change rate of sites
Incorporate sitemap info
Quality
High quality pages should be prioritized
Based on link-analysis, popularity, heuristics on content
Politeness
When was the last time you hit a server?

SLIDE 14

URL Frontier

Freshness, Quality and Politeness
These goals will conflict with each other
A simple priority queue will fail because links are bursty
Many sites have lots of links pointing to themselves

creating bursty references

Time influences the priority
Politeness Challenges
Even if only one thread is assigned to hit a particular host it

can hit it repeatedly

Heuristic : insert a time gap between successive requests

SLIDE 15

Magnitude of the crawl

To fetch 1,000,000,000 pages in one month...
a small fraction of the web
we need to fetch 400 pages per second !
Since many fetches will be duplicates, unfetchable, filtered,
etc. 400 pages per second isn’t fast enough

SLIDE 16

Introduction
URL Frontier
Robust Crawling
DNS
Various parts of architecture
URL Frontier
Index
Distributed Indices
Connectivity Servers

Overview

Web Crawling Outline

SLIDE 17

Robust Crawling

WWW DNS Fetch Parse Seen? Doc. Finger- prints URL Filter Robots.txt Duplicate Elimination URL Index URL Frontier Queue Host Splitter To Other Nodes From Other Nodes

The output of the URL Filter at each node is sent to the Duplicate Eliminator at all other nodes

SLIDE 18

URL Frontier Implementation - Mercator

URLs flow from top to bottom
Front queues manage priority
Back queue manage politeness
Each queue is FIFO

Prioritizer F "Front" Queues

1 2 F

B "Back" Queues Front Queue Selector Back Queue Router Host to Back Queue Mapping Table

1 2 B

Back Queue Selector Timing Heap

http://research.microsoft.com/~najork/mercator.pdf

SLIDE 19

URL Frontier Implementation - Mercator

Prioritizer takes URLS and assigns a

priority

Integer between 1 and F
Appends URL to appropriate queue
Priority
Based on rate of change
Based on quality (spam)
Based on application

Prioritizer F "Front" Queues

1 2 F

Front Queue Selector

Front queues

SLIDE 20

URL Frontier Implementation - Mercator

Selection from front queues is

initiated from back queues

Pick a front queue, how?
Round robin
Randomly
Monte Carlo
Biased toward high priority

Back queues

B "Back" Queues Back Queue Router Host to Back Queue Mapping Table

1 2 B

Back Queue Selector Timing Heap

SLIDE 21

URL Frontier Implementation - Mercator

Each back queue is non-empty

while crawling

Each back queue has URLs from
ne host only
Maintain a table of URL to back

queues (mapping) to help

Back queues

B "Back" Queues Back Queue Router Host to Back Queue Mapping Table

1 2 B

Back Queue Selector Timing Heap

SLIDE 22

URL Frontier Implementation - Mercator

Timing Heap
One entry per queue
Has earliest time that a host can

be hit again

Earliest time based on
Last access to that host
Plus any appropriate heuristic

Back queues

B "Back" Queues Back Queue Router Host to Back Queue Mapping Table

1 2 B

Back Queue Selector Timing Heap

SLIDE 23

URL Frontier Implementation - Mercator

A crawler thread needs a URL
It gets the timing heap root
It gets the next eligible queue

based on time, b.

It gets a URL from b
If b is empty
Pull a URL v from front queue
If back queue for v exists place

it in that queue, repeat.

Else add v to b - update heap.

Back queues

B "Back" Queues Back Queue Router Host to Back Queue Mapping Table

1 2 B

Back Queue Selector Timing Heap

SLIDE 24

URL Frontier Implementation - Mercator

How many queues?
Keep all threads busy
~3 times as many back queues

as crawler threads

Web-scale issues
This won’t fit in memory
Solution
Keep queues on disk and

keep a portion in memory.

Back queues

B "Back" Queues Back Queue Router Host to Back Queue Mapping Table

1 2 B

Back Queue Selector Timing Heap

SLIDE 25