Scalable, Generic, and Adaptive Systems for Focused Crawling - - PowerPoint PPT Presentation

▶

Mar 09, 2023 32 likes •718 views

Scalable, Generic, and Adaptive Systems for Focused Crawling Georges Gouriten* - georges@netiru.fr Silviu Maniu Pierre Senellart* * Tlcom Paristech Institut Mines-Tlcom LTCI CNRS Hong Kong University What is focused

SLIDE 1

Scalable, Generic, and Adaptive Systems for Focused Crawling

Georges Gouriten* - georges@netiru.fr Silviu Maniu° Pierre Senellart*°

* Télécom Paristech – Institut Mines-Télécom – LTCI CNRS ° Hong Kong University

SLIDE 2

What is focused crawling?

SLIDE 3

A directed graph

SLIDE 4

Web Social network P2P etc.

SLIDE 5

Weighted

3 5 4 3 5 3 4 2 2 3

SLIDE 6

Let u be a node, β(u) = count of the word Bhutan in all the tweets of u

SLIDE 7

Even more weighted

2 3 1 1 1 3

SLIDE 8

Let (u, v) be an edge, α(u) = count of the word Bhutan in all the tweets of u mentioning v

SLIDE 9

3 5 4 3 5 3 4 2 2 3 3 5 4 3 5 3 4 2 2 3 2 3 1 1 1 3

The total graph

SLIDE 10

3 5 4 3 5 3 4 2 2 3 3 5 4 3 5 3 4 2 2 3 2 3 1 1 1 3

A seed list

SLIDE 11

The frontier

3 5 4 3 5 3 4 2 2 3 3 5 4 3 5 3 4 2 2 3 2 3 1 1 1 3

SLIDE 12

Crawling one node

3 5 4 3 5 3 4 2 2 3 3 5 4 3 5 3 4 2 2 3 2 3 1 1 1 3

SLIDE 13

A crawl sequence

Let V0 be the seed list, a set of nodes, a crawl sequence, starting from V0, is

{ vi, vi in frontier(V0 U {v0, v1, .. , vi-1}) }

SLIDE 14

Goal of a focused crawler

Produce crawl sequences with global scores (sum) as high as possible

SLIDE 15

A high-level algorithm

Estimate scores at the frontier Pick a node from the frontier Crawl the node

SLIDE 16

Supposing a perfect estimator

SLIDE 17

Finding an optimal crawl sequence offline: NP-hard Greedy wins for a crawled graph > 1000 nodes Refresh rate of 1 is better

SLIDE 18

Estimation in practice

SLIDE 19

Different kinds of estimators

SLIDE 20

bfs

3 5 4 3 5 3 4 2 2 3

SLIDE 21

bfs

3 5 4 3 5 3 4 2 2 3

SLIDE 22

bfs

SLIDE 23

nr

navigational rank score propagation from the ancestors of a node then to the children of a node

SLIDE 24

nr

SLIDE 25

pic
nline page importance computation

~ online pageRank computation

SLIDE 26

pic
2. ->

SLIDE 27

Open spaces in the state-of-the-art

nr has a quadratic complexity

pic focus on popularity

the rest is about how to score

SLIDE 28

First-level neighboorhood

SLIDE 29

Second-level neighboorhood

SLIDE 30

Neighborhood-based estimators

SLIDE 31

deg, e, n, ne

deg: number of neighbors e: sum of incoming edges n: sum of incoming nodes ne: sum of incoming (node*edge)s

SLIDE 32

Linear regressions

SLIDE 33

Multi-armed bandits (1)

slot machine 1 slot machine 2 slot machine 3 slot machine 4 ...

SLIDE 34

Multi-armed bandits (2)

Budget n, how to maximize the reward? Balance exploration and exploitation

SLIDE 35

Applied to focused crawling

Slot machines: estimators Reward: score of the top node

SLIDE 36

mab_ε

probability 1-ε: slot machine with the highest average reward probability ε: random slot machine

SLIDE 37

mab_ε-first

steps [0, └ε x N┘]: random slot machine steps [└ε x N┘ +1, N]: slot machine with the highest average reward

SLIDE 38

mab_var

Succession of ε-first strategies, with a reset every r steps, r varying with the context

SLIDE 39

Their running times

SLIDE 40

Expected running times

Twitter API for one week:

3s
200,000 nodes

One domain website for one week:

1s
600,000 nodes

SLIDE 41

Experimental framework (1)

SLIDE 42

Experimental framework (2)

─ Graph score 10 seed graphs 1 seed graph: 50 seeds picked randomly among non-zero β Arithmetic average of the crawl scores (sum) ─ Global score Normalization with a baseline -- relative score Geometric average among the five graphs

SLIDE 43

Datasets and code are online

http://netiru.fr/research/14fc

SLIDE 44

To measure the running times

Same crawl sequence: the oracle Storage in RAM (20G) 3.6 GHz

SLIDE 45

The running times (ms)

SLIDE 46

nr

Quadratic complexity, with large constant factors

SLIDE 47

Their precision

SLIDE 48

The precision

Same crawl sequence: the oracle Precision: distance of the top node to the actual top node Arithmetically averaged over a window of 1000 steps

SLIDE 49

For bretagne

SLIDE 50

Their ability to lead crawls

SLIDE 51

Leading the crawl

Different crawl sequences: defined by the top estimated nodes

SLIDE 52

Average graph scores for France

SLIDE 53

The multi armed-bandits

SLIDE 54

All the estimators

SLIDE 55

Conclusion

SLIDE 56

What we learnt

Generic model NP-hardness offline Refresh rate of 1 Greedy Neighborhood features Linear regressions Multi-armed bandit strategy

SLIDE 57

Future work

Approximation of the optimal score Distributed crawl Recrawling nodes Further multi-armed bandits comparisons

SLIDE 58

Thank you.

georges@netiru.fr

SLIDE 59

Finding the optimal crawl sequences in a known graph

SLIDE 60

PTime many-one reduction from the LST-Graph problem Problem remains hard if nodes, not edges, are weighted A subtree rooted at r is seen as a crawl sequence starting from r Free edges are added to the graph to allow free crawls from he seed to any potential root of a subtree

SLIDE 61

Rich friends will make you richer

SLIDE 62

The greedy strategy

Node picked = argmax(β(v)), v in frontier

SLIDE 63

Is not always optimal

2 1 2 3 4 20 12

SLIDE 64

The altered greedy strategy

Node picked = probability q: argmax(β(v)) probability 1-q: random v so that, max(β(u)) - β(v) <= ζ x max(β(u))

SLIDE 65

Altered greedy vs greedy for jazz

SLIDE 66

The refresh rate disadvantage

SLIDE 67

When estimation takes too long

SLIDE 68