Scalable, Generic, and Adaptive Systems for Focused Crawling - - PowerPoint PPT Presentation

scalable generic and adaptive systems for focused crawling
SMART_READER_LITE
LIVE PREVIEW

Scalable, Generic, and Adaptive Systems for Focused Crawling - - PowerPoint PPT Presentation

Scalable, Generic, and Adaptive Systems for Focused Crawling Georges Gouriten* - georges@netiru.fr Silviu Maniu Pierre Senellart* * Tlcom Paristech Institut Mines-Tlcom LTCI CNRS Hong Kong University What is focused


slide-1
SLIDE 1

Scalable, Generic, and Adaptive Systems for Focused Crawling

Georges Gouriten* - georges@netiru.fr Silviu Maniu° Pierre Senellart*°

* Télécom Paristech – Institut Mines-Télécom – LTCI CNRS ° Hong Kong University

slide-2
SLIDE 2

What is focused crawling?

slide-3
SLIDE 3

A directed graph

slide-4
SLIDE 4

Web Social network P2P etc.

slide-5
SLIDE 5

Weighted

3 5 4 3 5 3 4 2 2 3

slide-6
SLIDE 6

Let u be a node, β(u) = count of the word Bhutan in all the tweets of u

slide-7
SLIDE 7

Even more weighted

2 3 1 1 1 3

slide-8
SLIDE 8

Let (u, v) be an edge, α(u) = count of the word Bhutan in all the tweets of u mentioning v

slide-9
SLIDE 9

3 5 4 3 5 3 4 2 2 3 3 5 4 3 5 3 4 2 2 3 2 3 1 1 1 3

The total graph

slide-10
SLIDE 10

3 5 4 3 5 3 4 2 2 3 3 5 4 3 5 3 4 2 2 3 2 3 1 1 1 3

A seed list

slide-11
SLIDE 11

The frontier

3 5 4 3 5 3 4 2 2 3 3 5 4 3 5 3 4 2 2 3 2 3 1 1 1 3

slide-12
SLIDE 12

Crawling one node

3 5 4 3 5 3 4 2 2 3 3 5 4 3 5 3 4 2 2 3 2 3 1 1 1 3

slide-13
SLIDE 13

A crawl sequence

Let V0 be the seed list, a set of nodes, a crawl sequence, starting from V0, is

{ vi, vi in frontier(V0 U {v0, v1, .. , vi-1}) }

slide-14
SLIDE 14

Goal of a focused crawler

Produce crawl sequences with global scores (sum) as high as possible

slide-15
SLIDE 15

A high-level algorithm

Estimate scores at the frontier Pick a node from the frontier Crawl the node

slide-16
SLIDE 16

Supposing a perfect estimator

slide-17
SLIDE 17

Finding an optimal crawl sequence offline: NP-hard Greedy wins for a crawled graph > 1000 nodes Refresh rate of 1 is better

slide-18
SLIDE 18

Estimation in practice

slide-19
SLIDE 19

Different kinds of estimators

slide-20
SLIDE 20

bfs

3 5 4 3 5 3 4 2 2 3

slide-21
SLIDE 21

bfs

3 5 4 3 5 3 4 2 2 3

slide-22
SLIDE 22

bfs

slide-23
SLIDE 23

nr

navigational rank score propagation from the ancestors of a node then to the children of a node

slide-24
SLIDE 24

nr

slide-25
SLIDE 25
  • pic
  • nline page importance computation

~ online pageRank computation

slide-26
SLIDE 26
  • pic
  • 2. ->
slide-27
SLIDE 27

Open spaces in the state-of-the-art

nr has a quadratic complexity

  • pic focus on popularity

the rest is about how to score

slide-28
SLIDE 28

First-level neighboorhood

slide-29
SLIDE 29

Second-level neighboorhood

slide-30
SLIDE 30

Neighborhood-based estimators

slide-31
SLIDE 31

deg, e, n, ne

deg: number of neighbors e: sum of incoming edges n: sum of incoming nodes ne: sum of incoming (node*edge)s

slide-32
SLIDE 32

Linear regressions

slide-33
SLIDE 33

Multi-armed bandits (1)

slot machine 1 slot machine 2 slot machine 3 slot machine 4 ...

slide-34
SLIDE 34

Multi-armed bandits (2)

Budget n, how to maximize the reward? Balance exploration and exploitation

slide-35
SLIDE 35

Applied to focused crawling

Slot machines: estimators Reward: score of the top node

slide-36
SLIDE 36

mab_ε

probability 1-ε: slot machine with the highest average reward probability ε: random slot machine

slide-37
SLIDE 37

mab_ε-first

steps [0, └ε x N┘]: random slot machine steps [└ε x N┘ +1, N]: slot machine with the highest average reward

slide-38
SLIDE 38

mab_var

Succession of ε-first strategies, with a reset every r steps, r varying with the context

slide-39
SLIDE 39

Their running times

slide-40
SLIDE 40

Expected running times

Twitter API for one week:

  • 3s
  • 200,000 nodes

One domain website for one week:

  • 1s
  • 600,000 nodes
slide-41
SLIDE 41

Experimental framework (1)

slide-42
SLIDE 42

Experimental framework (2)

─ Graph score 10 seed graphs 1 seed graph: 50 seeds picked randomly among non-zero β Arithmetic average of the crawl scores (sum) ─ Global score Normalization with a baseline -- relative score Geometric average among the five graphs

slide-43
SLIDE 43

Datasets and code are online

http://netiru.fr/research/14fc

slide-44
SLIDE 44

To measure the running times

Same crawl sequence: the oracle Storage in RAM (20G) 3.6 GHz

slide-45
SLIDE 45

The running times (ms)

slide-46
SLIDE 46

nr

Quadratic complexity, with large constant factors

slide-47
SLIDE 47

Their precision

slide-48
SLIDE 48

The precision

Same crawl sequence: the oracle Precision: distance of the top node to the actual top node Arithmetically averaged over a window of 1000 steps

slide-49
SLIDE 49

For bretagne

slide-50
SLIDE 50

Their ability to lead crawls

slide-51
SLIDE 51

Leading the crawl

Different crawl sequences: defined by the top estimated nodes

slide-52
SLIDE 52

Average graph scores for France

slide-53
SLIDE 53

The multi armed-bandits

slide-54
SLIDE 54

All the estimators

slide-55
SLIDE 55

Conclusion

slide-56
SLIDE 56

What we learnt

Generic model NP-hardness offline Refresh rate of 1 Greedy Neighborhood features Linear regressions Multi-armed bandit strategy

slide-57
SLIDE 57

Future work

Approximation of the optimal score Distributed crawl Recrawling nodes Further multi-armed bandits comparisons

slide-58
SLIDE 58

Thank you.

georges@netiru.fr

slide-59
SLIDE 59

Finding the optimal crawl sequences in a known graph

slide-60
SLIDE 60

PTime many-one reduction from the LST-Graph problem Problem remains hard if nodes, not edges, are weighted A subtree rooted at r is seen as a crawl sequence starting from r Free edges are added to the graph to allow free crawls from he seed to any potential root of a subtree

slide-61
SLIDE 61

Rich friends will make you richer

slide-62
SLIDE 62

The greedy strategy

Node picked = argmax(β(v)), v in frontier

slide-63
SLIDE 63

Is not always optimal

2 1 2 3 4 20 12

slide-64
SLIDE 64

The altered greedy strategy

Node picked = probability q: argmax(β(v)) probability 1-q: random v so that, max(β(u)) - β(v) <= ζ x max(β(u))

slide-65
SLIDE 65

Altered greedy vs greedy for jazz

slide-66
SLIDE 66

The refresh rate disadvantage

slide-67
SLIDE 67

When estimation takes too long

slide-68
SLIDE 68

The score degradation (%) at different steps