[PPT] - Start a count for an itemset S B if every proper subset of S had a PowerPoint Presentation

SLIDE 1

 Start a count for an itemset S ⊆ B if every

proper subset of S had a count prior to arrival

f basket B
Intuitively: If all subsets of S are being counted

this means they are “frequent/hot” and thus S has a potential to be “hot”

 Example:

Start counting S={i, j} iff both i and j were counted

prior to seeing B

Start counting S={i, j, k} iff {i, j}, {i, k}, and {j, k}

were all counted prior to seeing B

209

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

SLIDE 2

 Counts for single items < (2/c)∙(avg. number

f items in a basket)

 Counts for larger itemsets = ??  But we are conservative about starting

counts of large sets

If we counted every set we saw, one basket
f 20 items would initiate 1M counts

210

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

SLIDE 3

Mining of Massive Datasets

http://www.mmds.org

n lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. If you make use of a significant portion of these slides in your own lecture, please include this message, or a link to our web site: http://www.mmds.org

SLIDE 4

High dim. data High dim. data

Locality sensitive hashing Clustering Dimensional ity reduction

Graph data Graph data

PageRank, SimRank Community Detection Spam Detection

Infinite data Infinite data

Filtering data streams Web advertising Queries on streams

Machine learning Machine learning

SVM Decision Trees Perceptron, kNN

Apps Apps

Recommen der systems Association Rules Duplicate document detection

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

212

SLIDE 5

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

213

Facebook social graph

4-degrees of separation [Backstrom-Boldi-Rosa-Ugander-Vigna, 2011]

SLIDE 6

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

214

Connections between political blogs

Polarization of the network [Adamic-Glance, 2005]

SLIDE 7

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

215

Citation networks and Maps of science

[Börner et al., 2012]

SLIDE 8 domain2 domain1 domain3 router

Internet

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

216

SLIDE 9

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

217

Seven Bridges of Königsberg

[Euler, 1735]

Return to the starting point by traveling each link of the graph once and only once.

SLIDE 10

 Web as a directed graph:

Nodes: Webpages
Edges: Hyperlinks
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

218

I teach a class on Networks. CS224W: Classes are in the Gates building Computer Science Department at Stanford

SLIDE 11

 Web as a directed graph:

Nodes: Webpages
Edges: Hyperlinks
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

219

I teach a class on Networks. CS224W: Classes are in the Gates building Computer Science Department at Stanford

SLIDE 12

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

220

SLIDE 13

 How to organize the Web?

 First try: Human curated

Web directories

Yahoo, DMOZ, LookSmart

 Second try: Web Search

Information Retrieval investigates:

Find relevant docs in a small and trusted set

Newspaper articles, Patents, etc.
But: Web is huge, full of untrusted documents,

random things, web spam, etc.

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

221

SLIDE 14

2 challenges of web search:

 (1) Web contains many sources of information

Who to “trust”?

Trick: Trustworthy pages may point to each other!

 (2) What is the “best” answer to query

“newspaper”?

No single right answer
Trick: Pages that actually know about newspapers

might all be pointing to many newspapers

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

222

SLIDE 15

 All web pages are not equally “important”

www.joe-schmoe.com vs. www.stanford.edu

 There is large diversity

in the web-graph node connectivity. Let’s rank the pages by the link structure!

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

223

SLIDE 16

 We will cover the following Link Analysis

approaches for computing importances

f nodes in a graph:
Page Rank
Topic-Specific (Personalized) Page Rank
Web Spam Detection Algorithms

224

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

SLIDE 17

SLIDE 18

 Idea: Links as votes

Page is more important if it has more links
In-coming links? Out-going links?

 Think of in-links as votes:

www.stanford.edu has 23,400 in-links
www.joe-schmoe.com has 1 in-link

 Are all in-links are equal?

Links from important pages count more
Recursive question!

226

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

SLIDE 19

B 38.4 C 34.3 E 8.1 F 3.9 D 3.9 A 3.3 1.6 1.6 1.6 1.6 1.6

227

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

SLIDE 20

 Each link’s vote is proportional to the

importance of its source page

 If page j with importance rj has n out-links,

each link gets rj / n votes

 Page j’s own importance is the sum of the

votes on its in-links

228

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

j

k i rj/3 rj/3 rj/3

rj = ri/3+rk/4

ri/3 rk/4

SLIDE 21

 A “vote” from an important

page is worth more

 A page is important if it is

pointed to by other important pages

 Define a “rank” rj for page j

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

229







j i i j

r r

i

d

y m a a/2 y/2 a/2 m y/2

The web in 1839 “Flow” equations:

ry = ry /2 + ra /2 ra = ry /2 + rm rm = ra /2

… out-degree of node

SLIDE 22

 3 equations, 3 unknowns,

no constants

No unique solution
All solutions equivalent modulo the scale factor

 Additional constraint forces uniqueness:

+ + =
Solution: =
, =
, =
 Gaussian elimination method works for

small examples, but we need a better method for large web-size graphs

 We need a new formulation!

230

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

ry = ry /2 + ra /2 ra = ry /2 + rm rm = ra /2

Flow equations:

SLIDE 23

 Stochastic adjacency matrix

Let page has out-links
If → , then =
else = 0
is a column stochastic matrix
Columns sum to 1

 Rank vector : vector with an entry per page

is the importance score of page
∑

= 1

 The flow equations can be written

= ⋅

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

231







j i i j

r r

i

d

SLIDE 24

 Remember the flow equation:  Flow equation in the matrix form

⋅ =

Suppose page i links to 3 pages, including j
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

232

j i

M r r =

rj 1/3







j i i j

r r

i

d

ri

. . =

SLIDE 25

 The flow equations can be written

=

 So the rank vector r is an eigenvector of the

stochastic web matrix M

In fact, its first or principal eigenvector,

with corresponding eigenvalue 1

Largest eigenvalue of M is 1 since M is

column stochastic (with non-negative entries)

We know r is unit length and each column of M

sums to one, so ≤

 We can now efficiently solve for r!

The method is called Power iteration

233

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

NOTE: x is an eigenvector with the corresponding eigenvalue λ if:

=

SLIDE 26

r = M∙r

y ½ ½ 0 y a = ½ 0 1 a m 0 ½ 0 m

234

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

y a m y a m y ½ ½ a ½ 1 m ½ ry = ry /2 + ra /2 ra = ry /2 + rm rm = ra /2

SLIDE 27

 Given a web graph with n nodes, where the

nodes are pages and edges are hyperlinks

 Power iteration: a simple iterative scheme

Suppose there are N web pages
Initialize: r(0) = [1/N,….,1/N]T
Iterate: r(t+1) = M ∙ r(t)
Stop when |r(t+1) – r(t)|1 < 

235

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org



   j i t i t j

r r

i ) ( ) 1 (

d

di …. out-degree of node i

|x|1 = 1≤i≤N|xi| is the L1 norm Can use any other vector norm, e.g., Euclidean

SLIDE 28

 Power Iteration:

Set

= 1/N

1: ′ = ∑
→
2: = ′
Goto 1

 Example:

ry 1/3 1/3 5/12 9/24 6/15 ra = 1/3 3/6 1/3 11/24 … 6/15 rm 1/3 1/6 3/12 1/6 3/15

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

y a m

y a m y ½ ½ a ½ 1 m ½

236

Iteration 0, 1, 2, …

ry = ry /2 + ra /2 ra = ry /2 + rm rm = ra /2

SLIDE 29

 Power Iteration:

Set

= 1/N

1: ′ = ∑
→
2: = ′
Goto 1

 Example:

ry 1/3 1/3 5/12 9/24 6/15 ra = 1/3 3/6 1/3 11/24 … 6/15 rm 1/3 1/6 3/12 1/6 3/15

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

y a m

y a m y ½ ½ a ½ 1 m ½

237

Iteration 0, 1, 2, …

ry = ry /2 + ra /2 ra = ry /2 + rm rm = ra /2

SLIDE 30

 Power iteration:

A method for finding dominant eigenvector (the vector corresponding to the largest eigenvalue)

() = ⋅ ()
() = ⋅ =

= ⋅

() = ⋅ =

= ⋅

 Claim:

Sequence ⋅ , ⋅ , … ⋅ , … approaches the dominant eigenvector of

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

238

Details!

SLIDE 31

 Claim: Sequence ⋅ , ⋅ , … ⋅ , …

approaches the dominant eigenvector of

 Proof:

Assume M has n linearly independent eigenvectors,

, , … , with corresponding eigenvalues , , … , , where > > ⋯ >

Vectors , , … , form a basis and thus we can write:

() = + + ⋯ +

() = + + ⋯ +

= () + () + ⋯ + () = () + () + ⋯ + ()

Repeated multiplication on both sides produces

() = (

) + ( ) + ⋯ + ( )

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

239

Details!

SLIDE 32

 Claim: Sequence ⋅ , ⋅ , … ⋅ , …

approaches the dominant eigenvector of

 Proof (continued):

Repeated multiplication on both sides produces

() = (

) + ( ) + ⋯ + ( )

() =

+

+ ⋯ +
Since > then fractions
,
… < 1

and so

= 0 as → ∞ (for all = 2 … ).
Thus: () ≈
Note if = 0 then the method won’t converge
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

240

Details!

SLIDE 33

 Imagine a random web surfer:

At any time , surfer is on some page
At time + , the surfer follows an
ut-link from uniformly at random
Ends up on some page linked from
Process repeats indefinitely

 Let:

 () … vector whose th coordinate is the

prob. that the surfer is at page at time
So, () is a probability distribution over pages
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

241







j i i j

r r (i) dout

j i1 i2 i3

SLIDE 34

 Where is the surfer at time t+1?

Follows a link uniformly at random

+ = ⋅ ()

 Suppose the random walk reaches a state

+ = ⋅ () = ()

then () is stationary distribution of a random walk

 Our original rank vector satisfies = ⋅

So, is a stationary distribution for

the random walk

) ( M ) 1 ( t p t p   

j i1 i2 i3

242

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

SLIDE 35

 A central result from the theory of random

walks (a.k.a. Markov processes): For graphs that satisfy certain conditions, the stationary distribution is unique and eventually will be reached no matter what the initial probability distribution at time t = 0

243

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

SLIDE 36

SLIDE 37

 Does this converge?  Does it converge to what we want?  Are results reasonable?



   j i t i t j

r r

i ) ( ) 1 (

d

Mr r 

r

equivalently

245

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

SLIDE 38

 Example:

ra 1 1 rb 1 1

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

246

=

b a

Iteration 0, 1, 2, …



   j i t i t j

r r

i ) ( ) 1 (

d

SLIDE 39

 Example:

ra 1 rb 1

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

247

=

b a

Iteration 0, 1, 2, …



   j i t i t j

r r

i ) ( ) 1 (

d

SLIDE 40

2 problems:

 (1) Some pages are

dead ends (have no out-links)

Random walk has “nowhere” to go to
Such pages cause importance to “leak out”

 (2) Spider traps:

(all out-links are within the group)

Random walked gets “stuck” in a trap
And eventually spider traps absorb all importance
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

248

Dead end

SLIDE 41

 Power Iteration:

Set

= 1

= ∑

→
And iterate

 Example:

ry 1/3 2/6 3/12 5/24 ra = 1/3 1/6 2/12 3/24 … rm 1/3 3/6 7/12 16/24 1

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

249

Iteration 0, 1, 2, …

y a m

y a m y ½ ½ a ½ m ½ 1

ry = ry /2 + ra /2 ra = ry /2 rm = ra /2 + rm

m is a spider trap

All the PageRank score gets “trapped” in node m.

SLIDE 42

 The Google solution for spider traps: At each

time step, the random surfer has two options

With prob. , follow a link at random
With prob. 1-, jump to some random page
Common values for  are in the range 0.8 to 0.9

 Surfer will teleport out of spider trap

within a few time steps

250

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

y a m y a m

SLIDE 43

 Power Iteration:

Set

= 1

= ∑

→
And iterate

 Example:

ry 1/3 2/6 3/12 5/24 ra = 1/3 1/6 2/12 3/24 … rm 1/3 1/6 1/12 2/24

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

251

Iteration 0, 1, 2, …

y a m

y a m y ½ ½ a ½ m ½

ry = ry /2 + ra /2 ra = ry /2 rm = ra /2 Here the PageRank “leaks” out since the matrix is not stochastic.

SLIDE 44

 Teleports: Follow random teleport links with

probability 1.0 from dead-ends

Adjust matrix accordingly
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

252

y a m

y a m y ½ ½ ⅓ a ½ ⅓ m ½ ⅓ y a m y ½ ½ a ½ m ½

y a m

SLIDE 45

Why are dead-ends and spider traps a problem and why do teleports solve the problem?

 Spider-traps are not a problem, but with traps

PageRank scores are not what we want

Solution: Never get stuck in a spider trap by

teleporting out of it in a finite number of steps

 Dead-ends are a problem

The matrix is not column stochastic so our initial

assumptions are not met

Solution: Make matrix column stochastic by always

teleporting when there is nowhere else to go

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

253

SLIDE 46

 Google’s solution that does it all:

At each step, random surfer has two options:

With probability , follow a link at random
With probability 1-, jump to some random page

 PageRank equation [Brin-Page, 98]

=
→

+ (1 − ) 1

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

254

di … out-degree

f node i

This formulation assumes that has no dead ends. We can either preprocess matrix to remove all dead ends or explicitly follow random teleport links with probability 1.0 from dead-ends.

SLIDE 47

 PageRank equation [Brin-Page, ‘98]

=
→

+ (1 − ) 1

 The Google Matrix A:

= + 1 − 1 ×

 We have a recursive problem: = ⋅

And the Power method still works!

 What is  ?

In practice  =0.8,0.9 (make 5 steps on avg., jump)
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

255

[1/N]NxN…N by N matrix where all entries are 1/N

SLIDE 48

y a = m 1/3 1/3 1/3 0.33 0.20 0.46 0.24 0.20 0.52 0.26 0.18 0.56 7/33 5/33 21/33 . . .

256

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

y a m

13/15 7/15

1/2 1/2 0 1/2 0 0 0 1/2 1 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 y 7/15 7/15 1/15 a 7/15 1/15 1/15 m 1/15 7/15 13/15 0.8 + 0.2 M [1/N]NxN A

SLIDE 49

SLIDE 50

 Key step is matrix-vector multiplication

rnew = A ∙ rold

 Easy if we have enough main memory to

hold A, rold, rnew

 Say N = 1 billion pages

We need 4 bytes for

each entry (say)

2 billion entries for

vectors, approx 8GB

Matrix A has N2 entries
1018 is a large number!
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

258

½ ½ 0 ½ 0 0 0 ½ 1 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3

7/15 7/15 1/15 7/15 1/15 1/15 1/15 7/15 13/15 0.8 +0.2 A = ∙M + (1-) [1/N]NxN

= A =

SLIDE 51

 Suppose there are N pages  Consider page i, with di out-links  We have Mji = 1/|di| when i → j

and Mji = 0 otherwise

 The random teleport is equivalent to:

Adding a teleport link from i to every other page

and setting transition probability to (1-)/N

Reducing the probability of following each
ut-link from 1/|di| to /|di|
Equivalent: Tax each page a fraction (1-) of its

score and redistribute evenly

259

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

SLIDE 52

 = ⋅ , where = +



= ∑

⋅



= ∑ +

⋅
= ∑

⋅ +

∑
= ∑

⋅ +

since ∑ = 1

 So we get: = ⋅ +

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

260

[x]N … a vector of length N with all entries x

Note: Here we assumed M has no dead-ends

SLIDE 53

 We just rearranged the PageRank equation

= ⋅ + −

where [(1-)/N]N is a vector with all N entries (1-)/N

 M is a sparse matrix! (with no dead-ends)

10 links per node, approx 10N entries

 So in each iteration, we need to:

Compute rnew =  M ∙ rold
Add a constant value (1-)/N to each entry in rnew
Note if M contains dead-ends then ∑
< and

we also have to renormalize rnew so that it sums to 1

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

261

SLIDE 54

 Input: Graph and parameter

Directed graph (can have spider traps and dead ends)
Parameter

 Output: PageRank vector

Set:
=
repeat until convergence: ∑
−
>
∀: ′

= ∑

→

′

= if in-degree of is 0

Now re-insert the leaked PageRank:

∀:

= +

=

262

where: = ∑ ′

If the graph has no dead-ends then the amount of leaked PageRank is 1-β. But since we have dead-ends

the amount of leaked PageRank may be larger. We have to explicitly account for it by computing S.

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

SLIDE 55

 Encode sparse matrix using only nonzero

entries

Space proportional roughly to number of links
Say 10N, or 4*10*1 billion = 40GB
Still won’t fit in memory, but will fit on disk
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

263

3 1, 5, 7 1 5 17, 64, 113, 117, 245 2 2 13, 23

source node degree destination nodes

SLIDE 56

 Assume enough RAM to fit rnew into memory

Store rold and matrix M on disk

 1 step of power-iteration is:

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

264

3 1, 5, 6 1 4 17, 64, 113, 117 2 2 13, 23

source degree destination 1 2 3 4 5 6 1 2 3 4 5 6 rnew rold

Initialize all entries of rnew = (1-) / N For each page i (of out-degree di): Read into memory: i, di, dest1, …, destdi, rold(i) For j = 1…di rnew(destj) +=  rold(i) / di

SLIDE 57

 Assume enough RAM to fit rnew into memory

Store rold and matrix M on disk

 In each iteration, we have to:

Read rold and M
Write rnew back to disk
Cost per iteration of Power method:

= 2|r| + |M|

 Question:

What if we could not even fit rnew in memory?
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

265

SLIDE 58

Break rnew into k blocks that fit in memory
Scan M and rold once for each block
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

266

4 0, 1, 3, 5 1 2 0, 5 2 2 3, 4

src degree destination 1 2 3 4 5 1 2 3 4 5 rnew rold

M

SLIDE 59

 Similar to nested-loop join in databases

Break rnew into k blocks that fit in memory
Scan M and rold once for each block

 Total cost:

k scans of M and rold
Cost per iteration of Power method:

k(|M| + |r|) + |r| = k|M| + (k+1)|r|

 Can we do better?

Hint: M is much bigger than r (approx 10-20x), so

we must avoid reading it k times per iteration

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

267

SLIDE 60

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

268

4 0, 1 1 3 2 2 1

src degree destination 1 2 3 4 5 1 2 3 4 5 rnew rold

4 5 1 3 5 2 2 4 4 3 2 2 3

Break M into stripes! Each stripe contains only destination nodes in the corresponding block of rnew

SLIDE 61

 Break M into stripes

Each stripe contains only destination nodes

in the corresponding block of rnew

 Some additional overhead per stripe

But it is usually worth it

 Cost per iteration of Power method:

=|M|(1+) + (k+1)|r|

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

269

SLIDE 62

 Measures generic popularity of a page

Biased against topic-specific authorities
Solution: Topic-Specific PageRank (next)

 Uses a single measure of importance

Other models of importance
Solution: Hubs-and-Authorities

 Susceptible to Link spam

Artificial link topographies created in order to

boost page rank

Solution: TrustRank
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

270

SLIDE 63

Mining of Massive Datasets

http://www.mmds.org

n lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. If you make use of a significant portion of these slides in your own lecture, please include this message, or a link to our web site: http://www.mmds.org

SLIDE 64

B 38.4 C 34.3 E 8.1 F 3.9 D 3.9 A 3.3 1.6 1.6 1.6 1.6 1.6

272

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

SLIDE 65

y a = m 1/3 1/3 1/3 0.33 0.20 0.46 0.24 0.20 0.52 0.26 0.18 0.56 7/33 5/33 21/33 . . .

273

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

y a m

0.8+0.2·⅓ 0.8·½+0.2·⅓

1/2 1/2 0 1/2 0 0 0 1/2 1 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 y 7/15 7/15 1/15 a 7/15 1/15 1/15 m 1/15 7/15 13/15 0.8 + 0.2 M [1/N]NxN A

r = A r

Equivalently: = ⋅ +

SLIDE 66

 Input: Graph and parameter

Directed graph with spider traps and dead ends
Parameter

 Output: PageRank vector

Set:
=
, = 1
do:
∀: ′

() = ∑

()
→

′

() = if in-degree of is 0

Now re-insert the leaked PageRank:

∀:

=

+
= +
while ∑
() −
() >
274

where: = ∑ ′

()

If the graph has no dead-

ends then the amount of leaked PageRank is 1-β. But since we have dead-ends the amount of leaked PageRank may be larger. We have to explicitly account for it by computing S.

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

SLIDE 67

 Measures generic popularity of a page

Will ignore/miss topic-specific authorities
Solution: Topic-Specific PageRank (next)

 Uses a single measure of importance

Other models of importance
Solution: Hubs-and-Authorities

 Susceptible to Link spam

Artificial link topographies created in order to

boost page rank

Solution: TrustRank
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

275

SLIDE 68

SLIDE 69

 Instead of generic popularity, can we

measure popularity within a topic?

 Goal: Evaluate Web pages not just according

to their popularity, but by how close they are to a particular topic, e.g. “sports” or “history”

 Allows search queries to be answered based

n interests of the user
Example: Query “Trojan” wants different pages

depending on whether you are interested in sports, history and computer security

277

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

SLIDE 70

 Random walker has a small probability of

teleporting at any step

 Teleport can go to:

Standard PageRank: Any page with equal probability
To avoid dead-end and spider-trap problems
Topic Specific PageRank: A topic-specific set of

“relevant” pages (teleport set)

 Idea: Bias the random walk

When walker teleports, she pick a page from a set S
S contains only pages that are relevant to the topic
E.g., Open Directory (DMOZ) pages for a given topic/query
For each teleport set S, we get a different vector rS

278

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

SLIDE 71

 To make this work all we need is to update the

teleportation part of the PageRank formulation: = + ( − )/|| if ∈ +

therwise
A is stochastic!

 We weighted all pages in the teleport set S equally

Could also assign different weights to pages!

 Compute as for regular PageRank:

Multiply by M, then add a vector
Maintains sparseness
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

279

SLIDE 72

1 2 3 4

Suppose S = {1},  = 0.8

Node Iteration 1 2 … stable 1 0.25 0.4 0.28 0.294 2 0.25 0.1 0.16 0.118 3 0.25 0.3 0.32 0.327 4 0.25 0.2 0.24 0.261

0.2 0.5 0.5 1 1 1 0.4 0.4 0.8 0.8 0.8

280

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

S={1,2,3,4}, β=0.8: r=[0.13, 0.10, 0.39, 0.36] S={1,2,3} , β=0.8: r=[0.17, 0.13, 0.38, 0.30] S={1,2} , β=0.8: r=[0.26, 0.20, 0.29, 0.23] S={1} , β=0.8: r=[0.29, 0.11, 0.32, 0.26] S={1}, β=0.90: r=[0.17, 0.07, 0.40, 0.36] S={1} , β=0.8: r=[0.29, 0.11, 0.32, 0.26] S={1}, β=0.70: r=[0.39, 0.14, 0.27, 0.19]

SLIDE 73

 Create different PageRanks for different topics

The 16 DMOZ top-level categories:
arts, business, sports,…

 Which topic ranking to use?

User can pick from a menu
Classify query into a topic
Can use the context of the query
E.g., query is launched from a web page talking about a

known topic

History of queries e.g., “basketball” followed by “Jordan”
User context, e.g., user’s bookmarks, …

281

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

SLIDE 74

Random Walk with Restarts: S is a single element

SLIDE 75

A B H 1 1 D 1 1 E F G 1 1 1 I J 1 1 1

a.k.a.: Relevance, Closeness, ‘Similarity’…

[Tong-Faloutsos, ‘06]

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

283

SLIDE 76

 Shortest path is not good:  No effect of degree-1 nodes (E, F, G)!  Multi-faceted relationships

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

284

SLIDE 77

 Network flow is not good:  Does not punish long paths

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

285

SLIDE 78

A B H 1 1 D 1 1 E F G 1 1 1 I J 1 1 1

Multiple connections
Quality of connection
Direct & Indirect

connections

Length, Degree,

Weight…

…

[Tong-Faloutsos, ‘06]

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

286

SLIDE 79

 SimRank: Random walks from a fixed node on

k-partite graphs

 Setting: k-partite graph

with k types of nodes

E.g.: Authors, Conferences, Tags

 Topic Specific PageRank

from node u: teleport set S = {u}

 Resulting scores measures similarity to node u  Problem:

Must be done once for each node u
Suitable for sub-Web-scale applications
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

287

Authors Conferences Tags

SLIDE 80

288

ICDM KDD SDM Philip S. Yu IJCAI NIPS AAAI

M. Jordan

Ning Zhong

R. Ramakrishnan

… … … …

Conference Author

Q: What is most related conference to ICDM?

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

A: Topic-Specific PageRank with teleport set S={ICDM}

SLIDE 81

ICDM KDD SDM ECML PKDD PAKDD CIKM DMKD SIGMOD ICML ICDE

0.009 0.011 0.008 0.007 0.005 0.005 0.005 0.004 0.004 0.004

289

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

SLIDE 82

 “Normal” PageRank:

Teleports uniformly at random to any node
All nodes have the same probability of surfer landing

there: S = [0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1]

 Topic-Specific PageRank also known as

Personalized PageRank:

Teleports to a topic specific set of pages
Nodes can have different probabilities of surfer

landing there: S = [0.1, 0, 0, 0.2, 0, 0, 0.5, 0, 0, 0.2]

 Random Walk with Restarts:

Topic-Specific PageRank where teleport is always to

the same node. S=[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

290

SLIDE 83

SLIDE 84

 Spamming:

Any deliberate action to boost a web

page’s position in search engine results, incommensurate with page’s real value

 Spam:

Web pages that are the result of spamming

 This is a very broad definition

SEO industry might disagree!
SEO = search engine optimization

 Approximately 10-15% of web pages are spam

292

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

SLIDE 85

 Early search engines:

Crawl the Web
Index pages by the words they contained
Respond to search queries (lists of words) with

the pages containing those words

 Early page ranking:

Attempt to order pages matching a search query

by “importance”

First search engines considered:
(1) Number of times query words appeared
(2) Prominence of word position, e.g. title, header
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

293

SLIDE 86

 As people began to use search engines to find

things on the Web, those with commercial interests tried to exploit search engines to bring people to their own site – whether they wanted to be there or not

 Example:

Shirt-seller might pretend to be about “movies”

 Techniques for achieving high

relevance/importance for a web page

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

294

SLIDE 87

 How do you make your page appear to be

about movies?

(1) Add the word movie 1,000 times to your page
Set text color to the background color, so only

search engines would see it

(2) Or, run the query “movie” on your

target search engine

See what page came first in the listings
Copy it into your page, make it “invisible”

 These and similar techniques are term spam

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

295

SLIDE 88

 Believe what people say about you, rather

than what you say about yourself

Use words in the anchor text (words that appear

underlined to represent the link) and its surrounding text

 PageRank as a tool to measure the

“importance” of Web pages

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

296

SLIDE 89

 Our hypothetical shirt-seller looses

Saying he is about movies doesn’t help, because
thers don’t say he is about movies
His page isn’t very important, so it won’t be ranked

high for shirts or movies

 Example:

Shirt-seller creates 1,000 pages, each links to his with

“movie” in the anchor text

These pages have no links in, so they get little PageRank
So the shirt-seller can’t beat truly important movie

pages, like IMDB

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

297

SLIDE 90

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

298

SLIDE 91

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

299

SPAM FARMING

SLIDE 92

 Once Google became the dominant search

engine, spammers began to work out ways to fool Google

 Spam farms were developed to concentrate

PageRank on a single page

 Link spam:

Creating link structures that

boost PageRank of a particular page

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

300

SLIDE 93

 Three kinds of web pages from a

spammer’s point of view

Inaccessible pages
Accessible pages
e.g., blog comments pages
spammer can post links to his pages
Owned pages
Completely controlled by spammer
May span multiple domain names

301

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

SLIDE 94

 Spammer’s goal:

Maximize the PageRank of target page t

 Technique:

Get as many links from accessible pages as

possible to target page t

Construct “link farm” to get PageRank

multiplier effect

302

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

SLIDE 95

Inaccessible t Accessible Owned 1 2 M

One of the most common and effective

rganizations for a link farm

303

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Millions of farm pages

SLIDE 96

 x: PageRank contributed by accessible pages  y: PageRank of target page t  Rank of each “farm” page =

+
 = +
+
+
= + +
+
 =
+
where =
Very small; ignore

Now we solve for y

304

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

N…# pages on the web M…# of pages spammer

wns

Inaccessible

t

Accessible Owned

1 2 M

SLIDE 97

 =

+
where =
 For  = 0.85, 1/(1-2)= 3.6

 Multiplier effect for acquired PageRank  By making M large, we can make y as

large as we want

305

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

N…# pages on the web M…# of pages spammer

wns

Inaccessible

t

Accessible Owned

1 2 M

SLIDE 98

SLIDE 99

 Combating term spam

Analyze text using statistical methods
Similar to email spam filtering
Also useful: Detecting approximate duplicate pages

 Combating link spam

Detection and blacklisting of structures that look like

spam farms

Leads to another war – hiding and detecting spam farms
TrustRank = topic-specific PageRank with a teleport

set of trusted pages

Example: .edu domains, similar domains for non-US schools
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

307

SLIDE 100

 Basic principle: Approximate isolation

It is rare for a “good” page to point to a “bad”

(spam) page

 Sample a set of seed pages from the web  Have an oracle (human) to identify the good

pages and the spam pages in the seed set

Expensive task, so we must make seed set as

small as possible

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

308

SLIDE 101

 Call the subset of seed pages that are

identified as good the trusted pages

 Perform a topic-sensitive PageRank with

teleport set = trusted pages

Propagate trust through links:
Each page gets a trust value between 0 and 1

 Solution 1: Use a threshold value and mark

all pages below the trust threshold as spam

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

309

SLIDE 102

 Set trust of each trusted page to 1  Suppose trust of page p is tp

Page p has a set of out-links op

 For each qop, p confers the trust to q

 tp/|op|

for 0 < < 1

 Trust is additive

Trust of p is the sum of the trust conferred
n p by all its in-linked pages

 Note similarity to Topic-Specific PageRank

Within a scaling factor, TrustRank = PageRank with

trusted pages as teleport set

310

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

SLIDE 103

 Trust attenuation:

The degree of trust conferred by a trusted page

decreases with the distance in the graph

 Trust splitting:

The larger the number of out-links from a page,

the less scrutiny the page author gives each out- link

Trust is split across out-links
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

311

SLIDE 104

 Two conflicting considerations:

Human has to inspect each seed page, so

seed set must be as small as possible

Must ensure every good page gets adequate

trust rank, so need make all good pages reachable from seed set by short paths

312

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org