Start a count for an itemset S B if every proper subset of S had a - - PowerPoint PPT Presentation

start a count for an itemset s b if every proper subset
SMART_READER_LITE
LIVE PREVIEW

Start a count for an itemset S B if every proper subset of S had a - - PowerPoint PPT Presentation

Start a count for an itemset S B if every proper subset of S had a count prior to arrival of basket B Intuitively: If all subsets of S are being counted this means they are frequent/hot and thus S has a potential to be hot


slide-1
SLIDE 1

 Start a count for an itemset S ⊆ B if every

proper subset of S had a count prior to arrival

  • f basket B
  • Intuitively: If all subsets of S are being counted

this means they are “frequent/hot” and thus S has a potential to be “hot”

 Example:

  • Start counting S={i, j} iff both i and j were counted

prior to seeing B

  • Start counting S={i, j, k} iff {i, j}, {i, k}, and {j, k}

were all counted prior to seeing B

209

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
slide-2
SLIDE 2

 Counts for single items < (2/c)∙(avg. number

  • f items in a basket)

 Counts for larger itemsets = ??  But we are conservative about starting

counts of large sets

  • If we counted every set we saw, one basket
  • f 20 items would initiate 1M counts

210

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
slide-3
SLIDE 3

Mining of Massive Datasets

http://www.mmds.org

n lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. If you make use of a significant portion of these slides in your own lecture, please include this message, or a link to our web site: http://www.mmds.org

slide-4
SLIDE 4

High dim. data High dim. data

Locality sensitive hashing Clustering Dimensional ity reduction

Graph data Graph data

PageRank, SimRank Community Detection Spam Detection

Infinite data Infinite data

Filtering data streams Web advertising Queries on streams

Machine learning Machine learning

SVM Decision Trees Perceptron, kNN

Apps Apps

Recommen der systems Association Rules Duplicate document detection

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

212

slide-5
SLIDE 5
  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

213

Facebook social graph

4-degrees of separation [Backstrom-Boldi-Rosa-Ugander-Vigna, 2011]

slide-6
SLIDE 6
  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

214

Connections between political blogs

Polarization of the network [Adamic-Glance, 2005]

slide-7
SLIDE 7
  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

215

Citation networks and Maps of science

[Börner et al., 2012]

slide-8
SLIDE 8 domain2 domain1 domain3 router

Internet

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

216

slide-9
SLIDE 9
  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

217

Seven Bridges of Königsberg

[Euler, 1735]

Return to the starting point by traveling each link of the graph once and only once.

slide-10
SLIDE 10

 Web as a directed graph:

  • Nodes: Webpages
  • Edges: Hyperlinks
  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

218

I teach a class on Networks. CS224W: Classes are in the Gates building Computer Science Department at Stanford

slide-11
SLIDE 11

 Web as a directed graph:

  • Nodes: Webpages
  • Edges: Hyperlinks
  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

219

I teach a class on Networks. CS224W: Classes are in the Gates building Computer Science Department at Stanford

slide-12
SLIDE 12
  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

220

slide-13
SLIDE 13

 How to organize the Web?

 First try: Human curated

Web directories

  • Yahoo, DMOZ, LookSmart

 Second try: Web Search

  • Information Retrieval investigates:

Find relevant docs in a small and trusted set

  • Newspaper articles, Patents, etc.
  • But: Web is huge, full of untrusted documents,

random things, web spam, etc.

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

221

slide-14
SLIDE 14

2 challenges of web search:

 (1) Web contains many sources of information

Who to “trust”?

  • Trick: Trustworthy pages may point to each other!

 (2) What is the “best” answer to query

“newspaper”?

  • No single right answer
  • Trick: Pages that actually know about newspapers

might all be pointing to many newspapers

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

222

slide-15
SLIDE 15

 All web pages are not equally “important”

www.joe-schmoe.com vs. www.stanford.edu

 There is large diversity

in the web-graph node connectivity. Let’s rank the pages by the link structure!

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

223

slide-16
SLIDE 16

 We will cover the following Link Analysis

approaches for computing importances

  • f nodes in a graph:
  • Page Rank
  • Topic-Specific (Personalized) Page Rank
  • Web Spam Detection Algorithms

224

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
slide-17
SLIDE 17
slide-18
SLIDE 18

 Idea: Links as votes

  • Page is more important if it has more links
  • In-coming links? Out-going links?

 Think of in-links as votes:

  • www.stanford.edu has 23,400 in-links
  • www.joe-schmoe.com has 1 in-link

 Are all in-links are equal?

  • Links from important pages count more
  • Recursive question!

226

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
slide-19
SLIDE 19

B 38.4 C 34.3 E 8.1 F 3.9 D 3.9 A 3.3 1.6 1.6 1.6 1.6 1.6

227

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
slide-20
SLIDE 20

 Each link’s vote is proportional to the

importance of its source page

 If page j with importance rj has n out-links,

each link gets rj / n votes

 Page j’s own importance is the sum of the

votes on its in-links

228

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

j

k i rj/3 rj/3 rj/3

rj = ri/3+rk/4

ri/3 rk/4

slide-21
SLIDE 21

 A “vote” from an important

page is worth more

 A page is important if it is

pointed to by other important pages

 Define a “rank” rj for page j

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

229

j i i j

r r

i

d

y m a a/2 y/2 a/2 m y/2

The web in 1839 “Flow” equations:

ry = ry /2 + ra /2 ra = ry /2 + rm rm = ra /2

… out-degree of node

slide-22
SLIDE 22

 3 equations, 3 unknowns,

no constants

  • No unique solution
  • All solutions equivalent modulo the scale factor

 Additional constraint forces uniqueness:

  • + + =
  • Solution: =
  • , =
  • , =
  •  Gaussian elimination method works for

small examples, but we need a better method for large web-size graphs

 We need a new formulation!

230

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

ry = ry /2 + ra /2 ra = ry /2 + rm rm = ra /2

Flow equations:

slide-23
SLIDE 23

 Stochastic adjacency matrix

  • Let page has out-links
  • If → , then =
  • else = 0
  • is a column stochastic matrix
  • Columns sum to 1

 Rank vector : vector with an entry per page

  • is the importance score of page

= 1

  •  The flow equations can be written

= ⋅

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

231

j i i j

r r

i

d

slide-24
SLIDE 24

 Remember the flow equation:  Flow equation in the matrix form

⋅ =

  • Suppose page i links to 3 pages, including j
  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

232

j i

M r r =

rj 1/3

j i i j

r r

i

d

ri

. . =

slide-25
SLIDE 25

 The flow equations can be written

=

 So the rank vector r is an eigenvector of the

stochastic web matrix M

  • In fact, its first or principal eigenvector,

with corresponding eigenvalue 1

  • Largest eigenvalue of M is 1 since M is

column stochastic (with non-negative entries)

  • We know r is unit length and each column of M

sums to one, so ≤

 We can now efficiently solve for r!

The method is called Power iteration

233

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

NOTE: x is an eigenvector with the corresponding eigenvalue λ if:

=

slide-26
SLIDE 26

r = M∙r

y ½ ½ 0 y a = ½ 0 1 a m 0 ½ 0 m

234

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

y a m y a m y ½ ½ a ½ 1 m ½ ry = ry /2 + ra /2 ra = ry /2 + rm rm = ra /2

slide-27
SLIDE 27

 Given a web graph with n nodes, where the

nodes are pages and edges are hyperlinks

 Power iteration: a simple iterative scheme

  • Suppose there are N web pages
  • Initialize: r(0) = [1/N,….,1/N]T
  • Iterate: r(t+1) = M ∙ r(t)
  • Stop when |r(t+1) – r(t)|1 < 

235

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

   j i t i t j

r r

i ) ( ) 1 (

d

di …. out-degree of node i

|x|1 = 1≤i≤N|xi| is the L1 norm Can use any other vector norm, e.g., Euclidean

slide-28
SLIDE 28

 Power Iteration:

  • Set

= 1/N

  • 1: ′ = ∑
  • 2: = ′
  • Goto 1

 Example:

ry 1/3 1/3 5/12 9/24 6/15 ra = 1/3 3/6 1/3 11/24 … 6/15 rm 1/3 1/6 3/12 1/6 3/15

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

y a m

y a m y ½ ½ a ½ 1 m ½

236

Iteration 0, 1, 2, …

ry = ry /2 + ra /2 ra = ry /2 + rm rm = ra /2

slide-29
SLIDE 29

 Power Iteration:

  • Set

= 1/N

  • 1: ′ = ∑
  • 2: = ′
  • Goto 1

 Example:

ry 1/3 1/3 5/12 9/24 6/15 ra = 1/3 3/6 1/3 11/24 … 6/15 rm 1/3 1/6 3/12 1/6 3/15

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

y a m

y a m y ½ ½ a ½ 1 m ½

237

Iteration 0, 1, 2, …

ry = ry /2 + ra /2 ra = ry /2 + rm rm = ra /2

slide-30
SLIDE 30

 Power iteration:

A method for finding dominant eigenvector (the vector corresponding to the largest eigenvalue)

  • () = ⋅ ()
  • () = ⋅ =

= ⋅

  • () = ⋅ =

= ⋅

 Claim:

Sequence ⋅ , ⋅ , … ⋅ , … approaches the dominant eigenvector of

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

238

Details!

slide-31
SLIDE 31

 Claim: Sequence ⋅ , ⋅ , … ⋅ , …

approaches the dominant eigenvector of

 Proof:

  • Assume M has n linearly independent eigenvectors,

, , … , with corresponding eigenvalues , , … , , where > > ⋯ >

  • Vectors , , … , form a basis and thus we can write:

() = + + ⋯ +

  • () = + + ⋯ +

= () + () + ⋯ + () = () + () + ⋯ + ()

  • Repeated multiplication on both sides produces

() = (

) + ( ) + ⋯ + ( )

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

239

Details!

slide-32
SLIDE 32

 Claim: Sequence ⋅ , ⋅ , … ⋅ , …

approaches the dominant eigenvector of

 Proof (continued):

  • Repeated multiplication on both sides produces

() = (

) + ( ) + ⋯ + ( )

  • () =

+

  • + ⋯ +
  • Since > then fractions
  • ,
  • … < 1

and so

  • = 0 as → ∞ (for all = 2 … ).
  • Thus: () ≈
  • Note if = 0 then the method won’t converge
  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

240

Details!

slide-33
SLIDE 33

 Imagine a random web surfer:

  • At any time , surfer is on some page
  • At time + , the surfer follows an
  • ut-link from uniformly at random
  • Ends up on some page linked from
  • Process repeats indefinitely

 Let:

 () … vector whose th coordinate is the

  • prob. that the surfer is at page at time
  • So, () is a probability distribution over pages
  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

241

j i i j

r r (i) dout

j i1 i2 i3

slide-34
SLIDE 34

 Where is the surfer at time t+1?

  • Follows a link uniformly at random

+ = ⋅ ()

 Suppose the random walk reaches a state

+ = ⋅ () = ()

then () is stationary distribution of a random walk

 Our original rank vector satisfies = ⋅

  • So, is a stationary distribution for

the random walk

) ( M ) 1 ( t p t p   

j i1 i2 i3

242

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
slide-35
SLIDE 35

 A central result from the theory of random

walks (a.k.a. Markov processes): For graphs that satisfy certain conditions, the stationary distribution is unique and eventually will be reached no matter what the initial probability distribution at time t = 0

243

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
slide-36
SLIDE 36
slide-37
SLIDE 37

 Does this converge?  Does it converge to what we want?  Are results reasonable?

   j i t i t j

r r

i ) ( ) 1 (

d

Mr r 

  • r

equivalently

245

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
slide-38
SLIDE 38

 Example:

ra 1 1 rb 1 1

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

246

=

b a

Iteration 0, 1, 2, …

   j i t i t j

r r

i ) ( ) 1 (

d

slide-39
SLIDE 39

 Example:

ra 1 rb 1

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

247

=

b a

Iteration 0, 1, 2, …

   j i t i t j

r r

i ) ( ) 1 (

d

slide-40
SLIDE 40

2 problems:

 (1) Some pages are

dead ends (have no out-links)

  • Random walk has “nowhere” to go to
  • Such pages cause importance to “leak out”

 (2) Spider traps:

(all out-links are within the group)

  • Random walked gets “stuck” in a trap
  • And eventually spider traps absorb all importance
  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

248

Dead end

slide-41
SLIDE 41

 Power Iteration:

  • Set

= 1

= ∑

  • And iterate

 Example:

ry 1/3 2/6 3/12 5/24 ra = 1/3 1/6 2/12 3/24 … rm 1/3 3/6 7/12 16/24 1

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

249

Iteration 0, 1, 2, …

y a m

y a m y ½ ½ a ½ m ½ 1

ry = ry /2 + ra /2 ra = ry /2 rm = ra /2 + rm

m is a spider trap

All the PageRank score gets “trapped” in node m.

slide-42
SLIDE 42

 The Google solution for spider traps: At each

time step, the random surfer has two options

  • With prob. , follow a link at random
  • With prob. 1-, jump to some random page
  • Common values for  are in the range 0.8 to 0.9

 Surfer will teleport out of spider trap

within a few time steps

250

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

y a m y a m

slide-43
SLIDE 43

 Power Iteration:

  • Set

= 1

= ∑

  • And iterate

 Example:

ry 1/3 2/6 3/12 5/24 ra = 1/3 1/6 2/12 3/24 … rm 1/3 1/6 1/12 2/24

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

251

Iteration 0, 1, 2, …

y a m

y a m y ½ ½ a ½ m ½

ry = ry /2 + ra /2 ra = ry /2 rm = ra /2 Here the PageRank “leaks” out since the matrix is not stochastic.

slide-44
SLIDE 44

 Teleports: Follow random teleport links with

probability 1.0 from dead-ends

  • Adjust matrix accordingly
  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

252

y a m

y a m y ½ ½ ⅓ a ½ ⅓ m ½ ⅓ y a m y ½ ½ a ½ m ½

y a m

slide-45
SLIDE 45

Why are dead-ends and spider traps a problem and why do teleports solve the problem?

 Spider-traps are not a problem, but with traps

PageRank scores are not what we want

  • Solution: Never get stuck in a spider trap by

teleporting out of it in a finite number of steps

 Dead-ends are a problem

  • The matrix is not column stochastic so our initial

assumptions are not met

  • Solution: Make matrix column stochastic by always

teleporting when there is nowhere else to go

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

253

slide-46
SLIDE 46

 Google’s solution that does it all:

At each step, random surfer has two options:

  • With probability , follow a link at random
  • With probability 1-, jump to some random page

 PageRank equation [Brin-Page, 98]

  • =

+ (1 − ) 1

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

254

di … out-degree

  • f node i

This formulation assumes that has no dead ends. We can either preprocess matrix to remove all dead ends or explicitly follow random teleport links with probability 1.0 from dead-ends.

slide-47
SLIDE 47

 PageRank equation [Brin-Page, ‘98]

  • =

+ (1 − ) 1

  •  The Google Matrix A:

= + 1 − 1 ×

 We have a recursive problem: = ⋅

And the Power method still works!

 What is  ?

  • In practice  =0.8,0.9 (make 5 steps on avg., jump)
  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

255

[1/N]NxN…N by N matrix where all entries are 1/N

slide-48
SLIDE 48

y a = m 1/3 1/3 1/3 0.33 0.20 0.46 0.24 0.20 0.52 0.26 0.18 0.56 7/33 5/33 21/33 . . .

256

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

y a m

13/15 7/15

1/2 1/2 0 1/2 0 0 0 1/2 1 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 y 7/15 7/15 1/15 a 7/15 1/15 1/15 m 1/15 7/15 13/15 0.8 + 0.2 M [1/N]NxN A

slide-49
SLIDE 49
slide-50
SLIDE 50

 Key step is matrix-vector multiplication

  • rnew = A ∙ rold

 Easy if we have enough main memory to

hold A, rold, rnew

 Say N = 1 billion pages

  • We need 4 bytes for

each entry (say)

  • 2 billion entries for

vectors, approx 8GB

  • Matrix A has N2 entries
  • 1018 is a large number!
  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

258

½ ½ 0 ½ 0 0 0 ½ 1 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3

7/15 7/15 1/15 7/15 1/15 1/15 1/15 7/15 13/15 0.8 +0.2 A = ∙M + (1-) [1/N]NxN

= A =

slide-51
SLIDE 51

 Suppose there are N pages  Consider page i, with di out-links  We have Mji = 1/|di| when i → j

and Mji = 0 otherwise

 The random teleport is equivalent to:

  • Adding a teleport link from i to every other page

and setting transition probability to (1-)/N

  • Reducing the probability of following each
  • ut-link from 1/|di| to /|di|
  • Equivalent: Tax each page a fraction (1-) of its

score and redistribute evenly

259

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
slide-52
SLIDE 52

 = ⋅ , where = +

= ∑

= ∑ +

  • = ∑

⋅ +

  • = ∑

⋅ +

  • since ∑ = 1

 So we get: = ⋅ +

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

260

[x]N … a vector of length N with all entries x

Note: Here we assumed M has no dead-ends

slide-53
SLIDE 53

 We just rearranged the PageRank equation

= ⋅ + −

  • where [(1-)/N]N is a vector with all N entries (1-)/N

 M is a sparse matrix! (with no dead-ends)

  • 10 links per node, approx 10N entries

 So in each iteration, we need to:

  • Compute rnew =  M ∙ rold
  • Add a constant value (1-)/N to each entry in rnew
  • Note if M contains dead-ends then ∑
  • < and

we also have to renormalize rnew so that it sums to 1

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

261

slide-54
SLIDE 54

 Input: Graph and parameter

  • Directed graph (can have spider traps and dead ends)
  • Parameter

 Output: PageRank vector

  • Set:
  • =
  • repeat until convergence: ∑
  • >
  • ∀: ′

= ∑

= if in-degree of is 0

  • Now re-insert the leaked PageRank:

∀:

= +

  • =

262

where: = ∑ ′

  • If the graph has no dead-ends then the amount of leaked PageRank is 1-β. But since we have dead-ends

the amount of leaked PageRank may be larger. We have to explicitly account for it by computing S.

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
slide-55
SLIDE 55

 Encode sparse matrix using only nonzero

entries

  • Space proportional roughly to number of links
  • Say 10N, or 4*10*1 billion = 40GB
  • Still won’t fit in memory, but will fit on disk
  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

263

3 1, 5, 7 1 5 17, 64, 113, 117, 245 2 2 13, 23

source node degree destination nodes

slide-56
SLIDE 56

 Assume enough RAM to fit rnew into memory

  • Store rold and matrix M on disk

 1 step of power-iteration is:

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

264

3 1, 5, 6 1 4 17, 64, 113, 117 2 2 13, 23

source degree destination 1 2 3 4 5 6 1 2 3 4 5 6 rnew rold

Initialize all entries of rnew = (1-) / N For each page i (of out-degree di): Read into memory: i, di, dest1, …, destdi, rold(i) For j = 1…di rnew(destj) +=  rold(i) / di

slide-57
SLIDE 57

 Assume enough RAM to fit rnew into memory

  • Store rold and matrix M on disk

 In each iteration, we have to:

  • Read rold and M
  • Write rnew back to disk
  • Cost per iteration of Power method:

= 2|r| + |M|

 Question:

  • What if we could not even fit rnew in memory?
  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

265

slide-58
SLIDE 58
  • Break rnew into k blocks that fit in memory
  • Scan M and rold once for each block
  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

266

4 0, 1, 3, 5 1 2 0, 5 2 2 3, 4

src degree destination 1 2 3 4 5 1 2 3 4 5 rnew rold

M

slide-59
SLIDE 59

 Similar to nested-loop join in databases

  • Break rnew into k blocks that fit in memory
  • Scan M and rold once for each block

 Total cost:

  • k scans of M and rold
  • Cost per iteration of Power method:

k(|M| + |r|) + |r| = k|M| + (k+1)|r|

 Can we do better?

  • Hint: M is much bigger than r (approx 10-20x), so

we must avoid reading it k times per iteration

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

267

slide-60
SLIDE 60
  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

268

4 0, 1 1 3 2 2 1

src degree destination 1 2 3 4 5 1 2 3 4 5 rnew rold

4 5 1 3 5 2 2 4 4 3 2 2 3

Break M into stripes! Each stripe contains only destination nodes in the corresponding block of rnew

slide-61
SLIDE 61

 Break M into stripes

  • Each stripe contains only destination nodes

in the corresponding block of rnew

 Some additional overhead per stripe

  • But it is usually worth it

 Cost per iteration of Power method:

=|M|(1+) + (k+1)|r|

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

269

slide-62
SLIDE 62

 Measures generic popularity of a page

  • Biased against topic-specific authorities
  • Solution: Topic-Specific PageRank (next)

 Uses a single measure of importance

  • Other models of importance
  • Solution: Hubs-and-Authorities

 Susceptible to Link spam

  • Artificial link topographies created in order to

boost page rank

  • Solution: TrustRank
  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

270

slide-63
SLIDE 63

Mining of Massive Datasets

http://www.mmds.org

n lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. If you make use of a significant portion of these slides in your own lecture, please include this message, or a link to our web site: http://www.mmds.org

slide-64
SLIDE 64

B 38.4 C 34.3 E 8.1 F 3.9 D 3.9 A 3.3 1.6 1.6 1.6 1.6 1.6

272

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
slide-65
SLIDE 65

y a = m 1/3 1/3 1/3 0.33 0.20 0.46 0.24 0.20 0.52 0.26 0.18 0.56 7/33 5/33 21/33 . . .

273

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

y a m

0.8+0.2·⅓ 0.8·½+0.2·⅓

1/2 1/2 0 1/2 0 0 0 1/2 1 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 y 7/15 7/15 1/15 a 7/15 1/15 1/15 m 1/15 7/15 13/15 0.8 + 0.2 M [1/N]NxN A

r = A r

Equivalently: = ⋅ +

slide-66
SLIDE 66

 Input: Graph and parameter

  • Directed graph with spider traps and dead ends
  • Parameter

 Output: PageRank vector

  • Set:
  • =
  • , = 1
  • do:
  • ∀: ′

() = ∑

  • ()

() = if in-degree of is 0

  • Now re-insert the leaked PageRank:

∀:

=

  • +
  • = +
  • while ∑
  • () −
  • () >
  • 274

where: = ∑ ′

()

  • If the graph has no dead-

ends then the amount of leaked PageRank is 1-β. But since we have dead-ends the amount of leaked PageRank may be larger. We have to explicitly account for it by computing S.

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
slide-67
SLIDE 67

 Measures generic popularity of a page

  • Will ignore/miss topic-specific authorities
  • Solution: Topic-Specific PageRank (next)

 Uses a single measure of importance

  • Other models of importance
  • Solution: Hubs-and-Authorities

 Susceptible to Link spam

  • Artificial link topographies created in order to

boost page rank

  • Solution: TrustRank
  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

275

slide-68
SLIDE 68
slide-69
SLIDE 69

 Instead of generic popularity, can we

measure popularity within a topic?

 Goal: Evaluate Web pages not just according

to their popularity, but by how close they are to a particular topic, e.g. “sports” or “history”

 Allows search queries to be answered based

  • n interests of the user
  • Example: Query “Trojan” wants different pages

depending on whether you are interested in sports, history and computer security

277

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
slide-70
SLIDE 70

 Random walker has a small probability of

teleporting at any step

 Teleport can go to:

  • Standard PageRank: Any page with equal probability
  • To avoid dead-end and spider-trap problems
  • Topic Specific PageRank: A topic-specific set of

“relevant” pages (teleport set)

 Idea: Bias the random walk

  • When walker teleports, she pick a page from a set S
  • S contains only pages that are relevant to the topic
  • E.g., Open Directory (DMOZ) pages for a given topic/query
  • For each teleport set S, we get a different vector rS

278

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
slide-71
SLIDE 71

 To make this work all we need is to update the

teleportation part of the PageRank formulation: = + ( − )/|| if ∈ +

  • therwise
  • A is stochastic!

 We weighted all pages in the teleport set S equally

  • Could also assign different weights to pages!

 Compute as for regular PageRank:

  • Multiply by M, then add a vector
  • Maintains sparseness
  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

279

slide-72
SLIDE 72

1 2 3 4

Suppose S = {1},  = 0.8

Node Iteration 1 2 … stable 1 0.25 0.4 0.28 0.294 2 0.25 0.1 0.16 0.118 3 0.25 0.3 0.32 0.327 4 0.25 0.2 0.24 0.261

0.2 0.5 0.5 1 1 1 0.4 0.4 0.8 0.8 0.8

280

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

S={1,2,3,4}, β=0.8: r=[0.13, 0.10, 0.39, 0.36] S={1,2,3} , β=0.8: r=[0.17, 0.13, 0.38, 0.30] S={1,2} , β=0.8: r=[0.26, 0.20, 0.29, 0.23] S={1} , β=0.8: r=[0.29, 0.11, 0.32, 0.26] S={1}, β=0.90: r=[0.17, 0.07, 0.40, 0.36] S={1} , β=0.8: r=[0.29, 0.11, 0.32, 0.26] S={1}, β=0.70: r=[0.39, 0.14, 0.27, 0.19]

slide-73
SLIDE 73

 Create different PageRanks for different topics

  • The 16 DMOZ top-level categories:
  • arts, business, sports,…

 Which topic ranking to use?

  • User can pick from a menu
  • Classify query into a topic
  • Can use the context of the query
  • E.g., query is launched from a web page talking about a

known topic

  • History of queries e.g., “basketball” followed by “Jordan”
  • User context, e.g., user’s bookmarks, …

281

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
slide-74
SLIDE 74

Random Walk with Restarts: S is a single element

slide-75
SLIDE 75

A B H 1 1 D 1 1 E F G 1 1 1 I J 1 1 1

a.k.a.: Relevance, Closeness, ‘Similarity’…

[Tong-Faloutsos, ‘06]

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

283

slide-76
SLIDE 76

 Shortest path is not good:  No effect of degree-1 nodes (E, F, G)!  Multi-faceted relationships

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

284

slide-77
SLIDE 77

 Network flow is not good:  Does not punish long paths

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

285

slide-78
SLIDE 78

A B H 1 1 D 1 1 E F G 1 1 1 I J 1 1 1

  • Multiple connections
  • Quality of connection
  • Direct & Indirect

connections

  • Length, Degree,

Weight…

[Tong-Faloutsos, ‘06]

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

286

slide-79
SLIDE 79

 SimRank: Random walks from a fixed node on

k-partite graphs

 Setting: k-partite graph

with k types of nodes

  • E.g.: Authors, Conferences, Tags

 Topic Specific PageRank

from node u: teleport set S = {u}

 Resulting scores measures similarity to node u  Problem:

  • Must be done once for each node u
  • Suitable for sub-Web-scale applications
  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

287

Authors Conferences Tags

slide-80
SLIDE 80

288

ICDM KDD SDM Philip S. Yu IJCAI NIPS AAAI

  • M. Jordan

Ning Zhong

  • R. Ramakrishnan

… … … …

Conference Author

Q: What is most related conference to ICDM?

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

A: Topic-Specific PageRank with teleport set S={ICDM}

slide-81
SLIDE 81

ICDM KDD SDM ECML PKDD PAKDD CIKM DMKD SIGMOD ICML ICDE

0.009 0.011 0.008 0.007 0.005 0.005 0.005 0.004 0.004 0.004

289

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
slide-82
SLIDE 82

 “Normal” PageRank:

  • Teleports uniformly at random to any node
  • All nodes have the same probability of surfer landing

there: S = [0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1]

 Topic-Specific PageRank also known as

Personalized PageRank:

  • Teleports to a topic specific set of pages
  • Nodes can have different probabilities of surfer

landing there: S = [0.1, 0, 0, 0.2, 0, 0, 0.5, 0, 0, 0.2]

 Random Walk with Restarts:

  • Topic-Specific PageRank where teleport is always to

the same node. S=[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

290

slide-83
SLIDE 83
slide-84
SLIDE 84

 Spamming:

  • Any deliberate action to boost a web

page’s position in search engine results, incommensurate with page’s real value

 Spam:

  • Web pages that are the result of spamming

 This is a very broad definition

  • SEO industry might disagree!
  • SEO = search engine optimization

 Approximately 10-15% of web pages are spam

292

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
slide-85
SLIDE 85

 Early search engines:

  • Crawl the Web
  • Index pages by the words they contained
  • Respond to search queries (lists of words) with

the pages containing those words

 Early page ranking:

  • Attempt to order pages matching a search query

by “importance”

  • First search engines considered:
  • (1) Number of times query words appeared
  • (2) Prominence of word position, e.g. title, header
  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

293

slide-86
SLIDE 86

 As people began to use search engines to find

things on the Web, those with commercial interests tried to exploit search engines to bring people to their own site – whether they wanted to be there or not

 Example:

  • Shirt-seller might pretend to be about “movies”

 Techniques for achieving high

relevance/importance for a web page

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

294

slide-87
SLIDE 87

 How do you make your page appear to be

about movies?

  • (1) Add the word movie 1,000 times to your page
  • Set text color to the background color, so only

search engines would see it

  • (2) Or, run the query “movie” on your

target search engine

  • See what page came first in the listings
  • Copy it into your page, make it “invisible”

 These and similar techniques are term spam

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

295

slide-88
SLIDE 88

 Believe what people say about you, rather

than what you say about yourself

  • Use words in the anchor text (words that appear

underlined to represent the link) and its surrounding text

 PageRank as a tool to measure the

“importance” of Web pages

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

296

slide-89
SLIDE 89

 Our hypothetical shirt-seller looses

  • Saying he is about movies doesn’t help, because
  • thers don’t say he is about movies
  • His page isn’t very important, so it won’t be ranked

high for shirts or movies

 Example:

  • Shirt-seller creates 1,000 pages, each links to his with

“movie” in the anchor text

  • These pages have no links in, so they get little PageRank
  • So the shirt-seller can’t beat truly important movie

pages, like IMDB

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

297

slide-90
SLIDE 90
  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

298

slide-91
SLIDE 91
  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

299

SPAM FARMING

slide-92
SLIDE 92

 Once Google became the dominant search

engine, spammers began to work out ways to fool Google

 Spam farms were developed to concentrate

PageRank on a single page

 Link spam:

  • Creating link structures that

boost PageRank of a particular page

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

300

slide-93
SLIDE 93

 Three kinds of web pages from a

spammer’s point of view

  • Inaccessible pages
  • Accessible pages
  • e.g., blog comments pages
  • spammer can post links to his pages
  • Owned pages
  • Completely controlled by spammer
  • May span multiple domain names

301

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
slide-94
SLIDE 94

 Spammer’s goal:

  • Maximize the PageRank of target page t

 Technique:

  • Get as many links from accessible pages as

possible to target page t

  • Construct “link farm” to get PageRank

multiplier effect

302

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
slide-95
SLIDE 95

Inaccessible t Accessible Owned 1 2 M

One of the most common and effective

  • rganizations for a link farm

303

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Millions of farm pages

slide-96
SLIDE 96

 x: PageRank contributed by accessible pages  y: PageRank of target page t  Rank of each “farm” page =

  • +
  •  = +
  • +
  • +
  • = + +
  • +
  •  =
  • +
  • where =
  • Very small; ignore

Now we solve for y

304

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

N…# pages on the web M…# of pages spammer

  • wns

Inaccessible

t

Accessible Owned

1 2 M

slide-97
SLIDE 97

 =

  • +
  • where =
  •  For  = 0.85, 1/(1-2)= 3.6

 Multiplier effect for acquired PageRank  By making M large, we can make y as

large as we want

305

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

N…# pages on the web M…# of pages spammer

  • wns

Inaccessible

t

Accessible Owned

1 2 M

slide-98
SLIDE 98
slide-99
SLIDE 99

 Combating term spam

  • Analyze text using statistical methods
  • Similar to email spam filtering
  • Also useful: Detecting approximate duplicate pages

 Combating link spam

  • Detection and blacklisting of structures that look like

spam farms

  • Leads to another war – hiding and detecting spam farms
  • TrustRank = topic-specific PageRank with a teleport

set of trusted pages

  • Example: .edu domains, similar domains for non-US schools
  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

307

slide-100
SLIDE 100

 Basic principle: Approximate isolation

  • It is rare for a “good” page to point to a “bad”

(spam) page

 Sample a set of seed pages from the web  Have an oracle (human) to identify the good

pages and the spam pages in the seed set

  • Expensive task, so we must make seed set as

small as possible

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

308

slide-101
SLIDE 101

 Call the subset of seed pages that are

identified as good the trusted pages

 Perform a topic-sensitive PageRank with

teleport set = trusted pages

  • Propagate trust through links:
  • Each page gets a trust value between 0 and 1

 Solution 1: Use a threshold value and mark

all pages below the trust threshold as spam

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

309

slide-102
SLIDE 102

 Set trust of each trusted page to 1  Suppose trust of page p is tp

  • Page p has a set of out-links op

 For each qop, p confers the trust to q

  •  tp/|op|

for 0 < < 1

 Trust is additive

  • Trust of p is the sum of the trust conferred
  • n p by all its in-linked pages

 Note similarity to Topic-Specific PageRank

  • Within a scaling factor, TrustRank = PageRank with

trusted pages as teleport set

310

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
slide-103
SLIDE 103

 Trust attenuation:

  • The degree of trust conferred by a trusted page

decreases with the distance in the graph

 Trust splitting:

  • The larger the number of out-links from a page,

the less scrutiny the page author gives each out- link

  • Trust is split across out-links
  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

311

slide-104
SLIDE 104

 Two conflicting considerations:

  • Human has to inspect each seed page, so

seed set must be as small as possible

  • Must ensure every good page gets adequate

trust rank, so need make all good pages reachable from seed set by short paths

312

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org