Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 17: - - PowerPoint PPT Presentation

data mining techniques
SMART_READER_LITE
LIVE PREVIEW

Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 17: - - PowerPoint PPT Presentation

Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 17: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang, Tan et al., Leskovec et al.) Graph Data: Media Networks Connections between political blogs


slide-1
SLIDE 1

Data Mining Techniques

CS 6220 - Section 3 - Fall 2016

Lecture 17: Link Analysis

Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang, 
 Tan et al., Leskovec et al.)

slide-2
SLIDE 2

Graph Data: Media Networks

Connections between political blogs

Polarization of the network [Adamic-Glance, 2005]

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

slide-3
SLIDE 3

Schedule Updates

slide-4
SLIDE 4
  • Human-curated 


(e.g. Yahoo, Looksmart)

  • Hand-written descriptions
  • Wait time for inclusion
  • Text-search


(e.g. WebCrawler, Lycos)

  • Prone to term-spam

Web search before PageRank

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

slide-5
SLIDE 5

Web as a Directed Graph

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

slide-6
SLIDE 6

PageRank: Links as Votes

  • Pages with more inbound links are more important
  • Inbound links from important pages carry more weight

Not all pages are equally important Many inbound
 links Few/no inbound
 links Links from unimportant pages Links from important pages

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

slide-7
SLIDE 7

Example: PageRank Scores

B 38.4 C 34.3 E 8.1 F 3.9 D 3.9 A 3.3 1.6 1.6 1.6 1.6 1.6

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

slide-8
SLIDE 8

Simple Recursive Formulation

  • A link’s vote is proportional to the

importance of its source page

  • If page j with importance rj has n
  • ut-links, each link gets rj / n votes
  • Page j’s own importance is the

sum of the votes on its in-links

j

k i rj/3 rj/3 rj/3

rj = ri/3+rk/4

ri/3 rk/4

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

slide-9
SLIDE 9

Equivalent Formulation: Random Surfer

  • At time t a surfer is on some page i
  • At time t+1 the surfer follows a 


link to a new page at random

  • Define rank ri as fraction of time

spent on page i

j

k i rj/3 rj/3 rj/3

rj = ri/3+rk/4

ri/3 rk/4

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

slide-10
SLIDE 10

PageRank: The “Flow” Model

  • 3 equations, 3 unknowns
  • Impose constraint: ry + ra + rm = 1
  • Solution: ry = 2/5, ra = 2/5, rm = 1/5

=

j i i j

r r

i

d

y m a a/2 y/2 a/2 m y/2

“Flow” equations: ry = ry /2 + ra /2 ra = ry /2 + rm rm = ra /2

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

slide-11
SLIDE 11

PageRank: The “Flow” Model

=

j i i j

r r

i

d

y m a a/2 y/2 a/2 m y/2

“Flow” equations: ry = ry /2 + ra /2 ra = ry /2 + rm rm = ra /2

r = M·r

Matrix M is stochastic (i.e. columns sum to one)

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

slide-12
SLIDE 12

PageRank: Eigenvector Problem

  • PageRank: Solve for eigenvector r = M r 


with eigenvalue λ = 1

  • Eigenvector with λ = 1 is guaranteed 


to exist since M is a stochastic matrix
 (i.e. if a = M b then Σ ai = Σ bi)

  • Problem: There are billions of pages on the 

  • internet. How do we solve for eigenvector


with order 1010 elements?

slide-13
SLIDE 13

PageRank: Power Iteration

Model for random Surfer:

  • At time t = 0 pick a page at random
  • At each subsequent time t follow an

  • utgoing link at random

Probabilistic interpretation:

slide-14
SLIDE 14

PageRank: Power Iteration

y m a a/2 y/2 a/2 m y/2

pt converges to r. Iterate until |pt - pt -1| < ε

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

slide-15
SLIDE 15
  • PageRank is assumes a random walk


model for individual surfers

  • Equivalent assumption: flow model


in which equal fractions of surfers
 follow each link at every time

  • Ergodicity: The equilibrium of the flow model

is the same as the asymptotic distribution for an individual random walk

Aside: Ergodicity

slide-16
SLIDE 16

Aside: Ergodicity

  • PageRank is assumes a random walk


model for individual surfers

  • Equivalent assumption: flow model


in which equal fractions of surfers
 follow each link at every time

  • Ergodicity: The equilibrium of the flow model

is the same as the asymptotic distribution for an individual random walk

slide-17
SLIDE 17

Aside: Ergodicity

  • PageRank is assumes a random walk


model for individual surfers

  • Equivalent assumption: flow model


in which equal fractions of surfers
 follow each link at every time

  • Ergodicity: The equilibrium of the flow model

is the same as the asymptotic distribution for an individual random walk Averaging over individuals is equivalent
 to averaging single individual over time

slide-18
SLIDE 18

PageRank: Problems

Dead end Spider trap

  • 1. Dead Ends
  • Nodes with no outgoing links.
  • Where do surfers go next?

  • 2. Spider Traps
  • Subgraph with no outgoing

links to wider graph

  • Surfers are “trapped” with 


no way out.

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

slide-19
SLIDE 19

Power Iteration: Dead Ends

y m a a/2 y/2 a/2 y/2

Probability not conserved

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

slide-20
SLIDE 20

Power Iteration: Dead Ends

y m a a/2 y/2 a/2 y/2

Fixes “probability sink” issue

(teleport at dead ends)

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

slide-21
SLIDE 21

Power Iteration: Spider Traps

y m a a/2 y/2 a/2 y/2

Probability accumulates in traps (surfers get stuck)

m

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

slide-22
SLIDE 22

rj = X

i→j

β ri di + (1 − β) 1 N

Solution: Random Teleports

Model for teleporting random surfer:

  • At time t = 0 pick a page at random
  • At each subsequent time t
  • With probability β follow an 

  • utgoing link at random
  • With probability 1-β teleport


to a new initial location at random PageRank Equation [Page & Brin 1998]

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

slide-23
SLIDE 23

Power Iteration: Teleports

y m a

(can use power iteration as normal)

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

slide-24
SLIDE 24

Power Iteration: Teleports

y m a

(can use power iteration as normal)

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

slide-25
SLIDE 25

Power Iteration: Teleports

y m a

(can use power iteration as normal)

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

slide-26
SLIDE 26

Computing PageRank

  • M is sparse - only store nonzero entries
  • Space proportional roughly to number of links
  • Say 10N, or 4*10*1 billion = 40GB
  • Still won’t fit in memory, but will fit on disk

3 1, 5, 7 1 5 17, 64, 113, 117, 245 2 2 13, 23

source node degree destination nodes

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

slide-27
SLIDE 27

Block-based Update Algorithm

  • Break rnew into k blocks that fit in memory
  • Scan M and rold once for each block

4 0, 1, 3, 5 1 2 0, 5 2 2 3, 4

src degree destination 1 2 3 4 5 1 2 3 4 5 rnew rold

M

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

slide-28
SLIDE 28

Block-Stripe Update Algorithm

4 0, 1 1 3 2 2 1

src degree destination 1 2 3 4 5 1 2 3 4 5 rnew rold

4 5 1 3 5 2 2 4 4 3 2 2 3

Break M into stripes: Each stripe contains only destination nodes in the corresponding block of rnew

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

slide-29
SLIDE 29

First Spammers: Term Spam

  • How do you make your page appear to be

about movies?

  • (1) Add the word movie 1,000 times to your page
  • Set text color to the background color, so only search

engines would see it

  • (2) Or, run the query “movie” on your 


target search engine

  • See what page came first in the listings
  • Copy it into your page, make it “invisible”
  • These and similar techniques are term spam

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

slide-30
SLIDE 30

Google’s Solution to Term Spam

  • Believe what people say about you, rather

than what you say about yourself

  • Use words in the anchor text (words that appear

underlined to represent the link) and its surrounding text

  • PageRank as a tool to measure the

“importance” of Web pages

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

slide-31
SLIDE 31

Google vs. Spammers: Round 2!

  • Once Google became the dominant search

engine, spammers began to work out ways to fool Google

  • Spam farms were developed to concentrate

PageRank on a single page

  • Link spam:
  • Creating link structures that 


boost PageRank of a particular 
 page

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

slide-32
SLIDE 32

Link Spamming

  • Three kinds of web pages from a 


spammer’s point of view

  • Inaccessible pages
  • Accessible pages
  • e.g., blog comments pages
  • spammer can post links to his pages
  • Owned pages
  • Completely controlled by spammer
  • May span multiple domain names

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

slide-33
SLIDE 33

Link Farms

  • Spammer’s goal:
  • Maximize the PageRank of target page t
  • Technique:
  • Get as many links from accessible pages as

possible to target page t

  • Construct “link farm” to get PageRank 


multiplier effect

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

slide-34
SLIDE 34

Link Farms

Inaccessible t Accessible Owned 1 2 M

One of the most common and effective 


  • rganizations for a link farm

Millions of 
 farm pages

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

slide-35
SLIDE 35

PageRank: Extensions

  • Topic-specific PageRank:
  • Restrict teleportation to some set S

  • f pages related to a specific topic
  • Set p0i = 1/|S| if i ∈ S, p0i = 0 otherwise
  • Trust Propagation
  • Use set S of trusted pages for


teleport set