CS345a: Data Mining Jure Leskovec Stanford University Instructors: - - PowerPoint PPT Presentation

cs345a data mining jure leskovec
SMART_READER_LITE
LIVE PREVIEW

CS345a: Data Mining Jure Leskovec Stanford University Instructors: - - PowerPoint PPT Presentation

CS345a: Data Mining Jure Leskovec Stanford University Instructors: Instructors: Jure Leskovec A Anand Rajaraman d R j TAs: Abhishek Gupta Abhi h k G t Roshan Sumbaly Reach us at cs345a win0910 staff@ R


slide-1
SLIDE 1

CS345a: Data Mining Jure Leskovec

Stanford University

slide-2
SLIDE 2

 Instructors:  Instructors:

  • Jure Leskovec
  • A

d R j

  • Anand Rajaraman

 TAs:

Abhi h k G t

  • Abhishek Gupta
  • Roshan Sumbaly

R h t 345 i 0910 t ff@

 Reach us at cs345a‐win0910‐staff@

lists.stanford.edu M i f t f d d / l / 345

 More info on www.stanford.edu/class/cs345a

2 1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining

slide-3
SLIDE 3

 Homework: 20%  Homework: 20%

  • Gradiance and other
  • 3 l t d

f th t

  • 3 late days for the quarter
  • All homeworks must be handed in

 Project: 40%  Project: 40%

  • Start early

k l f

  • Takes lots of time

 Final Exam: 40%

3 1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining

slide-4
SLIDE 4

 Basic databases: CS145  Basic databases: CS145  Algorithms:

  • Dynamic programming basic data structures
  • Dynamic programming, basic data structures

 Basic statistics:

  • Moments t pi al distrib tions re ression
  • Moments, typical distributions, regression, …

 Programming:

Y h i b t C /J ill b f l

  • Your choice, but C++/Java will be very useful

 We provide some background, but the class

We provide some background, but the class will be fast paced

1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining 4

slide-5
SLIDE 5

 Software implementation related to course  Software implementation related to course

subject matter

 Should involve an original component or  Should involve an original component or

experiment

 More later about available data and  More later about available data and

computing resources

 It’s going to be fun and hard work

5 1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining

slide-6
SLIDE 6

 Many past projects have dealt with  Many past projects have dealt with

collaborative filtering (advice based on what similar people do) similar people do)

  • E.g., Netflix Challenge

 Others have dealt with engineering solutions  Others have dealt with engineering solutions

to machine‐learning problems

 Lots of interesting project ideas  Lots of interesting project ideas

  • If you can’t think of one please come talk to us

6 1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining

slide-7
SLIDE 7

 Data:  Data:

  • Netflix
  • WebBase

WebBase

  • Wikipedia
  • TREC
  • ShareThis
  • Google

g

 Infrastructure:

  • Aster Data cluster on Amazon EC2
  • Supports both MapReduce and SQL

1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining 7

slide-8
SLIDE 8

 ML generally requires a large  ML generally requires a large

“training set” of correctly classified data: classified data:

  • Example: classify Web pages by topic

 Hard to find well‐classified data:

  • Open Directory works for page topics,

Open Directory works for page topics, because work is collaborative and shared by many.

  • Other good exceptions?

8 1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining

slide-9
SLIDE 9

Many problems require thought:

Many problems require thought:

  • 1. Tell important pages from unimportant

(PageRank) (PageRank)

  • 2. Tell real news from publicity (how?)

3 Distinguish positive from negative product

  • 3. Distinguish positive from negative product

reviews (how?) 4 Feature generation in ML

  • 4. Feature generation in ML
  • 5. Etc., etc.

9 1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining

slide-10
SLIDE 10

Working in pairs OK but

Working in pairs OK, but …

  • 1. No more than two per project.

2 W ill t f i th f

  • 2. We will expect more from a pair than from an

individual. 3 The effort should be roughly evenly distributed

  • 3. The effort should be roughly evenly distributed.

10 1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining

slide-11
SLIDE 11

 Map‐Reduce and Hadoop

Map Reduce and Hadoop

 Recommendation systems

  • Collaborative filtering

 Dimensionality reduction  Dimensionality reduction  Finding nearest neighbors  Finding similar sets

  • Minhashing, Locality‐Sensitive hashing

 Clustering  PageRank and measures of importance in graphs  PageRank and measures of importance in graphs

(link analysis)

  • Spam detection
  • Topic‐specific search

11 1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining

slide-12
SLIDE 12

 Large scale machine learning  Large scale machine learning  Association rules, frequent itemsets  Extracting structured data (relations) from the  Extracting structured data (relations) from the

Web

 Clustering data  Clustering data  Graph partitioning  Spam detection  Spam detection  Managing Web advertisements  Mining data streams  Mining data streams

12 1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining

slide-13
SLIDE 13

 Lots of data is being collected

Lots of data is being collected and warehoused

  • Web data, e‐commerce

h d /

  • purchases at department/

grocery stores

  • Bank/Credit Card

transactions

 Computers are cheap and powerful

p p p

 Competitive Pressure is Strong

  • Provide better, customized services for an edge (e.g. in

g ( g Customer Relationship Management)

1/5/2010 13 Jure Leskovec, Stanford CS345a: Data Mining

slide-14
SLIDE 14

 Data collected and stored at

enormous speeds (GB/hour)

  • remote sensors on a satellite
  • telescopes scanning the skies
  • microarrays generating gene

expression data p

  • scientific simulations

generating terabytes of data

T di i l h i i f ibl f

 Traditional techniques infeasible for

raw data

 Data mining helps scientists

  • in classifying and segmenting data
  • in Hypothesis Formation

1/5/2010 14 Jure Leskovec, Stanford CS345a: Data Mining

slide-15
SLIDE 15

 There is often information “hidden” in the data that is

not readily evident not readily evident

 Human analysts take weeks to discover useful

information M h f th d t i l d t ll

 Much of the data is never analyzed at all

3,500,000 4,000,000 2,000,000 2,500,000 3,000,000

The Data Gap

T t l di k (TB) i 1995

00 000 1,000,000 1,500,000 Total new disk (TB) since 1995

Number of

500,000 1995 1996 1997 1998 1999

analysts

1/5/2010 15 Jure Leskovec, Stanford CS345a: Data Mining

slide-16
SLIDE 16

 Many Definitions  Many Definitions  Non‐trivial extraction of implicit, previously

unknown and useful information from data unknown and useful information from data

 Exploration & analysis, by automatic or

semi automatic means of semi‐automatic means, of large quantities of data in order to discover in order to discover meaningful patterns

1/5/2010 16 Jure Leskovec, Stanford CS345a: Data Mining

slide-17
SLIDE 17

 Process of semi automatically analyzing large  Process of semi‐automatically analyzing large

databases to find patterns that are:

  • valid: hold on new data with some certainty
  • valid: hold on new data with some certainty
  • novel: non‐obvious to the system

f l h ld b ibl t t th it

  • useful: should be possible to act on the item
  • understandable: humans should be able to

interpret the pattern interpret the pattern

1/5/2010 17 Jure Leskovec, Stanford CS345a: Data Mining

slide-18
SLIDE 18

 A big data mining risk is that you will  A big data‐mining risk is that you will

“discover” patterns that are meaningless.

 Bonferroni’s principle: (roughly) if you look in

more places for interesting patterns than your more places for interesting patterns than your amount of data will support, you are bound to find crap find crap.

18 1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining

slide-19
SLIDE 19

 A parapsychologist in the 1950’s hypothesized  A parapsychologist in the 1950 s hypothesized

that some people had Extra‐Sensory Perception Perception

 He devised an experiment where subjects

were asked to guess 10 hidden cards – red or were asked to guess 10 hidden cards – red or blue

 He discovered that almost 1 in 1000 had ESP –  He discovered that almost 1 in 1000 had ESP –

they were able to get all 10 right

19 1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining

slide-20
SLIDE 20

 He told these people they had ESP and called  He told these people they had ESP and called

them in for another test of the same type

 Alas he discovered that almost all of them  Alas, he discovered that almost all of them

had lost their ESP

 What did he conclude?  What did he conclude?  He concluded that you shouldn’t tell people  He concluded that you shouldn t tell people

they have ESP; it causes them to lose it. 

20 1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining

slide-21
SLIDE 21

 Banking: loan/credit card approval:

g / pp

  • predict good customers based on old customers

 Customer relationship management:

id tif th h lik l t l f tit

  • identify those who are likely to leave for a competitor

 Targeted marketing:

  • identify likely responders to promotions

identify likely responders to promotions

 Fraud detection: telecommunications, finance

  • from an online stream of event identify fraudulent

t events

 Manufacturing and production:

  • automatically adjust knobs when process parameter

automatically adjust knobs when process parameter changes

1/5/2010 21 Jure Leskovec, Stanford CS345a: Data Mining

slide-22
SLIDE 22

 Medicine: disease outcome, effectiveness of

Medicine: disease outcome, effectiveness of treatments

  • analyze patient disease history: find relationship

between diseases between diseases

 Molecular/Pharmaceutical:

  • id

tif d

  • identify new drugs

 Scientific data analysis:

id if l i b hi f b l

  • identify new galaxies by searching for sub clusters

 Web site/store design and promotion:

  • find affinity of visitor to pages and modify layout

1/5/2010 22 Jure Leskovec, Stanford CS345a: Data Mining

slide-23
SLIDE 23

 Overlaps with machine learning statistics

Overlaps with machine learning, statistics, artificial intelligence, databases, visualization but more stress on

  • scalability of number
  • f features and instances
  • stress on algorithms and

Machine Learning/ Pattern Statistics/ AI

  • stress on algorithms and

architectures whereas foundations of methods

Recognition Data Mining

and formulations provided by statistics and machine learning

  • automation for handling large

Database

automation for handling large, heterogeneous data

systems

1/5/2010 23 Jure Leskovec, Stanford CS345a: Data Mining

slide-24
SLIDE 24

 Prediction Methods  Prediction Methods

  • Use some variables to predict unknown or

future values of other variables future values of other variables.

 Description Methods

Description Methods

  • Find human‐interpretable patterns that

describe the data.

1/5/2010 24 Jure Leskovec, Stanford CS345a: Data Mining

slide-25
SLIDE 25

 Classification  Classification  Clustering  Association Rule Discovery:  Association Rule Discovery:  Sequential Pattern Discovery  Regression  Regression  Anomaly Detection

1/5/2010 25 Jure Leskovec, Stanford CS345a: Data Mining

slide-26
SLIDE 26

Early

Class: Attributes:

Courtesy: http://aps.umn.edu

y Intermediate

  • Stages of Formation
  • Image features,
  • Characteristics of light

waves received, etc.

Intermediate Late Late

Data Size:

  • 72 million stars, 20 million galaxies
  • Object Catalog: 9 GB
  • Image Database: 150 GB

1/5/2010 26 Jure Leskovec, Stanford CS345a: Data Mining

slide-27
SLIDE 27

 Observe Stock Movements

Observe Stock Movements

 Cluster them: Stock‐{UP/DOWN}  Similarity Measure:

T i t i il if th t d ib d b

  • Two points are more similar if the events described by

them frequently happen together on the same day.

Discovered Clusters Industry Group

1

Applied-Matl-DOWN,Bay-Network-Down,3-COM-DOWN, Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN, DSC-Comm-DOWN,INTEL-DOWN,LSI-Logic-DOWN, Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down, Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOWN, Sun-DOWN

Technology1‐DOWN

A l C DOWN A t d k DOWN DEC DOWN

2

Apple-Comp-DOWN,Autodesk-DOWN,DEC-DOWN, ADV-Micro-Device-DOWN,Andrew-Corp-DOWN, Computer-Assoc-DOWN,Circuit-City-DOWN, Compaq-DOWN, EMC-Corp-DOWN, Gen-Inst-DOWN, Motorola-DOWN,Microsoft-DOWN,Scientific-Atl-DOWN

Technology2-DOWN

3

Fannie-Mae-DOWN,Fed-Home-Loan-DOWN, MBNA Corp DOWN Morgan Stanley DOWN

i i l

3

MBNA-Corp-DOWN,Morgan-Stanley-DOWN

Financial-DOWN

4

Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP, Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP, Schlumberger-UP

Oil-UP

1/5/2010 27 Jure Leskovec, Stanford CS345a: Data Mining

slide-28
SLIDE 28

 Given database of user preferences predict  Given database of user preferences, predict

preference of new user

 Example:

p

  • Predict what new movies you will like based on
  • your past preferences
  • others with similar past preferences
  • their preferences for the new movies

 Example:  Example:

  • Predict what books/CDs a person may want to buy
  • (and suggest it or give discounts to tempt
  • (and suggest it, or give discounts to tempt

customer)

1/5/2010 28 Jure Leskovec, Stanford CS345a: Data Mining

slide-29
SLIDE 29

 Detect significant deviations  Detect significant deviations

from normal behavior

 Applications:  Applications:

  • Credit Card Fraud Detection
  • Network Intrusion

Detection Detection

1/5/2010 29 Jure Leskovec, Stanford CS345a: Data Mining

slide-30
SLIDE 30

 Supermarket shelf management.

  • Goal: To identify items that are bought together by

sufficiently many customers.

  • Approach: Process the point of sale data collected with
  • Approach: Process the point‐of‐sale data collected with

barcode scanners to find dependencies among items.

  • A classic rule ‐‐
  • If a customer buys diaper and milk, then he is likely to buy beer.
  • So, don’t be surprised if you find six‐packs stacked next to diapers!

TID Items

1 Bread, Coke, Milk 2 Beer, Bread 3 Beer, Coke, Diaper, Milk

Rules Discovered:

{ Milk} --> { Coke} { Diaper, Milk} --> { Beer}

Rules Discovered:

{ Milk} --> { Coke} { Diaper, Milk} --> { Beer} p 4 Beer, Bread, Diaper, Milk 5 Coke, Diaper, Milk

1/5/2010 30 Jure Leskovec, Stanford CS345a: Data Mining

slide-31
SLIDE 31

 Network intrusion detection using a combination of

i l l di d l ifi i 4 G sequential rule discovery and classification tree on 4 GB DARPA data

  • Won over (manual) knowledge engineering approach
  • http://www cs columbia edu/~sal/JAM/PROJECT/ provides good
  • http://www.cs.columbia.edu/ sal/JAM/PROJECT/ provides good

detailed description of the entire process

 Major US bank: Customer attrition prediction

  • Segment customers based on financial behavior: 3 segments
  • Segment customers based on financial behavior: 3 segments
  • Build attrition models for each of the 3 segments
  • 40‐50% of attritions were predicted == factor of 18 increase

T d di k i j US b k

 Targeted credit marketing: major US banks

  • find customer segments based on 13 months credit balances
  • build another response model based on surveys
  • increased response 4 times

2%

  • increased response 4 times – 2%

1/5/2010 31 Jure Leskovec, Stanford CS345a: Data Mining

slide-32
SLIDE 32

 Scalability  Scalability  Dimensionality  Complex and Heterogeneous Data  Complex and Heterogeneous Data  Data Quality  Data Ownership and Distribution  Data Ownership and Distribution  Privacy Preservation  Streaming Data  Streaming Data

1/5/2010 32 Jure Leskovec, Stanford CS345a: Data Mining

slide-33
SLIDE 33

[Leskovec et al., TWEB ’07]

 Senders and followers of recommendations

receive discounts on products

10% credit 10% off

R d i d b f

 Recommendations are made to any number of

people at the time of purchase O l h i i h b fi

Jure Leskovec, Stanford CS345a: Data Mining

 Only the recipient who buys first gets a

discount

1/5/2010 33

slide-34
SLIDE 34

Product recommendation k network

purchase following a recommendation customer recommending a customer recommending a product customer not buying a recommended product recommended product

34 1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining

slide-35
SLIDE 35

 Large online retailer (June 2001 to May 2003)  Large online retailer (June 2001 to May 2003)  15,646,121 recommendations

d

 3,943,084 distinct customers  548,523 products recommended  99% of them belonging 4 main product  99% of them belonging 4 main product

groups:

  • books
  • books
  • DVDs
  • music

music

  • VHS

35 1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining

slide-36
SLIDE 36

 Recommendations

Recommendations

  • sender (shadowed)
  • recipient (shadowed)
  • recommendation time
  • buy bit
  • purchase time
  • purchase time
  • product price

 Additional product info (from the retailer’s website)  Additional product info (from the retailer s website)

  • categories
  • reviews
  • ratings

36 1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining

slide-37
SLIDE 37

 What role does the product category play?  What role does the product category play?

products customers recommenda- tions edges buy + get discount buy + no discount tions discount discount Book 103,161 2,863,977 5,741,611 2,097,809 65,344 17,769 DVD 19,829 805,285 8,180,393 962,341 17,232 58,189 Music 393,598 794,148 1,443,847 585,738 7,837 2,739 Video 26,131 239,583 280,270 160,683 909 467 F ll 542 719 3 943 084 15 646 121 3 153 676 91 322 79 164 Full 542,719 3,943,084 15,646,121 3,153,676 91,322 79,164

people recommendations

Jure Leskovec, Stanford CS345a: Data Mining

high low

1/5/2010 37

slide-38
SLIDE 38

There are relatively few DVD titles, but DVDs account for ~ 50% of recommendations recommendations.

recommendations per person

  • DVD: 10
  • books and music: 2
  • VHS: 1
  • VHS: 1

recommendations per purchase

  • books: 69
  • DVDs: 108
  • music: 136
  • music: 136
  • VHS: 203

Overall there are 3.69 recommendations per node on 3.85 different products.

Music recommendations reached about the same number of people as DVDs but used only 1/5 as many recommendations

Book recommendations reached by far the most people – 2.8 million.

All networks have a very small number of unique edges For books videos

All networks have a very small number of unique edges. For books, videos and music the number of unique edges is smaller than the number of nodes – the networks are highly disconnected

38 1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining

slide-39
SLIDE 39

12x 10

4 6

10 12 nent 4x 10

6

6 8 t compon 2 n # nodes

6

4 6 e of giant 10 20 m (month) 1.7*106m 2 siz by month quadratic fit

39

1 2 3 4 x 10

6

number of nodes

1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining

slide-40
SLIDE 40

 94% of users make first recommendation without having

94% of users make first recommendation without having received one previously

 linear growth: ~ 165,000 new users added each month

size of giant connected component increases from 1% to 2 5%

 size of giant connected component increases from 1% to 2.5%

  • f the network (100,420 users) – small!

 some sub‐communities are better connected

  • 24% out of 18,000 users for westerns on DVD
  • 26% of 25,000 for classics on DVD
  • 19% of 47,000 for anime (Japanese animated film) on DVD

19% of 47,000 for anime (Japanese animated film) on DVD

 others are just as disconnected

  • 3% of 180,000 home and gardening

2 7% f hild ’ d fit DVD

  • 2‐7% for children’s and fitness DVDs

40 1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining

slide-41
SLIDE 41

 Does sending more recommendations  Does sending more recommendations

influence more purchases?

5 6 7 ases 3 4 ber of Purcha 20 40 60 80 100 120 140 1 2 Num

Jure Leskovec, Stanford CS345a: Data Mining

20 40 60 80 100 120 140 Outgoing Recommendations

1/5/2010 41

slide-42
SLIDE 42

 consider whether sender has at least one successful

d i recommendation

 controls for sender getting credit for purchase that resulted

from others recommending the same product to the same person person

0.1 0.12 it

probability of i i

0.06 0.08 bility of Cred

receiving a credit levels

  • ff for DVDs

0.02 0.04 Probab

42

10 20 30 40 50 60 70 80 Outgoing Recommendations

1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining

slide-43
SLIDE 43

DVD recommendations asing

0.09 0.1

DVD recommendations (8.2 million observations)

  • f purcha

0 05 0.06 0.07 0.08

bability o

0 02 0.03 0.04 0.05

Prob

0.01 0.02 10 20 30 40

43

10 20 30 40

# recommendations received

1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining

slide-44
SLIDE 44

 Effectiveness of subsequent recommendations?  Effectiveness of subsequent recommendations?

  • Multiple recommendations between two individuals

weaken the impact of the bond on purchases

0.06 0.07 g

p p

0.05 lity of buying 0.03 0.04 Probabil 5 10 15 20 25 30 35 40 0.02 Exchanged recommendations

Jure Leskovec, Stanford CS345a: Data Mining 1/5/2010 44

slide-45
SLIDE 45

Consider successful recommendations in terms of

  • av # senders of recommendations per book category
  • av. # senders of recommendations per book category
  • av. # of recommendations accepted

books overall have a 3% success rate

  • (2% with discount, 1% without)

Lower than average success rate

Lower than average success rate

  • fiction
  • romance (1.78), horror (1.81)
  • teen (1.94), children’s books (2.06)
  • i

(2 30) i fi (2 34) t d th ill (2 40)

  • comics (2.30), sci‐fi (2.34), mystery and thrillers (2.40)
  • nonfiction
  • sports (2.26)
  • home & garden (2.26)
  • travel (2 39)
  • travel (2.39)

Higher than average success rate

  • professional & technical
  • medicine (5.68)
  • professional & technical (4 54)
  • professional & technical (4.54)
  • engineering (4.10), science (3.90), computers & internet (3.61)
  • law (3.66), business & investing (3.62)

45 1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining

slide-46
SLIDE 46

 Professional & technical book recommendations are more

  • ften accepted

 Some organized contexts other than professional also have

higher success rate, e.g. religion g , g g

  • overall success rate 3.13%
  • Christian themed books
  • Christian living and theology (4.7%)

Christian living and theology (4.7%)

  • Bibles (4.8%)
  • not‐as‐organized religion
  • new age (2.5%)

g ( )

  • occult spirituality (2.2%)

 Well organized hobbies

  • books on orchids recommended successfully twice as often as books
  • n tomato growing

46 1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining

slide-47
SLIDE 47

Variable transformation Coefficient const

  • 0.940 ***

# recommendations ln(r) 0 426 *** # recommendations ln(r) 0.426 # senders ln(ns)

  • 0.782 ***

# recipients ln(n )

  • 1 307 ***

# recipients ln(nr) 1.307 product price ln(p) 0.128 *** # reviews ln(v)

  • 0 011 ***

# reviews ln(v)

  • 0.011
  • avg. rating

ln(t)

  • 0.027 *

R2 0 74

1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining 47

R2 0.74

significance at the 0.01 (***), 0.05 (**) and 0.1 (*) levels

slide-48
SLIDE 48

 47 000 customers responsible for the 2 5 out of

47,000 customers responsible for the 2.5 out of 16 million recommendations in the system

 29% success rate per recommender of an anime

DVD

 Giant component covers 19% of the nodes  Overall, recommendations for DVDs are more

likely to result in a purchase (7%), but the anime i d community stands out

Jure Leskovec, Stanford CS345a: Data Mining 1/5/2010 48

slide-49
SLIDE 49

Three colors: blue, white & red ,

showing purchasers only

49 1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining

slide-50
SLIDE 50

 Small community  Small community

  • few reviews, senders, and recipients
  • b t

di d ti h l

  • but sending more recommendations helps

 Pricey products  Rating doesn’t play as much of a role  Rating doesn t play as much of a role

50 1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining

slide-51
SLIDE 51

Observations for diffusion models: Observations for diffusion models:

 purchase decision more complex than threshold

  • r simple infection

 influence saturates as the number of contacts

expands

 links user effectiveness if they are overused  links user effectiveness if they are overused

Conditions for successful recommendations:

 professional and organizational contexts  discounts on expensive items

ll i h l k i i i

 small, tightly knit communities

Jure Leskovec, Stanford CS345a: Data Mining 1/5/2010 51