http://cs246.stanford.edu 3/9/20 Jure Leskovec, Stanford CS246: - - PowerPoint PPT Presentation

http cs246 stanford edu 3 9 20 jure leskovec stanford
SMART_READER_LITE
LIVE PREVIEW

http://cs246.stanford.edu 3/9/20 Jure Leskovec, Stanford CS246: - - PowerPoint PPT Presentation

Note to other teachers and users of these slides: We would be delighted if you found our material useful for giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. If you make use of a


slide-1
SLIDE 1

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

http://cs246.stanford.edu

Note to other teachers and users of these slides: We would be delighted if you found our material useful for giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. If you make use of a significant portion of these slides in your own lecture, please include this message, or a link to our web site: http://www.mmds.org

slide-2
SLIDE 2

3/9/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 2

slide-3
SLIDE 3

¡ Date: § Thursday, March 19, 12:15-3:15 PM PDT § Location:

§ if SUNetID[0] in [‘A', .. ‘R'] then Cubberley Auditorium § if SUNetID[0] in [‘S', .. ‘Z'] then STLC114

¡ Alternate Date: § Wednesday, March 18, 6:00-9:00 PM PDT § Location:

§ Gates 104 § There is still SOME SPACE LEFT!

¡ TAs will NOT answer questions during the final

3/9/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 3

slide-4
SLIDE 4

You may come to Stanford to take the exam, or…

¡ Date:

§ From Wed, Mar 18, 6 PM to Thu, Mar 19, 6 PM (PDT) § Agree with your exam monitor on the most convenient 3-hour slot in that window of time

¡ Exam monitors will receive an email from SCPD with the

final exam, which they will in turn forward to you right before the beginning of your 3-hour slot

¡ Once you completed the exam, make sure to send the file

back to your exam monitor (high-quality scanned copy)

¡ Exam monitors will NOT answer questions during

the final

3/9/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 4

slide-5
SLIDE 5

¡ Final exam is open book and open notes ¡ A calculator or computer is REQUIRED

§ You may only use your computer to do arithmetic calculations (i.e., the buttons found on a standard scientific calculator) § You may also use your computer to read course notes or the textbook § But no Internet/Google/Python access is allowed

¡ Practice finals are posted on Piazza! ¡ We recommend bringing a power strip

3/9/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 5

slide-6
SLIDE 6

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

http://cs246.stanford.edu

slide-7
SLIDE 7

¡ Redundancy leads to a bad user experience

§ Uncertainty around information need => don’t put all eggs in one basket

¡ How do we optimize for diversity directly?

3/9/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 7

slide-8
SLIDE 8

Monday, January 14, 2013

France intervenes Chuck for Defense Argo wins big Hagel expects fight

3/9/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 8

slide-9
SLIDE 9

Monday, January 14, 2013

France intervenes Chuck for Defense Argo wins big New gun proposals

3/9/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 9

slide-10
SLIDE 10

¡ Idea: Encode diversity as coverage problem ¡ Example: Word cloud of news for a single day

§ Want to select articles so that most words are “covered”

3/9/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 10

slide-11
SLIDE 11

3/9/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 11

slide-12
SLIDE 12

¡ Q: What is being covered? ¡ A: Concepts (In our case: Named entities) ¡ Q: Who is doing the covering? ¡ A: Documents

France Mali Hagel Pentagon Obama Romney Zero Dark Thirty Argo NFL

Hagel expects fight

3/9/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 12

slide-13
SLIDE 13

¡ Suppose we are given a set of documents D

§ Each document d covers a set 𝒀𝒆 of words/topics/named entities W

¡ For a set of documents A Í D we define

𝑮 𝑩 = $

𝒋∈𝑩

𝒀𝒋

¡ Goal: We want to

max

𝑩 $𝒍 𝑮(𝑩)

¡ Note: F(A) is a set function: 𝑮 𝑩 : 𝐓𝐟𝐮𝐭 → ℕ

3/9/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 13

slide-14
SLIDE 14

¡ Given universe of elements 𝑿 = {𝒙𝟐, … , 𝒙𝒐}

and sets 𝒀𝟐, … , 𝒀𝒏 Í 𝑿

¡ Goal: Find k sets Xi that cover the most of W

§ More precisely: Find k sets Xi whose size of the union is the largest § Bad news: A known NP-complete problem

3/9/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 14

W X1

X2 X3

X4

slide-15
SLIDE 15

Simple Heuristic: Greedy Algorithm:

¡ Start with 𝑩𝟏 = { } ¡ For 𝒋 = 𝟐 … 𝒍

§ Find set 𝒆 that 𝐧𝐛𝐲 𝑮(𝑩𝒋#𝟐 ∪ {𝒆}) § Let 𝑩𝒋 = 𝑩𝒋#𝟐 È {𝒆}

¡ Example:

§ Eval. 𝑮 𝒆𝟐 , … , 𝑮({𝒆𝒏}), pick best (say 𝒆𝟐) § Eval. 𝑮 𝒆𝟐} ∪ {𝒆𝟑 , … , 𝑮({𝒆𝟐} ∪ {𝒆𝒏}), pick best (say 𝒆𝟑) § Eval. 𝑮({𝒆𝟐, 𝒆𝟑} ∪ {𝒆𝟒}), … , 𝑮({𝒆𝟐, 𝒆𝟑} ∪ {𝒆𝒏}), pick best § And so on…

3/9/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 15

𝑮 𝑩 = (

𝒆∈𝑩

𝒀𝒆

slide-16
SLIDE 16

¡ Goal: Maximize the covered area

3/9/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 16

slide-17
SLIDE 17

¡ Goal: Maximize the covered area

3/9/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 17

slide-18
SLIDE 18

¡ Goal: Maximize the covered area

3/9/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 18

slide-19
SLIDE 19

¡ Goal: Maximize the covered area

3/9/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 19

slide-20
SLIDE 20

¡ Goal: Maximize the covered area

3/9/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 20

slide-21
SLIDE 21

¡ Goal: Maximize the size of the covered area ¡ Greedy first picks A and then C ¡ But the optimal way would be to pick B and C

3/9/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 21

A B C

slide-22
SLIDE 22

¡ Greedy produces a solution A

where: F(A) ³ (1-1/e)*OPT (F(A) ³ 0.63*OPT)

[Nemhauser, Fisher, Wolsey ’78]

¡ Claim holds for functions F(·) with 2 properties:

§ F is monotone: (adding more docs doesn’t decrease coverage) if A Í B then F(A) £ F(B) and F({})=0 § F is submodular: adding an element to a set gives less improvement than adding it to one of its subsets

3/9/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 22

slide-23
SLIDE 23

Definition:

¡ Set function F(·) is called submodular if:

For all A,BÍ W: F(A) + F(B) ³ F(AÈ B) + F(AÇ B)

3/9/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 23

B A A È B

A Ç B

+ + ³

slide-24
SLIDE 24

¡ Diminishing returns characterization

Equivalent definition:

¡ Set function F(·) is called submodular if:

For all AÍ B:

3/9/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 24

d B A d + +

Large improvement Small improvement Gain of adding d to a small set Gain of adding d to a large set

F(A È {d}) – F(A) ≥ F(B È {d}) – F(B)

slide-25
SLIDE 25

¡ F(·) is submodular: A Í B ¡ Natural example:

§ Sets 𝑒1, … , 𝑒% § 𝐺 𝐵 = ⋃&∈( 𝑒&

(size of the covered area)

§ Claim: 𝑮(𝑩) is submodular!

3/9/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 25

A B d d

Gain of adding d to a small set Gain of adding d to a large set

F(A È {d}) – F(A) ≥ F(B È {d}) – F(B)

slide-26
SLIDE 26

¡ Submodularity is discrete analogue of

concavity

3/9/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 26

F(·) Solution size |A|

F(A) F(A È {d}) F(B È {d})

"A Í B

F(B) Adding d to B helps less than adding it to A! Gain of adding Xd to a small set Gain of adding Xd to a large set

F(A È {d}) – F(A) ≥ F(B È {d}) – F(B)

slide-27
SLIDE 27

¡ Marginal gain:

𝚬𝑮 𝒆 𝑩 = 𝑮 𝑩 ∪ {𝒆} − 𝑮(𝑩)

¡ Submodular:

𝑮 𝑩 ∪ {𝒆} − 𝑮 𝑩 ≥ 𝑮 𝑪 ∪ {𝒆} − 𝑮(𝑪)

¡ Concavity:

𝒈 𝒃 + 𝒆 − 𝒈 𝒃 ≥ 𝒈 𝒄 + 𝒆 − 𝒈(𝒄)

3/9/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 27

𝐵 ⊆ 𝐶 𝑏 ≤ 𝑐

F(A) |A|

slide-28
SLIDE 28

¡ Let 𝑮𝟐 … 𝑮𝒏 be submodular and 𝝁𝟐 … 𝝁𝒏 > 𝟏

then 𝑮 𝑩 = ∑𝒋*𝟐

𝒏 𝝁𝒋𝑮𝒋 𝑩 is submodular

§ Submodularity is closed under non-negative linear combinations!

¡ This is an extremely useful fact:

§ Average of submodular functions is submodular: 𝑮 𝑩 = ∑𝒋 𝑸 𝒋 ⋅ 𝑮𝒋 𝑩 § Multicriterion optimization: 𝑮 𝑩 = ∑𝒋 𝝁𝒋𝑮𝒋 𝑩

3/9/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 28

slide-29
SLIDE 29

¡ Q: What is being covered? ¡ A: Concepts (In our case: Named entities) ¡ Q: Who is doing the covering? ¡ A: Documents

France Mali Hagel Pentagon Obama Romney Zero Dark Thirty Argo NFL

Hagel expects fight

3/9/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 29

slide-30
SLIDE 30

¡ Objective: pick k docs that cover most concepts ¡ F(A): the number of concepts covered by A

§ Elements…concepts, Sets … concepts in docs § F(A) is submodular and monotone! § We can use greedy algorithm to optimize F

France Mali Hagel Pentagon Obama Romney Zero Dark Thirty Argo NFL

Enthusiasm for Inauguration wanes Inauguration weekend

3/9/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 30

slide-31
SLIDE 31

¡ Objective: pick k docs that cover most concepts

France Mali Hagel Pentagon Obama Romney Zero Dark Thirty Argo NFL

Penalizes redundancy

Enthusiasm for Inauguration wanes Inauguration weekend

Submodular Concept importance? All-or-nothing too harsh

3/9/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 31

The good: The bad:

slide-32
SLIDE 32

3/9/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 32

slide-33
SLIDE 33

¡ Objective: pick k docs that cover most concepts

France Mali Hagel Pentagon Obama Romney Zero Dark Thirty Argo NFL

Enthusiasm for Inauguration wanes Inauguration weekend

3/9/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 33

¡ Each concept 𝒅 has importance weight 𝒙𝒅

slide-34
SLIDE 34

¡ Document coverage function

probability document d covers concept c

[e.g., how strongly d covers c]

Obama Romney

Enthusiasm for Inauguration wanes

3/9/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 34

slide-35
SLIDE 35

¡ Document coverage function:

probability document d covers concept c

§ Coverd(c) can also model how relevant is concept c for user u

¡ Set coverage function:

§ Prob. that at least one document in A covers c

¡ Objective:

max

A:|A|≤k F(A) =

X

c

wc coverA(c)

concept weights

3/9/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 35

slide-36
SLIDE 36

¡ The objective function is also submodular

§ Intuitively, it has a diminishing returns property § Greedy algorithm leads to a (1 – 1/e) ~ 63% approximation, i.e., a near-optimal solution

max

A:|A|≤k F(A) =

X

c

wc coverA(c)

3/9/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 36

slide-37
SLIDE 37

¡ Objective: pick k docs that cover most concepts

France Mali Hagel Pentagon Obama Romney Zero Dark Thirty Argo NFL

Enthusiasm for Inauguration wanes Inauguration weekend

3/9/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 37

¡ Each concept 𝑑 has importance weight 𝑥! ¡ Documents partially cover concepts: 𝐝𝐩𝐰𝐟𝐬𝒆(𝒅)

slide-38
SLIDE 38

3/9/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 38

slide-39
SLIDE 39

¡ Greedy algorithm is slow!

§ At each iteration we need to re-evaluate marginal gains of all remaining documents § Runtime 𝑷(|𝑬| · 𝑳) for

selecting 𝑳 documents out of the set of 𝑬 of them

Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 39

a b c d Marginal gain: F(AÈx)-F(A) e

Greedy

Add document with highest marginal gain

3/9/20

slide-40
SLIDE 40

¡ In round 𝒋: So far we have 𝑩𝒋$𝟐 = {𝒆𝟐, … , 𝒆𝒋$𝟐}

§ Now we pick 𝐞𝒋 = 𝐛𝐬𝐡 𝐧𝐛𝐲

𝒆∈𝑾 𝑮(𝑩𝒋&𝟐 ∪ {𝒆}) − 𝑮(𝑩𝒋&𝟐)

§ Greedy algorithm maximizes the “marginal benefit” 𝚬𝒋 𝒆 = 𝑮(𝑩𝒋"𝟐 ∪ {𝒆}) − 𝑮(𝑩𝒋"𝟐)

¡ By submodularity property:

𝐺 𝐵( ∪ 𝑒 − 𝐺 𝐵( ≥ 𝐺 𝐵) ∪ 𝑒 − 𝐺 𝐵) for 𝑗 < 𝑘

¡ Observation: By submodularity:

For every 𝒆 ∈ 𝑬 𝚬𝒋(𝒆) ≥ 𝚬𝒌(𝒆) for 𝒋 < 𝒌 since 𝑩𝒋Í 𝑩𝒌

¡ Marginal benefits 𝚬𝒋(𝒆) only shrink!

(as i grows)

3/9/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 40

d Di(d) ³ D j(d)

[Leskovec et al., KDD ’07] Selecting document d in step i covers more words than selecting d at step j (j>i)

slide-41
SLIDE 41

¡ Idea:

§ Use Di as upper-bound on Dj (j > i)

¡ Lazy Greedy:

§ Keep an ordered list of marginal benefits Di from previous iteration § Re-evaluate Di only for top element § Re-sort and prune

3/9/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 41

a b c d e

[Leskovec et al., KDD ’07]

F(A È {d}) – F(A) ≥ F(B È {d}) – F(B)

A1={a} A Í B (Upper bound on) Marginal gain D1

slide-42
SLIDE 42

¡ Idea:

§ Use Di as upper-bound on Dj (j > i)

¡ Lazy Greedy:

§ Keep an ordered list of marginal benefits Di from previous iteration § Re-evaluate Di only for top element § Re-sort and prune

3/9/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 42

a d b c e

[Leskovec et al., KDD ’07]

F(A È {d}) – F(A) ≥ F(B È {d}) – F(B)

A Í B A1={a} Upper bound on Marginal gain D2

slide-43
SLIDE 43

¡ Idea:

§ Use Di as upper-bound on Dj (j > i)

¡ Lazy Greedy:

§ Keep an ordered list of marginal benefits Di from previous iteration § Re-evaluate Di only for top element § Re-sort and prune

3/9/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 43

a c d b e Upper bound on Marginal gain D2

[Leskovec et al., KDD ’07]

F(A È {d}) – F(A) ≥ F(B È {d}) – F(B)

A1={a} A2={a,b} A Í B

slide-44
SLIDE 44

¡ Summary so far:

§ Diversity can be formulated as a set cover § Set cover is submodular optimization problem § Can be (approximately) solved using greedy algorithm § Lazy-greedy gives significant speedup

3/9/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 44

Lower is better

1 2 3 4 5 6 7 8 9 10 100 200 300 400 number of blogs selected

running time (seconds) exhaustive search (all subsets) naive greedy

Lazy

slide-45
SLIDE 45

3/9/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 45

But what about personalization?

model

Election trouble Songs of Syria Sandy delays

Recommendations

slide-46
SLIDE 46

France Mali Hagel Pentagon Obama Romney Zero Dark Thirty Argo NFL France intervenes Chuck for Defense Argo wins big

We assumed same concept weighting for all users

3/9/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 46

slide-47
SLIDE 47

¡ Each user has different preferences over

concepts

France Mali Hagel Pentagon Obama Romney Zero Dark Thirty Argo NFL France Mali Hagel Pentagon Obama Romney Zero Dark Thirty Argo NFL

politico movie buff

3/9/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 47

slide-48
SLIDE 48

¡ Assume each user u has different preference

vector wc(u) over concepts c

¡ Goal: Learn personal concept weights from

user feedback

max

A:|A|≤k F(A) =

X

c

wc coverA(c) max

A:|A|≤k F(A) =

X

c

w(u)

c

coverA(c)

3/9/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 48

slide-49
SLIDE 49

France Mali Hagel Pentagon Obama Romney Zero Dark Thirty Argo NFL France intervenes Chuck for Defense Argo wins big

3/9/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 49

slide-50
SLIDE 50

¡ Multiplicative Weights algorithm

§ Assume each concept 𝒅 has weight 𝒙𝒅 § We recommend document 𝒆 and receive feedback, say 𝒔 = +1 or -1 § Update the weights:

§ For each 𝒅 ∈ 𝒀𝒆 set 𝒙𝒅 = 𝜸𝒔𝒙𝒅

§ If concept c appears in doc d and we received positive feedback r=+1 then we increase the weight wc by multiplying it by 𝜸 (𝜸 > 𝟐)

  • therwise we decrease the weight (divide by 𝜸)

§ Normalize weights so that ∑𝒅 𝒙𝒅 = 𝟐

3/9/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 50

slide-51
SLIDE 51

¡ Steps of the algorithm:

  • 1. Identify items to recommend from
  • 2. Identify concepts [what makes items redundant?]
  • 3. Weigh concepts by general importance
  • 4. Define item-concept coverage function
  • 5. Select items using probabilistic set cover
  • 6. Obtain feedback, update weights

3/9/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 51