INFO 4300 / CS4300 Information Retrieval slides adapted from - - PowerPoint PPT Presentation

info 4300 cs4300 information retrieval slides adapted
SMART_READER_LITE
LIVE PREVIEW

INFO 4300 / CS4300 Information Retrieval slides adapted from - - PowerPoint PPT Presentation

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from http://informationretrieval.org/ IR 7: Scores and Evaluation Paul Ginsparg Cornell University, Ithaca, NY 15 Sep 2011 1 / 42 Administrativa


slide-1
SLIDE 1

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch¨ utze’s, linked from http://informationretrieval.org/

IR 7: Scores and Evaluation

Paul Ginsparg

Cornell University, Ithaca, NY

15 Sep 2011

1 / 42

slide-2
SLIDE 2

Administrativa

Course Webpage: http://www.infosci.cornell.edu/Courses/info4300/2011fa/ Assignment 1. Posted: 2 Sep, Due: Sun, 18 Sep Lectures: Tuesday and Thursday 11:40-12:55, Kimball B11 Instructor: Paul Ginsparg, ginsparg@..., 255-7371, Physical Sciences Building 452 Instructor’s Office Hours: Wed 1-2pm, Fri 2-3pm, or e-mail instructor to schedule an appointment Teaching Assistant: Saeed Abdullah, office hour Fri 3:30pm-4:30pm in the small conference room (133) at 301 College Ave, and by email, use cs4300-l@lists.cs.cornell.edu Course text at: http://informationretrieval.org/

Introduction to Information Retrieval, C.Manning, P.Raghavan, H.Sch¨ utze

see also

Information Retrieval, S. B¨ uttcher, C. Clarke, G. Cormack

http://mitpress.mit.edu/catalog/item/default.asp?ttype=2&tid=12307

2 / 42

slide-3
SLIDE 3

Discussion 2, 20 Sep

For this class, read and be prepared to discuss the following:

  • K. Sparck Jones, “A statistical interpretation of term

specificity and its application in retrieval”. Journal of Documentation 28, 11-21, 1972. http://www.soi.city.ac.uk/∼ser/idfpapers/ksj orig.pdf Letter by Stephen Robertson and reply by Karen Sparck Jones, Journal of Documentation 28, 164-165, 1972. http://www.soi.city.ac.uk/∼ser/idfpapers/letters.pdf The first paper introduced the term weighting scheme known as inverse document frequency (IDF). Some of the terminology used in this paper will be introduced in the lectures. The letter describes a slightly different way of expressing IDF, which has become the standard form. (Stephen Robertson has mounted these papers on his Web site with permission from the publisher.)

3 / 42

slide-4
SLIDE 4

Overview

1

Recap

2

Implementation

3

Unranked evaluation

4

Ranked evaluation

5

SVD Intuition

6

Incremental Numerics

4 / 42

slide-5
SLIDE 5

Outline

1

Recap

2

Implementation

3

Unranked evaluation

4

Ranked evaluation

5

SVD Intuition

6

Incremental Numerics

5 / 42

slide-6
SLIDE 6

Cluster pruning

Cluster docs in preprocessing step Pick √ N “leaders” For non-leaders, find nearest leader (expect √ N / leader) For query q, find closest leader L ( √ N computations) Rank L and followers

  • r generalize: b1 closest leaders, and then b2 leaders closest to

query

6 / 42

slide-7
SLIDE 7

7 / 42

slide-8
SLIDE 8

Outline

1

Recap

2

Implementation

3

Unranked evaluation

4

Ranked evaluation

5

SVD Intuition

6

Incremental Numerics

8 / 42

slide-9
SLIDE 9

Non-docID ordering of postings lists

So far: postings lists have been ordered according to docID. Alternative: a query-independent measure of “goodness” of a page. Example: PageRank g(d) of page d, a measure of how many “good” pages hyperlink to d Order documents in postings lists according to PageRank: g(d1) > g(d2) > g(d3) > . . . Define composite score of a document: net-score(q, d) = g(d) + cos(q, d) This scheme supports early termination: We do not have to process postings lists in their entirety to find top k.

9 / 42

slide-10
SLIDE 10

Non-docID ordering of postings lists (2)

Order documents in postings lists according to PageRank: g(d1) > g(d2) > g(d3) > . . . Define composite score of a document: net-score(q, d) = g(d) + cos(q, d) Suppose: (i) g → [0, 1]; (ii) g(d) < 0.1 for the document d we’re currently processing; (iii) smallest top k score we’ve found so far is 1.2 Then all subsequent scores will be < 1.1. So we’ve already found the top k and can stop processing the remainder of postings lists. Questions?

10 / 42

slide-11
SLIDE 11

Outline

1

Recap

2

Implementation

3

Unranked evaluation

4

Ranked evaluation

5

SVD Intuition

6

Incremental Numerics

11 / 42

slide-12
SLIDE 12

Measures for a search engine

How fast does it index

e.g., number of bytes per hour

How fast does it search

e.g., latency as a function of queries per second

What is the cost per query?

in dollars

12 / 42

slide-13
SLIDE 13

Measures for a search engine

All of the preceding criteria are measurable: we can quantify speed / size / money However, the key measure for a search engine is user happiness. What is user happiness? Factors include:

Speed of response Size of index Uncluttered UI Most important: relevance (actually, maybe even more important: it’s free)

Note that none of these is sufficient: blindingly fast, but useless answers won’t make a user happy. How can we quantify user happiness?

13 / 42

slide-14
SLIDE 14

Who is the user?

Who is the user we are trying to make happy? Web search engine: searcher. Success: Searcher finds what she was looking for. Measure: rate of return to this search engine Web search engine: advertiser. Success: Searcher clicks on

  • ad. Measure: clickthrough rate

Ecommerce: buyer. Success: Buyer buys something. Measures: time to purchase, fraction of “conversions” of searchers to buyers Ecommerce: seller. Success: Seller sells something. Measure: profit per item sold Enterprise: CEO. Success: Employees are more productive (because of effective search). Measure: profit of the company

14 / 42

slide-15
SLIDE 15

Most common definition of user happiness: Relevance

User happiness is equated with the relevance of search results to the query. But how do you measure relevance? Standard methodology in information retrieval consists of three elements.

A benchmark document collection A benchmark suite of queries An assessment of the relevance of each query-document pair

15 / 42

slide-16
SLIDE 16

Relevance: query vs. information need

Relevance to what? First take: relevance to the query “Relevance to the query” is very problematic. Information need i: “I am looking for information on whether drinking red wine is more effective at reducing your risk of heart attacks than white wine.” This is an information need, not a query. Query q: [red wine white wine heart attack] Consider document d′: At heart of his speech was an attack

  • n the wine industry lobby for downplaying the role of red and

white wine in drunk driving. d′ is an excellent match for query q . . . d′ is not relevant to the information need i.

16 / 42

slide-17
SLIDE 17

Relevance: query vs. information need

User happiness can only be measured by relevance to an information need, not by relevance to queries. Terminology is sloppy here and in course text: “query-document” relevance judgments even though we mean “information-need–document” relevance judgments.

17 / 42

slide-18
SLIDE 18

Precision and recall

Precision (P) is the fraction of retrieved documents that are relevant Precision = #(relevant items retrieved) #(retrieved items) = P(relevant|retrieved) Recall (R) is the fraction of relevant documents that are retrieved Recall = #(relevant items retrieved) #(relevant items) = P(retrieved|relevant)

18 / 42

slide-19
SLIDE 19

Precision and recall

Relevant Nonrelevant Retrieved true positives (TP) false positives (FP) Not retrieved false negatives (FN) true negatives (TN) P = TP/(TP + FP) R = TP/(TP + FN)

19 / 42

slide-20
SLIDE 20

Precision/recall tradeoff

You can increase recall by returning more docs. Recall is a non-decreasing function of the number of docs retrieved. A system that returns all docs has 100% recall! The converse is also true (usually): It’s easy to get high precision for very low recall. Suppose the document with the largest score is relevant. How can we maximize precision?

20 / 42

slide-21
SLIDE 21

A combined measure: F

Frequently used: balanced F, the harmonic mean of P and R: 1 F = 1 2 1 P + 1 R

  • r

F = 2PR P + R Extremes: If P ≪ R, then F ≈ 2P. If R ≪ P, then F ≈ 2R. So F is automatically sensitive to the one that is much smaller. If P ≈ R, then F ≈ P ≈ R.

21 / 42

slide-22
SLIDE 22

Outline

1

Recap

2

Implementation

3

Unranked evaluation

4

Ranked evaluation

5

SVD Intuition

6

Incremental Numerics

22 / 42

slide-23
SLIDE 23

Precision-recall curve

Precision/recall/F are measures for unranked sets. We can easily turn set measures into measures of ranked lists. Just compute the set measure for each “prefix”: the top 1, top 2, top 3, top 4 etc results Doing this for precision and recall gives you a precision-recall curve.

23 / 42

slide-24
SLIDE 24

A precision-recall curve

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Recall Precision

Each point corresponds to a result for the top k ranked hits (k = 1, 2, 3, 4, . . .). Interpolation (in red): Take maximum of all future points Rationale for interpolation: The user is willing to look at more stuff if both precision and recall get better. Questions?

24 / 42

slide-25
SLIDE 25

11-point interpolated average precision

Recall Interpolated Precision 0.0 1.00 0.1 0.67 0.2 0.63 0.3 0.55 0.4 0.45 0.5 0.41 0.6 0.36 0.7 0.29 0.8 0.13 0.9 0.10 1.0 0.08 11-point average: ≈ 0.425 How can precision at 0.0 be > 0?

25 / 42

slide-26
SLIDE 26

Averaged 11-point precision/recall graph

0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 Recall Precision

Compute interpolated precision at recall levels 0.0, 0.1, 0.2, . . . Do this for each of the queries in the evaluation benchmark Average over queries This measure measures performance at all recall levels. The curve is typical of performance levels at TREC. Note that performance is not very good!

26 / 42

slide-27
SLIDE 27

Outline

1

Recap

2

Implementation

3

Unranked evaluation

4

Ranked evaluation

5

SVD Intuition

6

Incremental Numerics

27 / 42

slide-28
SLIDE 28

Netflix challenge, 2006–2009

Next 9 slides adapted from (“Simon Funk” = Brandyn Webb) http://sifter.org/∼simon/journal/20061211.html See also popular article:

http://www.nytimes.com/2008/11/23/magazine/23Netflix-t.html

Netflix provided 100M ratings (from 1 to 5)

  • f 17K movies by 500K users.

i.e., 100 million (User,Movie,Rating)’s of the form (105932,14002,3) Predict (User,Movie,?) not in the database (how would the given User rate the given Movie?) $50k incentive to the best each year, and $1M to the first to beat a set target (10% better than Netflix)

28 / 42

slide-29
SLIDE 29

User-Movie Rating Matrix Rum

Visualize as large sparse 500k × 17k “user-movie” matrix Rum, with (u, m)th matrix element containing rating (1–5) by user u for movie m. About 8.5B entries total, so data in only 1 of 85 = 1.2%. Certain specified ‘?’ elements constitute a quiz: make best guess Pum at missing ratings. Use “mean squared error” (mse) as measure of accuracy: guess 1.5 and actual is 2, “penalty” = (2 − 1.5)2 = 0.25. Then sum over penalties for all guesses (including optional sqrt): rmse E =

  • u,m

(Rum − Pum)2

29 / 42

slide-30
SLIDE 30

Linear Dependencies

If one had the full 8.5 billion ratings (and many “weary users”), they would contains many regularities, i.e., not consist of 8.5B independent and unrelated ratings. Describe each movie in terms of some basic attributes such as

  • verall quality

action or comedy actors . . . Describe user preferences in terms of complementary attributes or preferences they rate high or low prefer action or comedy preferred actors . . .

30 / 42

slide-31
SLIDE 31

Model the data

Explain 8.5 billion ratings by far less than 8.5 billion numbers (e.g., a single number specifying movie’s action content can explain the attraction to a few million action-buffs) Define model for data with smaller number of parameters, infer parameters from the data, SVD ( = singular value decomposition) reduces in this case to the assumption that user’s overall rating is composed of a sum of preferences over movie features

31 / 42

slide-32
SLIDE 32

Example: Just one Feature

Suppose only 1 feature, overall quality, and 1 corresponding user tendency to rate high/low. Three users: Uu = (1, 2, 3) Five movies: Vm = (1, 1, 3, 2, 1) Predicted rating matrix: Pum = UuVm =   1 1 3 2 1 2 2 6 4 2 3 3 9 6 3   ‘Explain’ 15 data points with only 7 parameters (only one overall scale)

32 / 42

slide-33
SLIDE 33

More Features

Now suppose 40 features: Each movie described by 40 values, specifying for each feature degree to which contained in movie; Each user described by 40 values, specifying degree to which each feature preferred by user. To calculate rating, sum products of each user preference multiplied by the corresponding movie feature. E.g., movie Terminator might be (action=1.2, chickflick=-1, . . .), and user Joe might be (action=3, chickflick=-1, . . .). Combine to find Joe likes Terminator with rating 3 ∗ 1.2 + (−1) ∗ (−1) + . . . = 4.6 + . . . . (Negative numbers OK: Terminator is anti-chickflick, Joe has aversion to chickflicks, “so Terminator actively scores positive points with Joe for being decidedly un-chickflicky.”)

33 / 42

slide-34
SLIDE 34

Concise Model

Model requires roughly 40∗(500K+17K) values, or about 20M: less than the original 8.5B by a factor of 400. Predicted ratings: Pum =

r

  • f =1

Uf

u · V f m

Uf

u is the preference of user u for feature f , V f m is the degree to

which movie m contains feature f (up to r = 40). Original matrix has been decomposed into product of two rectangular matrices: the 500,000 × 40 user preference matrix Uf

u,

and the 40 × 17,000 movie feature matrix V f

m.

(Matrix multiplication just performs the products and sums described above, resulting in an approximation to the original 500,000 × 17,000 rating matrix.)

34 / 42

slide-35
SLIDE 35

Pum = UuVm =   1 2 3  

3×1

  • 1

1 3 2 1

  • 1×5

=   1 1 3 2 1 2 2 6 4 2 3 3 9 6 3  

  • 3×5

Pum = r

f =1 Uf u · V f m =

           1 . . . 1 2 . . . 1 1 . . . 2 . . . 2 . . . 4 5 . . . 1 1 . . . 2           

  • n×r

   1 1 . . . 2 1 . . . . . . . . . . . . . . . 2 5 . . . 1 2   

  • r×m

=            3 6 . . . 3 3 4 7 . . . 5 4 5 11 . . . 4 5 . . . 10 22 . . . 8 10 7 10 . . . 11 7 5 11 . . . 4 5           

  • n×m

35 / 42

slide-36
SLIDE 36

How to calculate model parameters

Singular value decomposition (SVD) is the mathematical method for finding the two smaller matrices which minimize the resulting approximation error (rmse) to original matrix. The rank-40 SVD of the 8.5B matrix gives the best approximation within framework of 40 feature user-movie-rating model.

  • Difficult to calculate SVD of large matrix.
  • Moreover don’t have all 8.5B entries

(instead have 100M entries and 8.4B empty cells) But can train parameters by following derivative of the approximation error (steepest descent). (also means the unknown error on the 8.4B empty matrix elements can be ignored — for a fully known matrix, end result coincides exactly with the SVD)

36 / 42

slide-37
SLIDE 37

Summary

End result of SVD = list of inferred categories, sorted by relevance. Each category expressed by extent to which each user and movie belong (or anti-belong), as read off from columns of user matrix U,

  • r rows of movie matrix V .

Sorted by value, a category might represent action movies (movies with a lot of action at the top, slow movies at the bottom), and correspondingly users who like action movies (at the top, and those who prefer slow movies at the bottom). Procedure discovers whatever the data implies: algorithm itself has no inherent concept of action (uses neither titles nor descriptions). Uses only a hundred million examples of the form: user 17538 gives movie 4819 a rating of 3 (and 84 of 85 ratings are missing).

37 / 42

slide-38
SLIDE 38

Outline

1

Recap

2

Implementation

3

Unranked evaluation

4

Ranked evaluation

5

SVD Intuition

6

Incremental Numerics

38 / 42

slide-39
SLIDE 39

Incremental SVD method

(from http://sifter.org/∼simon/journal/20070815.html) Recall: Rum = known rating by user u for item m Pum = predicted rating for user u for item m Singular vectors indexed by f = 1, . . . , r Uf

u = element of the f th singular user vector for the uth user

V f

m = element of the f th singular item vector for the mth movie

SVD computes the prediction as: Pum =

r

  • f =1

Uf

u · V f m

39 / 42

slide-40
SLIDE 40

Error Gradient

The error in the prediction for user u’s rating of movie m is eum = Rum − Pum , and the total rms error E for all predictions is given by E 2 =

  • u′,m′

e2

u′m′ .

For gradient descent, take the partial derivative of the squared error with respect to each of the parameters Uf

u and V f m,

∂E 2 ∂Uf

u

=

  • m′

−2eum′ ∂Pum′ ∂Uf

u

= −2

  • m′

eum′V f

m′ = −2

  • m′

(Rum′−Pum′)V f

m′

(derivative for Uf

u just the sum over all the ratings by user u). Similarly

∂E 2 ∂V f

m

=

  • u′

−2eu′m ∂Pu′m ∂V f

m

= −2

  • u′

eu′mUf

u′ = −2

  • u′

(Ru′m−Pu′m)Uf

u′

40 / 42

slide-41
SLIDE 41

Gradient Descent

http://mathworld.wolfram.com/MethodofSteepestDescent.html

Starts at point P0 and moves from Pi to Pi+1 by minimizing along the line extending from Pi in the direction of −∇f (Pi), the local downhill gradient. For 1d function f (x), takes the form of iterating xi = xi−1 − ǫf ′(xi−1) for small ǫ > 0, from starting point x0 until fixed point is reached. f (x) = x3 − 2x2 + 2 with ǫ = .1 and starting points x0 = 2, 0.01.

41 / 42

slide-42
SLIDE 42

Inner Loop

In simple backpropagation algorithm for gradient descent, use as parameter step “learning rate” parameter ℓ = 2ǫ multiplied by gradient:

∆Uf

u = −ǫ∂E 2

∂Uf

u

= ℓ

  • m′

eum′V f

m′

∆V f

m = −ǫ ∂E 2

∂V f

m

= ℓ

  • u′

eu′mUf

u′

translates to inner loop of code as real err = ℓ * (rating(user,movie) - predictRating(user,movie)); userValue[f][user] += err * movieValue[f][movie]; movieValue[f][movie] += err * userValue[f][user];

(sum former over movies, latter over users, and iterate to minimum)

42 / 42