1
CSCI 5417 Information Retrieval Systems Jim Martin
Lecture 7 9/13/2011
9/14/11 CSCI 5417 2
Today
Review
Efficient scoring schemes Approximate scoring
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 7 - - PDF document
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 7 9/13/2011 Today Review Efficient scoring schemes Approximate scoring Evaluating IR systems 9/14/11 CSCI 5417 2 1 Normal Cosine Scoring 9/14/11 CSCI 5417 3
9/14/11 CSCI 5417 2
Efficient scoring schemes Approximate scoring
9/14/11 CSCI 5417 3
9/14/11 CSCI 5417 4
9/14/11 CSCI 5417 5
K < |A| << N A does not necessarily contain the top K, but
Return the top K docs in A
9/14/11 CSCI 5417 6
Low scores are unlikely to change the
Two ideas follow
9/14/11 CSCI 5417 7
After a fixed number of docs or wft,d drops below some threshold
from the postings of each query term
9/14/11 CSCI 5417 8
High IDF terms likely to contribute most to
Stop if doc scores relatively unchanged
9/14/11 CSCI 5417 9 9/14/11 CSCI 5417 10
Number of documents/hour Realtime search
Latency as a function of index size
Ability to express complex information
Speed on complex queries
9/14/11 CSCI 5417 11
Speed of response/size of index are factors But blindingly fast, useless answers won’t
What makes people come back?
9/14/11 CSCI 5417 12
Who is the user we are trying to make
Can measure rate of return users
Measure time to purchase, or fraction of
9/14/11 CSCI 5417 13
How much time do my users save when
Many other criteria having to do with
9/14/11 CSCI 5417 14
1.
2.
3.
9/14/11 CSCI 5417 15
E.g., Information need: I'm looking for information on
Query: wine red white heart attack effective
9/14/11 CSCI 5417 16
sometimes as queries
For at least for subset of docs that some
9/14/11 CSCI 5417 17
False pos/False neg Type 1/Type 2 errors
9/14/11 CSCI 5417 18
Given a query, an engine classifies each doc as
Accuracy of an engine: the fraction of these
9/14/11 CSCI 5417 19
Precision: fraction of retrieved docs that are
Recall: fraction of relevant docs that are retrieved
Precision P = a/(a+b) Recall R = a/(a+c)
9/14/11 CSCI 5417 20
That is, recall either stays the same or
Or as recall increases
A fact with strong empirical confirmation
9/14/11 CSCI 5417 21
People aren’t really reliable assessors
Systems tuned on one collection may not
9/14/11 CSCI 5417 22
We’re not doing Boolean relevant/not
The system can return varying number of
All things being equal we want relevant
9/14/11 CSCI 5417 23
9/14/11 CSCI 5417 24
9/14/11 CSCI 5417 25
9/14/11 CSCI 5417 26
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Recall Precision
9/14/11 CSCI 5417 27
A precision-recall graph for a single query isn’t a
You need to average performance over a whole
But there’s a technical issue:
Precision-recall calculations fill only some points on
How do you determine a value (interpolate) between
9/14/11 CSCI 5417 28
9/14/11 CSCI 5417 29
0 .1 .2 .3 ..... 1 9/14/11 CSCI 5417 30
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Recall Precision
9/14/11 CSCI 5417 31
SabIR/Cornell 8A1 11pt precision from TREC 8
0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 Recall Precision
9/14/11 CSCI 5417 32
9/14/11 CSCI 5417 33
Graphs are good, but people like single summary
Precision at fixed retrieval level
Perhaps most appropriate for web search: all people
But has an arbitrary parameter of k
11-point interpolated average precision
The standard measure in the TREC competitions: you
Evaluates performance at all recall levels
9/14/11 CSCI 5417 34
Average of the precision value obtained for
Avoids interpolation, use of fixed recall
MAP for query collection is arithmetic avg.
Macro-averaging: each query counts equally
9/14/11 CSCI 5417 35 35
System A vs System B
System A (1.1) vs System A (1.2)
Approach A vs. Approach B
Vector space approach vs. Probabilistic
Systems on different collections?
System A on med vs. trec vs web text?
9/14/11 CSCI 5417 36
9/14/11 CSCI 5417 37
Still need
Test queries Relevance assessments
Test queries
Must be germane to docs available Best designed by domain experts Random query terms generally not a good idea
Relevance assessments
Human judges, time-consuming Human panels are not perfect
9/14/11 CSCI 5417 38
You would have to look at every document.
Run a query on a representative set of state
Take the union of the top N results from
Have the analysts judge the relevant docs
9/14/11 CSCI 5417 39
9/14/11 CSCI 5417 40
Relevance vs Marginal Relevance
A document can be redundant even if it is highly
Duplicates The same information from different sources Marginal relevance is a better measure of utility for
Using facts/entities as evaluation units more
But harder to create evaluation set
9/14/11 CSCI 5417 41
9/14/11 CSCI 5417 42
NDCG (Normalized Cumulative Discounted Gain)
Clickthrough on first result Not very reliable if you look at a single clickthrough …
Studies of user behavior in the lab A/B testing Focus groups Diary studies
42
9/14/11 CSCI 5417 43
43 9/14/11 CSCI 5417 44