Why Batch and User Evaluations Do Not Give the Same Results
- A. Turpin
Curtin University of Technology Perth, Australia
- W. Hersh
Oregon Health Sciences University Portland, Oregon
Presented at SIGIR2001 New Orleans
Why Batch and User Evaluations Do Not Give the Same Results A. - - PowerPoint PPT Presentation
Why Batch and User Evaluations Do Not Give the Same Results A. Turpin Curtin University of Technology Perth, Australia W. Hersh Oregon Health Sciences University Portland, Oregon Presented at SIGIR2001 New Orleans TREC ad-hoc 0.4 MAP 0.3
Presented at SIGIR2001 New Orleans
Number: 414i Title: Cuba, sugar, imports Description: What countries import Cuban sugar? Instances: In the time alloted, please find as many DIFFERENT countries of the sort described above as you can. Please save at least one document for EACH such DIFFERENT country. If one document discusses several such countries, then you need not save other documents that repeat those, since your goal is to identify as many DIFFERENT countries of the sort described above as possible.
Experiment 1 - Instance Recall
0.213 0.330 0.275 0.385 0.390 0.324
Pre-batch MAP User IR Post-batch MAP
Baseline Improved
1) What are the names of three US national parks where
2) Identify a site with Roman ruins in present day France 3) Name four films in which Orson Welles appeared 4) Name three countries that imported Cuban sugar during the period of time covered by the document collection
5) Which childrens TV program was on the air longer the
Doody Show? 6) Which painting did Edvard Munch complete first Vampire or Puberty? 7) Which was the last dynasty of China Qing or Ming? 8) Is Denmark larger or smaller in population than Norway?
Experiment 2 - Question Answering
0.228 66% 0.270 0.327 60% 0.354
Pre-batch MAP User QA Post-batch MAP
Baseline Improved
Precision metrics on user queries and collection
0.00 0.10 0.20 0.30 0.40 0.50 0.60 MAP p@10 p@50 MAP p@10 p@50
Baseline Improved
47% p=0.02 57% p=0.03 33% p=0.14 68% p=0.001 100% p=0.001 40% p=0.02
QA experiment
Number of instances on user queries and collection
0.00 2.00 4.00 6.00 8.00 10.00 12.00 14.00
Baseline Improved
105% p=0.04 30% p=0.28
– Maybe obscure document titles? – Don’t read the list from the top?
50 100 150 Baseline Improved
2% p=0.93 0% p=0.97 35% p=0.01 35% p=0.02 Relevant Irrelevant QA QA IR IR
a good weighting scheme because
– They will ignore high ranking relevant docs – They will happily issue a few extra queries
q T t d T t
d t TF t IDF d t TF
,
2
) , ( ) ( ) , (
q T t
d t d t d
,
, , 2
q T t
d t d t d t t t q
,
' , , ,
Basic cosine Okapi Pivoted Okapi