Why Batch and User Evaluations Do Not Give the Same Results A. - - PowerPoint PPT Presentation

why batch and user evaluations do not give the same
SMART_READER_LITE
LIVE PREVIEW

Why Batch and User Evaluations Do Not Give the Same Results A. - - PowerPoint PPT Presentation

Why Batch and User Evaluations Do Not Give the Same Results A. Turpin Curtin University of Technology Perth, Australia W. Hersh Oregon Health Sciences University Portland, Oregon Presented at SIGIR2001 New Orleans TREC ad-hoc 0.4 MAP 0.3


slide-1
SLIDE 1

Why Batch and User Evaluations Do Not Give the Same Results

  • A. Turpin

Curtin University of Technology Perth, Australia

  • W. Hersh

Oregon Health Sciences University Portland, Oregon

Presented at SIGIR2001 New Orleans

slide-2
SLIDE 2

TREC ad-hoc

0.2 0.3 0.4 95 96 97 98 99 00

MAP

slide-3
SLIDE 3

Experimental method

  • 1. Set baseline system to basic Cosine Vector weights
  • 2. Identify “super” system using batch experiments
  • 3. Run 24 users on the 2 systems with same topics
  • 4. Send results off to NIST
  • 5. Get relevance judgments
  • 6. Analyse user results
  • 7. Check batch results
slide-4
SLIDE 4

Example instance recall query

Number: 414i Title: Cuba, sugar, imports Description: What countries import Cuban sugar? Instances: In the time alloted, please find as many DIFFERENT countries of the sort described above as you can. Please save at least one document for EACH such DIFFERENT country. If one document discusses several such countries, then you need not save other documents that repeat those, since your goal is to identify as many DIFFERENT countries of the sort described above as possible.

slide-5
SLIDE 5
slide-6
SLIDE 6

Experiment 1 - Instance Recall

0.213 0.330 0.275 0.385 0.390 0.324

Pre-batch MAP User IR Post-batch MAP

Baseline Improved

slide-7
SLIDE 7

8 Q&A queries

1) What are the names of three US national parks where

  • ne can find redwoods?

2) Identify a site with Roman ruins in present day France 3) Name four films in which Orson Welles appeared 4) Name three countries that imported Cuban sugar during the period of time covered by the document collection

slide-8
SLIDE 8

8 Q&A queries

5) Which childrens TV program was on the air longer the

  • riginal Mickey Mouse Club or the original Howdy

Doody Show? 6) Which painting did Edvard Munch complete first Vampire or Puberty? 7) Which was the last dynasty of China Qing or Ming? 8) Is Denmark larger or smaller in population than Norway?

slide-9
SLIDE 9

Experiment 2 - Question Answering

0.228 66% 0.270 0.327 60% 0.354

Pre-batch MAP User QA Post-batch MAP

Baseline Improved

slide-10
SLIDE 10

Results Summary

Predicted Instance recall 81% 15% (p = 0.27) Question answering 58%

  • 6% (p = 0.41)

Actual

Why?

  • 1. Systems no different on topics and collection used
  • 2. There was a difference, but users ignored it
slide-11
SLIDE 11

Precision metrics on user queries and collection

0.00 0.10 0.20 0.30 0.40 0.50 0.60 MAP p@10 p@50 MAP p@10 p@50

Baseline Improved

47% p=0.02 57% p=0.03 33% p=0.14 68% p=0.001 100% p=0.001 40% p=0.02

  • Inst. Recall experiment

QA experiment

slide-12
SLIDE 12

Number of instances on user queries and collection

0.00 2.00 4.00 6.00 8.00 10.00 12.00 14.00

  • Num. inst. @10
  • Num. inst. @50

Baseline Improved

105% p=0.04 30% p=0.28

slide-13
SLIDE 13

So what happens to the difference?

  • Users compensate for the lack of relevant docs

within time limit

  • Users ignore high ranked relevant documents

– Maybe obscure document titles? – Don’t read the list from the top?

  • “Extra” relevant docs give no new information
slide-14
SLIDE 14

Number of queries per topic

1 2 3 4 5 IR QA

Baseline Improved

16% p=0.16 33% p=0.04

slide-15
SLIDE 15

50 100 150 Baseline Improved

Number of docs retrieved

2% p=0.93 0% p=0.97 35% p=0.01 35% p=0.02 Relevant Irrelevant QA QA IR IR

slide-16
SLIDE 16

24% p=0.22 87% p=0.002

Number top 10 relevant docs ignored 0% 20% 40% 60% IR QA Baseline Improved

slide-17
SLIDE 17

Conclusion

  • In these two tasks there is no use providing users with

a good weighting scheme because

– They will ignore high ranking relevant docs – They will happily issue a few extra queries

  • They find answers just as well with old technology
  • User interface effects?
  • Task effect?
slide-18
SLIDE 18
  • d

q T t d T t

d t TF t IDF d t TF

,

2

) , ( ) ( ) , (

  • +
  • d

q T t

d t d t d

W f f t IDF

,

, , 2

) (

  • +
  • d

q T t

d t d t d t t t q

W f f f f N f

,

' , , ,

ln

Basic cosine Okapi Pivoted Okapi