Building Reusable Test Collections Ellen M. Voorhees 1 Test - - PowerPoint PPT Presentation

building reusable test collections
SMART_READER_LITE
LIVE PREVIEW

Building Reusable Test Collections Ellen M. Voorhees 1 Test - - PowerPoint PPT Presentation

Building Reusable Test Collections Ellen M. Voorhees 1 Test Collections Evaluate search effectiveness using test collections set of documents set of questions relevance judgments 20 Relevance judgments Number Relevant


slide-1
SLIDE 1

Building Reusable Test Collections

Ellen M. Voorhees

1

slide-2
SLIDE 2

Test Collections

  • Evaluate search effectiveness using

test collections

  • set of documents
  • set of questions
  • relevance judgments
  • Relevance judgments
  • ideally, complete judgments---all docs for

all topics

  • unfeasible for document sets large enough

to be interesting

  • so, need to sample, but how?

2

10 20 10 20 30

Number Relevant Retrieved Number Retrieved R R

slide-3
SLIDE 3

Problem Statement

General-purpose: supports a wide range of measures and search scenarios Reusable: unbiased for systems that were not used to build the collection Cost: proportional to the number of human judgments required for entire procedure

3

Want to build general-purpose, reusable IR test collections at acceptable cost

slide-4
SLIDE 4

Pooling

  • For sufficiently large l and diverse

engines, depth-l pools produce “essentially complete” judgments

  • Unjudged documents are assumed to be

not relevant when computing traditional evaluation measures such as average precision (AP)

  • Resulting test collections have been found

to be both fair and reusable.

1) fair: no bias against systems used to construct collection 2) reusable: fair to systems not used in collection construction

RUN A 401 401 RUN B

Pools

401 403 402

Top l

Alphabetized Docnos

slide-5
SLIDE 5

Pooling Bias

  • Traditional pooling takes top l documents

1) intentional bias toward top ranks where relevant are found 2) l was originally large enough to reach past swell of topic-word relevant

  • As document collection grows, a constant

cut-off stays within swell

  • Pools cannot be proportional to corpus

size due to practical constraints

1) sample runs differently to build unbiased pools 2) new evaluation metrics that do not assume complete judgments

l l

  • C. Buckley, D. Dimmick, I. Soboroff, and E. Voorhees.

Bias and the limits of pooling for large collections. Information Retrieval, 10(6):491-508, 2007.

slide-6
SLIDE 6

LOU Test

“Leave Out Uniques” test of reusability:

examine effect on test collection if some participating team had not done so

Procedure

  • create judgment set that removes all uniquely-retrieved relevant documents

for one team

  • evaluate all runs using original judgment set and again using newly created set
  • compare evaluation results
  • Kendall’s t between system rankings
  • maximum drop in ranking over runs submitted by team
slide-7
SLIDE 7

Inferred Measure Sampling

  • Stratified sampling where strata are

defined by ranks

  • Different strata have different

probabilities for documents to be selected to be judged

  • Given strata and probabilities, estimate

AP by inferring which unjudged docs are likely to be relevant

  • Quality of estimate varies widely

depending on exact sampling strategy

  • Fair, but may be less reusable
  • E. Yilmaz, E. Kanoulas, and J. A. Aslam. A simple

and efficient sampling method for estimating AP and NDCG. SIGIR 2008, pp.603—610.

slide-8
SLIDE 8

Multi-armed Bandit Sampling

  • Bandit techniques trade-off between

exploiting known good “arms” and exploring to find better arms. For collection building, each run is an arm, and reward is finding a relevant doc

  • Simulations suggest can get similar-quality

collections as pooling but with many fewer judgments

  • TREC 2017 Common Core track first

attempt to build new collection using bandit technique

bandit selection method: 2017: MaxMean 2018: MTF

  • D. Losada, J. Parapar, A. Barreiro. Feeling Lucky? Multi-

armed Bandits for Ordering Judgements in Pooling-based

  • Evaluation. Proceedings of SAC 2016. pp. 1027-1034.

✓ ✓ ✓ ✓ ✓ ✓ ✘ ✘ ✘

slide-9
SLIDE 9

How should overall budget be divided among topics?

  • use features of top-10 pools, to predict

per-topic minimum judgments needed

  • results in a conservative, but reasonable,

allocation of budget across topics for historical collections

Feature Exponent Combined Budget for All Topics

Budget Allocation Strategy

PoolSize NumRel NumNonRel

How does assessor learn topic?

  • allocating some budget to shallow

pools causes minimal degradation over “pure” bandit method

Implementing a practical bandit approach

9

Pool Depth

30,000 35,000 25,000 20,000 15,000 10,000 5,000

Budget

t

Estimatet =

PoolSizet

⎷NumNonRelt

slide-10
SLIDE 10

Collection Quality

2017 Common Core collection less reusable than hoped (just too few judgments) Additional experiments demonstrate greedy bandit methods can be UNFAIR

2 4 6 8 10 12 14 16 18 20 100 200 300 400 500 600 700 800 Largest Observed Drop in MAP Ranking Number Unique Relevant Retrieved Team

MAP Precision(10) t Drop t Drop MaxMean .980 2 .937 11 Inferred .961 7 .999 1

LOU-results for TREC 2017 Common Core collection Fairness test: build collection from judgments on small inferred-sample or on equal number of documents selected by MaxMean bandit approach (average

  • f 300 judgments per topic). Evaluate runs using respective judgment sets

and compare run rankings to full collection rankings. Judgment budget is small enough that R exceeds budget for some topics. Example: topic 389 with R=324, 45% of which are uniques; one run has 98 relevant in top 100 ranks, so 1/3 relevant in bandit set came from this single run to the exclusion of other runs.

slide-11
SLIDE 11

An Aside

  • Note that this is a concrete example of

why the goal in building a collection is NOT to maximize the number of relevant found!

  • The goal is actually to find an unbiased

set of relevant.

  • We don’t know how to build a

guaranteed unbiased judgment set, nor prove that an existing set is unbiased, but sometimes less is more.

11

Mo Most Relevant Fo Found

Image: Eunice De Faria/Pixabay

slide-12
SLIDE 12

Bandit Conclusions

Can be unfair when budget is small relative to (unknown) number of relevant

  • must reserve some of budget for quality

control, so operative number of judgments is less than B

  • Does not provide practical means for

coordination among assessors

  • multiple human judges working at

different rates and at different times

  • subject to a common overall budget
  • stopping criteria depends on outcome of

process

Image: Pascal/flickr

slide-13
SLIDE 13

HiCal

  • TREC 2019 and 2020 Deep Learning track used

modification of U. of Waterloo’s HiCAL system

  • HiCAL dynamic method that builds model of

relevance based on available judgments. Suggests first most-likely-to-be-relevant unjudged document as next to judge.

  • Modified version used in tracks: start with

depth-10 pools

  • Judge initial pools and
  • 2019: 300 document sample selected by StatMap;

estimate R

  • 2020: 100 additional docs selected by HiCAL

Iterate until 2estR+100<|J| or estR~|J|

Mustafa Abualsaud, Nimesh Ghelani, Haotian Zhang, Mark Smucker, Gordon Cormack and Maura

  • Grossman. A System for Efficient High-Recall Retrieval,

SIGIR 2018. (https://hical.github.io/)

l

slide-14
SLIDE 14

HiCAL Collection Quality?

  • Hard to say in the absence of Truth
  • Concept of uniquely retrieved relevant

docs not defined, so no LOU testing

  • can leave out team from entire process,

but HiCAL able to recover those docs

  • f 5760 tests for cross product of

{team}X{map,P_10}X{trec8,robust,deep}X{stopping criterion}

exactly one t was less than 0.92

  • Very few topics enter a second iteration
  • So: Deep Learning track collections are

fair, probably (?) reusable, but unknown effect of topic sample

14

Image: mohamed Hassan/Pixabay

slide-15
SLIDE 15

TREC-COVID

TREC-COVID: build a pandemic test collection for current and future biomedical crises… …in a very short time frame using open- source literature on COVID-19

Images: Alexandra Koch/Pixabay

15

slide-16
SLIDE 16

TREC-COVID

  • Structured as a series of rounds, where

each round uses a superset of previous rounds’ document and question sets.

  • The document set is CORD-19

maintained by AI2. Questions came from search logs of medical libraries. Judgments from people with biomedical expertise.

16

slide-17
SLIDE 17

17

  • Apr 15–Apr 23
  • Apr 10 release of

CORD-19; ~47k articles

  • 30 topics
  • 56 teams, 143

submissions

  • ~8.5k judgments
  • Jun 26—Jul 6
  • Jun 19 release of

CORD-19; ~158k articles

  • 45 topics
  • 27 teams, 72

submissions

  • ~46k cumulative

judgments

  • Jul 22—Aug 3
  • Jul 16 release of

CORD-19; ~191k articles

  • 50 topics
  • 28 teams, 126

submissions

  • ~69k cumulative

judgments

Round 5 Round 4 Round 3

  • May 26—Jun 3
  • May 19 release of

CORD-19; ~128k articles

  • 40 topics
  • 31 teams, 79

submissions

  • ~33k cumulative

judgments

  • May 4—May 13
  • May 1 release of

CORD-19; ~60k articles

  • 35 topics
  • 51 teams, 136

submissions

  • ~20k cumulative

judgments

Round 2 Round 1

TREC-COVID Rounds

slide-18
SLIDE 18

1 doc1 1 doc7 1 1 doc100 1 2 doc1 2 doc2 1 2 doc100 1 3 doc1 1 3 doc2 4 doc13 1 4 doc52 4 doc57 5 doc1 5 doc13 1 5 doc57 1 5 doc89 5 doc92 5 doc100

Residual Collection Eval

  • Using residual collection evaluation: all

previously judged documents for a given topic are removed from the collection for scoring

  • prevents gaming of scoring metrics

(including inadvertent correlations)

  • can depress absolute values of scores, but

relative scores okay

  • bookkeeping of previously judged

documents important

  • each rounds’ submissions scored only on

that round’s judgments, not cumulative qrels, so less stable

18

1 doc1 1 doc7 1 1 doc100 1 2 doc1 2 doc2 1 2 doc100 1 3 doc1 1 3 doc2 4 doc13 1 4 doc52 4 doc57 5 doc1 5 doc13 1 5 doc57 1 5 doc89 5 doc92 5 doc100 1 doc8 1 1 doc92 2 doc3 2 doc99 1 3 doc4 3 doc27 1 3 doc31 1 4 doc44 1 4 doc63 4 doc100 5 doc3 1 5 doc4 1

Cumulative Rounds 1…x qrels Round x+1 qrels

1 doc100 1 doc8 1 doc7 1 doc23 2 doc3 2 doc100 2 doc7 2 doc2 3 doc13 3 doc52 3 doc57 3 doc1 4 doc13 4 doc57 4 doc89 4 doc63 5 doc1 5 doc4 5 doc3 5 doc22

Round x+1 run A score over reduced list using reduced qrels Round x+1 run A score report

slide-19
SLIDE 19

Judgment Sets

  • Per-round judgment period initially short
  • two “half-rounds” of judging per round of TREC-COVID
  • less than 10 days per half-round
  • estimated could judge about 100 documents per topic per half-round
  • SHALLOW pools over large, diverse run sets
  • Extended judgment period in later round
  • fewer runs, but still diverse and largely effective (feedback)
  • individual rounds’ pools and thus qrels still just shallow pools because of residual

collection evaluation

  • TREC-COVID Complete
  • approximately 69k judgments built from multiple rounds of feedback runs
  • 50 topics: topics added in later rounds received relatively more judgments in

subsequent rounds to even out assessment effort

19

slide-20
SLIDE 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50

Topic

500 1000 1500

Number of Documents

Judged Partially Relevant Fully Relevant

Percentage Relevant as a Quality Indicator

  • For some topics, almost ¾ of judged

documents are relevant!

  • Yet stability tests suggest that the

collection including those topics is fine.

  • Why?
  • about 1% of document collection judged

for some topics (enormous percentage)

  • most runs quite effective: pools not filled

with chuff & lots of the relevance space explored

  • duplicates in the document collection
  • “relevant” included partially relevant

20

Total number of documents judged and number of documents judged partially or fully relevant in TREC-COVID Complete collection.

slide-21
SLIDE 21

Collection Stability

10 20 30 40 50

topic set size

0.0 0.2 0.4

percentage swaps

Percentage swaps for map bin 3 (run scores difference between 0.03 and 0.04)

Collection 38 Collection 43 Collection 47 Collection 50 10 20 30 40 50

topic set size

0.0 0.2 0.4

percentage swaps

Percentage swaps for map bin 2 (run scores difference between 0.02 and 0.03)

Collection 38 Collection 43 Collection 47 Collection 50 10 20 30 40 50

topic set size

0.0 0.2 0.4

percentage swaps

Percentage swaps for map bin 4 (run scores difference between 0.04 and 0.05)

Collection 38 Collection 43 Collection 47 Collection 50

21

10 20 30 40 50

topic set size

0.0 0.2 0.4

percentage swaps

Percentage swaps for map bin 5 (run scores difference between 0.05 and 0.06)

Collection 38 Collection 43 Collection 47 Collection 50

slide-22
SLIDE 22

P@5

0.2 0.4 0.6 0.8 xj4wang_run1 BBGhelani1 BBGhelani2 run1 sab20.1.meta.docs IRIT_marked_un_pair run3 base.unipd.it sheikh_bm25_manual cu_dbmi_bm25_1 IRIT_marked_base UIUC_DMG_setrank_ret crowd1 cu_dbmi_bm25_2 OHSU_RUN2 IRIT_marked_mu_pair run2 UP−rrf5rnd1 UIUC_DMG_setrank_re 10x20.prf.unipd.it crowd2 CSIROmedNIR smith.rm3 T5R1 UIowaS_Run2 UIowaS_Run3 CSIROmed_RF uogTrDPH_QE uogTrDPH_QE_QQ BioinfoUA−emb T5R3 udel_fang_run3 PS−r1−bm25all sab20.1.merged uogTrDPH_prox_QQ elhuyar_rRnk_cbert udel_fang_run1 azimiv_wk1 baseline KU_run1 OHSU_RUN3 Technion−RRF UIowaS_Run1 UP−cqqrnd1 10x10.prf.unipd.it BioinfoUA−noadapt bm25_baseline dmis−rnd1−run1 dmis−rnd1−run3 elhuyar_rRnk_sbert poznan_run1 Technion−MEDMM bm25t5 poznan_run3 Technion−JPD UIUC_DMG_setrank ixa−ir−filter−query KU_run3 RMITBFuseM2 smith.ql elhuyar_indri jlbase poznan_run2 PS−r1−bm25medical SinequaR1_2 smith.bm25 UP−sdqrnd1 bm25_basline sab20.1.blind PS−r1−bm25none udel_fang_run2 NTU_NMLAB_BM25_Human TU_Vienna_TKL_1 ir_covid19_cle_dfr NTU_NMLAB_BM25_Hum2 OHSU_RUN1 TU_Vienna_TKL_2 SINEQUA BioinfoUA−emb−q ir_covid19_cle_ib RMITBM1 sheikh_bm25_all CSIROmed_PE ielab−prf.2query.v3 BITEM_df wistud_bing wistud_indri wistud_noSearch PL2c1.0_Bo1 ixa−ir−filter−quest PL2c1.0 DA_IICT_all irc_pubmed BM25R2 tcs_ilabs_gg_r1 dmis−rnd1−run2 NTU_NMLAB_BM25_ALLQQ BITEM_BL ielab−prf UB_NLP_RUN_1 irc_entrez SinequaR1_1 BITEM_stem ixa−ir−filter−narr Conv_KNRM irc_pmc CBOWexp.0 ir_covid19_cle_lmd RUIR−bm25−mt−exp ielab−prf.recency DA_IICT_narr_qe jlprec Meta−Conv−KNRM factum−1 KU_run2 RUIR−bm25−at−exp BERT Tetralogie1Fr savantx_nist_run_2 DA_IICT_narr savantx_nist_run_1 RUIR−doc2vec jlrecall BRPHJ_NLP1 TMACC_SeTA_baseline SFDC−23April−run1 Tetralogie0Fr BRPHJ_NLP3 CincyMedIR−run2 CincyMedIR−run3 savantx_nist_run_3 yn−r1−hierarchy CincyMedIR−run1 ERST_QUESTION SFDC−23April−run2 yn−r1−alltext BRPHJ_NLP2 ERST_PROSE lda400s5000 ERST_NARRATIVE lda400s2000 yn−r1−concepttext tm_lda400
  • new
  • rig
  • MAP
0.2 0.4 0.6 0.8 sab20.1.meta.docs BBGhelani1 BBGhelani2 udel_fang_run3 crowd1 crowd2 smith.rm3 OHSU_RUN2 run1 UIowaS_Run3 run2 uogTrDPH_QE uogTrDPH_QE_QQ udel_fang_run1 OHSU_RUN1 IRIT_marked_un_pair UP−rrf5rnd1 CSIROmed_RF run3 IRIT_marked_mu_pair sab20.1.merged xj4wang_run1 uogTrDPH_prox_QQ azimiv_wk1 UIowaS_Run2 IRIT_marked_base cu_dbmi_bm25_2 Technion−MEDMM cu_dbmi_bm25_1 OHSU_RUN3 bm25t5 Technion−RRF UIUC_DMG_setrank_ret KU_run1 smith.ql UIowaS_Run1 smith.bm25 CSIROmedNIR baseline UP−cqqrnd1 KU_run3 T5R3 elhuyar_rRnk_cbert base.unipd.it UIUC_DMG_setrank_re Technion−JPD RMITBFuseM2 ixa−ir−filter−query BioinfoUA−noadapt T5R1 UP−sdqrnd1 BioinfoUA−emb 10x20.prf.unipd.it elhuyar_rRnk_sbert elhuyar_indri SinequaR1_2 10x10.prf.unipd.it PL2c1.0_Bo1 RMITBM1 DA_IICT_all SINEQUA CSIROmed_PE NTU_NMLAB_BM25_Hum2 TU_Vienna_TKL_2 UIUC_DMG_setrank sab20.1.blind PS−r1−bm25all TU_Vienna_TKL_1 NTU_NMLAB_BM25_Human BM25R2 NTU_NMLAB_BM25_ALLQQ PL2c1.0 ixa−ir−filter−quest BITEM_df sheikh_bm25_manual udel_fang_run2 PS−r1−bm25medical BITEM_BL BITEM_stem SinequaR1_1 jlbase PS−r1−bm25none BioinfoUA−emb−q wistud_bing wistud_indri wistud_noSearch dmis−rnd1−run1 poznan_run1 ixa−ir−filter−narr poznan_run3 poznan_run2 ielab−prf.2query.v3 irc_pubmed dmis−rnd1−run3 CBOWexp.0 KU_run2 RUIR−doc2vec irc_entrez bm25_baseline irc_pmc bm25_basline DA_IICT_narr_qe RUIR−bm25−at−exp RUIR−bm25−mt−exp ir_covid19_cle_dfr sheikh_bm25_all DA_IICT_narr ir_covid19_cle_ib Conv_KNRM jlprec ir_covid19_cle_lmd Meta−Conv−KNRM dmis−rnd1−run2 BERT ielab−prf factum−1 UB_NLP_RUN_1 ielab−prf.recency savantx_nist_run_2 tcs_ilabs_gg_r1 BRPHJ_NLP1 savantx_nist_run_1 jlrecall SFDC−23April−run2 SFDC−23April−run1 Tetralogie1Fr savantx_nist_run_3 yn−r1−hierarchy yn−r1−alltext TMACC_SeTA_baseline Tetralogie0Fr CincyMedIR−run2 CincyMedIR−run3 CincyMedIR−run1 yn−r1−concepttext BRPHJ_NLP2 ERST_QUESTION BRPHJ_NLP3 ERST_PROSE tm_lda400 ERST_NARRATIVE lda400s2000 lda400s5000
  • new
  • rig
  • Reusability

22

Round 1 submissions evaluated using just Round 1 judgments (“original”) and again using cumulative judgments through Round 2 (“new”). Systems are ranked by decreasing score using original judgments (so order is different in different graphs). P@5: Kendall t = 0.9452; max D = 33 MAP: Kendall t = 0.9505; max D = 19

  • riginal

new

slide-23
SLIDE 23

Building Reusable Collections

  • Shared test collections continue to be vital infrastructure for IR research
  • We lack effective ways of assessing the quality of a test collection
  • some tests that can detect some problems
  • incremental, diagnostic tests would help in collection creation
  • simulations need to address pragmatic constraints
  • perceived fairness for participants
  • initial learning period of human assessor per topic
  • budget allocation across topics (i.e., stopping conditions)
  • Current state-of-the-practice
  • diverse sets of high-quality runs are really helpful in building good collections
  • quality heuristics more context dependent than previously realized
  • we know how to build collections for topics with small R (‘small’ relative to B)
  • …but can’t know size of R until appreciable assessing has been done

23