[PPT] - Building Reusable Test Collections Ellen M. Voorhees 1 Test PowerPoint Presentation

SLIDE 1

Building Reusable Test Collections

Ellen M. Voorhees

1

SLIDE 2

Test Collections

Evaluate search effectiveness using

test collections

set of documents
set of questions
relevance judgments
Relevance judgments
ideally, complete judgments---all docs for

all topics

unfeasible for document sets large enough

to be interesting

so, need to sample, but how?

2

10 20 10 20 30

Number Relevant Retrieved Number Retrieved R R

SLIDE 3

Problem Statement

General-purpose: supports a wide range of measures and search scenarios Reusable: unbiased for systems that were not used to build the collection Cost: proportional to the number of human judgments required for entire procedure

3

Want to build general-purpose, reusable IR test collections at acceptable cost

SLIDE 4

Pooling

For sufficiently large l and diverse

engines, depth-l pools produce “essentially complete” judgments

Unjudged documents are assumed to be

not relevant when computing traditional evaluation measures such as average precision (AP)

Resulting test collections have been found

to be both fair and reusable.

1) fair: no bias against systems used to construct collection 2) reusable: fair to systems not used in collection construction

RUN A 401 401 RUN B

Pools

401 403 402

Top l

Alphabetized Docnos

SLIDE 5

Pooling Bias

Traditional pooling takes top l documents

1) intentional bias toward top ranks where relevant are found 2) l was originally large enough to reach past swell of topic-word relevant

As document collection grows, a constant

cut-off stays within swell

Pools cannot be proportional to corpus

size due to practical constraints

1) sample runs differently to build unbiased pools 2) new evaluation metrics that do not assume complete judgments

l l

C. Buckley, D. Dimmick, I. Soboroff, and E. Voorhees.

Bias and the limits of pooling for large collections. Information Retrieval, 10(6):491-508, 2007.

SLIDE 6

LOU Test

“Leave Out Uniques” test of reusability:

examine effect on test collection if some participating team had not done so

Procedure

create judgment set that removes all uniquely-retrieved relevant documents

for one team

evaluate all runs using original judgment set and again using newly created set
compare evaluation results
Kendall’s t between system rankings
maximum drop in ranking over runs submitted by team

SLIDE 7

Inferred Measure Sampling

Stratified sampling where strata are

defined by ranks

Different strata have different

probabilities for documents to be selected to be judged

Given strata and probabilities, estimate

AP by inferring which unjudged docs are likely to be relevant

Quality of estimate varies widely

depending on exact sampling strategy

Fair, but may be less reusable
E. Yilmaz, E. Kanoulas, and J. A. Aslam. A simple

and efficient sampling method for estimating AP and NDCG. SIGIR 2008, pp.603—610.

SLIDE 8

Multi-armed Bandit Sampling

Bandit techniques trade-off between

exploiting known good “arms” and exploring to find better arms. For collection building, each run is an arm, and reward is finding a relevant doc

Simulations suggest can get similar-quality

collections as pooling but with many fewer judgments

TREC 2017 Common Core track first

attempt to build new collection using bandit technique

bandit selection method: 2017: MaxMean 2018: MTF

D. Losada, J. Parapar, A. Barreiro. Feeling Lucky? Multi-

armed Bandits for Ordering Judgements in Pooling-based

Evaluation. Proceedings of SAC 2016. pp. 1027-1034.

✓ ✓ ✓ ✓ ✓ ✓ ✘ ✘ ✘

SLIDE 9

How should overall budget be divided among topics?

use features of top-10 pools, to predict

per-topic minimum judgments needed

results in a conservative, but reasonable,

allocation of budget across topics for historical collections

Feature Exponent Combined Budget for All Topics

Budget Allocation Strategy

PoolSize NumRel NumNonRel

How does assessor learn topic?

allocating some budget to shallow

pools causes minimal degradation over “pure” bandit method

Implementing a practical bandit approach

9

Pool Depth

30,000 35,000 25,000 20,000 15,000 10,000 5,000

Budget

t

Estimatet =

PoolSizet

⎷NumNonRelt

SLIDE 10

Collection Quality

2017 Common Core collection less reusable than hoped (just too few judgments) Additional experiments demonstrate greedy bandit methods can be UNFAIR

2 4 6 8 10 12 14 16 18 20 100 200 300 400 500 600 700 800 Largest Observed Drop in MAP Ranking Number Unique Relevant Retrieved Team

MAP Precision(10) t Drop t Drop MaxMean .980 2 .937 11 Inferred .961 7 .999 1

LOU-results for TREC 2017 Common Core collection Fairness test: build collection from judgments on small inferred-sample or on equal number of documents selected by MaxMean bandit approach (average

f 300 judgments per topic). Evaluate runs using respective judgment sets

and compare run rankings to full collection rankings. Judgment budget is small enough that R exceeds budget for some topics. Example: topic 389 with R=324, 45% of which are uniques; one run has 98 relevant in top 100 ranks, so 1/3 relevant in bandit set came from this single run to the exclusion of other runs.

SLIDE 11

An Aside

Note that this is a concrete example of

why the goal in building a collection is NOT to maximize the number of relevant found!

The goal is actually to find an unbiased

set of relevant.

We don’t know how to build a

guaranteed unbiased judgment set, nor prove that an existing set is unbiased, but sometimes less is more.

11

Mo Most Relevant Fo Found

Image: Eunice De Faria/Pixabay

SLIDE 12

Bandit Conclusions

Can be unfair when budget is small relative to (unknown) number of relevant

must reserve some of budget for quality

control, so operative number of judgments is less than B

Does not provide practical means for

coordination among assessors

multiple human judges working at

different rates and at different times

subject to a common overall budget
stopping criteria depends on outcome of

process

Image: Pascal/flickr

SLIDE 13

HiCal

TREC 2019 and 2020 Deep Learning track used

modification of U. of Waterloo’s HiCAL system

HiCAL dynamic method that builds model of

relevance based on available judgments. Suggests first most-likely-to-be-relevant unjudged document as next to judge.

Modified version used in tracks: start with

depth-10 pools

Judge initial pools and
2019: 300 document sample selected by StatMap;

estimate R

2020: 100 additional docs selected by HiCAL

Iterate until 2estR+100<|J| or estR~|J|

Mustafa Abualsaud, Nimesh Ghelani, Haotian Zhang, Mark Smucker, Gordon Cormack and Maura

Grossman. A System for Efficient High-Recall Retrieval,

SIGIR 2018. (https://hical.github.io/)

l

SLIDE 14

HiCAL Collection Quality?

Hard to say in the absence of Truth
Concept of uniquely retrieved relevant

docs not defined, so no LOU testing

can leave out team from entire process,

but HiCAL able to recover those docs

f 5760 tests for cross product of

{team}X{map,P_10}X{trec8,robust,deep}X{stopping criterion}

exactly one t was less than 0.92

Very few topics enter a second iteration
So: Deep Learning track collections are

fair, probably (?) reusable, but unknown effect of topic sample

14

Image: mohamed Hassan/Pixabay

SLIDE 15

TREC-COVID

TREC-COVID: build a pandemic test collection for current and future biomedical crises… …in a very short time frame using open- source literature on COVID-19

Images: Alexandra Koch/Pixabay

15

SLIDE 16

TREC-COVID

Structured as a series of rounds, where

each round uses a superset of previous rounds’ document and question sets.

The document set is CORD-19

maintained by AI2. Questions came from search logs of medical libraries. Judgments from people with biomedical expertise.

16

SLIDE 17

17

Apr 15–Apr 23
Apr 10 release of

CORD-19; ~47k articles

30 topics
56 teams, 143

submissions

~8.5k judgments
Jun 26—Jul 6
Jun 19 release of

CORD-19; ~158k articles

45 topics
27 teams, 72

submissions

~46k cumulative

judgments

Jul 22—Aug 3
Jul 16 release of

CORD-19; ~191k articles

50 topics
28 teams, 126

submissions

~69k cumulative

judgments

Round 5 Round 4 Round 3

May 26—Jun 3
May 19 release of

CORD-19; ~128k articles

40 topics
31 teams, 79

submissions

~33k cumulative

judgments

May 4—May 13
May 1 release of

CORD-19; ~60k articles

35 topics
51 teams, 136

submissions

~20k cumulative

judgments

Round 2 Round 1

TREC-COVID Rounds

SLIDE 18

1 doc1 1 doc7 1 1 doc100 1 2 doc1 2 doc2 1 2 doc100 1 3 doc1 1 3 doc2 4 doc13 1 4 doc52 4 doc57 5 doc1 5 doc13 1 5 doc57 1 5 doc89 5 doc92 5 doc100

Residual Collection Eval

Using residual collection evaluation: all

previously judged documents for a given topic are removed from the collection for scoring

prevents gaming of scoring metrics

(including inadvertent correlations)

can depress absolute values of scores, but

relative scores okay

bookkeeping of previously judged

documents important

each rounds’ submissions scored only on

that round’s judgments, not cumulative qrels, so less stable

18

1 doc1 1 doc7 1 1 doc100 1 2 doc1 2 doc2 1 2 doc100 1 3 doc1 1 3 doc2 4 doc13 1 4 doc52 4 doc57 5 doc1 5 doc13 1 5 doc57 1 5 doc89 5 doc92 5 doc100 1 doc8 1 1 doc92 2 doc3 2 doc99 1 3 doc4 3 doc27 1 3 doc31 1 4 doc44 1 4 doc63 4 doc100 5 doc3 1 5 doc4 1

Cumulative Rounds 1…x qrels Round x+1 qrels

1 doc100 1 doc8 1 doc7 1 doc23 2 doc3 2 doc100 2 doc7 2 doc2 3 doc13 3 doc52 3 doc57 3 doc1 4 doc13 4 doc57 4 doc89 4 doc63 5 doc1 5 doc4 5 doc3 5 doc22

Round x+1 run A score over reduced list using reduced qrels Round x+1 run A score report

SLIDE 19

Judgment Sets

Per-round judgment period initially short
two “half-rounds” of judging per round of TREC-COVID
less than 10 days per half-round
estimated could judge about 100 documents per topic per half-round
SHALLOW pools over large, diverse run sets
Extended judgment period in later round
fewer runs, but still diverse and largely effective (feedback)
individual rounds’ pools and thus qrels still just shallow pools because of residual

collection evaluation

TREC-COVID Complete
approximately 69k judgments built from multiple rounds of feedback runs
50 topics: topics added in later rounds received relatively more judgments in

subsequent rounds to even out assessment effort

19

SLIDE 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50

Topic

500 1000 1500

Number of Documents

Judged Partially Relevant Fully Relevant

Percentage Relevant as a Quality Indicator

For some topics, almost ¾ of judged

documents are relevant!

Yet stability tests suggest that the

collection including those topics is fine.

Why?
about 1% of document collection judged

for some topics (enormous percentage)

most runs quite effective: pools not filled

with chuff & lots of the relevance space explored

duplicates in the document collection
“relevant” included partially relevant

20

Total number of documents judged and number of documents judged partially or fully relevant in TREC-COVID Complete collection.

SLIDE 21

Collection Stability

10 20 30 40 50

topic set size

0.0 0.2 0.4

percentage swaps

Percentage swaps for map bin 3 (run scores difference between 0.03 and 0.04)

Collection 38 Collection 43 Collection 47 Collection 50 10 20 30 40 50

topic set size

0.0 0.2 0.4

percentage swaps

Percentage swaps for map bin 2 (run scores difference between 0.02 and 0.03)

Collection 38 Collection 43 Collection 47 Collection 50 10 20 30 40 50

topic set size

0.0 0.2 0.4

percentage swaps

Percentage swaps for map bin 4 (run scores difference between 0.04 and 0.05)

Collection 38 Collection 43 Collection 47 Collection 50

21

10 20 30 40 50

topic set size

0.0 0.2 0.4

percentage swaps

Percentage swaps for map bin 5 (run scores difference between 0.05 and 0.06)

Collection 38 Collection 43 Collection 47 Collection 50

SLIDE 22

P@5

0.2 0.4 0.6 0.8 xj4wang_run1 BBGhelani1 BBGhelani2 run1 sab20.1.meta.docs IRIT_marked_un_pair run3 base.unipd.it sheikh_bm25_manual cu_dbmi_bm25_1 IRIT_marked_base UIUC_DMG_setrank_ret crowd1 cu_dbmi_bm25_2 OHSU_RUN2 IRIT_marked_mu_pair run2 UP−rrf5rnd1 UIUC_DMG_setrank_re 10x20.prf.unipd.it crowd2 CSIROmedNIR smith.rm3 T5R1 UIowaS_Run2 UIowaS_Run3 CSIROmed_RF uogTrDPH_QE uogTrDPH_QE_QQ BioinfoUA−emb T5R3 udel_fang_run3 PS−r1−bm25all sab20.1.merged uogTrDPH_prox_QQ elhuyar_rRnk_cbert udel_fang_run1 azimiv_wk1 baseline KU_run1 OHSU_RUN3 Technion−RRF UIowaS_Run1 UP−cqqrnd1 10x10.prf.unipd.it BioinfoUA−noadapt bm25_baseline dmis−rnd1−run1 dmis−rnd1−run3 elhuyar_rRnk_sbert poznan_run1 Technion−MEDMM bm25t5 poznan_run3 Technion−JPD UIUC_DMG_setrank ixa−ir−filter−query KU_run3 RMITBFuseM2 smith.ql elhuyar_indri jlbase poznan_run2 PS−r1−bm25medical SinequaR1_2 smith.bm25 UP−sdqrnd1 bm25_basline sab20.1.blind PS−r1−bm25none udel_fang_run2 NTU_NMLAB_BM25_Human TU_Vienna_TKL_1 ir_covid19_cle_dfr NTU_NMLAB_BM25_Hum2 OHSU_RUN1 TU_Vienna_TKL_2 SINEQUA BioinfoUA−emb−q ir_covid19_cle_ib RMITBM1 sheikh_bm25_all CSIROmed_PE ielab−prf.2query.v3 BITEM_df wistud_bing wistud_indri wistud_noSearch PL2c1.0_Bo1 ixa−ir−filter−quest PL2c1.0 DA_IICT_all irc_pubmed BM25R2 tcs_ilabs_gg_r1 dmis−rnd1−run2 NTU_NMLAB_BM25_ALLQQ BITEM_BL ielab−prf UB_NLP_RUN_1 irc_entrez SinequaR1_1 BITEM_stem ixa−ir−filter−narr Conv_KNRM irc_pmc CBOWexp.0 ir_covid19_cle_lmd RUIR−bm25−mt−exp ielab−prf.recency DA_IICT_narr_qe jlprec Meta−Conv−KNRM factum−1 KU_run2 RUIR−bm25−at−exp BERT Tetralogie1Fr savantx_nist_run_2 DA_IICT_narr savantx_nist_run_1 RUIR−doc2vec jlrecall BRPHJ_NLP1 TMACC_SeTA_baseline SFDC−23April−run1 Tetralogie0Fr BRPHJ_NLP3 CincyMedIR−run2 CincyMedIR−run3 savantx_nist_run_3 yn−r1−hierarchy CincyMedIR−run1 ERST_QUESTION SFDC−23April−run2 yn−r1−alltext BRPHJ_NLP2 ERST_PROSE lda400s5000 ERST_NARRATIVE lda400s2000 yn−r1−concepttext tm_lda400

new
rig
MAP

0.2 0.4 0.6 0.8 sab20.1.meta.docs BBGhelani1 BBGhelani2 udel_fang_run3 crowd1 crowd2 smith.rm3 OHSU_RUN2 run1 UIowaS_Run3 run2 uogTrDPH_QE uogTrDPH_QE_QQ udel_fang_run1 OHSU_RUN1 IRIT_marked_un_pair UP−rrf5rnd1 CSIROmed_RF run3 IRIT_marked_mu_pair sab20.1.merged xj4wang_run1 uogTrDPH_prox_QQ azimiv_wk1 UIowaS_Run2 IRIT_marked_base cu_dbmi_bm25_2 Technion−MEDMM cu_dbmi_bm25_1 OHSU_RUN3 bm25t5 Technion−RRF UIUC_DMG_setrank_ret KU_run1 smith.ql UIowaS_Run1 smith.bm25 CSIROmedNIR baseline UP−cqqrnd1 KU_run3 T5R3 elhuyar_rRnk_cbert base.unipd.it UIUC_DMG_setrank_re Technion−JPD RMITBFuseM2 ixa−ir−filter−query BioinfoUA−noadapt T5R1 UP−sdqrnd1 BioinfoUA−emb 10x20.prf.unipd.it elhuyar_rRnk_sbert elhuyar_indri SinequaR1_2 10x10.prf.unipd.it PL2c1.0_Bo1 RMITBM1 DA_IICT_all SINEQUA CSIROmed_PE NTU_NMLAB_BM25_Hum2 TU_Vienna_TKL_2 UIUC_DMG_setrank sab20.1.blind PS−r1−bm25all TU_Vienna_TKL_1 NTU_NMLAB_BM25_Human BM25R2 NTU_NMLAB_BM25_ALLQQ PL2c1.0 ixa−ir−filter−quest BITEM_df sheikh_bm25_manual udel_fang_run2 PS−r1−bm25medical BITEM_BL BITEM_stem SinequaR1_1 jlbase PS−r1−bm25none BioinfoUA−emb−q wistud_bing wistud_indri wistud_noSearch dmis−rnd1−run1 poznan_run1 ixa−ir−filter−narr poznan_run3 poznan_run2 ielab−prf.2query.v3 irc_pubmed dmis−rnd1−run3 CBOWexp.0 KU_run2 RUIR−doc2vec irc_entrez bm25_baseline irc_pmc bm25_basline DA_IICT_narr_qe RUIR−bm25−at−exp RUIR−bm25−mt−exp ir_covid19_cle_dfr sheikh_bm25_all DA_IICT_narr ir_covid19_cle_ib Conv_KNRM jlprec ir_covid19_cle_lmd Meta−Conv−KNRM dmis−rnd1−run2 BERT ielab−prf factum−1 UB_NLP_RUN_1 ielab−prf.recency savantx_nist_run_2 tcs_ilabs_gg_r1 BRPHJ_NLP1 savantx_nist_run_1 jlrecall SFDC−23April−run2 SFDC−23April−run1 Tetralogie1Fr savantx_nist_run_3 yn−r1−hierarchy yn−r1−alltext TMACC_SeTA_baseline Tetralogie0Fr CincyMedIR−run2 CincyMedIR−run3 CincyMedIR−run1 yn−r1−concepttext BRPHJ_NLP2 ERST_QUESTION BRPHJ_NLP3 ERST_PROSE tm_lda400 ERST_NARRATIVE lda400s2000 lda400s5000

new
rig
Reusability

22

Round 1 submissions evaluated using just Round 1 judgments (“original”) and again using cumulative judgments through Round 2 (“new”). Systems are ranked by decreasing score using original judgments (so order is different in different graphs). P@5: Kendall t = 0.9452; max D = 33 MAP: Kendall t = 0.9505; max D = 19

riginal

new

SLIDE 23

Building Reusable Collections

Shared test collections continue to be vital infrastructure for IR research
We lack effective ways of assessing the quality of a test collection
some tests that can detect some problems
incremental, diagnostic tests would help in collection creation
simulations need to address pragmatic constraints
perceived fairness for participants
initial learning period of human assessor per topic
budget allocation across topics (i.e., stopping conditions)
Current state-of-the-practice
diverse sets of high-quality runs are really helpful in building good collections
quality heuristics more context dependent than previously realized
we know how to build collections for topics with small R (‘small’ relative to B)
…but can’t know size of R until appreciable assessing has been done

23