Building Reusable Test Collections
Ellen M. Voorhees
1
Building Reusable Test Collections Ellen M. Voorhees 1 Test - - PowerPoint PPT Presentation
Building Reusable Test Collections Ellen M. Voorhees 1 Test Collections Evaluate search effectiveness using test collections set of documents set of questions relevance judgments 20 Relevance judgments Number Relevant
1
all topics
to be interesting
2
10 20 10 20 30
Number Relevant Retrieved Number Retrieved R R
General-purpose: supports a wide range of measures and search scenarios Reusable: unbiased for systems that were not used to build the collection Cost: proportional to the number of human judgments required for entire procedure
3
engines, depth-l pools produce “essentially complete” judgments
not relevant when computing traditional evaluation measures such as average precision (AP)
to be both fair and reusable.
1) fair: no bias against systems used to construct collection 2) reusable: fair to systems not used in collection construction
RUN A 401 401 RUN B
Pools
401 403 402
Top l
Alphabetized Docnos
1) intentional bias toward top ranks where relevant are found 2) l was originally large enough to reach past swell of topic-word relevant
cut-off stays within swell
size due to practical constraints
1) sample runs differently to build unbiased pools 2) new evaluation metrics that do not assume complete judgments
l l
Bias and the limits of pooling for large collections. Information Retrieval, 10(6):491-508, 2007.
“Leave Out Uniques” test of reusability:
examine effect on test collection if some participating team had not done so
Procedure
for one team
defined by ranks
probabilities for documents to be selected to be judged
AP by inferring which unjudged docs are likely to be relevant
depending on exact sampling strategy
and efficient sampling method for estimating AP and NDCG. SIGIR 2008, pp.603—610.
exploiting known good “arms” and exploring to find better arms. For collection building, each run is an arm, and reward is finding a relevant doc
collections as pooling but with many fewer judgments
attempt to build new collection using bandit technique
bandit selection method: 2017: MaxMean 2018: MTF
armed Bandits for Ordering Judgements in Pooling-based
✓ ✓ ✓ ✓ ✓ ✓ ✘ ✘ ✘
How should overall budget be divided among topics?
per-topic minimum judgments needed
allocation of budget across topics for historical collections
Feature Exponent Combined Budget for All Topics
Budget Allocation Strategy
PoolSize NumRel NumNonRel
How does assessor learn topic?
pools causes minimal degradation over “pure” bandit method
9
Pool Depth
30,000 35,000 25,000 20,000 15,000 10,000 5,000
Budget
t
Estimatet =
PoolSizet
⎷NumNonRelt
2017 Common Core collection less reusable than hoped (just too few judgments) Additional experiments demonstrate greedy bandit methods can be UNFAIR
2 4 6 8 10 12 14 16 18 20 100 200 300 400 500 600 700 800 Largest Observed Drop in MAP Ranking Number Unique Relevant Retrieved Team
MAP Precision(10) t Drop t Drop MaxMean .980 2 .937 11 Inferred .961 7 .999 1
LOU-results for TREC 2017 Common Core collection Fairness test: build collection from judgments on small inferred-sample or on equal number of documents selected by MaxMean bandit approach (average
and compare run rankings to full collection rankings. Judgment budget is small enough that R exceeds budget for some topics. Example: topic 389 with R=324, 45% of which are uniques; one run has 98 relevant in top 100 ranks, so 1/3 relevant in bandit set came from this single run to the exclusion of other runs.
why the goal in building a collection is NOT to maximize the number of relevant found!
set of relevant.
guaranteed unbiased judgment set, nor prove that an existing set is unbiased, but sometimes less is more.
11
Mo Most Relevant Fo Found
Image: Eunice De Faria/Pixabay
Can be unfair when budget is small relative to (unknown) number of relevant
control, so operative number of judgments is less than B
coordination among assessors
different rates and at different times
process
Image: Pascal/flickr
modification of U. of Waterloo’s HiCAL system
relevance based on available judgments. Suggests first most-likely-to-be-relevant unjudged document as next to judge.
depth-10 pools
estimate R
Iterate until 2estR+100<|J| or estR~|J|
Mustafa Abualsaud, Nimesh Ghelani, Haotian Zhang, Mark Smucker, Gordon Cormack and Maura
SIGIR 2018. (https://hical.github.io/)
l
docs not defined, so no LOU testing
but HiCAL able to recover those docs
{team}X{map,P_10}X{trec8,robust,deep}X{stopping criterion}
exactly one t was less than 0.92
fair, probably (?) reusable, but unknown effect of topic sample
14
Image: mohamed Hassan/Pixabay
Images: Alexandra Koch/Pixabay
15
each round uses a superset of previous rounds’ document and question sets.
maintained by AI2. Questions came from search logs of medical libraries. Judgments from people with biomedical expertise.
16
17
CORD-19; ~47k articles
submissions
CORD-19; ~158k articles
submissions
judgments
CORD-19; ~191k articles
submissions
judgments
Round 5 Round 4 Round 3
CORD-19; ~128k articles
submissions
judgments
CORD-19; ~60k articles
submissions
judgments
Round 2 Round 1
1 doc1 1 doc7 1 1 doc100 1 2 doc1 2 doc2 1 2 doc100 1 3 doc1 1 3 doc2 4 doc13 1 4 doc52 4 doc57 5 doc1 5 doc13 1 5 doc57 1 5 doc89 5 doc92 5 doc100
previously judged documents for a given topic are removed from the collection for scoring
(including inadvertent correlations)
relative scores okay
documents important
that round’s judgments, not cumulative qrels, so less stable
18
1 doc1 1 doc7 1 1 doc100 1 2 doc1 2 doc2 1 2 doc100 1 3 doc1 1 3 doc2 4 doc13 1 4 doc52 4 doc57 5 doc1 5 doc13 1 5 doc57 1 5 doc89 5 doc92 5 doc100 1 doc8 1 1 doc92 2 doc3 2 doc99 1 3 doc4 3 doc27 1 3 doc31 1 4 doc44 1 4 doc63 4 doc100 5 doc3 1 5 doc4 1
Cumulative Rounds 1…x qrels Round x+1 qrels
1 doc100 1 doc8 1 doc7 1 doc23 2 doc3 2 doc100 2 doc7 2 doc2 3 doc13 3 doc52 3 doc57 3 doc1 4 doc13 4 doc57 4 doc89 4 doc63 5 doc1 5 doc4 5 doc3 5 doc22
Round x+1 run A score over reduced list using reduced qrels Round x+1 run A score report
collection evaluation
subsequent rounds to even out assessment effort
19
Topic
500 1000 1500
Number of Documents
Judged Partially Relevant Fully Relevant
documents are relevant!
collection including those topics is fine.
for some topics (enormous percentage)
with chuff & lots of the relevance space explored
20
Total number of documents judged and number of documents judged partially or fully relevant in TREC-COVID Complete collection.
10 20 30 40 50
topic set size
0.0 0.2 0.4
percentage swaps
Percentage swaps for map bin 3 (run scores difference between 0.03 and 0.04)
Collection 38 Collection 43 Collection 47 Collection 50 10 20 30 40 50
topic set size
0.0 0.2 0.4
percentage swaps
Percentage swaps for map bin 2 (run scores difference between 0.02 and 0.03)
Collection 38 Collection 43 Collection 47 Collection 50 10 20 30 40 50
topic set size
0.0 0.2 0.4
percentage swaps
Percentage swaps for map bin 4 (run scores difference between 0.04 and 0.05)
Collection 38 Collection 43 Collection 47 Collection 50
21
10 20 30 40 50
topic set size
0.0 0.2 0.4
percentage swaps
Percentage swaps for map bin 5 (run scores difference between 0.05 and 0.06)
Collection 38 Collection 43 Collection 47 Collection 50
P@5
0.2 0.4 0.6 0.8 xj4wang_run1 BBGhelani1 BBGhelani2 run1 sab20.1.meta.docs IRIT_marked_un_pair run3 base.unipd.it sheikh_bm25_manual cu_dbmi_bm25_1 IRIT_marked_base UIUC_DMG_setrank_ret crowd1 cu_dbmi_bm25_2 OHSU_RUN2 IRIT_marked_mu_pair run2 UP−rrf5rnd1 UIUC_DMG_setrank_re 10x20.prf.unipd.it crowd2 CSIROmedNIR smith.rm3 T5R1 UIowaS_Run2 UIowaS_Run3 CSIROmed_RF uogTrDPH_QE uogTrDPH_QE_QQ BioinfoUA−emb T5R3 udel_fang_run3 PS−r1−bm25all sab20.1.merged uogTrDPH_prox_QQ elhuyar_rRnk_cbert udel_fang_run1 azimiv_wk1 baseline KU_run1 OHSU_RUN3 Technion−RRF UIowaS_Run1 UP−cqqrnd1 10x10.prf.unipd.it BioinfoUA−noadapt bm25_baseline dmis−rnd1−run1 dmis−rnd1−run3 elhuyar_rRnk_sbert poznan_run1 Technion−MEDMM bm25t5 poznan_run3 Technion−JPD UIUC_DMG_setrank ixa−ir−filter−query KU_run3 RMITBFuseM2 smith.ql elhuyar_indri jlbase poznan_run2 PS−r1−bm25medical SinequaR1_2 smith.bm25 UP−sdqrnd1 bm25_basline sab20.1.blind PS−r1−bm25none udel_fang_run2 NTU_NMLAB_BM25_Human TU_Vienna_TKL_1 ir_covid19_cle_dfr NTU_NMLAB_BM25_Hum2 OHSU_RUN1 TU_Vienna_TKL_2 SINEQUA BioinfoUA−emb−q ir_covid19_cle_ib RMITBM1 sheikh_bm25_all CSIROmed_PE ielab−prf.2query.v3 BITEM_df wistud_bing wistud_indri wistud_noSearch PL2c1.0_Bo1 ixa−ir−filter−quest PL2c1.0 DA_IICT_all irc_pubmed BM25R2 tcs_ilabs_gg_r1 dmis−rnd1−run2 NTU_NMLAB_BM25_ALLQQ BITEM_BL ielab−prf UB_NLP_RUN_1 irc_entrez SinequaR1_1 BITEM_stem ixa−ir−filter−narr Conv_KNRM irc_pmc CBOWexp.0 ir_covid19_cle_lmd RUIR−bm25−mt−exp ielab−prf.recency DA_IICT_narr_qe jlprec Meta−Conv−KNRM factum−1 KU_run2 RUIR−bm25−at−exp BERT Tetralogie1Fr savantx_nist_run_2 DA_IICT_narr savantx_nist_run_1 RUIR−doc2vec jlrecall BRPHJ_NLP1 TMACC_SeTA_baseline SFDC−23April−run1 Tetralogie0Fr BRPHJ_NLP3 CincyMedIR−run2 CincyMedIR−run3 savantx_nist_run_3 yn−r1−hierarchy CincyMedIR−run1 ERST_QUESTION SFDC−23April−run2 yn−r1−alltext BRPHJ_NLP2 ERST_PROSE lda400s5000 ERST_NARRATIVE lda400s2000 yn−r1−concepttext tm_lda40022
Round 1 submissions evaluated using just Round 1 judgments (“original”) and again using cumulative judgments through Round 2 (“new”). Systems are ranked by decreasing score using original judgments (so order is different in different graphs). P@5: Kendall t = 0.9452; max D = 33 MAP: Kendall t = 0.9505; max D = 19
new
23