Error Link Detection and Correction in Wikipedia
Chengyu Wang, Rong Zhang, Xiaofeng He, Aoying Zhou
School of Computer Science and Software Engineering East China Normal University Shanghai, China
Error Link Detection and Correction in Wikipedia Chengyu Wang, Rong - - PowerPoint PPT Presentation
Error Link Detection and Correction in Wikipedia Chengyu Wang, Rong Zhang, Xiaofeng He, Aoying Zhou School of Computer Science and Software Engineering East China Normal University Shanghai, China Outline Introduction Related Work
School of Computer Science and Software Engineering East China Normal University Shanghai, China
2
3
Wikipedia #Entities #Links English 3.6M 92M Chinese 0.9M 11M
4
Links to Correct! The backend is written in Java…
5
6
pages.
7
contribution” of Wikipedia links
8
9
10
– Construct a dictionary 𝑁 = (𝑛,𝐹') containing pairs of an anchor text 𝑛 and its referent entity collection 𝐹'
– Generate candidate error link set 𝐷𝑀' = < 𝑚.,/,𝑚.,/0 > containing pairs of a candidate error link 𝑚.,/ and its most possible correction 𝑚.,/0
– Train a classifier 𝑔 to predict whether 𝑚.,/ is an error link and 𝑚.,/0 is a corrected link simultaneously
– Utilize Wikipedia to construct ambiguous anchor text-referent entity dictionary
disambiguation pages, hyperlinks, etc.
– Example
– For each anchor text
neighbors
11
– A PageRank-like algorithm to assign weights to links in an ATSN – Weight transition:
𝑣.,/
(5) =
1 𝑃𝑣𝑢𝑀𝑗𝑜𝑙/ < 𝑥.,/
(5>?)
– Weight update rule
𝑥.,/
(5) =
@ 𝑣A,.
(5) BC,D∈F5G.5AD
+ 1 𝑀' @ 𝑥I,J
(5>?) BK,L∈G MN
12
– An asymmetric measurement based on LinkRank – SC from 𝑓. to 𝑓
/: sum of weights of links between 𝑓. and all 𝑓 /’s neighbors
𝑇𝐷 𝑓. → 𝑓/ = @ 𝑥.,/0
QR0 ∈SQ.TUVWX(QR)∧BD,R0∈GN
– 𝑓
/ and 𝑓/0 share the same entity mention
– 𝑓. links to 𝑓
/ in Wikipedia
– Given a pre-defined threshold 𝜐, we have
𝑇𝐷 𝑓. → 𝑓/0 − 𝑇𝐷 𝑓. → 𝑓/ 𝑇𝐷 𝑓. → 𝑓/0 > 𝜐 13
– Inlink similarity – 𝐽𝑀𝑇 𝑗,𝑘 =
F5G.5ASW_QD∩F5G.5ASW_QR a? F5G.5ASW_QD∪F5G.5ASW_QR a?
– Outlink similarity 𝑃𝑀𝑇 𝑗, 𝑘 – Inlink relatedness – 𝐽𝑀𝑆 𝑗, 𝑘 = 𝑓A ∈ 𝐽𝑜𝑀𝑗𝑜𝑙𝑂𝑝𝑒𝑓. 𝑚A,/ ∈ 𝑀'
F5G.5ASW_QD
– Outlink relatedness 𝑃𝑀𝑆 𝑗, 𝑘
– Context similarity 𝐷𝑇 𝑗, 𝑘 =
gD
h<gR
gD i< gR i
– Frequent context similarity 𝐺𝐷𝑇 𝑗, 𝑘 =
kgD
h<kgR
kgD i< kgR i
14
– Feature vector of a link 𝑚.,/
𝑤(𝑚.,/) =< 𝐽𝑀𝑇 𝑗, 𝑘 , 𝑃𝑀𝑇 𝑗, 𝑘 , 𝐽𝑀𝑆 𝑗, 𝑘 , 𝑃𝑀𝑆 𝑗, 𝑘 , 𝐷𝑇 𝑗, 𝑘 , 𝐺𝐷𝑇 𝑗, 𝑘 >
– Vector difference between two links: 𝑤g 𝑚.,/, 𝑚.,/0 = 𝑤 𝑚.,/ − 𝑤 𝑚.,/0 – Feature vector of a data instance: 𝑤mG 𝑚.,/,𝑚.,/0 =< 𝑤 𝑚.,/ ,𝑤 𝑚.,/0 ,𝑤g 𝑚.,/,𝑚.,/0 > – Example
– Train a SVM classifier 𝑔 to predict whether 𝑚.,/ is an error link and 𝑚.,/0 is a corrected link based on 𝑤mG 𝑚.,/,𝑚.,/0
15
16
– Sample candidate error links and compare the density of error links – Methods for comparison
17
ambiguous entities based on disambiguation pages
anchor texts based on the dictionary
uniform link weights
varied parameter settings
– Use SVM as the classifier to train models on candidate error link sets – Methods for comparison (considering feature subsets)
18
English Wikipedia Chinese Wikipedia
19
based on Vector Space Model
referent entities in Wikipedia
Wikipedia link structure
error links directly (w/o pairwise learning)
– MSNE: Multiple Senses of Named Entities
– MSC: Multiple Senses of Concepts
– ACNE: Ambiguity Between Concepts and Named Entities
20
21
22
– The two-stage approach is effective to detect and correct error links in Wikipedia.
– Most linking errors in Wikipedia are caused by multiple senses of named entities.
– Detecting error links where the correct entities is outside Wikipedia. – Detecting and correcting errors in other Web-scale networks.
23
* The first author would like to thank CIKM 2016 for the SIGIR student travel grant.