Fact Harvesting from Natural Language Text in Wikipedia
Matteo Cannaviccio (Roma Tre University) Denilson Barbosa (University of Alberta) Paolo Merialdo (Roma Tre University)
July 6°, 2016 – AT&T
Fact Harvesting from Natural Language Text in Wikipedia Matteo - - PowerPoint PPT Presentation
Fact Harvesting from Natural Language Text in Wikipedia Matteo Cannaviccio (Roma Tre University) Denilson Barbosa (University of Alberta) Paolo Merialdo (Roma Tre University) July 6, 2016 AT&T Knowledge Graphs Enabling technology
Matteo Cannaviccio (Roma Tre University) Denilson Barbosa (University of Alberta) Paolo Merialdo (Roma Tre University)
July 6°, 2016 – AT&T
Enabling technology for: semantic search in terms of entities-relations (not keywords-pages) text analytics text understanding/summarization recommendation systems to identify personalized entities and relations
Knowledge Vault Microsoft Probase
Entities, entity types
It represents a real world object
all people person film director location state
Relations and facts
It describes a semantic association between two entity types
birthPlace
person location
Relations and facts
It describes a semantic association between two entity types
associations between two entities
birthPlace birthPlace
person location
Entities (nodes) and facts (edges)
birthPlace director spouse
350K types
100 relations
1.5K types
4K relations
Knowledge Graph
15K types
1.1K types
4.5K relations
Knowledge Vault
250 types
6K relations
[Dong16, Weikum16]
#Facts/Entities in Freebase (as of March 2016)
[Dong16] [West+14]
Lector:
(many facts)
(homogeneous language) Goal:
Source:
(category, infoboxes, …) Process:
Articles with no Infobox
Our purpose Increase a KG with facts extracted from Wikipedia text Experiment: Facts in the domain of people:
Result: Lector can extract more than 200K facts:
Our method We rely on the duality between:
Facts Patterns
(Michelle, Harward) (Hillary, Yale)
& Fact Candidates
X studied at Y X graduated from Y X earned his degree from Y X was a student at Y X visited Y (Michelle, Harward) (Alberto, PoliMi) (Hillary, Yale) (Wesley, UofTexas)
Adapted from an example by Gerhard Weikum
Dipre (1998)
Snowball (2000), Espresso(2006), Nell(2010), …
TextRunner(2007), ReVerb(2011), Ollie(2012), …
Facts Patterns
(Michelle, Harward) (Hillary, Yale)
& Fact Candidates
X studied at Y X graduated from Y X earned his degree from Y X was a student at Y X visited Y
(Michelle, Harward) (Hillary, Yale) (Divesh, RomaTre) (Alberto, PoliMi)
(noisy, drifting)
(Michelle, Harward) (Alberto, PoliMi) (Hillary, Yale) (Wesley, UofTexas)
Adapted from an example by Gerhard Weikum
(Many) Facts from the KG (Good) Phrases from Articles
(Michelle, Harward) (Hillary, Yale) ... ... X studied at Y X graduted from Y X earned his degree from Y ...
New Facts
(Michelle, Harward) (Hillary, Yale) (Alberto, PoliMi) ... ...
Adapted from an example by Gerhard Weikum
articles
was born in […] attended […] is a graduate of […]
1 .. 1 1 3 2
was born in […] attended […] is a graduate of […]
1 .. 1 1 3 2
… was born in … attended … entity is a graduate of […]
en3 en1 en1 en1 en4 en2
annotated articles
3
3
en4 en1 … … … en3 new facts
almaMater birthPlace …
… … en4 en2 en3 en1
Freebase
almaMater almaMater birthPlace birthPlace …
We rely on:
Wikipedia original entities:
(subject of the article)
(entities linked in the article)
disambiguated by the page … but never linked in their article! We match the primary entity using:
(Michelle Obama)
(Obama)
(Michelle LaVaughn Robinson Obama)
(She)
Primary entity
disambiguated by wiki-links … but only the first occurrence!
Secondary entities
We match secondary entities using:
(University of Chicago Medical Center)
(University of Chicago)
articles
was born in […] attended […] is a graduate of […]
1 .. 1 1 3 2
was born in […] attended […] is a graduate of […]
1 .. 1 1 3 2
… was born in … attended … entity is a graduate of […]
en3 en1 en1 en1 en4 en2
annotated articles
… … en4 en2 en3 en1
Freebase
almaMater almaMater birthPlace birthPlace …
“is married to” → [VB] , [VB] , [TO] → relational “together with” → [RP] , [IN] → not relational Filtering relational phrases (R)
For each sentence in all the articles (containing en1 and en2):
“was the first” , “was the 41st”
→ “was the ORD” “is an American” , “is a Canadian” → “is a NAT” Generalizing phrases (G)
Considering only witness count is not reliable: “was born in” birthPlace deathPlace
...
For each relation, we rank the phrases:
where:
articles
was born in […] attended […] is a graduate of […]
1 .. 1 1 3 2
was born in […] attended […] is a graduate of […]
1 .. 1 1 3 2
… was born in … attended … entity is a graduate of […]
en3 en1 en1 en1 en4 en2
annotated articles
3
3
en4 en1 … … … en3 new facts
almaMater birthPlace …
… … en4 en2 en3 en1
Freebase
almaMater almaMater birthPlace birthPlace …
people/person/place_of_birth people/person/place_of_death people/person/nationality sports/pro_athlete/teams people/person/education people/person/spouse people/person/parents people/person/children people/person/ethnicity people/person/religion award/award_winner/awards_won government/politician/party
maximum number of phrases for each relation
Aim of the experiment
evaluated facts estimated accuracy extracted by Lector (not yet in FB) already in Freebase Freebase relations # facts
662,192 178,849 584,792 145,080 378,043 130,425 123,747 141,860 39,869 47,016 98,625 65,300 people/person/place_of_birth people/person/place_of_death people/person/nationality sports/pro_athlete/teams people/person/education people/person/spouse people/person/parents people/person/children people/person/ethnicity people/person/religion award/award_winner/awards_won government/politician/party 57,140 18,458 50,234 49,809 46,342 14,939 5,648 3,149 2,989 1,437 1,934 3,684 347 104 290 286 286 97 50 50 50 50 50 50
All the numbers are calculated over the 977K person from RDF interlinks (owl:sameAs).
Ambigous phrases:
“met”
Impact of K
(number of phrases for relation)
K ∈ {1, 5, 10, 15, 20}
1800 manually evaluated facts
K=1 K=5 K=10 K=15 K=20
(+) accuracy: 97.24% ±1.49% (-) extracted facts: 57K to 50K (-8%)
removing it
not in DBpedia DBpedia relations YAGO relations not in YAGO # facts
birthPlace deathPlace nationality team almaMater spouse parent child ethnicity religion award party
48,314 15,818 48,125 23,640 45,585 14,662 5,631 3,140 2,890 1,368 1,655 3,594 57,140 18,458 50,234 49,809 46,342 14,939 5,648 3,149 2,989 1,437 1,934 3,684 55,577 18,014 49,977 35,013 46,095 14,573
3,684
wasBornIn diedIn isCitizenOf playsFor graduatedFrom isMarriedTo
isPoliticianOf
extracted by Lector (not yet in FB)
Future works
filter ambiguous phrases
process to other relations
All the facts produced are available for download at: http://dx.doi.org/10.7939/DVN/10795
arrived in retired to went to settled in lived in returned to died at moved to was born in died in
5000 10000 15000
died suddenly at was executed in was killed in died suddenly in died in a was assassinated in settled in retired to died at died in
2 4 6
is a native of was an died in is an grew up in was a returned to is a was born at was born in
0e+00 2e+04 4e+04 6e+04 8e+04 1e+05 was a native of who was born in was born on grew up in was born in the born in is a native of was born near was born at was born in
2 4 6 8
birthPlace deathPlace
We “normalize” list of entities using such as Hearst pattern <Ronaldo> played for many teams such as <FCBarcelona>, <Real_Madrid> and <InterFC> <Ronaldo> played for <FCBarcelona> <Ronaldo> played for <Real_Madrid> <Ronaldo> played for <InterFC>
<Alice>, the sister of <Bob>, is married with <Charlie> <Alice> is married with <Bob>’s brother To improve accuracy, check around! <Ronaldo> played for <FCBarcelona> and then moved to <InterFC> To improve recall, find subordinate clauses!
phrase c ( 𝒒, 𝒔𝒋 ) P(𝒔𝒋 | 𝒒) score(𝒔𝒋 , 𝒒) was born in 106,200 0.73 8.43 was born at 5,606 0.86 7.43 is a native of 445 0.72 4.43 grew up in 2,399 0.54 4.22 was born on 165 0.81 4.15 born in 431 0.68 4.13 was a native of 440 0.65 4.00 is originally from 163 0.70 3.60 hails from 149 0,68 3.40 is a 4,890 0.07 0.61 returned to 3,459 0.21 1.72 died in 2,108 0.11 0.79 was raised in 615 0.46 3.01 filtered
… top-k