Fact Harvesting from Natural Language Text in Wikipedia Matteo - - PowerPoint PPT Presentation

fact harvesting from
SMART_READER_LITE
LIVE PREVIEW

Fact Harvesting from Natural Language Text in Wikipedia Matteo - - PowerPoint PPT Presentation

Fact Harvesting from Natural Language Text in Wikipedia Matteo Cannaviccio (Roma Tre University) Denilson Barbosa (University of Alberta) Paolo Merialdo (Roma Tre University) July 6, 2016 AT&T Knowledge Graphs Enabling technology


slide-1
SLIDE 1

Fact Harvesting from Natural Language Text in Wikipedia

Matteo Cannaviccio (Roma Tre University) Denilson Barbosa (University of Alberta) Paolo Merialdo (Roma Tre University)

July 6°, 2016 – AT&T

slide-2
SLIDE 2

Knowledge Graphs

Enabling technology for: semantic search in terms of entities-relations (not keywords-pages) text analytics text understanding/summarization recommendation systems to identify personalized entities and relations

slide-3
SLIDE 3

Knowledge Graphs: Semantic Search

slide-4
SLIDE 4

Knowledge Graphs: Semantic Search

slide-5
SLIDE 5

Knowledge Graphs: Semantic Search

slide-6
SLIDE 6

Knowledge Graphs: Semantic Search

slide-7
SLIDE 7

Knowledge Graphs: Recommendation Systems

slide-8
SLIDE 8

Knowledge Graphs

Knowledge Vault Microsoft Probase

slide-9
SLIDE 9

What is a Knowledge Graph (1)

A graph that aims to describe knowledge about real world

Entities, entity types

  • An entity is an instance (with id) of multiple types

It represents a real world object

  • Entity types are organized in a hierarchy

all people person film director location state

slide-10
SLIDE 10

What is a Knowledge Graph (2)

A graph that aims to describe knowledge about real world

Relations and facts

  • A relation is triple: subject type – predicate – object type

It describes a semantic association between two entity types

birthPlace

person location

slide-11
SLIDE 11

What is a Knowledge Graph (3)

A graph that aims to describe knowledge about real world

Relations and facts

  • A relation is triple: subject type – predicate – object type

It describes a semantic association between two entity types

  • Facts define instances of relations, represent semantic

associations between two entities

birthPlace birthPlace

person location

slide-12
SLIDE 12

What is a Knowledge Graph (4)

A graph that aims to describe knowledge about real world

Entities (nodes) and facts (edges)

birthPlace director spouse

slide-13
SLIDE 13

Knowledge Graphs

  • 10M entities in

350K types

  • 120M facts for

100 relations

  • 40M entities in

1.5K types

  • 650M facts for

4K relations

  • core of Google

Knowledge Graph

  • 600M entities in

15K types

  • 20B facts
  • 45M entities in

1.1K types

  • 271M facts for

4.5K relations

Knowledge Vault

  • 4M entities in

250 types

  • 500M facts for

6K relations

[Dong16, Weikum16]

slide-14
SLIDE 14

Knowledge Graphs: incompleteness

#Facts/Entities in Freebase (as of March 2016)

  • 40% of entities with no facts
  • 56% of entities with <3 facts

[Dong16] [West+14]

slide-15
SLIDE 15

Knowledge Graphs: incompleteness

slide-16
SLIDE 16

Wikipedia-derived Knowledge Graphs

Lector:

  • Text as source of facts
  • Encyclopedic nature

(many facts)

  • Restricted community

(homogeneous language) Goal:

  • Derive a KG from Wikipedia

Source:

  • Structured components

(category, infoboxes, …) Process:

  • Assign a type to the main entity
  • Map attributes to KG relations

Our Focus

Articles with no Infobox

  • 56% in 2008
  • 66% in 2010
slide-17
SLIDE 17

Lector: Harvesting facts from text

Our purpose Increase a KG with facts extracted from Wikipedia text Experiment: Facts in the domain of people:

  • 12 Freebase relations

Result: Lector can extract more than 200K facts:

  • absent in Freebase, DBPedia and YAGO
  • many relations reach an estimated accuracy of 95%

Our method We rely on the duality between:

  • phrases: spans of text between two entities
  • relations: canonical relations from a KG
slide-18
SLIDE 18

Duality of Phrases and Relations

slide-19
SLIDE 19

Duality of Patterns and Relations:

Facts Patterns

(Michelle, Harward) (Hillary, Yale)

& Fact Candidates

X studied at Y X graduated from Y X earned his degree from Y X was a student at Y X visited Y (Michelle, Harward) (Alberto, PoliMi) (Hillary, Yale) (Wesley, UofTexas)

Adapted from an example by Gerhard Weikum

slide-20
SLIDE 20

Duality of Patterns and Relations: an Adult Approach…

Dipre (1998)

  • seminal work

Snowball (2000), Espresso(2006), Nell(2010), …

  • build on Dipre

TextRunner(2007), ReVerb(2011), Ollie(2012), …

  • Open IE: discover new relations (open)
slide-21
SLIDE 21

Facts Patterns

(Michelle, Harward) (Hillary, Yale)

& Fact Candidates

X studied at Y X graduated from Y X earned his degree from Y X was a student at Y X visited Y

(Michelle, Harward) (Hillary, Yale) (Divesh, RomaTre) (Alberto, PoliMi)

  • good for recall
  • not for precision:

(noisy, drifting)

(Michelle, Harward) (Alberto, PoliMi) (Hillary, Yale) (Wesley, UofTexas)

Duality of Patterns and Relations: …with a Teenage Attitude

Adapted from an example by Gerhard Weikum

slide-22
SLIDE 22

(Many) Facts from the KG (Good) Phrases from Articles

(Michelle, Harward) (Hillary, Yale) ... ... X studied at Y X graduted from Y X earned his degree from Y ...

  • High precision
  • (no drifting)

With a Teenager: better to Introduce a soft Distant Supervision

New Facts

(Michelle, Harward) (Hillary, Yale) (Alberto, PoliMi) ... ...

Adapted from an example by Gerhard Weikum

slide-23
SLIDE 23

Our approach

  • riginal

articles

was born in […] attended […] is a graduate of […]

1 .. 1 1 3 2

was born in […] attended […] is a graduate of […]

1 .. 1 1 3 2

… was born in … attended … entity is a graduate of […]

en3 en1 en1 en1 en4 en2

annotated articles

3

3

en4 en1 … … … en3 new facts

almaMater birthPlace …

… … en4 en2 en3 en1

Freebase

almaMater almaMater birthPlace birthPlace …

slide-24
SLIDE 24

Annotate articles with FB entities

We rely on:

  • Wikipedia entities (highlighted in the text)
  • RDF interlink between Wikipedia and Freebase

Wikipedia original entities:

  • Primary entity

(subject of the article)

  • Secondary entities

(entities linked in the article)

slide-25
SLIDE 25

Annotate articles with FB entities

disambiguated by the page … but never linked in their article! We match the primary entity using:

  • Full name

(Michelle Obama)

  • Last name

(Obama)

  • Complete name

(Michelle LaVaughn Robinson Obama)

  • Personal pronouns

(She)

Primary entity

slide-26
SLIDE 26

Annotate articles with FB entities

disambiguated by wiki-links … but only the first occurrence!

Secondary entities

We match secondary entities using:

  • Anchor text

(University of Chicago Medical Center)

  • Wikipedia id

(University of Chicago)

slide-27
SLIDE 27

Our approach

  • riginal

articles

was born in […] attended […] is a graduate of […]

1 .. 1 1 3 2

was born in […] attended […] is a graduate of […]

1 .. 1 1 3 2

… was born in … attended … entity is a graduate of […]

en3 en1 en1 en1 en4 en2

annotated articles

… … en4 en2 en3 en1

Freebase

almaMater almaMater birthPlace birthPlace …

slide-28
SLIDE 28

Extracting phrases

  • Conform with POS-level patterns [Mesquita+13]

“is married to” → [VB] , [VB] , [TO] → relational “together with” → [RP] , [IN] → not relational Filtering relational phrases (R)

  • 1. extract the span of text between en1 and en2
  • 2. generalize it (G) and check if it is relational (R)
  • 3. if it is, associate it with all the relations that link en1 to en2 in the KG

For each sentence in all the articles (containing en1 and en2):

“was the first” , “was the 41st”

→ “was the ORD” “is an American” , “is a Canadian” → “is a NAT” Generalizing phrases (G)

slide-29
SLIDE 29

Extracting phrases (cont’d)

Considering only witness count is not reliable: “was born in” birthPlace deathPlace

...

For each relation, we rank the phrases:

  • scoring the specificity of a phrase ( 𝒒 ) with a relation ( 𝒔𝒋 ):

where:

  • P(𝒔𝒋 | 𝒒) > 0.5 minimum probability threshold
slide-30
SLIDE 30

Our approach

  • riginal

articles

was born in […] attended […] is a graduate of […]

1 .. 1 1 3 2

was born in […] attended […] is a graduate of […]

1 .. 1 1 3 2

… was born in … attended … entity is a graduate of […]

en3 en1 en1 en1 en4 en2

annotated articles

3

3

en4 en1 … … … en3 new facts

almaMater birthPlace …

… … en4 en2 en3 en1

Freebase

almaMater almaMater birthPlace birthPlace …

slide-31
SLIDE 31

Experiments

  • 12 Freebase relations in the domain of people:

people/person/place_of_birth people/person/place_of_death people/person/nationality sports/pro_athlete/teams people/person/education people/person/spouse people/person/parents people/person/children people/person/ethnicity people/person/religion award/award_winner/awards_won government/politician/party

  • K = 20

maximum number of phrases for each relation

  • 977K entities person (interlinked in multiple KGs)

Aim of the experiment

  • Quantify the number of facts extracted by Lector (not in Freebase)
  • Accuracy of the facts:
  • manually evaluation of a random sample (1800 extracted facts)
  • estimating precision (we use Wilson score interval for C.L. = 95%)
slide-32
SLIDE 32

Lector new facts

evaluated facts estimated accuracy extracted by Lector (not yet in FB) already in Freebase Freebase relations # facts

662,192 178,849 584,792 145,080 378,043 130,425 123,747 141,860 39,869 47,016 98,625 65,300 people/person/place_of_birth people/person/place_of_death people/person/nationality sports/pro_athlete/teams people/person/education people/person/spouse people/person/parents people/person/children people/person/ethnicity people/person/religion award/award_winner/awards_won government/politician/party 57,140 18,458 50,234 49,809 46,342 14,939 5,648 3,149 2,989 1,437 1,934 3,684 347 104 290 286 286 97 50 50 50 50 50 50

All the numbers are calculated over the 977K person from RDF interlinks (owl:sameAs).

slide-33
SLIDE 33

Limitations

Ambigous phrases:

  • ../spouse :

“met”

  • ../children : “was succeeded by”
  • ../place_of_birth : “grew up in”

Impact of K

(number of phrases for relation)

  • We try different values

K ∈ {1, 5, 10, 15, 20}

  • Groundtruth:

1800 manually evaluated facts

K=1 K=5 K=10 K=15 K=20

(+) accuracy: 97.24% ±1.49% (-) extracted facts: 57K to 50K (-8%)

removing it

slide-34
SLIDE 34

… and in other KGs?

not in DBpedia DBpedia relations YAGO relations not in YAGO # facts

birthPlace deathPlace nationality team almaMater spouse parent child ethnicity religion award party

48,314 15,818 48,125 23,640 45,585 14,662 5,631 3,140 2,890 1,368 1,655 3,594 57,140 18,458 50,234 49,809 46,342 14,939 5,648 3,149 2,989 1,437 1,934 3,684 55,577 18,014 49,977 35,013 46,095 14,573

  • 2,958
  • 1,370

3,684

wasBornIn diedIn isCitizenOf playsFor graduatedFrom isMarriedTo

  • hasChild
  • hasWonPrize

isPoliticianOf

extracted by Lector (not yet in FB)

slide-35
SLIDE 35

Conclusions

Future works

  • Introduce negative counts to

filter ambiguous phrases

  • Extend and generalize the

process to other relations

slide-36
SLIDE 36

Questions?

All the facts produced are available for download at: http://dx.doi.org/10.7939/DVN/10795

slide-37
SLIDE 37

Extracting phrases (cont’d)

arrived in retired to went to settled in lived in returned to died at moved to was born in died in

5000 10000 15000

died suddenly at was executed in was killed in died suddenly in died in a was assassinated in settled in retired to died at died in

2 4 6

is a native of was an died in is an grew up in was a returned to is a was born at was born in

0e+00 2e+04 4e+04 6e+04 8e+04 1e+05 was a native of who was born in was born on grew up in was born in the born in is a native of was born near was born at was born in

2 4 6 8

birthPlace deathPlace

slide-38
SLIDE 38

Improve phrases extraction

We “normalize” list of entities using such as Hearst pattern <Ronaldo> played for many teams such as <FCBarcelona>, <Real_Madrid> and <InterFC> <Ronaldo> played for <FCBarcelona> <Ronaldo> played for <Real_Madrid> <Ronaldo> played for <InterFC>

slide-39
SLIDE 39

Improve phrases extraction

<Alice>, the sister of <Bob>, is married with <Charlie> <Alice> is married with <Bob>’s brother To improve accuracy, check around! <Ronaldo> played for <FCBarcelona> and then moved to <InterFC> To improve recall, find subordinate clauses!

slide-40
SLIDE 40

Place of birth ranking

phrase c ( 𝒒, 𝒔𝒋 ) P(𝒔𝒋 | 𝒒) score(𝒔𝒋 , 𝒒) was born in 106,200 0.73 8.43 was born at 5,606 0.86 7.43 is a native of 445 0.72 4.43 grew up in 2,399 0.54 4.22 was born on 165 0.81 4.15 born in 431 0.68 4.13 was a native of 440 0.65 4.00 is originally from 163 0.70 3.60 hails from 149 0,68 3.40 is a 4,890 0.07 0.61 returned to 3,459 0.21 1.72 died in 2,108 0.11 0.79 was raised in 615 0.46 3.01 filtered

  • ut

… top-k

slide-41
SLIDE 41

Knowledge Graphs: Semantic Search

slide-42
SLIDE 42

Knowledge Graphs: Semantic Search