Approaches Towards Unified Models for Integrating Web Knowledge - - PowerPoint PPT Presentation

approaches towards unified models for integrating web
SMART_READER_LITE
LIVE PREVIEW

Approaches Towards Unified Models for Integrating Web Knowledge - - PowerPoint PPT Presentation

Approaches Towards Unified Models for Integrating Web Knowledge Bases Maria Koutraki Joint work with: Nicoleta Preda, Dan Vodislav Paris, 26/10/2016 Koutraki Maria 2 Motivation Unstructured Data Koutraki Maria 3 Motivation


slide-1
SLIDE 1

Approaches Towards Unified Models for Integrating Web Knowledge Bases Maria Koutraki

Joint work with: Nicoleta Preda, Dan Vodislav Paris, 26/10/2016

slide-2
SLIDE 2

Koutraki Maria

2

slide-3
SLIDE 3

Motivation – Unstructured Data

Koutraki Maria

3

slide-4
SLIDE 4

Motivation – Unstructured Data

Koutraki Maria

4

  • Text representation
  • Lack of structure
  • No entity resolution
  • No entity disambiguation
slide-5
SLIDE 5

Motivation – Structured Data

Koutraki Maria

5

What is structured data?

  • RDF – Resource Description Framework
  • W3C standard for describing web resources
  • Triple = statement of the form (subject, property, object)

Subject Property Object Rodin type Artist Artist interestedIn Sculpture Rodin notableWork The Thinker The Thinker type Sculpture Rodin influences Artist1

interestedIn type

slide-6
SLIDE 6

Motivation – Structured Data

Koutraki Maria

6

Linked Open Data Cloud

300 600 900 1200 1500 1800

Topic %

Government 18.05% Publications 9.47% Life Sciences 8.19% User-generated content 4.73% Cross-domain 4.04% Media 2.17% Geographic 2.07% Social Web 51.28%

Domains

  • Exponential increase of

datasets and triples

  • > 30 billion triples
  • Automatically constructed KBs
slide-7
SLIDE 7

Motivation – Structured Data

Koutraki Maria

7

createdBy

1902 bronze

style date

Museum_Rodin 1840

born

sculpturer

type

Artist1 Artist2

influences influences

DBpedia

slide-8
SLIDE 8

Motivation – Structured Data

Koutraki Maria

8

1840

born

sculpturer

type createdBy

1902 bronze

style date

Artist1 Artist2

influences influences

DBpedia Museum_Rodin

Complementary

slide-9
SLIDE 9

Motivation – Structured Data

Koutraki Maria

9

createdBy

1902 bronze

style date

Museum_Rodin Freebase

mentor

sculpturer

type

Artist_3 Artist_4

mentor mentor

Artist_5 Artist_6

mentor

slide-10
SLIDE 10

Motivation – Structured Data

Koutraki Maria

10

createdBy

1902 bronze

style date

Museum_Rodin Freebase

mentor

sculpturer

type

Artist_3 Artist_4

mentor mentor

Artist_5 Artist_6

mentor

slide-11
SLIDE 11

Motivation – Structured Data

Koutraki Maria

11

Diverse schemas for representation in LOD

  • ~576 schemas/vocabularies

used for representation

  • Diverse quality of schemas[1]
  • Duplicate representation of

similar concepts/classes and relations

  • Lack of explicit alignment

between classes/relations (with only up to 2%)[2]

[1] Aimilia Magkanaraki, Sofia Alexaki, Vassilis Christophides, Dimitris Plexousakis: Benchmarking RDF Schemas for the Semantic Web. International Semantic Web Conference 2002: 132-146 [2] Max Schmachtenberg, Christian Bizer, Heiko Paulheim: Adoption of the Linked Data Best Practices in Different Topical Domains. International Semantic Web Conference (1) 2014: 245-260

slide-12
SLIDE 12

Motivation – Web services

Koutraki Maria

12

slide-13
SLIDE 13

Motivation – Web services

Koutraki Maria

13

createdBy

1902 bronze

style date

Museum_Rodin

slide-14
SLIDE 14

Motivation – Web services

Koutraki Maria

14

createdBy

1902 bronze

style date

Museum_Rodin sculpture bronze DBpedia

contains style

  • wl:sameAs
slide-15
SLIDE 15

Motivation – Web services

Koutraki Maria

15

createdBy

1902 bronze

style date

Museum_Rodin MuseumExhibitions(Paris) sculpture bronze DBpedia

contains style

<exhibitions> <museum> Louvre </museum> <museum>Rodin</museum> </exhibitions>

  • wl:sameAs
slide-16
SLIDE 16

Motivation – Web services

Koutraki Maria

16

createdBy

1902 bronze

style date

Museum_Rodin MuseumExhibitions(Paris) sculpture bronze DBpedia

contains style

<exhibitions> <museum> Louvre </museum> <museum>Rodin</museum> </exhibitions>

slide-17
SLIDE 17

Motivation – Web services

More than 12000 APIs* from various domains:

  • Search (3200 APIs)
  • Social (3000 APIs)
  • Traveling (1200 APIs)
  • Music (1000 APIs)
  • Financial (1200 APIs), Science (600 APIs), Weather (300 APIs)

*Source: ProgrammableWeb.com

17

Koutraki Maria

slide-18
SLIDE 18

Context & Objectives

¤ PART I – DORIS: Deriving Intensional Description for Web Services ¤ PART II – SOFYA: Online Relation Alignment on Linked Datasets

Koutraki Maria

18

SOFYA SPARQL endpoint SPARQL endpoint DORIS

Knowledge Base

Web Service

Knowledge Base Knowledge Base

slide-19
SLIDE 19

Part I: Deriving Intensional Descriptions for Web Services

19

Koutraki Maria

[CIKM’15, ISWC’15, BDA’15]

slide-20
SLIDE 20

Web Services

¤ Way of publishing/exporting data ¤ A Web service (WS) is a function ¤ Consider WSs implementing REST: Interfaces to data sources ¤ Call a WS:

¤ URL address of WS ¤ Input value

Example: “get artworks by artist name” – exported by DORIS_museums

¤ call for input “Rodin”: http://doris_museums.com?artist= Rodin ¤ Output: XML document

20

Koutraki Maria

What is a Web Service?

What is a Web service?

slide-21
SLIDE 21

Objective

21

Koutraki Maria

Uniform access to Web services! Local as view approach:

  • We consider as target source a given Knowledge Base (RDF)
  • Infer a mapping function (transform XML call results à RDF)
  • Infer a description (parameterized query over the target KB)

Web Services Web Service

Knowledge Base

slide-22
SLIDE 22

Mapping function (σ)

Web service: “get artworks by artist”

R: getArtWorksByArtist(Rodin) σ(R)

σ

WS call result (XML) KB fragment (RDF)

URI5 1889 The Kiss date name URI1 Rodin name 1840 birthdate URI3 1902 The Thinker date name URI4 shownAt works URI2 works shownAt

22

Koutraki Maria

root t d a b n The Thinker 1902 1840 Rodin item t d a b n The Kiss 1889 1840 Rodin item

slide-23
SLIDE 23

Parameterized Query

Schema of the parameterized query: the KB schema

23

Koutraki Maria

URI5 1889 The Kiss date name URI1 Rodin name 1840 birthdate URI3 1902 The Thinker date name URI4 shownAt works URI2 works shownAt

σ(getArtworksByArtist(Rodin))

slide-24
SLIDE 24

Parameterized Query

Schema of the parameterized query: the KB schema

?x ?IO name ?l1 birthdate ?z ?l3 ?l4 date name ?y shownAt works σ(getArtworksByArtist(?IO))

24

Koutraki Maria

URI5 1889 The Kiss date name URI1 Rodin name 1840 birthdate URI3 1902 The Thinker date name URI4 shownAt works URI2 works shownAt

σ(getArtworksByArtist(Rodin))

slide-25
SLIDE 25

Parameterized Query

Schema of the parameterized query: the KB schema

25

Koutraki Maria

URI5 1889 The Kiss date name URI1 Rodin name 1840 birthdate URI3 1902 The Thinker date name URI4 shownAt works URI2 works shownAt

σ(getArtworksByArtist(Rodin))

slide-26
SLIDE 26

Overview – DORIS system

  • 1. Mapping Function
  • 2. Parameterized Query

Instance – based solution

1. Probing

  • Call WS with top entities from KB
  • Obtain call results (samples)

2. Compute alignments between WS and KB

  • Path Alignments
  • Class/Relation Alignments
  • 1. Web service
  • 2. Knowledge Base

26

Koutraki Maria

Input: Output:

slide-27
SLIDE 27

Path Alignments

¤ Relevant WS call result to an input entity (Rodin) ¤ Leaf nodes in call result encode attributes for input entity ¤ Linear XML paths in WS call result correspond to input entity – literal paths

27

Koutraki Maria

root t d a b n The Thinker 1902 1840 Rodin item t d a b n The Kiss 1889 1840 Rodin item yago:The_Thinker 1902 The Thinker date name yago:Rodin Rodin name 1840 birthdate

yago:Rodin_Museum

shownAt works

yago:Pantheon

shownAt works

getArtWorksByArtist(Rodin) yago fragment (Rodin)

slide-28
SLIDE 28

Path Alignments

Path Pairs:

root t d a b n The Thinker 1902 1840 Rodin item t d a b n The Kiss 1889 1840 Rodin item yago:The_Thinker 1902 The Thinker date name yago:Rodin Rodin name 1840 birthdate

yago:Rodin_Museum

shownAt works

yago:Pantheon

shownAt works

getArtWorksByArtist(Rodin) yago fragment (Rodin)

t root item KB Input shownAt works name

28

Koutraki Maria

slide-29
SLIDE 29

Metrics for Path Alignments

1. Overlapping: align two paths if the results of the one overlap the results of the other over a threshold α.

#x: number of samples

2. Inclusions: align two paths if the results of the one are included in the results of the other over a threshold α. ¤ Compute both ways inclusions: KB path ⇆ WS path ¤ Partial completeness assumption: “a source knows either all or

none of the p-attributes of some x”

29

Koutraki Maria

slide-30
SLIDE 30

Class & Relation Alignments

¤ Idea: starting from the right-most side, align functional sub-paths (paths selecting one value) ¤ Assumption: the XML call result encode at least a function property per class of entities

t item shownAt works name

XML: KB:

1 n 1 1 1 1 1 n 1 n 1 1

à “item” nodes correspond to artworks

KB Input

Problem: Identify XML nodes representing entities

30

Koutraki Maria

root

slide-31
SLIDE 31

Class & Relation Alignments

¤ Idea: starting from the right-most side, align functional sub-paths (paths selecting one value) ¤ Assumption: the XML call result encode at least a function property per class of entities

t item shownAt works name

XML: KB:

1 n 1 1 1 1 1 n 1 n 1 1

à “item” nodes correspond to artworks

KB Input

Problem: Identify XML nodes representing entities

31

Koutraki Maria

root

slide-32
SLIDE 32

Class & Relation Alignments

¤ KB: “A relation r(x,y) is called functional if for x there are not more than one y.” ¤ XML: “A path is functional if there are no two sibling nodes sharing the same label”.

Compute Functionality

32

Koutraki Maria

slide-33
SLIDE 33

Overview

1. Web service 2. Knowledge Base

DORIS

1. Mapping Function 2. Parameterized Query Discovering I/O Dependencies

33

Koutraki Maria

slide-34
SLIDE 34

Discovering I/O Dependencies

Koutraki Maria

34

ID_THE_THINKER

  • 1.96 m
  • Bronze

ID_THE_KISS

  • 1.81 m
  • Bronze

Auguste Rodin Auguste Rodin Join the output from the two calls

  • The Thinker ID_THE_THINKER
  • The Kiss ID_THE_KISS

getArtworksByArtist getArtworksByArworkID

slide-35
SLIDE 35

Discovering I/O Dependencies

¤ Discover “hidden” input types for Web services in the outputs of mapped (solved) Web services Example: Solution

35

Koutraki Maria

getArtworksByArtist getArtworkByArtworkID

artworkID

slide-36
SLIDE 36

Experimental Setup - Results

¤ 3 KB Tested ( YAGO, DBpedia, BNF) ¤ > 50 Web Services (music, movies, books, geodata)

¤ à High Precision and Recall ¤ Summarization of Class/Relation alignment experiments: *Tested only with WSs from “Books” domain Precision Recall Classes Relations Classes Relations YAGO 0.92 0.91 0.96 0.93 DBpedia 0.91 0.92 0.98 0.95 BNF * 1 1 1 1

36

Koutraki Maria

slide-37
SLIDE 37

0.2 0.4 0.6 0.8 1 1 2 3 4 5 6 7 8 9 10 11 Precision

Overlap

0.2 0.4 0.6 0.8 1 1 2 3 4 5 6 7 8 9 10 11 Precision

KB à WS

0.2 0.4 0.6 0.8 1 1 2 3 4 5 6 7 8 9 10 11 Precision

WS à KB

0.2 0.4 0.6 0.8 1 1 2 3 4 5 6 7 8 9 10 11 Recall

Overlap

0.2 0.4 0.6 0.8 1 1 2 3 4 5 6 7 8 9 10 11 Recall

KB à WS

0.2 0.4 0.6 0.8 1 1 2 3 4 5 6 7 8 9 10 11 Recall

WS à KB

Evaluation Results

¤ Path Alignment ¤ Music Domain: 25 Web services

¤ More results : http://oasis.prism.uvsq.fr/doris/index.html

37

Koutraki Maria

slide-38
SLIDE 38

Conclusions - DORIS

Koutraki Maria

38

¤ We proposed DORIS, a system that provides a formal description

  • f the output of a Web service in terms of a global schema

¤ We provide a transformation function, as a script, to transform the

  • utput of the Web service in terms of a global schema.

¤ We proposed and algorithm that discovers I/O dependences between Web services of the same API

slide-39
SLIDE 39

Part II: Online Relation Alignment on Linked Datasets

39

Koutraki Maria

[EDBT’16]

slide-40
SLIDE 40

Approach: Online Relation Alignment

¤ Goal: Compute one-to-one relation alignments

¤ Equivalence or subsumptions

¤ Align KBs published by SPARQL endpoints ¤ The entities of the two KBs are aligned via sameAs links ¤ Approach:

¤ Instance-based ¤ Supervised Model (features computed on KB instances) ¤ Sample for a minimal set of entities to perform the alignment process

40

Koutraki Maria

slide-41
SLIDE 41

Approach: Outline

SPARQL endpoint SPARQL endpoint

rT

41

Koutraki Maria

y x y’ x’

rT--

KBS KBT

rS

sameAs sameAs

1

Candidates for alignment:

rS ⊆ rT1 rS ⊆ rT2 rS ⊆ rT3 …

2

Classify the alignments:

rS ⊆ rT1 (correct) rS ⊆ rT2 (incorrect) rS ⊆ rT3 (correct) …

3

slide-42
SLIDE 42

Approach: Features

Feature group Inductive Logic Programming (ILP) General Statistics (GS) Lexical ..as matchers

42

Koutraki Maria

slide-43
SLIDE 43

Features – ILP: CWA & PCA

¤ Closed world assumption (cwa): for a relation r the KB contains all the facts.

¤ Good precision, bad recall ¤ Absent data – counter examples

¤ Partial completeness assumption (pca): for a subject x and relation r, the KB contains ether all or none of the facts.

43

Koutraki Maria

slide-44
SLIDE 44

Features – ILP: CWA & PCA

b3

44

Koutraki Maria

The_Thinker b2

created created

b2

knownFor created

Example 1

KBS KBT rS: created rT:knownFor

slide-45
SLIDE 45

Features – ILP: CWA & PCA

b3

45

Koutraki Maria

The_Thinker b2

created created

b2

knownFor created

Example 2

c1 c2

created created

KBS KBT rS: created rT:knownFor

slide-46
SLIDE 46

Features – Relation Functionality

¤ Functionality: “A relation r(x,y) is called functional if for x there are not more than one y.” ¤ If rs is subsumed in rt the functionality should be higher ¤ Target relations should have better coverage of facts

46

Koutraki Maria

slide-47
SLIDE 47

Features - ILP: PIA

¤ Partial completeness assumption - pca

¤ good performance for functional relations ¤ Penalizes the non-functional relations

¤ Propose: Partial incompleteness assumption – pia

¤ The more important the counter example is the more should count!

47

Koutraki Maria

slide-48
SLIDE 48

Features – GS: Type similarity

¤ Check the type distribution similarity between relations rS and rT. ¤ Example: ¤ Weighted Jaccard similarity metric to assess if the two relations have similar structure in terms of types. ¤ High similarity – Good indicator for equivalence/subsumption between relations

48

Koutraki Maria

Book 30% Movie 20% … Book 20% Movie 30% … rT :hasWriter rS :hasCreator

High similarity!!

slide-49
SLIDE 49

Features – GS: Type dissimilarity

¤ Check if type distribution in rS contains type that do not exist in rT. ¤ Example: ¤ For missing types and based on their ratio we can accurately assess that rT does not subsume rS.

49

Koutraki Maria

Book 30% Movie 20% Song 5% … Book 20% Movie 30% Paintings 50% … rT :hasWriter rS :hasCreator

High dissimilarity!!

slide-50
SLIDE 50

Features – GS: Relevance likelihood

¤ Likelihood of ILP scores: depend on the datasets the matchers varies !! ¤ Compute the likelihood of specific ILP scores being indicators of subsumption for a relation pair!

¤ pca likelihood ¤ cwa likelihood ¤ Joint pca & cwa likelihood

¤ Compute the likelihood of a relation alignment being correct given a specific ILP score. ¤ Probabilities are measured on the training set! Assign the scores

  • n the test set

Koutraki Maria

50

slide-51
SLIDE 51

Approach: Efficiency Issues

¤ Challenges

¤ Bandwidth ¤ Time-out at SPARQL endpoints

¤ Approach

¤ Reduce data transfers ¤ Retrieve a subset of instances for a given relation

¤ Solution

¤ Sample for a minimal subset of instances for the relation alignment ¤ First-N ¤ Random ¤ Stratified

Koutraki Maria

51

slide-52
SLIDE 52

Experimental Setup

¤ 3 Knowledge Bases

¤ YAGO, DBpedia, Freebase (e.g. YAGO à DBpedia)

¤ Relations ¤ Baselines

¤ cwa (used in PARIS) ¤ pca (used in ROSA)

¤ SOFYA: Logistic Regression (any other supervised model can be applied)

52

Koutraki Maria

KB YAGO DBpedia Freebase #relations 36 563 1666

slide-53
SLIDE 53

Evaluation Results: Performance

¤ Full Data: Comparison of the different models and competitors

Koutraki Maria

53

slide-54
SLIDE 54

Evaluation Results: Performance

¤ Sampled Data: Individual results on sampling – Stratified Level 3 – 50 entity samples

Koutraki Maria

54

slide-55
SLIDE 55

Evaluation Results: Efficiency

SPARQL Sampling time in milliseconds

500 1000 1500 2000 2500 3000 3500 100 500 1000

milliseconds Sample Size firstN random str.lvl-2 str.lvl-3 str.lvl-4 str.lvl-5 str.lvl-6

55

Koutraki Maria 20 40 60 80 100 120 140 160 100 500 1000

Kilobytes Sample Size firstN random str.lvl-2 str.lvl-3 str.lvl-4 str.lvl-5 str.lvl-6

Bandwidth usage in in kilobytes

slide-56
SLIDE 56

Conclusions - SOFYA

Koutraki Maria

56

¤ We proposed SOFYA, an instance-based relation alignment approach, discovering subsumptions of relations ¤ We propose supervised machine learning models, that combine a set of light-weight features to decide if the subsumption relationship is correct or incorrect ¤ Overcome main drawbacks of existing schema matching approaches, through efficient alignment algorithms ¤ Harness the complementarity of LOD sources through relation alignments at query time

slide-57
SLIDE 57

Future/Ongoing work

¤ Automatic discovery of input types in DORIS ¤ Investigate for additional features in SOFYA ¤ Relation alignment for complex relations: 1-n relations in SOFYA ¤ Compute subsumption of relations starting from the super-relation in SOFYA

57

Koutraki Maria

slide-58
SLIDE 58

Publications (1/2)

¤ National conferences:

¤ Mapping Web Services to Knowledge Bases, 2015, Bases de Données Avancées (BDA), Maria Koutraki, Dan Vodislav, Nicoleta Preda ¤ DORIS: Discovering Ontological Relations in Services, 2015, Bases de Données Avancées (BDA), Maria Koutraki, Dan Vodislav, Nicoleta Preda ¤ Uniformly Querying Web Knowledge Bases, 2016, parisDB, Maria Koutraki, Nicoleta Preda, Dan Vodislav

58

Koutraki Maria

slide-59
SLIDE 59

Publications (2/2)

¤ International conferences:

¤ Deriving Intensional Descriptions for Web Services, 2015, International Conference on Information and Knowledge Management (CIKM), Maria Koutraki, Dan Vodislav, Nicoleta Preda ¤ DORIS: Discovering Ontological Relations in Services, 2015, International Semantic Web Conference (ISWC), Maria Koutraki, Dan Vodislav, Nicoleta Preda ¤ SOFYA: Semantic on-the-fly Relation Alignment, 2016, International Conference on Extending Database Technology (EDBT), Maria Koutraki, Nicoleta Preda, Dan Vodislav

59

Koutraki Maria

slide-60
SLIDE 60

Thank you all !

Questions ?

60

Koutraki Maria