COMA COMA A system for flexible system for flexible combination - - PowerPoint PPT Presentation

coma coma a system for flexible system for flexible
SMART_READER_LITE
LIVE PREVIEW

COMA COMA A system for flexible system for flexible combination - - PowerPoint PPT Presentation

COMA COMA A system for flexible system for flexible combination of schema matching combination of schema matching approaches approaches Hong-Hai Do, Erhard Rahm University of Leipzig, Germany dbs.uni-leipzig.de Content Content


slide-1
SLIDE 1

COMA – COMA – A system for flexible system for flexible combination of schema matching combination of schema matching approaches approaches

Hong-Hai Do, Erhard Rahm University of Leipzig, Germany dbs.uni-leipzig.de

slide-2
SLIDE 2

2

Content Content

Motivation The COMA approach

Comprehensive matcher library Flexible combination scheme Novel reuse-oriented match approach

Evaluation setup and results Conclusions and future work

slide-3
SLIDE 3

3

Motivation Motivation

Schema matching: Finding semantic correspondences

between two schemas

Crucial step in many applications

Data integration: mediators, data warehouses E-Business: XML message mapping ...

Currently manual, time-consuming, tedious

Need for approaches to automate the task as much as possible

DeliverTo Address Street City Zip BillTo PO2 shipToCity shipToStreet ShipTo shipToZip custCity custStreet Customer custZip PO1 custName

PO1.ShipTo.shipToCity ↔ PO2.DeliverTo.Address.City

slide-4
SLIDE 4

4

Individual Match Approaches Individual Match Approaches

Schema-based Instance-based

  • Parents
  • Children
  • Leaves

Linguistic Constraint- based

  • Types
  • Keys
  • Value pattern

and ranges

Constraint- based Linguistic

  • IR (word

frequencies, key terms)

Constraint- based

  • Names
  • Descriptions

Structure Element Element

Survey paper [Rahm, Bernstein Survey paper [Rahm, Bernstein -

  • VLDB Journal’01]

VLDB Journal’01] Reuse-oriented

Structure Element

  • Dictionaries
  • Thesauries
  • Previous

match results

slide-5
SLIDE 5

5

Combining Match Approaches Combining Match Approaches

Combination of match algorithms

Hybrid: fixed combination, difficult to extend and improve

currently most common: Cupid, SemInt, SimilarityFlooding, DIKE,

MOMIS, TranScm

Composite: combination of the results of independently

executed matchers

currently only for machine learning-based techniques: LSD, GLUE

COMA: Framework for flexible COmbination of

MAtch algorithms

Extensible matcher library Combination scheme with various combination strategies

slide-6
SLIDE 6

6

System Architecture System Architecture

Schema Import Match Iteration

Matcher 1 Matcher 2 Matcher 3 S2 S1

User Interaction (optional)

Matcher execution Combination of match results Similarity cube UserFeedback

S2→S1 S1→S2

Mapping

Matcher Library Combination Scheme

slide-7
SLIDE 7

7

Combination Scheme Combination Scheme

[S1, S2, 0.7] Combined similarity Match results Matchers S1 S2

... ... ... 0.8 s2 s1 S1→S2 ... ... ... 0.8 s1 s2 S2→S1

Similarity cube Similarity matrix S1 S2

SmallLarge, LargeSmall, Both MaxN (Max1), Threshold, MaxDelta, Threshold+MaxN, Threshold+MaxDelta Dice, Average Average, Max, Min, Weighted

  • 1. Aggregation
  • f matcher-

specific results

  • 2. Match

direction

  • 3. Selection of

match candidates

  • 4. Computation
  • f combined

similarity

slide-8
SLIDE 8

8

Match Processing: Example Match Processing: Example

shipToCity shipToStreet City Matcher1: 0.6 Matcher2: 0.8 shipToCity shipToStreet Average: 0.7 S2 S1 Average: 0.6 LargeSmall (Match candidates for smaller schema S2)

  • 1. Aggregation
  • 2. Direction
  • 3. Selection

City S2 Matcher1: 0.8 Matcher2: 0.4 Threshold(0.5) Max1 S1

|S1|>|S2|

Sim S1 elements S2 elements 0.6 shipToStreet 0.7 shipToCity City Sim S1 elements S2 elements 0.7 shipToCity City 0.6 shipToStreet City Sim S1 elements S2 elements 0.7 shipToCity City

slide-9
SLIDE 9

9

Matcher Library Matcher Library

Existing schema-level match results – – – – – User-specified (mis-) matches Data type compatibility table External dictionaries – – – – Auxiliary Info DataType, Name Data Types+Names TypeName – – Schema Reuse-

  • riented

TypeName Leaf elements Leaves TypeName Child elements Children Name Names+Paths NamePath Affix, 3-Gram, Synonym Element names Name Hybrid – – UserFeedback – Data types DataType – Element names Synonym – Element names EditDistance – Element names Soundex – Element names n-gram – Element names Affix Simple Constituent Matchers Schema Info Matcher Type

slide-10
SLIDE 10

10

Reuse-oriented Matching Reuse-oriented Matching

The MatchCompose operation: Transitivity of element

similarity

Composition of similarity relationships

Reuse of multiple match correspondences

  • vs. reuse of single element-level correspondences from

synonym tables, thesauries

m = MatchCompose (m1, m2) firstName lastName Name

S S1 S2

0.8 0.7 0.6 0.6

Name

S2

0.7 0.65

m1 m2 m FName LName firstName lastName

S1

slide-11
SLIDE 11

11

Schema-level Reuse Schema-level Reuse

The Schema matcher:

Reuse complete match results at the schema level Exploit all possible reuse opportunities Limit negative effects of transitivity

Existing match results S1 ↔ S2 S1 ↔ Si, S2 ↔ Si S1 ↔ S2 Similarity cube S1 ↔ Sj, Sj ↔ S2 Sk ↔ S1, S2 ↔ Sk Match problem Match result Search repository Match- Compose Aggregation Direction Selection

slide-12
SLIDE 12

12

Real-world Evaluation Real-world Evaluation

5 real-world schemas (XML – Purchase order), 10 match tasks

  • CIDX, Excel, Noris, Paragon, Apertum from biztalk.org
  • 40-145 elements

Systematic evaluation (automatic mode)

  • 1 Series = 10 Experiments: Test of 1 configuration of (Matcher, Aggregation,

Direction, Selection, Combined similarity) with 10 match tasks

  • 12,312 series = 123,120 experiments

2 36 3 3 Σ = 16 + 14

  • Average
  • Max
  • Average
  • Min

12 combinations 2 single Reuse

  • Max
  • Average
  • Min

11 combinations

  • Average
  • Dice
  • MaxN(1-4)
  • Delta(0.01-0.1)
  • Threshold(0.3-1.0)
  • Threshold(0.5)+

MaxN(1-4)

  • Threshold(0.5)+

Delta(0.01-0.1)

  • LargeSmall
  • SmallLarge
  • Both

5 single No reuse Combined Sim Selection Direction Aggregation Matchers

slide-13
SLIDE 13

13

Match Quality Measures Match Quality Measures

Comparison of automatically with manually (i.e. real)

derived match correspondences

Quality measures:

Overall: post-match effort to add missed and to remove false

matches; negative Overall → no gain

Computed for single experiments and averaged over 10

experiments for each series (average Overall, etc.)

      − = + − = + + − = + = + = Precision Recall B A C B B A C A Overall B A B Recall C B B Precision 1 2 * 1

Real matches Suggested matches A B C SimilarityFlooding [ICDE02]: A: Missed matches B: Correct matches C: False matches

slide-14
SLIDE 14

14

Results: Combination Strategies (1) Results: Combination Strategies (1)

Most no-reuse series have

negative average Overall

“Good” matcher/strategy:

Positive average Overall High presence in higher Overall

ranges

#All Series = 8208

270 207 62 136 114 179 160 3 30 60 90 120 150 180 210 240 270 300 Min 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

Overall #Series

7077

Aggregation: Average

(compensating)

Average is used by all series with average Overall > 0.6

Aggregation (2376 series/strategy)

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Min 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Overall Series share

Max Min Average

slide-15
SLIDE 15

15

Results: Combination Strategies (2) Results: Combination Strategies (2)

Direction: Both (considering both

directions)

Selection: Threshold+Delta (above

threshold + within tolerance)

Combined similarity: Average

(pessimistic)

Matcher: All (combination of all

hybrid matchers)

Best selection (228 series/strategy)

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Min 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Overall Series share

Thr(0.8) MaxN(1) Thr(0.5)+MaxN(1) Delta(0.02) Thr(0.5)+Delta(0.02)

Computation of combined similarity (4104 series/strategy) 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Min 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Overall Series share Dice Average Direction (2736 series/strategy) 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Min 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Overall Series share

Both SmallLarge LargeSmall

slide-16
SLIDE 16

16

Results: Single Matchers Results: Single Matchers

Instability of some single (hybrid) matchers (negative Overall)

because of shared elements

E.g. DeliverTo.Address and BillTo.Address

Considering hierarchical names (NamePath) more accurate Schema-level reuse very effective:

  • Essential improvement over no-reuse hybrid matchers
  • Reusing approved match results better than automatically derived match results

a) Single matchers

  • 0,3
  • 0,2
  • 0,1

0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1

N a m e P a t h T y p e N a m e L e a v e s C h i l d r e n N a m e S c h e m a M S c h e m a A avg Precision avg Recall avg Overall No reuse Reuse

SchemaM: Schema with manually derived (real) match results SchemaA: Schema with match results automatically derived using the default match operation

slide-17
SLIDE 17

17

Results: Combined Match Approaches Results: Combined Match Approaches

Reuse matchers outperform no-reuse matchers

  • Best no-reuse All : 0.73 average Overall (Precision 0.95, Recall 0.78)
  • Best reuse All+SchemaM: 0.82 average Overall (Precision 0.93, Recall 0.89)

Combinations outperform single hybrid matchers

  • Combined matchers, e.g. All, consider many aspects at the same time
  • NamePath+Leaves: effective scheme, considering paths to identify context of

shared elements, and leaves to cope with structural conflicts

b) Matcher combinations

0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1

A l l + S c h e m a M S c h e m a M + N a m e P a t h S c h e m a M + N a m e S c h e m a M + T y p e N a m e S c h e m a M + L e a v e s S c h e m a M + C h i l d r e n A l l N a m e P a t h + L e a v e s N a m e P a t h + T y p e N a m e N a m e P a t h + C h i l d r e n N a m e P a t h + N a m e

avg Precision avg Recall avg Overall

All: Combination of all no-reuse hybrid matchers All+SchemaM: Combination

  • f all no-reuse hybrid matchers

and SchemaM

slide-18
SLIDE 18

18

Results: Match Sensitivity Results: Match Sensitivity

Impact of schema characteristics :

Degrading match quality with increase of schema size

Best combinations: no-reuse All and reuse-oriented

All+Schema

High stability across different match tasks Little tuning effort for the default match operation

0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 1<->2 1<->3 2<->3 1<->4 2<->4 3<->4 1<->5 2<->5 3<->5 4<->5

Match tasks Overall

24 48 72 96 120 144 168 192 216 240

# Elements

Overall(No Reuse) Overall(Manual Reuse) #All Elements

slide-19
SLIDE 19

19

Conclusions and Future Work Conclusions and Future Work

The COMA framework

Extensible matcher library, including novel reuse approach Powerful combination scheme for both specifying match

  • perations and constructing new matchers from existing ones

Comprehensive evaluation on real-world schemas

High effectiveness on large schemas Reuse: essential improvement over no-reuse Composite approach as THE solution for matcher combination

Future work

Matchers: more powerful reuse strategies, instance-based

matchers

More intelligent combination strategies Application to more real-world scenarios, esp. in bioinformatics