COMA COMA A system for flexible system for flexible combination - - PowerPoint PPT Presentation
COMA COMA A system for flexible system for flexible combination - - PowerPoint PPT Presentation
COMA COMA A system for flexible system for flexible combination of schema matching combination of schema matching approaches approaches Hong-Hai Do, Erhard Rahm University of Leipzig, Germany dbs.uni-leipzig.de Content Content
2
Content Content
Motivation The COMA approach
Comprehensive matcher library Flexible combination scheme Novel reuse-oriented match approach
Evaluation setup and results Conclusions and future work
3
Motivation Motivation
Schema matching: Finding semantic correspondences
between two schemas
Crucial step in many applications
Data integration: mediators, data warehouses E-Business: XML message mapping ...
Currently manual, time-consuming, tedious
Need for approaches to automate the task as much as possible
DeliverTo Address Street City Zip BillTo PO2 shipToCity shipToStreet ShipTo shipToZip custCity custStreet Customer custZip PO1 custName
PO1.ShipTo.shipToCity ↔ PO2.DeliverTo.Address.City
4
Individual Match Approaches Individual Match Approaches
Schema-based Instance-based
- Parents
- Children
- Leaves
Linguistic Constraint- based
- Types
- Keys
- Value pattern
and ranges
Constraint- based Linguistic
- IR (word
frequencies, key terms)
Constraint- based
- Names
- Descriptions
Structure Element Element
Survey paper [Rahm, Bernstein Survey paper [Rahm, Bernstein -
- VLDB Journal’01]
VLDB Journal’01] Reuse-oriented
Structure Element
- Dictionaries
- Thesauries
- Previous
match results
5
Combining Match Approaches Combining Match Approaches
Combination of match algorithms
Hybrid: fixed combination, difficult to extend and improve
currently most common: Cupid, SemInt, SimilarityFlooding, DIKE,
MOMIS, TranScm
Composite: combination of the results of independently
executed matchers
currently only for machine learning-based techniques: LSD, GLUE
COMA: Framework for flexible COmbination of
MAtch algorithms
Extensible matcher library Combination scheme with various combination strategies
6
System Architecture System Architecture
Schema Import Match Iteration
Matcher 1 Matcher 2 Matcher 3 S2 S1
User Interaction (optional)
Matcher execution Combination of match results Similarity cube UserFeedback
S2→S1 S1→S2
Mapping
Matcher Library Combination Scheme
7
Combination Scheme Combination Scheme
[S1, S2, 0.7] Combined similarity Match results Matchers S1 S2
... ... ... 0.8 s2 s1 S1→S2 ... ... ... 0.8 s1 s2 S2→S1
Similarity cube Similarity matrix S1 S2
SmallLarge, LargeSmall, Both MaxN (Max1), Threshold, MaxDelta, Threshold+MaxN, Threshold+MaxDelta Dice, Average Average, Max, Min, Weighted
- 1. Aggregation
- f matcher-
specific results
- 2. Match
direction
- 3. Selection of
match candidates
- 4. Computation
- f combined
similarity
8
Match Processing: Example Match Processing: Example
shipToCity shipToStreet City Matcher1: 0.6 Matcher2: 0.8 shipToCity shipToStreet Average: 0.7 S2 S1 Average: 0.6 LargeSmall (Match candidates for smaller schema S2)
- 1. Aggregation
- 2. Direction
- 3. Selection
City S2 Matcher1: 0.8 Matcher2: 0.4 Threshold(0.5) Max1 S1
|S1|>|S2|
Sim S1 elements S2 elements 0.6 shipToStreet 0.7 shipToCity City Sim S1 elements S2 elements 0.7 shipToCity City 0.6 shipToStreet City Sim S1 elements S2 elements 0.7 shipToCity City
9
Matcher Library Matcher Library
Existing schema-level match results – – – – – User-specified (mis-) matches Data type compatibility table External dictionaries – – – – Auxiliary Info DataType, Name Data Types+Names TypeName – – Schema Reuse-
- riented
TypeName Leaf elements Leaves TypeName Child elements Children Name Names+Paths NamePath Affix, 3-Gram, Synonym Element names Name Hybrid – – UserFeedback – Data types DataType – Element names Synonym – Element names EditDistance – Element names Soundex – Element names n-gram – Element names Affix Simple Constituent Matchers Schema Info Matcher Type
10
Reuse-oriented Matching Reuse-oriented Matching
The MatchCompose operation: Transitivity of element
similarity
Composition of similarity relationships
Reuse of multiple match correspondences
- vs. reuse of single element-level correspondences from
synonym tables, thesauries
m = MatchCompose (m1, m2) firstName lastName Name
S S1 S2
0.8 0.7 0.6 0.6
Name
S2
0.7 0.65
m1 m2 m FName LName firstName lastName
S1
11
Schema-level Reuse Schema-level Reuse
The Schema matcher:
Reuse complete match results at the schema level Exploit all possible reuse opportunities Limit negative effects of transitivity
Existing match results S1 ↔ S2 S1 ↔ Si, S2 ↔ Si S1 ↔ S2 Similarity cube S1 ↔ Sj, Sj ↔ S2 Sk ↔ S1, S2 ↔ Sk Match problem Match result Search repository Match- Compose Aggregation Direction Selection
12
Real-world Evaluation Real-world Evaluation
5 real-world schemas (XML – Purchase order), 10 match tasks
- CIDX, Excel, Noris, Paragon, Apertum from biztalk.org
- 40-145 elements
Systematic evaluation (automatic mode)
- 1 Series = 10 Experiments: Test of 1 configuration of (Matcher, Aggregation,
Direction, Selection, Combined similarity) with 10 match tasks
- 12,312 series = 123,120 experiments
2 36 3 3 Σ = 16 + 14
- Average
- Max
- Average
- Min
12 combinations 2 single Reuse
- Max
- Average
- Min
11 combinations
- Average
- Dice
- MaxN(1-4)
- Delta(0.01-0.1)
- Threshold(0.3-1.0)
- Threshold(0.5)+
MaxN(1-4)
- Threshold(0.5)+
Delta(0.01-0.1)
- LargeSmall
- SmallLarge
- Both
5 single No reuse Combined Sim Selection Direction Aggregation Matchers
13
Match Quality Measures Match Quality Measures
Comparison of automatically with manually (i.e. real)
derived match correspondences
Quality measures:
Overall: post-match effort to add missed and to remove false
matches; negative Overall → no gain
Computed for single experiments and averaged over 10
experiments for each series (average Overall, etc.)
− = + − = + + − = + = + = Precision Recall B A C B B A C A Overall B A B Recall C B B Precision 1 2 * 1
Real matches Suggested matches A B C SimilarityFlooding [ICDE02]: A: Missed matches B: Correct matches C: False matches
14
Results: Combination Strategies (1) Results: Combination Strategies (1)
Most no-reuse series have
negative average Overall
“Good” matcher/strategy:
Positive average Overall High presence in higher Overall
ranges
#All Series = 8208
270 207 62 136 114 179 160 3 30 60 90 120 150 180 210 240 270 300 Min 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
Overall #Series
7077
Aggregation: Average
(compensating)
Average is used by all series with average Overall > 0.6
Aggregation (2376 series/strategy)
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Min 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Overall Series share
Max Min Average
15
Results: Combination Strategies (2) Results: Combination Strategies (2)
Direction: Both (considering both
directions)
Selection: Threshold+Delta (above
threshold + within tolerance)
Combined similarity: Average
(pessimistic)
Matcher: All (combination of all
hybrid matchers)
Best selection (228 series/strategy)
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Min 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Overall Series share
Thr(0.8) MaxN(1) Thr(0.5)+MaxN(1) Delta(0.02) Thr(0.5)+Delta(0.02)
Computation of combined similarity (4104 series/strategy) 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Min 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Overall Series share Dice Average Direction (2736 series/strategy) 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Min 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Overall Series share
Both SmallLarge LargeSmall
16
Results: Single Matchers Results: Single Matchers
Instability of some single (hybrid) matchers (negative Overall)
because of shared elements
E.g. DeliverTo.Address and BillTo.Address
Considering hierarchical names (NamePath) more accurate Schema-level reuse very effective:
- Essential improvement over no-reuse hybrid matchers
- Reusing approved match results better than automatically derived match results
a) Single matchers
- 0,3
- 0,2
- 0,1
0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1
N a m e P a t h T y p e N a m e L e a v e s C h i l d r e n N a m e S c h e m a M S c h e m a A avg Precision avg Recall avg Overall No reuse Reuse
SchemaM: Schema with manually derived (real) match results SchemaA: Schema with match results automatically derived using the default match operation
17
Results: Combined Match Approaches Results: Combined Match Approaches
Reuse matchers outperform no-reuse matchers
- Best no-reuse All : 0.73 average Overall (Precision 0.95, Recall 0.78)
- Best reuse All+SchemaM: 0.82 average Overall (Precision 0.93, Recall 0.89)
Combinations outperform single hybrid matchers
- Combined matchers, e.g. All, consider many aspects at the same time
- NamePath+Leaves: effective scheme, considering paths to identify context of
shared elements, and leaves to cope with structural conflicts
b) Matcher combinations
0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1
A l l + S c h e m a M S c h e m a M + N a m e P a t h S c h e m a M + N a m e S c h e m a M + T y p e N a m e S c h e m a M + L e a v e s S c h e m a M + C h i l d r e n A l l N a m e P a t h + L e a v e s N a m e P a t h + T y p e N a m e N a m e P a t h + C h i l d r e n N a m e P a t h + N a m e
avg Precision avg Recall avg Overall
All: Combination of all no-reuse hybrid matchers All+SchemaM: Combination
- f all no-reuse hybrid matchers
and SchemaM
18
Results: Match Sensitivity Results: Match Sensitivity
Impact of schema characteristics :
Degrading match quality with increase of schema size
Best combinations: no-reuse All and reuse-oriented
All+Schema
High stability across different match tasks Little tuning effort for the default match operation
0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 1<->2 1<->3 2<->3 1<->4 2<->4 3<->4 1<->5 2<->5 3<->5 4<->5
Match tasks Overall
24 48 72 96 120 144 168 192 216 240
# Elements
Overall(No Reuse) Overall(Manual Reuse) #All Elements
19
Conclusions and Future Work Conclusions and Future Work
The COMA framework
Extensible matcher library, including novel reuse approach Powerful combination scheme for both specifying match
- perations and constructing new matchers from existing ones
Comprehensive evaluation on real-world schemas
High effectiveness on large schemas Reuse: essential improvement over no-reuse Composite approach as THE solution for matcher combination
Future work
Matchers: more powerful reuse strategies, instance-based
matchers
More intelligent combination strategies Application to more real-world scenarios, esp. in bioinformatics