iTrails: Pay-as-you-go Information Integration in Dataspaces Marcos - - PowerPoint PPT Presentation

itrails pay as you go information integration in
SMART_READER_LITE
LIVE PREVIEW

iTrails: Pay-as-you-go Information Integration in Dataspaces Marcos - - PowerPoint PPT Presentation

iTrails: Pay-as-you-go Information Integration in Dataspaces Marcos Vaz Salles Jens Dittrich Shant Karakashian Olivier Girard Lukas Blunschi ETH Zurich VLDB 2007 September 26, 2007 Outline Motivation iTrails Experiments


slide-1
SLIDE 1

September 26, 2007

iTrails: Pay-as-you-go Information Integration in Dataspaces

Marcos Vaz Salles Jens Dittrich Shant Karakashian Olivier Girard Lukas Blunschi ETH Zurich VLDB 2007

slide-2
SLIDE 2

2

September 26, 2007

Marcos Vaz Salles / ETH Zurich / marcos.vazsalles@inf.ethz.ch

Outline

Motivation iTrails Experiments Conclusions and Future Work

slide-3
SLIDE 3

3

September 26, 2007

Marcos Vaz Salles / ETH Zurich / marcos.vazsalles@inf.ethz.ch

Problem: Querying Several Sources

Data Sources

Laptop Email Server Web Server DB Server

What is the impact of global warming in Zurich? Query Systems

? ? ? ?

slide-4
SLIDE 4

4

September 26, 2007

Marcos Vaz Salles / ETH Zurich / marcos.vazsalles@inf.ethz.ch

Solution 1: Use a Search Engine

Data Sources

Laptop Email Server Web Server

Query System

DB Server Graph IR Search Engine global warming zurich TopX [VLDB05], FleXPath [SIGMOD04], XSearch [VLDB03], XRank [SIGMOD03]

Job!

text, links text, links text, links text, links

Drawback: Query semantics are not precise!

slide-5
SLIDE 5

5

September 26, 2007

Marcos Vaz Salles / ETH Zurich / marcos.vazsalles@inf.ethz.ch

Solution 2: Use an Information Integration System

Data Sources

Laptop Email Server Web Server

Query System

DB Server Information Integration System //Temperatures/*[city = “zurich”] GAV (e.g. [ICDE95]), LAV (e.g. [VLDB96]), GLAV [AAAI99], P2P (e.g. [SIGMOD04])

missing schema mapping schema mapping schema mapping missing schema mapping

Temps Cities CO2 Sunspots

. . . . . . ... ...

Drawback: Too much effort to provide schema mappings!

slide-6
SLIDE 6

6

September 26, 2007

Marcos Vaz Salles / ETH Zurich / marcos.vazsalles@inf.ethz.ch

?

Research Challenge: Is There an Integration Solution in-between These Two Extremes?

Graph IR Search Engine global warming zurich Data Sources

text, links

Information Integration System //Temperatures/*[city = “zurich”]

Temps Cities CO2 Sunspots

. . . . . . ... ...

Data Sources

full-blown schema mappings

Laptop Email Server Web Server DB Server

Dataspace System

global warming zurich

text, links text, links text, links text, links

Pay-as-you-go Information Integration

Dataspace Vision by Franklin, Halevy, and Maier [SIGMOD Record 05]

slide-7
SLIDE 7

7

September 26, 2007

Marcos Vaz Salles / ETH Zurich / marcos.vazsalles@inf.ethz.ch

Outline

Motivation iTrails Experiments Conclusions and Future Work

slide-8
SLIDE 8

8

September 26, 2007

Marcos Vaz Salles / ETH Zurich / marcos.vazsalles@inf.ethz.ch

iTrails Core Idea: Add Integration Hints Incrementally

Step 1: Provide a search service over all the data

Use a general graph data model (see VLDB 2006) Works for unstructured documents, XML, and relations

Step 2: Add integration semantics via hints (trails) on top

  • f the graph

Works across data sources, not only between sources

Step 3: If more semantics needed, go back to step 2 Impact:

Smooth transition between search and data integration Semantics added incrementally improve precision / recall

slide-9
SLIDE 9

9

September 26, 2007

Marcos Vaz Salles / ETH Zurich / marcos.vazsalles@inf.ethz.ch

iTrails: Defining Trails

Basic Form of a Trail

QL [.CL] → QR [.CR]

Intuition:

When I query for QL [.CL], you should also query for QR [.CR]

Queries: NEXI-like keyword and path expressions Attribute projections

slide-10
SLIDE 10

10

September 26, 2007

Marcos Vaz Salles / ETH Zurich / marcos.vazsalles@inf.ethz.ch

20 15 14 BE ZH ZH

Trail Examples: Global Warming Zurich

Trail for Implicit Meaning:

“When I query for global warming, you should also query for Temperature data above 10 degrees”

Trail for an Entity: “When I

query for zurich, you should also query for references of zurich as a region”

global warming → //Temperatures/*[celsius > 10]

DB Server

Temperatures

city celsius date Bern 24-Sep 24-Sep Zurich 25-Sep

zurich → //*[region = “ZH”]

Uster region

global warming zurich

9 ZH Zurich 26-Sep

slide-11
SLIDE 11

11

September 26, 2007

Marcos Vaz Salles / ETH Zurich / marcos.vazsalles@inf.ethz.ch

Trail Example: Deep Web Bookmarks

Trail for a Bookmark: “When I

query for train home, you should also query for the TrainCompany’s website with origin at ETH Uni and destination at Seilbahn

Rigiblick”

train home train home → //trainCompany.com//*[origin=“ETH Uni” and dest =“Seilbahn Rigiblick”]

Web Server

slide-12
SLIDE 12

12

September 26, 2007

Marcos Vaz Salles / ETH Zurich / marcos.vazsalles@inf.ethz.ch

Trail Examples: Thesauri, Dictionaries, Language-agnostic Search

Trail for Thesauri: “When I

query for car, you should also query for auto”

Trails for Dictionary:

“When I query for car, you should also query for carro and vice-versa” car auto car carro

car → auto car → carro carro → car

Laptop Email Server

slide-13
SLIDE 13

13

September 26, 2007

Marcos Vaz Salles / ETH Zurich / marcos.vazsalles@inf.ethz.ch

Trail Examples: Schema Equivalences

Trail for schema match on

names: “When I query for

Employee.empName, you should

also query for Person.name”

Trail for schema match on

salaries: “When I query for

Employee.salary, you should

also query for Person.income”

Employee

empName empId salary

Person

name age SSN income

//Employee//*.tuple.empName → //Person//*.tuple.name //Employee//*.tuple.salary → //Person//*.tuple.income

DB Server

slide-14
SLIDE 14

14

September 26, 2007

Marcos Vaz Salles / ETH Zurich / marcos.vazsalles@inf.ethz.ch

Outline

Motivation iTrails Experiments Conclusion and Future Work

Core Idea Trail Examples How are Trails Created? Uncertainty and Trails Rewriting Queries with Trails Recursive Matches

slide-15
SLIDE 15

15

September 26, 2007

Marcos Vaz Salles / ETH Zurich / marcos.vazsalles@inf.ethz.ch

How are Trails Created?

Given by the user

Explicitly Via Relevance Feedback

(Semi-)Automatically

Information extraction techniques Automatic schema matching Ontologies and thesauri (e.g., wordnet) User communities (e.g., trails on gene data, bookmarks)

slide-16
SLIDE 16

16

September 26, 2007

Marcos Vaz Salles / ETH Zurich / marcos.vazsalles@inf.ethz.ch

Uncertainty and Trails

Probabilistic Trails:

model uncertain trails probabilities used to rank trails

QL [.CL] → QR [.CR], 0 ≤ p ≤ 1

Example: car → auto

p

p = 0.8

slide-17
SLIDE 17

17

September 26, 2007

Marcos Vaz Salles / ETH Zurich / marcos.vazsalles@inf.ethz.ch

Certainty and Trails

Scored Trails:

give higher value to certain trails scoring factors used to boost scores of query results obtained

by the trail

QL [.CL] → QR [.CR], sf > 1

Examples:

  • T1: weather → //Temperatures/*
  • T2: yesterday → //*[date = today() – 1]

sf

p = 1, sf = 3 p = 0.9, sf = 2

slide-18
SLIDE 18

18

September 26, 2007

Marcos Vaz Salles / ETH Zurich / marcos.vazsalles@inf.ethz.ch

Rewriting Queries with Trails

U

weather yesterday

(1) Matching

T2: yesterday → //*[date = today() – 1]

Query (2) Transformation Trail U

weather yesterday

U

//*[date = today() – 1]

(3) Merging

T2 matches

slide-19
SLIDE 19

19

September 26, 2007

Marcos Vaz Salles / ETH Zurich / marcos.vazsalles@inf.ethz.ch

Replacing Trails

Trails that use replace instead of union

semantics

U

weather yesterday

(1) Matching

T2: yesterday //*[date = today() – 1]

Query (2) Transformation Trail U

weather //*[date = today() – 1]

(3) Merging

T2 matches

slide-20
SLIDE 20

20

September 26, 2007

Marcos Vaz Salles / ETH Zurich / marcos.vazsalles@inf.ethz.ch

... U

Problem: Recursive Matches (1/2)

U

weather yesterday

U

//*[date = today() – 1]

T2: yesterday →

//*[date = today() – 1]

New query still matches T2, so T2 could be applied again U

weather

U

yesterday

U

//*[date = today() – 1] //*[date = today() – 1]

U

//*[date = today() – 1] //*[date = today() – 1]

...

Infinite recursion

T2 matches T2 matches

slide-21
SLIDE 21

21

September 26, 2007

Marcos Vaz Salles / ETH Zurich / marcos.vazsalles@inf.ethz.ch

Problem: Recursive Matches (2/2)

U

weather yesterday

U

//*[date = today() – 1]

Trails may be mutually recursive

T3: //*.tuple.date →

//*.tuple.modified

U

weather

U

yesterday //*[date = today() – 1]

T10: //*.tuple.modified →

//*.tuple.date

U

//*[modified = today() – 1]

U

weather U yesterday //*[date = today() – 1]

U

//*[modified = today() – 1]

U

//*[date = today() – 1]

We again match T3 and enter an infinite loop

T3 matches T10 matches

slide-22
SLIDE 22

22

September 26, 2007

Marcos Vaz Salles / ETH Zurich / marcos.vazsalles@inf.ethz.ch

Solution: Multiple Match Coloring Algorithm

U

weather yesterday

T1: weather → //Temperatures/* T2: yesterday → //*[date = today() – 1] T3: //*.tuple.date → //*.tuple.modified T4: //*.tuple.date → //*.tuple.received

First Level

U

weather yesterday //Temperatures/*

U U

//*[date = today() – 1]

U

weather yesterday //Temperatures/*

U U

//*[modified = today() – 1]

U U

//*[received = today() – 1] //*[date = today() – 1]

Second Level

T1 matches T2 matches T3, T4 match

slide-23
SLIDE 23

23

September 26, 2007

Marcos Vaz Salles / ETH Zurich / marcos.vazsalles@inf.ethz.ch

Multiple Match Coloring Algorithm Analysis

Problem: MMCA is exponential in number of levels Solution: Trail Pruning

Prune by number of levels Prune by top-K trails matched in each level Prune by both top-K trails and number of levels

slide-24
SLIDE 24

24

September 26, 2007

Marcos Vaz Salles / ETH Zurich / marcos.vazsalles@inf.ethz.ch

Outline

Motivation iTrails Experiments Conclusion and Future Work

slide-25
SLIDE 25

25

September 26, 2007

Marcos Vaz Salles / ETH Zurich / marcos.vazsalles@inf.ethz.ch

iTrails Evaluation in iMeMex

iMeMex Dataspace System: Open-source prototype

available at http://www.imemex.org

Main Questions in Evaluation

Quality: Top-K Precision and Recall Performance: Use of Materialization Scalability: Query-rewrite Time vs. Number of Trails

slide-26
SLIDE 26

26

September 26, 2007

Marcos Vaz Salles / ETH Zurich / marcos.vazsalles@inf.ethz.ch

iTrails Evaluation in iMeMex

Scenario 1: Few High-quality Trails

Closer to information integration use cases Obtained real datasets and indexed them 18 hand-crafted trails 14 hand-crafted queries

Scenario 2: Many Low-quality Trails

Closer to search use cases Generated up to 10,000 trails

slide-27
SLIDE 27

27

September 26, 2007

Marcos Vaz Salles / ETH Zurich / marcos.vazsalles@inf.ethz.ch

iTrails Evaluation in iMeMex: Scenario 1

Configured iMeMex to act in three modes

Baseline: Graph / IR search engine iTrails: Rewrite search queries with trails Perfect Query: Semantics-aware query

Data: shipped to central index

Laptop Email Server Web Server DB Server

sizes in MB

slide-28
SLIDE 28

28

September 26, 2007

Marcos Vaz Salles / ETH Zurich / marcos.vazsalles@inf.ethz.ch

Quality: Top-K Precision and Recall

Search Engine misses relevant results

Q3: pdf

yesterday

Search Query is partially semantics-aware

Q13: to =

raimund.grube@ enron.com

Scenario 1: few high-quality trails (18 trails)

Queries perfect query

Perfect Query always has precision and recall equal to 1

K = 20

slide-29
SLIDE 29

29

September 26, 2007

Marcos Vaz Salles / ETH Zurich / marcos.vazsalles@inf.ethz.ch

Performance: Use of Materialization

Trail merging adds

  • verhead to

query execution Trail Materialization provides interactive times for all queries

response times in sec. Scenario 1: few high-quality trails (18 trails)

slide-30
SLIDE 30

30

September 26, 2007

Marcos Vaz Salles / ETH Zurich / marcos.vazsalles@inf.ethz.ch

Scalability: Query-rewrite Time vs. Number of Trails

Query-rewrite time can be controlled with pruning

Scenario 2: many low-quality trails

slide-31
SLIDE 31

31

September 26, 2007

Marcos Vaz Salles / ETH Zurich / marcos.vazsalles@inf.ethz.ch

Conclusion: Pay-as-you-go Information Integration

Step 1: Provide a search service over all

the data

Step 2: Add integration semantics via trails

Dataspace System global warming zurich text, links Data Sources

Step 3: If more semantics needed, go back to step 2 Our Contributions

iTrails: generic method to model semantic relationships

(e.g. implicit meaning, bookmarks, dictionaries, thesauri, attribute matches, ...)

We propose a framework and algorithms for Pay-as-you-

go Information Integration

Smooth transition between search and data integration

slide-32
SLIDE 32

32

September 26, 2007

Marcos Vaz Salles / ETH Zurich / marcos.vazsalles@inf.ethz.ch

Future Work

Trail Creation

Use collections (ontologies, thesauri, wikipedia) Work on automatic mining of trails from the dataspace

Other types of trails

Associations Lineage

slide-33
SLIDE 33

33

September 26, 2007

Marcos Vaz Salles / ETH Zurich / marcos.vazsalles@inf.ethz.ch

Questions? Thanks in advance for your feedback! ☺ ☺ ☺ ☺

marcos.vazsalles@inf.ethz.ch http://www.imemex.org

slide-34
SLIDE 34

34

September 26, 2007

Marcos Vaz Salles / ETH Zurich / marcos.vazsalles@inf.ethz.ch

Backup Slides

slide-35
SLIDE 35

35

September 26, 2007

Marcos Vaz Salles / ETH Zurich / marcos.vazsalles@inf.ethz.ch

Problem: Global Warming in Zurich

Query: “What is the

impact of global warming in Zurich?”

Search for:

global warming zurich

Meaning of keyword

query

global warming

should lead to query on Temperatures

zurich should

lead to a query for a city

slide-36
SLIDE 36

36

September 26, 2007

Marcos Vaz Salles / ETH Zurich / marcos.vazsalles@inf.ethz.ch

Problem: PDF Yesterday

Query: “Retrieve all

PDF documents added/modified yesterday”

Search for:

pdf yesterday

Meaning of keywords

pdf and yesterday

Different sources,

different schemas:

Laptop: modified Email: received DBMS: changed

slide-37
SLIDE 37

37

September 26, 2007

Marcos Vaz Salles / ETH Zurich / marcos.vazsalles@inf.ethz.ch

Related Work: Search vs. Data Integration

  • vs. Dataspaces

Schema- first Schema- later Schema- never Need for Schema Precise Precision / Recall Precision / Recall Query Semantics High Pay-as-you- go Low Integration Effort Data Integration Dataspaces Search Integration Solution Features

slide-38
SLIDE 38

38

September 26, 2007

Marcos Vaz Salles / ETH Zurich / marcos.vazsalles@inf.ethz.ch

Personal Dataspaces Literature

  • Dittrich, Salles, Kossmann, Blunschi. iMeMex: Escapes from the

Personal Information Jungle (Demo Paper). VLDB, September 2005.

  • Dittrich, Salles. iDM: A Unified and Versatile Data Model for

Personal Dataspace Management. VLDB, September 2006

  • Dittrich. iMeMex: A Platform for Personal Dataspace
  • Management. SIGIR PIM, August 2006.
  • Blunschi, Dittrich, Girard, Karakashian, Salles. A Dataspace

Odyssey: The iMeMex Personal Dataspace Management System (Demo Paper). CIDR, January 2007.

  • Dittrich, Blunschi, Färber, Girard, Karakashian, Salles. From

Personal Desktops to Personal Dataspaces: A Report on Building the iMeMex Personal Dataspace Management System. BTW 2007, March 2007

  • Salles, Dittrich, Karakashian, Girard, Blunschi. iTrails: Pay-as-you-

go Information Integration in Dataspaces. VLDB, September 2007

slide-39
SLIDE 39

39

September 26, 2007

Marcos Vaz Salles / ETH Zurich / marcos.vazsalles@inf.ethz.ch

iDM: iMeMex Data Model

Our approach: get the data model closer to personal

information – not the other way around

Supports:

Unstructured, semi-structured and structured data, e.g.,

files&folders, XML, relations

Clearly separation of logical and physical representation of data Arbitrary directed graph structures, e.g., section references in

LaTeX documents, links in filesystems, etc

Lazily computed data, e.g., ActiveXML (Abiteboul et. al.) Infinite data, e.g., media and data streams

See VLDB 2006

slide-40
SLIDE 40

40

September 26, 2007

Marcos Vaz Salles / ETH Zurich / marcos.vazsalles@inf.ethz.ch

Data Model Options

Data Models Support for Personal Data Support for Infinite data Support for Lazy Computation Support for Graph data Serialization independent Non- schematic data iDM XML Relational Bag of Words Specific schema Extension: XLink/ XPointer View mechanism Extension: ActiveXML Extension: Document streams Extension: Relational streams Extension: XML streams

slide-41
SLIDE 41

41

September 26, 2007

Marcos Vaz Salles / ETH Zurich / marcos.vazsalles@inf.ethz.ch

Data Models for Personal Information

Physical Level Relational XML Document / Bag of Words Personal Information iDM

Abstraction Level

lower higher

slide-42
SLIDE 42

42

September 26, 2007

Marcos Vaz Salles / ETH Zurich / marcos.vazsalles@inf.ethz.ch

Architectural Perspective

  • f iMeMex

Indexes&Replicas access (warehousing) Data source access (mediation) Complex operators (query algebra)

Operators

Physical Algebra

Data Store

Result Cache

Catalog

iQL Query Processor

Data

Operators

Cleaning Replicas Indexes &

Data Store Catalog

iDM Query Processor

Operators Catalog

Content Converters

Data Source Query Processor

Data Source Plugins

iMeMex PDSMS Search & Browse Office Tools Email

...

DBMS

Application Layer Data Source Layer

... ...

IMAP File System

...