XML Retrieval XML Retrieval XML Retrieval XML Retrieval DB/IR in - - PowerPoint PPT Presentation

xml retrieval xml retrieval xml retrieval xml retrieval
SMART_READER_LITE
LIVE PREVIEW

XML Retrieval XML Retrieval XML Retrieval XML Retrieval DB/IR in - - PowerPoint PPT Presentation

XML Retrieval XML Retrieval XML Retrieval XML Retrieval DB/IR in DB/IR in Theory Theory Web in Web in Practice Practice Sihem Amer-Yahia Mariano Consens Yahoo! Research University of Toronto In collaboration with: Ricardo Baeza-Yates


slide-1
SLIDE 1

XML Retrieval XML Retrieval XML Retrieval XML Retrieval

DB/IR in DB/IR in Theory Theory Web in Web in Practice Practice

Sihem Amer-Yahia Yahoo! Research Mariano Consens University of Toronto

In collaboration with: Ricardo Baeza-Yates Mounia Lalmas Yahoo! Research VLDB 2007, Vienna, 26/09/07 Queen Mary, Univ. of London

slide-2
SLIDE 2

Preliminaries Preliminaries

DB focused on languages, expressiveness

and efficient evaluation

IR focused on scoring and relevance metrics

In practice, a limited set of operations and

p p simple ranking go a long way

Theory is scary (think XQuery)

P b l k d h

Practice is inspiring but looks ad-hoc

VLDB 2007, Vienna, 26/09/2007 2

slide-3
SLIDE 3

Notion of Relevance Notion of Relevance

Data retrieval:

Syntax expresses semantics

Information retrieval:

Ambiguous semantics Relevance depends on user and context There is no “perfect” retrieval system

User assessments to evaluate system

effectiveness effectiveness

VLDB 2007, Vienna, 26/09/2007 3

slide-4
SLIDE 4

Overview Overview

Preliminaries Web in Practice Web in Practice Search in Web 2.0 Microformats and Mashups Microformats and Mashups DB/IR in Theory

Query Languages

Query Languages Retrieval Semantics

E l ti à l DB (Q P ssi )

Evaluation à la DB (Query Processing) Evaluation à la DB (Relevance Assessments)

Ch ll

VLDB 2007, Vienna, 26/09/2007

Challenges

4

slide-5
SLIDE 5

Web 2.0 (from Web 2.0 (from Wikipedia) Wikipedia)

Ri h S t f B d Rich Set of Buzzwords

VLDB 2007, Vienna, 26/09/2007 5

slide-6
SLIDE 6

(Web) Search is a Basic Necessity (Web) Search is a Basic Necessity

A (grossly inadequate) analogy: Toilets and Web 2.0

  • ts an W

.

"Rich societies have developed quite complicated and expensive systems for removing human wastes from houses and cities, usually by dumping them, treated to one degree or another, into subsoils by dumping them, treated to one degree or another, into subsoils

  • r bodies of water." Peter Bane, 2006

6 6 VLDB 2007, Vienna, 26/09/2007

slide-7
SLIDE 7

Rich Standard Infrastructure Rich Standard Infrastructure

Standard Pipes Standard Pipes XML

7 VLDB 2007, Vienna, 26/09/2007

slide-8
SLIDE 8

Big Infrastructure Sites Big Infrastructure Sites

Water Treatment Plants Search Engines Search Engines Portals

8 VLDB 2007, Vienna, 26/09/2007

slide-9
SLIDE 9

Community Sites Community Sites

9 VLDB 2007, Vienna, 26/09/2007

slide-10
SLIDE 10

The Importance of Mobility The Importance of Mobility

The need to carry around h l i l l i technological solutions to basic necessities

10 10 VLDB 2007, Vienna, 26/09/2007

slide-11
SLIDE 11

Mo Most C st Commonly mmonly Us Used ed is is … …

h l d h d l

“most popular searches” (2-3 keywords) Squat toilet

There are simple and sophisticated solutions to basic necessities N d f hi ti t d h Need for more sophisticated search

11 VLDB 2007, Vienna, 26/09/2007 11

slide-12
SLIDE 12

Overview Overview

Preliminaries Web in Practice Web in Practice Search in Web 2.0 Microformats and Mashups Microformats and Mashups DB/IR in Theory

Query Languages

Query Languages Retrieval Semantics

E l ti à l DB (Q P ssi )

Evaluation à la DB (Query Processing) Evaluation à la DB (Relevance Assessments)

Ch ll

VLDB 2007, Vienna, 26/09/2007

Challenges

12

slide-13
SLIDE 13

Microformats Microformats

Community data formats

Personal Data: hCard (vCard) Personal Data: hCard (vCard) Calendar and Events: hCal (iCal) Social Networking: XFN Social Networking: XFN Reviews: hReview Licenses: rel-license Licenses rel license Folksonomies: rel-tag

Embedded in XHTML pages and RSS feeds Embedded in XHTML pages and RSS feeds

Also RSS Extensions (iTunes, Yahoo! Media, Geo,

Google Base, 20+ more in use) g )

VLDB 2007, Vienna, 26/09/2007 13

slide-14
SLIDE 14

Example: hCal Example: hCal

<strong class="summary">Fashion Expo</strong> in <span class="location">Paris, France</span>: bb l "d " i l "2006 10 20" O 20 / bb <abbr class="dtstart" title="2006-10-20">Oct 20</abbr> to <abbr class="dtend" title="2006-10-23">22</abbr>

Large and growing list of websites Large and growing list of websites

Eventful.com LinkedIn Yedda upcoming.yahoo.com Yahoo! Local, Yahoo! Tech Reviews

Benefit from shared tools, practices (hCalendar

iC l E i ) creator, iCal Extraction)

VLDB 2007, Vienna, 26/09/2007 14

slide-15
SLIDE 15

Semantic Mashups Semantic Mashups

A “semantic” mashup can

Contact (hCard) Contact (hCard) Friends (XFN,FOAF) To attend a recommended event (hCal hReview) To attend a recommended event (hCal,hReview)

Microformats are the lower-case semantic

web web

Also Machine Tags (eg, flickr:user=me)

Tags that use a special syntax to define extra Tags that use a special syntax to define extra

information about a tag

Have a namespace, a predicate and a value

Have a namespace, a pred cate and a value (sounds familiar?)

VLDB 2007, Vienna, 26/09/2007 15

slide-16
SLIDE 16

Search in Mashup Creation Search in Mashup Creation

VLDB 2007, Vienna, 26/09/2007 16

slide-17
SLIDE 17

Mashup Tools Mashup Tools

Microsoft Popfly IBM ProjectZero IBM ProjectZero Yahoo! Pipes

All s d l s t sh b d t

Allows developers to mash-up web data drag and drop editor which enables user to

connect multiple Internet data sources connect multiple Internet data sources

a source is grabbed and searched! both content and structure are queried both content and structure are queried

VLDB 2007, Vienna, 26/09/2007 17

slide-18
SLIDE 18

Yahoo! Pi ahoo! Pipes Demo es Demo Y p Y p

VLDB 2007, Vienna, 26/09/2007 18

slide-19
SLIDE 19

Yahoo! Pi ahoo! Pipes Demo es Demo Y p Y p

VLDB 2007, Vienna, 26/09/2007 19

slide-20
SLIDE 20

Yahoo! Pipes Demo Result Yahoo! Pipes Demo Result

VLDB 2007, Vienna, 26/09/2007 20

slide-21
SLIDE 21

Overview Overview

Preliminaries Web in Practice Web in Practice Search in Web 2.0 Microformats and Mashups Microformats and Mashups DB/IR in Theory

Retrieval Languages and Semantics

Retrieval Languages and Semantics Evaluation à la DB (Query Processing)

E l ti à l DB (R l Ass ss ts)

Evaluation à la DB (Relevance Assessments) Challenges

VLDB 2007, Vienna, 26/09/2007 21

slide-22
SLIDE 22

Take Away Take Away

Search is crucial when accessing Web 2.0

sources sources

There is already demand for exploiting

additional structure in Web 2 0 search additional structure in Web 2.0 search

Structure (XML) retrieval needs to:

be exposed to users/developers be exposed to users/developers support rich, context-dependent semantics address efficiency and effectiveness address efficiency and effectiveness

VLDB 2007, Vienna, 26/09/2007 22

slide-23
SLIDE 23

Overview Overview

Preliminaries Web in Practice Web in Practice DB/IR in Theory Query Languages Query Languages Retrieval Semantics

Evaluation à la DB (Query Processing)

Evaluation à la DB (Query Processing) Evaluation à la DB (Relevance Assessments)

Ch ll s

Challenges

VLDB 2007, Vienna, 26/09/2007 23

slide-24
SLIDE 24

Languages Languages

Keyword search

“ t”

“squat”

Tag + Keyword search

description: squat

descr pt on squat

Path Expression + Keyword search

//image[./title about “squat”]

l f ll h

XQuery + Complex full-text search

for $i in //image

let score $s := $i ftscore “squat” && “toilet” $ $ q distance 2

VLDB 2007, Vienna, 26/09/2007 24

slide-25
SLIDE 25

Overview Overview

Preliminaries Web in Practice Web in Practice DB/IR in Theory Query Languages Query Languages Retrieval Semantics

Evaluation à la DB (Query Processing)

Evaluation à la DB (Query Processing) Evaluation à la DB (Relevance Assessments)

Ch ll s

Challenges

VLDB 2007, Vienna, 26/09/2007 25

slide-26
SLIDE 26

Retrieval Semantics Retrieval Semantics

Structure search incorporates conditions on

the underlying structure of a collection the underlying structure of a collection

Schemas help Schemas prescribe data and help validation Schemas prescribe data and help validation Provide limited description of valid instances

New semantics

Lowest Common Ancestor Query relaxation Overlapping elements

pp g

VLDB 2007, Vienna, 26/09/2007 26

slide-27
SLIDE 27

Lowest Common Ancestor Lowest Common Ancestor

Retrieve most relevant fragment

  • References:

Nearest Concept Queries (Schmidt etal Nearest Concept Queries (Schmidt etal,

ICDE 2002)

XRank (Guo et al, SIGMOD 2003)

XRank (Guo et al, SIGMOD 2003)

SchemaFree XQuery (Li et al VLDB 2004) XKSearch (Xu & Papakonstantinou SIGMOD XKSearch (Xu & Papakonstantinou, SIGMOD

2005)

VLDB 2007, Vienna, 26/09/2007 27

slide-28
SLIDE 28

XRank XRank

<workshop date=”28 July 2000”> <workshop date 28 July 2000 > <title> XML and Information Retrieval: A SIGIR 2000 Workshop </title> <editors> David Carmel, Yoelle Maarek, Aya Soffer </editors> <proceedings> p g <paper id=”1”> <title> XQL and Proximal Nodes </title> <author> Ricardo Baeza-Yates </author> <author> Gonzalo Navarro </author> <abstract> We consider the recently proposed language … </abstract> <section name=”Introduction”> S hi d i b i i i h XML Searching on structured text is becoming more important with XML … <subsection name=“Related Work”> The XQL language … </subsection> </subsection> </section> … <cite xmlns:xlink=”http://www acm org/www8/paper/xmlql> </cite> <cite xmlns:xlink http://www.acm.org/www8/paper/xmlql> … </cite> </paper>

(Guo etal, SIGMOD 2003)

slide-29
SLIDE 29

XRank XRank

<workshop date=”28 July 2000”> <title> XML and Information Retrieval: A SIGIR 2000 Workshop </title> <editors> David Carmel, Yoelle Maarek, Aya Soffer </editors> , , y <proceedings> <paper id=”1”> <title> XQL and Proximal Nodes </title> <author> Ricardo Baeza-Yates </author> <author> Gonzalo Navarro </author> <abstract> We consider the recently proposed language … </abstract> < ti ”I t d ti ”> <section name=”Introduction”> Searching on structured text is becoming more important with XML … <subsection name=“Related Work”> The XQL language The XQL language … </subsection> </section> … <cite xmlns:xlink=”http://www.acm.org/www8/paper/xmlql> … </cite> </paper> …

slide-30
SLIDE 30

XIRQL XIRQL

<workshop date=”28 July 2000”> <title> XML and Information Retrieval: A SIGIR 2000 Workshop </title> <editors> Da id Carmel Yoelle Maarek A a Soffer </editors> <editors> David Carmel, Yoelle Maarek, Aya Soffer </editors> <proceedings> <paper id=”1”> <title> XQL and Proximal Nodes </title> <title> XQL and Proximal Nodes </title> <author> Ricardo Baeza-Yates </author> <author> Gonzalo Navarro </author> <abstract> We consider the recently proposed language … </abstract> y p p g g <section name=”Introduction”> Searching on structured text is becoming more important with XML … <em> The XQL language </em>

index nodes

</section> … <cite xmlns:xlink=”http://www.acm.org/www8/paper/xmlql> … </cite> / </paper> …

(Fuhr & Großjohann, SIGIR 2001)

slide-31
SLIDE 31

XML Query XML Query Relaxation Relaxation

Twig scoring

Hi h lit

image Query

High quality Expensive computation

Path scoring g title toilet info Query Binary scoring

Low quality Fast computation

author squat

F mp

image image image + image image + + image title toilet info author edition toilet author info title toilet author squat info author squat author squat

(Amer-Yahia et al, VLDB 2005)

VLDB 2007, Vienna, 26/09/2007 31

slide-32
SLIDE 32

XML Query XML Query Relaxation Relaxation

Tree pattern relaxations:

image Query

Tree pattern relaxations:

Leaf node deletion Edge generalization

title toilet info

Subtree promotion

author squat image info title image info image title info title? Data edition squat info title toilet info author squat title toilet info author squat q squat q (Amer-Yahia, SIGMOD 2004) (Schlieder, EDBT 2002) (Delobel & Rousset, 2002)

VLDB 2007, Vienna, 26/09/2007 32

slide-33
SLIDE 33

Controlling Overlap Controlling Overlap

What most approaches are doing: pp g

  • Given a ranked list of elements:
  • 1. select element with the highest score

within a path within a path

  • 2. discard all ancestors and descendants
  • 3. go to step 1 until all elements have been
  • 3. go to step unt l all elements have been

dealt with

  • (Also referred to as brute-force filtering)

VLDB 2007, Vienna, 26/09/2007 33

slide-34
SLIDE 34

Post-Processing Overlap Post-Processing Overlap

Sometimes with some “prior” processing to affect

ranking: ranking:

Use of a utility function that captures the amount of

y p useful information in an element

Element score * Element size * Amount of relevant information

Used as a prior probability Then apply “brute-force” overlap removal

(Mihajlovic etal, INEX 2005; Ramirez etal, FQAS 2006))

VLDB 2007, Vienna, 26/09/2007 34

slide-35
SLIDE 35

Post-Processing Overla Post-Processing Overlap

  • Score of elements containing or contained

within higher ranking components are

p

g g p iteratively adjusted (depends on amount of overlap “allowed”) ( p p )

1.

Select the highest ranking component.

2

Adjust the retrieval status value of the other

2.

Adjust the retrieval status value of the other components.

3.

Repeat steps 1 and 2 until the top m components

3.

Repeat steps 1 and 2 until the top m components have been selected.

VLDB 2007, Vienna, 26/09/2007 35

(Clarke, SIGIR 2005)

slide-36
SLIDE 36

Post-Processing Overlap Post-Processing Overlap

S t filt i Smart filtering Given a list of rank elements

  • group elements per article

build a result tree

N1 Case 1

(Mass & Mandelbrod, INEX 2005)

  • build a result tree
  • “score grouping”:
  • for each element N1

N2 Case 1

for each element N1

  • 1. score N2 > score N1
  • 2. concentration of good elements
  • 3. even distribution of good elements

g

N1 N1 N2 Case 2 Case 3 Case 3

VLDB 2007, Vienna, 26/09/2007 36

slide-37
SLIDE 37

Languages Languages

Keyword search (CO Queries)

“ l”

“xml”

Tag + Keyword search

book: xml

book xml

Path Expression + Keyword search (CAS

Queries)

/b k[ /titl b t “ l db”]

/book[./title about “xml db”]

XQuery + Complex full-text search

for $b in /book for $b in /book

let score $s := $b ftcontains “xml” && “db” distance 5

VLDB 2007, Vienna, 26/09/2007 37

slide-38
SLIDE 38

Overview Overview

Preliminaries Web in Practice Web in Practice DB/IR in Theory Query Languages Query Languages Retrieval Semantics

Evaluation à la DB (Query Processing)

Evaluation à la DB (Query Processing) Evaluation à la DB (Relevance Assessments)

Ch ll s

Challenges

VLDB 2007, Vienna, 26/09/2007 38

slide-39
SLIDE 39

Encodings, Summaries, Indexes Encodings, Summaries, Indexes Access Methods Access Methods Access Methods Access Methods

VLDB 2007, Vienna, 26/09/2007 39

slide-40
SLIDE 40

Stack Algorithms Stack Algorithms

Region algebra encoding Region algebra encoding

Elements [DocID, Element, Start, End, LevelNum] Values [DocID Value Start LevelNum]

VLDB 2007, Vienna, 26/09/2007 40

Values [DocID, Value, Start, LevelNum]

slide-41
SLIDE 41

Structural Summaries Structural Summaries

XML structural summaries are graphs

representing relationships between sets in a partition of XML elements partition of XML elements.

Many proposals

Region inclusion graphs (RIGs) [CM94], representative objects

g g p ( ) [ ], p j (ROs)[NUWC97], dataguides [GW97], 1-index, 2-index and T- index [MS99], ToXin [RM01], XSKETCH [PG02], APEX [CMS02], A(k)-index [KSBG02], F+B-Index and F&B-Index [KBNK02], D(k)-index [QLO03], M(k)-index [HY04], Skeleton [BCFH+05], ( ) [Q ], ( ) [ ], [ ], XCLUSTER [PG06]

AxPRE (axis path r.e.) Summaries answer

How are all these summaries related? How are all these summaries related? Can they be constructed together? Can they be used [for query evaluation] together?

slide-42
SLIDE 42

Query Processing Query Processing

for $x in document(“catalog.xml”)//item, $y in document(“parts.xml”)//part, $z in document(“supplier.xml”)//supplier l t $ $ / dd ft "T t ” “O t i ” let $s := $z/address ftscore "Toronto” && “Ontario” where $x/part_no = $y/part_no and $z/supplier_no = $x/supplier_no

  • rder by $s

return <result score=“$s”> {$x/part_no} {$x/price} {$y/description} </result> </result>

VLDB 2007, Vienna, 26/09/2007 42

slide-43
SLIDE 43

Retrieval models Retrieval models

VLDB 2007, Vienna, 26/09/2007 43

slide-44
SLIDE 44

Score Combination Score Combination

BM25 SLM DFR

Ranking

Article

Inverted File

Q Sum Max MinMax Z g

Weighted Query

Q Z

+

Ranking

Abs

Inverted File

Ranking

Weighted Query

VLDB 2007, Vienna, 26/09/2007 44

.......

slide-45
SLIDE 45

Preliminaries for Top-k Retrieval Preliminaries for Top-k Retrieval

Each object is scored using different criteria

Score (or grade) is a value, usually [0,1]

g y

Criterion (e.g., a keyword) refers attributes or

keywords specified in the query

Each criterion has a sorted list of R(objects, score)

Each criterion has a sorted list of R(objects, score)

The combined score is computed using an

Aggregation function t(x1, x2, …, xm)

If x ≤ x’ for every i then t(x x

x ) ≤ t(x’

If xi ≤ x i for every i, then t(x1, x2, …, xm) ≤ t(x 1,

x’2, …, x’m)

Examples: average, weighted sum, min, max, etc.

G l

Goal

Merge ranked results to find the best top

best top-

  • k

k answers

VLDB 2007, Vienna, 26/09/2007 45

slide-46
SLIDE 46

Threshold Algorithm Threshold Algorithm (TA) [FLN’01] (TA) [FLN’01]

Sorted access in parallel to each of the m lists Random access for every new object seen in

th li t t fi d i th fi ld f R every other list to find i-th field xi of R.

Use aggregation function t(R) = t( x1,x2,,xm) to

calculate grade and store it in set Y only if it calculate grade and store it in set Y only if it belongs to current top-k objects.

Calculate threshold value T= t( x1,x2,,xm) of

aggregate function after every sorted access aggregate function after every sorted access , stop when k objects have grade at least T

Return set Y which has top-k values

p

Analysis: TA Optimal over every instance

b t bi O d d ’t f t ti

but… big O, and don’t forget assumptions

VLDB 2007, Vienna, 26/09/2007 46

slide-47
SLIDE 47

Variations of TA Variations of TA

NRA: When no random access (RA) is possible

Example: Web search engines, which typically do not allow you

to enter a URL and get its ranking to enter a URL and get its ranking

TAZ: When no sorted access (SA) is possible for some

predicates

Example: Find good restaurants near location x (sorted and Example: Find good restaurants near location x (sorted and

random access for restaurant ratings, random access only for distances from a mapping site)

CA: When the relative costs of random and sorted accesses

matter (TA+NRA).

TAθ: Only when approximate answers are needed

Example: Web search, with lots of good quality answers

p , g q y

SA/RA scheduling problem, IO-Top-K [BMSTW’06]

VLDB 2007, Vienna, 26/09/2007 47

slide-48
SLIDE 48

VisTopK Demo VisTopK Demo

VLDB 2007, Vienna, 26/09/2007 48

slide-49
SLIDE 49

Overview Overview

Preliminaries Web in Practice Web in Practice DB/IR in Theory Query Languages Query Languages Retrieval Semantics

Evaluation à la DB (Query Processing)

Evaluation à la DB (Query Processing) Evaluation à la DB (Relevance Assessments)

Ch ll s

Challenges

VLDB 2007, Vienna, 26/09/2007 49

slide-50
SLIDE 50

Evaluation Evaluation of XML retrieval:

  • f XML retrieval: INEX

INEX

Evaluating the effectiveness of content-oriented XML

retrieval approaches like TREC pp

Collaborative effort ⇒ participants contribute to the

development of the collection (IEEE and Wikipedia)

queries queries relevance assessments methodology

Content-only (CO) topics

Ignore document structure

Content-and-structure (CAS) topics

Contain conditions referring both to content and structure of

the sought elements g

Conditions may or may not be strict

VLDB 2007, Vienna, 26/09/2007 50

slide-51
SLIDE 51

CAS topics 2003-2004 CAS topics 2003-2004

<title> //article[(./fm//yr = '2000' OR ./fm//yr = '1999') AND about(., '"intelligent transportation system"')]//sec[about(.,'automation g p y )] [ ( , +vehicle')] </title> <description> A t t d hi l li ti i ti l f 1999 2000 b t Automated vehicle applications in articles from 1999 or 2000 about intelligent transportation systems. </description> <narrative> narrative To be relevant, the target component must be from an article on intelligent transportation systems published in 1999 or 2000 and must include a section which discusses automated vehicle applications, proposed or implemented in an intelligent transportation system proposed or implemented, in an intelligent transportation system. </narrative>

VLDB 2007, Vienna, 26/09/2007 51

slide-52
SLIDE 52

Relevance in XML retrieval Relevance in XML retrieval

A document is relevant

relevant if it “has significant and demonstrable bearing on the matter at and demonstrable bearing on the matter at hand”.

Common assumptions in laboratory Common assumptions in laboratory

experimentation:

− Objectivity

article article

Objectivity

− Topicality − Binary nature

article article

− Independence

XML XML retriev retrieval l eval evaluat uation

  • n

XML retrieva XML retrieval (Borl Borlund, JASIST 2003 ASIST 2003) XML retrieva XML retrieval ss1 ss1 ss2 ss2 ( , ( , J ) (Goevert etal, (Goevert etal, JIR 2006) JIR 2006)

VLDB 2007, Vienna, 26/09/2007 52

slide-53
SLIDE 53

Relevance in XML retrieval: Relevance in XML retrieval: INEX 2003 - INEX 2003 - 2004 004

article article

Topicality not enough Binary nature not enough

XML XML retriev retrieval l

y g Independence is wrong

eval evaluat uation

  • n

XML retrieva XML retrieval ss1 ss1 ss2 ss2 XML evaluation

  • Relevance = (0,0) (1,1) (1,2) (1,3) (2,1) (2,2) (2,3) (3,1) (3,2) (3,3)

exhaustivity = how much the section discusses the query: 0 1 2 3 evaluation exhaustivity = how much the section discusses the query: 0, 1, 2, 3 specificity = how focused the section is on the query: 0, 1, 2, 3

  • If a subsection is relevant so must be its enclosing section
  • If a subsection is relevant so must be its enclosing section, ...

VLDB 2007, Vienna, 26/09/2007 53

slide-54
SLIDE 54

Specificity Dimension 2005 Specificity Dimension 2005

continuous scale defined as ratio (in characters) of the highlighted text to element size

VLDB 2007, Vienna, 26/09/2007 54

slide-55
SLIDE 55

Measuring effectiveness: Metrics Measuring effectiveness: Metrics

− Inex_eval (also known as inex2002)

(Goevert & Kazai, INEX 2002)

  • fficial INEX metric 2002-2004

− Inex_eval_ng (also known as inex2003) (Goevert etal, JIR 2006) − ERR (expected ratio of relevant units)

(Piwowarski & Gallinari, INEX 2003)

CG (XML l i i )

− xCG (XML cumulative gain)

(Kazai & Lalmas, TOIS 2006)

  • fficial INEX metric 2005-

− t2i (tolerance to irrelevance)

(de Vries et al RIAO 2004)

t2i (tolerance to irrelevance)

(de Vries et al, RIAO 2004)

− EPRUM (Expected Precision Recall with User Modelling) (Piwowarski

& Dupret, SIGIR 2006)

H E l (H hl h l E l )

− HiXEval (Highlighting XML Retrieval Evaluation)

(Pehcevski & Thom, INEX 2005)

  • fficial INEX metric 2007

Structural Relevance (Ali & Consens & Lalmas SIGIR Element Retrieval

− Structural Relevance (Ali & Consens & Lalmas, SIGIR Element Retrieval

Workshop 2007)

VLDB 2007, Vienna, 26/09/2007 55

slide-56
SLIDE 56

Overview Overview

Preliminaries Web in Practice Web in Practice DB/IR in Theory Challenges Challenges

VLDB 2007, Vienna, 26/09/2007 56

slide-57
SLIDE 57

Challenges Challenges

In practice, user interfaces are key

p y

Combine sources of information Provide feedback on retrieval results

Interaction between traditional DB query

  • ptimization and ranking/top-k

What are the useful extensions to

keyword querying that incorporate l f structural information?

VLDB 2007, Vienna, 26/09/2007 57

slide-58
SLIDE 58

Challenges Challenges

Indexing, Searching, Ranking

Effici nt ( nd Eff ctiv ) l rithms

Efficient (and Effective) algorithms

INEX-like test collection and effectiveness

Too complex? Too complex? What constitutes a retrieval baseline? What is a good measure? Generalisation of the results on other data sets

Quality evaluation (Web, XML)

Wh th s s?

Who are the users? What are their information needs? What are the requirements?

What are the requ rements?

VLDB 2007, Vienna, 26/09/2007 58

slide-59
SLIDE 59

Challenges Ahead Challenges Ahead

Lots of opportunities

To understand the structure of data To exploit structure in searches To measure and improve search quality

Can search remain a joy to use when

j y users are allowed to

Contribute content? (Wikipedia) Share it? (Flickr) rate it? (YouTube)

Consens 59 VLDB 2007, Vienna, 26/09/2007 59