CS490W Semi-Structured Data Structure of XML XML data is organized - - PDF document

cs490w semi structured data
SMART_READER_LITE
LIVE PREVIEW

CS490W Semi-Structured Data Structure of XML XML data is organized - - PDF document

CS490W Semi-Structured Data Structure of XML XML data is organized by documents like unstructured data XML data and Retrieval There are structures (nodes/tags) within the documents Each XML document is an ordered, labeled tree


slide-1
SLIDE 1

CS490W

Luo Si

Department of Computer Science Purdue University

XML data and Retrieval XML and Retrieval: Outline Outline:

Semi-Structure Data

XML, Examples, Application

XML Search

XQuery XIRQL

Text-Based XML Retrieval

Vector-space model INEX

Semi-Structured Data

XML has been used as the standard representation of Semi- Structured Data

eXtensible Markup Language is a W3C-recommended general-purpose markup language that supports a wide variety of applications. A framework for defining markup languages Open vocabulary for tags Each set of XML corresponds to different applications facilitate the sharing of data across different information

systems, particularly systems connected via the Internet

Examples: RSS, XHTML, MathML

Semi-Structured Data

Structure of XML

XML data is organized by documents like unstructured data There are structures (nodes/tags) within the documents Each XML document is an ordered, labeled tree Element Nodes are labeled with

Node name (e.g., chapter) Node attributes and the values (e.g., size=1000; time=01/01/2007) May have child nodes or data

Data exist (e.g., text strings) within leaf nodes

XML Example

<book id=“ML_Tom”> <title>Machine Learning</title> <author> <firstname>Tom</firstname> <surname>Mitchell</surname> </author> ... <p>Machine Learning Applications...</p> ... </book> Elements, Attributes/Values, Data(Text String)

XML Example

<book id=“ML_Tom”> <title>Machine Learning</title> <author> <firstname>Tom</firstname> <surname>Michael</surname> </author> ... <p>Machine Learning Applications...</p> ... </book> Elements, Attributes/Values, Data(Text String) book title author title para para chapter chapter surname firstname para

slide-2
SLIDE 2

Elements

Elements are defined by markup tags Elements: <TagName attr_a=“value”…>text</TagName> ID of the element is TagName Attribute: attr_a; Values=“value” Data/text: “text” End tag </TagName>

XML, HTML, SGML

1986: SGML ISO 8879-1986 Nov 1995: HTML 2.0 Nov 1996: Simplified and stripped down SGML draft (dubbed XML) Jan 1997: HTML 3.2 Aug 1997: XML working draft Dec 1997: XML 1.0 proposed recommendation Jan 1998: XML Feb 1999: XHTML

XML and HTML

Both of them are derivations of SGML HTML is a markup language mainly for display in browsers XML is a framework for markup languages HTML defines display XML defines the data structure, the display factor is

separated from the content

HTML can be formalized as XML (XHTML)

Why XML?

Unlike relational database, XML data does not require

relational schemata, etc., because the data itself contains this information.

Unlike widely used Web format, HTML, which only ensures

the correct presentation of the formatted data, XML also guarantees total usability of data.

XML Applications

CML – chemical markup language: WML – wireless markup language ThML – theological markup language

XML Applications

CML – chemical markup language:

CML (Chemical Markup Language) is a new approach to managing molecular information using tools such as XML and Java. It was the first domain specific implementation based strictly on XML, <molecule convention="MDLMol" id="baclofen" title="BACLOFEN">

slide-3
SLIDE 3

XML Applications

WML – wireless markup language Wireless Markup Language, is a content format for devices that implement the Wireless Application Protocol (WAP) specification, such as mobile phones. <?xml version="1.0"?> <!DOCTYPE wml PUBLIC "-//PHONE.COM//DTD WML 1.1//EN" "http://www.phone.com/dtd/wml11.dtd" > <wml> <card id="main" title="First Card"> <p mode="wrap">This is a sample WML page.</p> </card> </wml>

XML Applications

ThML – theological markup language

<ThML> <ThML.body> – <div1>

<div2 title="Genesis" id="Gen">

– <div3 title="Chapter 1">

  • <p>
  • <scripture/>
  • In the beginning God created the heaven and the earth.
  • <scripture/>
  • And the earth was without form, and void; and darkness was upon the face of the deep.

And the Spirit of God moved upon the face of the waters.

  • </p>

– </div3>

</div2>

– </div1> </ThML.body> </ThML>

XML Files

Schema/DTD: syntax definition of XML Language;

Document Type Definition (DTD file)

XML provides an application independent way of sharing data. With a DTD, independent groups of people can agree to use a common DTD for interchanging data. However, this is often NOT the case <?xml version="1.0"?> <!DOCTYPE note [ <!ELEMENT note (to,from,heading,body)> <!ELEMENT to (#PCDATA)> DTD Example <!ELEMENT from (#PCDATA)> <!ELEMENT heading (#PCDATA)> <!ELEMENT body (#PCDATA)> ]>

XML Files

<?xml version="1.0"?>

<!DOCTYPE note [ <!ELEMENT note (to,from,heading,body)> <!ELEMENT to (#PCDATA)> DTD Example <!ELEMENT from (#PCDATA)> <!ELEMENT heading (#PCDATA)> <!ELEMENT body (#PCDATA)> ]>

<note> <to>Tove</to> XML Document <from>Jani</from> <heading>Reminder</heading> <body>Don't forget me this weekend!</body> </note>

XML Files

XML Schema:

Recommended by the W3C as the successor of DTDs, more informally referred to by the initialism for XML Schema instances, XSD (XML Schema Definition). XSDs are far more powerful than DTDs in describing XML languages. <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"> <xs:element name="country" type="Country"/> <xs:complexType name="Country"> <xs:sequence> <xs:element name="name" type="xs:string"/> <xs:element name="population" type="xs:decimal"/> </xs:sequence> </xs:complexType> </xs:schema>

XML Search

Most XML Search protocols use a database-based approach

Non-text data match Exact keyword (text) match Evaluate XML path expression No concept of relevant

slide-4
SLIDE 4

XML Search

Traditional XML Search from Database-based approach

XQuery Search multiple types of data: value-based (e.g., price of a book); ids (ISBN of book); keyword match (text)

XML text search from information retrieval approach

XIRQL Vector-space based Search text data: estimate relevance of xml elements with respect of query Query may contain path expressions

XML Search

XQuery

SQL for XML Used for text-rich documents; data-oriented documents (non-text); mixed documents Consider: path expression (XPath); XML Schema datatypes It is still a working draft; details are being improved

XML Search

XQuery considers some principal forms

Path expression Conditional expressions Datatype expressions List expression etc

Programming Language: Flowers (FLWOR) expression Principle forms can be evaluated with respect to context

Principal Forms

Path Query

/book//title contains “Information Retrieval” title of the book contains keywords “Information Retrieval”

Conditional expressions

$h/title, IF $h/@type = "Journal" THEN …. if the type of an article is journal

Flowers (FLWR)

Programming Language: Flowers (FLWR) expression

The programming language XQuery defines FLWOR or FLWR (often pronounced as 'flower') as expression that supports iteration and binding of variables to intermediate results.

For and let create a sequence of tuples where filters the tuples on a boolean expression

  • rder by sorts the tuples, using any comparable data

return gets evaluated once for every tuple

Flowers (FLWR)

for $d in document("depts.xml")//deptno let $e := document("emps.xml")//employee[deptno = $d] where count($e) >= 10

  • rder by avg($e/salary) descending

return <big-dept> { $d, <headcount>{count($e)}</headcount>, <avgsal>{avg($e/salary)}</avgsal> } </big-dept>

slide-5
SLIDE 5

XML Search

XQuery considers some principal forms and combine them

with Flowers (FLWR) It is quite similar to SQL for relational database

However, it does not have the concept of relevance, which is

important for both text data (text-based information retrieval) and non-text data (fuzzy search). Find a book about information retrieval Find a book which is about $30.

XML I R Challenges 1: Term Statistics

There are multiple types of elements: books/titles/abstracts;

how to construct the corpus-statistics (idf) for different elements?

How do we handle the term frequency information?

Example: /book//title “information retrieval” do we consider the book abstract? Hierarchical smoothing

XML I R Challenges 2: Schemas

Ideal Case

There is a universal schema User can associate data type with the universal schema without ambiguity Too ideal to be true…

Real Word

There are many schemas; different spellings; different concepts; different granularities; (e.g., “auth” & “authors”; “abstract” & “description”; “abstract” & “keywords”)

XML I R Challenges 3: User interface

How to guide user to find relevant elements

Granularity control: Book->Abstract->Full Text

What type of querying language

Natural language query (IR approach): most usable With structure information: more powerful but less usable

How to do query expansion

How to automatically add structure information e.g., find a book written by J. K. Rowling,

  • > find a book written by /../author (J. K. Rowling, )
  • pen research problem

XI RQL

  • Prof. Norbert Furth University of Dortmund: Open source

XML search engine XIRQL: a query language for information retrieval in XML documents Structured Document Retrieval Principle Users may not know the schema

Allow users to search even if they do not know the schema of the data

Units

Only atomic units can be returned

traditional IR treats documents as atomic units; XML treat tree-like view of documents.

XIRQL only indexes and returns atom-units

Atom-units can be leaf nodes that contain text information Atom-units can be other internal nodes Atom-units can be defined in DTD TF-IDF values are calculated based-on atom-units

slide-6
SLIDE 6

XI RQL Atom-Units

Structured Document Retrieval Principle

We should always rank the most specific/probable atom units for answering a query. Example query: xql Document: <chapter> 0.3 XQL <section> 0.5 example </section> <section> 0.8 XQL 0.7 syntax </section> </chapter>

Return section, not chapter

Structured Document Retrieval Principle

Data types: XIRQL suggests vague predicates for different kinds of data types (e.g., person names, locations, dates). It suggests datatype-specific comparison operators (e.g., ‘near’, <, >, ‘broader’, ‘narrower’….) Semantic Roles: search for #persname, XIRQL searches all persons in documents, without specifying their role, regardless of their position in the XML document tree

XI RQL Summary

Relevance ranking with respect to structure document

retrieval principle

Recommends datatype-specific operators for different types

  • f data

Enable semantic roles

Text-Based XML Retrieval

Documents are marked up with XML tags

journal articles, conference papers, novels, manuals…

Queries

plain text queries, queries with structures (keywords in the title or abstracts)

Results

System automatically adjust the granularities of the returned

  • results. (e.g., the most specific section about “the role of

p53 gene for cancer) Considers both coverage and specificity

Vector Space Model and XML

Vector space model for traditional IR

Represent queries and plain documents by vectors in the keyword space. Do not distinguish the keywords in different fields (e.g., title or full text). Calculate similarities between vectors

Vector space in XML data

Need to capture the structure of an XML document in the vector space.

slide-7
SLIDE 7

Vector Space Model and XML

Flexible queries for XML retrieval

Content Only queries (CO) information need of plan text queries, similar to those in traditional information retrieval Content and Structure (CAS) information need of plan text and structure information /book//title “Bill Gates”

  • r /book//author “Bill Gates”

the structure information can be strict or flexible. (i.e., must from some elements or prefered from some elements)

Tree Representation of Queries

Book Author Bill Gates /book//author “Bill Gates” Book Bill Gates /book “Bill Gates”

Vector Space Model and XML

Book Title Author Bill Gates Software Book Title Author Gary Rivlin The plot to get Bill Gates

Vector Space Model and XML

Vector space model for traditional IR

System treats the keywords in a document equally; so the two “Gates” are the same for two documents

Vector space in XML data

We must distinguish the two occurrences of “Gates” under different elements “Title” and “Author” Index must considers both the contents and the locations of keywords (e.g., different elements)

Vector Space Model and XML

Vector space in XML data

Index must considers both the contents and the locations of keywords (e.g., different elements) To accomplish this, we need to consider the partial trees (structural items) within an XML document. Can we build indexes for the structural items (partials trees)?

Vector Space Model and XML

Book Title Author Bill Gates Software If we do not allow gap in the tree structures, we can have structural items (partial trees) as

Bill Software Gates Title Software Author Bill Author Gates Book Title Software Bill Book Author Gates

slide-8
SLIDE 8

Vector Space Model and XML

Problems of Indexing with Structural items

The number of distinct structural items can be very huge. It is not practical to build and store a vector space index with so many dimensions

Some possible solutions

Build query-time partial vector space Restrict the structural items to a manageable set

Vector Space Model and XML

Query-time partial vector space Instead of generating all structural items at one time, we can

  • nly generate the necessary partial vector space for a specific

query (a much smaller set) For a specific query We seek all XML documents with any keyword satisfied the query, build partial vector space from these XML documents The similarity of qualified XML documents and the query can be calculated within the partial vector space

Vector Space Model and XML

Weights of Structural items (partial trees)

Down-weighting for structural items

Book Title Full Text P1 Software P2 Windows platform, linux… “Software” should have more influence (weight) for book element than “Windows”, “Platform”…. Calculate the weight of a term to an element K levels up by a scaling factor βk, 0<β<1

Vector Space Model and XML

Weights of Structural items (partial trees)

Down-weighting for structural items

Book Title Full Text P1 Software P2 Windows platform, linux… “Software” should have more influence (weight) for book element than “Windows”, “Platform”…. Calculate the weight of a term to an element K levels up by a scaling factor βk, 0<β<1

Vector Space Model and XML

Weights of Structural items (partial trees)

Down-weighting for structural items

Book Title Full Text P1 Software P2 Windows platform, linux…

0.8 0.2

Weights can also be set for different partial trees. The weights can be predefined Weights can be application oriented Weights can be user-specific. Weights can be query-specific. Learning issues…..

Vector Space Model and XML

Other issues of Weights of Structural items (partial trees) Book Title Full Text P1 Software P2 Windows platform, linux… Down-weighting is to use the contents of low-level elements for high-level elements. (e.g., contents of “title” and “full text” for “book”). Should we also incorporate contents

  • f high-level (or the same level)

elements for low-level elemnets? The smoothing strategy…

slide-9
SLIDE 9

Vector Space Model and XML

Calculating the similarity Vocabulary mismatch of keywords and structures Keyword mismatch has been studied in traditional information retrieval, we can utilize techniques such as query expansion, latent semantic indexing, probabilistic semantic index…. Structure mismatch

Book Software Book Title Software Book Full Text Software

Vector Space Model and XML

Calculating the similarity First find all structural items in the query Find all similar match again the vocabulary of structural items It is not a Boolean match, but a similarity match (e.g., 0.9 similarity score with an item) Retrieve all documents/elements with that structural item, compute the cosine similarity etc.

Vector Space Model and XML

Problems with the vector space model What IDF value? We cannot use a corpus-wide IDF value. The IDF value should be element-specific. But do we need to incorporate the IDF factor of high-level same-level elements? For heterogeneous XML documents We do not exactly know the mapping the schemas. Do we need schema mapping? How can we deal with uncertainty of schema mapping?

I NEX: Benchmark for text-based XML Retrieval

INEX: INitiative for the Evaluation of XML Retrieval

  • The analog of TREC (Text Retrieval Conference) for standard

unstructured information retrieval Provide testbed of Set of XML documents, plain queries (content-only queries) and structured queries (with XML structure) A set of retrieval tasks INEX 2002-2006: Mainly organized by people from Europe. It has attracted many participants from universities and big companies from all over the world

I NEX: Benchmark for text-based XML Retrieval

Ad-hoc XML Retrieval Task Each system index a set of XML documents For a set of queries (content-only, content and structure), system convert queries into internal representation In response, each system returns not documents, but most relevant elements within documents Evaluation metrics The retrieved elements are evaluated on two measures: Relevance – how relevant is the retrieved element Coverage – is the retrieved element too specific, too general or just fine There are scales for the measures, then are turned into precision/recall measures

I NEX: Benchmark for text-based XML Retrieval

Ad-hoc XML Retrieval Task 12,107 articles from IEEE Computer Society publications 494 Megabytes Average article: 1,532 XML nodes/elements Average node/element depth=7

slide-10
SLIDE 10

I NEX: Benchmark for text-based XML Retrieval

Relevance:

Relevance assessed on a scale from Irrelevant (scoring 0) to Highly Relevant (scoring 3)

Coverage No Coverage (N), too general (L), too specific (S), Exact (E) So every element returned by each engine has

ratings from {0,1,2,3} × {N,S,L,E} I NEX: Benchmark for text-based XML Retrieval

Define scores:

⎩ ⎨ ⎧ = =

  • therwise

3 , if 1 ) , ( E cov rel cov rel fstrict

{ } { } { }

⎪ ⎪ ⎪ ⎩ ⎪ ⎪ ⎪ ⎨ ⎧ = ∈ ∈ ∈ = = . if 00 . 1 , 1 if 25 . 2 , 2 , 1 if 50 . 3 , 3 , 2 if 75 . 3 if 00 . 1 cov) , ( N rel,cov L S rel,cov S L E rel,cov S L E rel,cov E rel,cov rel f

d generalize

I NEX: Benchmark for text-based XML Retrieval

Heterogeneous XML retrieval task:

The adhoc track in INEX has dealt with a single DTD of

  • ne type of type (computer science journal aritcles)

In “real-wordl” environments, XML retrieval must deal with different DTDs, different genres of data and widely varying topical content Problems: What methods can be used to map structural criteria onto

  • ther DTDs?

Should mappings focus on element names or also deal with element content or semantic?

I NEX: Benchmark for text-based XML Retrieval

XML I nformation Retrieval: Outline Basic Concepts of Information Retrieval:

Semi-Structure Data

XML, Examples, Application

XML Search

XQuery XIRQL

Text-Based XML Retrieval

Vector-space model INEX

XML Resources

www.w3.org/XML - XML resources at W3C Jan-Marco Bremer’s publications on xml and ir:

http://www.db.cs.ucdavis.edu/~bremer

Norbert Fuhr and Kai Grossjohann. XIRQL, SIGIR 2001 INEX: http://inex.is.informatik.uni-duisburg.de/ Chris Manning: Introduction to Information Retrieval

Some contents of the slides are based on above materials