[PPT] - 3. Text and document databases Normal databases: formatted records; PowerPoint Presentation

SLIDE 1

MMDB-3 J. Teuhola 2012 39

3. Text and document databases

Normal databases: formatted records;

document databases: free-form or semi-structured data (e.g. XML).

Application areas:

Office automation, document archives
Digital libraries
Electronic dictionaries /encyclopedias
Electronic newspapers
Source program libraries
Automated law and patent offices

What is a ‘document’? E.g. a book, chapter, paragraph, article,

letter, web page, source program, etc.

General problem setting: Searching documents by contents;

ften called also associative search.

Usually based on keywords or terms occurring in documents. Search terms may be combined with Boolean connectives

(AND, OR, NOT)

SLIDE 2

MMDB-3 J. Teuhola 2012 40

Non-indexed methods for string matching

Sequential full-text scanning; slow but some advantages:

No extra disk space required
Updates are fast (no index maintenance)
Partial-match retrieval is rather simple

(using wildcard characters)

Approximate matching is also possible

(using a threshold for edit distance)

Popular efficient algorithm: Boyer-Moore technique

Peculiar feature: searching is faster for longer search strings.
Based on preprocessing of the search string
Performance is sublinear in practice
Disk accesses cannot be reduced, except for extremely

long search strings.

Several other string matching algorithms exist (skipped).

SLIDE 3

MMDB-3 J. Teuhola 2012 41

Inverted indexing

Traditional way of improving search speed. What means ‘inverted’?

A document is a list of words, but the index gives for each word the list of documents where the word appears. Example documents: D1: ”TO BE OR NOT TO BE” D2: ”TO BE IS TO DO” D3: ”DO BE DO BE DO” Inverted index: BE {D1, D2, D3} DO {D2, D3} IS {D2} NOT {D1} OR {D1} TO {D1, D2}

SLIDE 4

MMDB-3 J. Teuhola 2012 42

Inverted indexing (cont.)

The set of words is called a lexicon. Some principles for it:

Case folding: Convert uppercase letters to lowercase
Stemming: Remove suffixes; index only the root forms of terms.
Do not include stopwords, like “the”, “is”, “as”, “that”, etc. which
ccur very often but do not bear semantic relevance.

The pointers to term occurrences may appear in different

granularities:

Coarse-grained index identifies document groups where

the term appears.

Moderate-grained index identifies the relevant documents
Fine-grained index contains sentence, word, or even byte

numbers for term occurrences.

SLIDE 5

MMDB-3 J. Teuhola 2012 43

Inverted indexing (cont.)

Coarse-grained index:

Small index size
Small maintenance penalty
Lot of plain text scanning
False drops for multi-term queries (terms do not co-occur).

Fine-grained index:

Large index size
High maintenance penalty
Supports proximity queries (terms occurring together)

SLIDE 6

MMDB-3 J. Teuhola 2012 44

Inverted indexing (cont.)

Ways to save storage space:

Front compression: Index in alphabetic order; the prefix common

with the previous term is expressed compactly.

Tail (suffix) compression: Store terms to the point where

they can be uniquely distinguished from other terms.

In fine-grained index: Instead of full pointers, store intervals of

successive occurrences of a term. Compound queries:

AND: Retrieve pointer lists and compute their intersection. OR: Retrieve pointer lists and form their union. NOT: This is usually combined with AND, so that we can apply

set difference to the pointer lists of terms.

SLIDE 7

MMDB-3 J. Teuhola 2012 45

Data structures for inverted indexes

(a)B+-tree, with index terms as keys and pointer lists as leaves. (b) Hash organization, preferably a dynamic version, e.g.

Linear hashing
Extendible hashing

(c) Trie-structure: each node represents one character, and the term is found by following a path from the root to a leaf. Problem: must be mapped to external storage In each case, the variable-length pointer lists can be either locally in the structure, or (preferably) detached.

SLIDE 8

MMDB-3 J. Teuhola 2012 46

Algorithm for building an inverted index

1. For each document, gather the index terms, combined with pointers

to the actual locations. The result is a big sequential file (called S).

2. Sort S by using e.g. external mergesort.
3. Combine the successive entries representing the same index term.
4. Build the index (B+-tree, hash table, ...) on the index terms and let

the leaf entries refer to the detached pointer lists, stored as variable-length records. Assessment of inverted indexes:

A very effective access method for retrieval. A very popular technique in practice for static document sets

(in spite of the storage penalty).

Presumably the main retrieval tool in web search engines.

SLIDE 9

MMDB-3 J. Teuhola 2012 47

Bitmap indexing

Another representation for the inverted index Suitable for coarse and moderate-grained indexing A bitmap is a matrix with a row for each index term, and a column

for each document. Element <i, j> is 1, if term i occurs in document j, otherwise 0. Example documents:

d1. ”TO BE OR NOT TO BE”
d2. ”TO BE IS TO DO”
d3. ”DO BE DO BE DO”

BE DO IS NOT OR TO d1 d2 d3 1 1 1 0 1 1 0 1 0 1 0 0 1 0 0 1 1 0 [Note: Normally these index terms would be stopwords.]

SLIDE 10

MMDB-3 J. Teuhola 2012 48

Bitmap indexing (cont.)

Especially efficient for Boolean queries: AND, OR, NOT can be

implemented directly in hardware (e.g. 64 bits in parallel).

Problem: High storage consumption (#terms × #documents);

the matrix is usually sparse.

Possible combined structure:

Use an inverted index for the less frequent terms, and a bitmap for the more frequent terms.

One option: compression. E.g. run-length coding: Replace

sequences of zeroes by their count (which gets close to the normal inverted index). Also the encoding of integers has to be decided. Example: Bitmap row = 001010000010001100000100... Run-length code = <2, 1, 5, 3, 0, 5, ...>

SLIDE 11

MMDB-3 J. Teuhola 2012 49

Hierarchical compression of bit strings

Divide the string into equal-sized blocks, Apply disjunction (OR) to the bits within each block, creating a

higher-level bit: A block of zeroes generates a 0-bit, others a 1-bit.

The process is repeated on higher levels, recursively. Advantage: Single bits are easily accessible, by studying one path

in the tree.

Compression: zero blocks need not be stored, at all.

Most of the leaf blocks are usually zero blocks (sparse bitmap).

0000 0010 0000 0011 1000 0000 0100 0000 0000 0000 0000 0000 0000 0000 0000 0000 0 1 0 1 1 0 1 0 0 0 0 0 0 0 0 0 1 1 0 0

SLIDE 12

MMDB-3 J. Teuhola 2012 50

Signature indexing

A probabilistic technique Can be generalized to any objects characterized by a variable-size

set of index terms or descriptors (also called features).

Signature is a bit-array, generated by hashing the index terms to

indexes of the array. It is usually at least some hundreds of bits long.

Signatures are collected to a separate signature file, which is smaller

than the whole space required by the documents.

The signature file acts as a filtering mechanism, to reduce the

amount of actual data to be searched.

The structure enables partial-match retrieval (subset of terms match) Queries can also be considered a kind of documents (collection of

keywords), and be converted into signature form. The query signature Q is compared against document signatures Di. If 1-bits of Q are included in 1-bits of Di, then Di is a candidate result.

Signatures are approximate descriptions of documents. False drops

must be eliminated by checking the actual match of all candidates.

SLIDE 13

MMDB-3 J. Teuhola 2012 51

Signature indexing (cont.)

Advantages of signature files:

Low storage consumption, compared to (fine-grained) inverted

indices. Typical value is 10-20% of the primary database size

More convenient than multidimensional indexes (to be studied later)

Signature generation methods:

Word signature method: Each index term is hashed into a sparse bit

pattern, and the patterns are concatenated to form the document signature. This method usually results in higher false drop probability, but preserves sequencing information of terms in documents.

Superimposed coding: Each index term produces a bit pattern of full

signature size. The patterns are OR’ed to form the document

signature. This is here the default method.

How to minimize false drops?

The proportions of 0’s and 1’s should be equal and uniformly

distributed. (Weight = number of 1-bits in the signature)

SLIDE 14

MMDB-3 J. Teuhola 2012 52

Signature indexing: Example

Document index terms: D1: database, object, programming, schema D2: algorithm, computer, programming D3: algorithm, data structure, programming Hashing: hash(“algorithm”) = 3 hash(“computer”) = 1 hash(“database”) = 7 hash(“data structure”) = 5 hash(“object”) = 6 hash(“programming”) = 4 hash(“schema”) = 1 Signatures: D1 = 10010110, D2 = 10110000, D3 = 00111000 Query: Documents about “algorithm” and “programming”? Query signature: Q = 00110000, matches with D2 and D3.

SLIDE 15

MMDB-3 J. Teuhola 2012 53

Models for information retrieval from documents

Boolean model

Queries consist of search terms, conneted by Boolean operators

(AND, OR, NOT)

All documents containing a match are retrieved – no ranking.

Vector space model

Search terms are given weights in documents Documents are ranked based on the distance between query

and term vectors

Probabilistic model

Estimation of the probability of relevance between query and

document, based on the relevance probabilities of the terms.

Popular version of this approach: BM25 (Best Match 25).

SLIDE 16

MMDB-3 J. Teuhola 2012 54

Vector-based document retrieval

Principles:

Retrieval with less precise queries; not always exact-match. Ranking the documents according to their ‘distance’ from the query. Semantic correlations of terms should be taken into account.

Concepts:

Synonymy: Different terms mean the same thing. Polysemy: A single term has multiple meanings. Weight of a term indicates its importance in a document. The weight

could be the number of term occurrences in the document.

Measuring the goodness of retrieval:

Precision: Proportion of retrieved relevant documents, relative to

the total number of retrieved documents.

Recall: Proportion of retrieved relevant documents, relative to the

total number of relevant documents.

The user decides which documents are relevant and which are not.

SLIDE 17

MMDB-3 J. Teuhola 2012 55

Precision and recall

All documents A = retrieved B = relevant X = hits Precision = ⎢X⎢/ ⎢A⎢ Recall = ⎢X⎢/ ⎢B⎢

SLIDE 18

MMDB-3 J. Teuhola 2012 56

Matrix representation of the term & document sets

Matrix D: one row per term, one column (Di) per document Generalization of bitmap Element Dt,i denotes the weight of term t in document i.

The weight could be e.g. the number of term occurrences (more advanced scheme later).

Column vectors characterize documents.

Example documents: D1: ”TO BE OR NOT TO BE” D2: ”TO BE IS TO DO” D3: ”DO BE DO BE DO” BE DO IS NOT OR TO D1 D2 D3 2 1 2 0 1 3 0 1 0 1 0 0 1 0 0 2 2 0

D

SLIDE 19

MMDB-3 J. Teuhola 2012 57

Comparing queries and documents

Queries can be regarded as documents, as well, with their own

characterizing vector, built from query terms (but usually without weights).

Task: Find k documents, whose vectors are closest to the query

vector.

Problem: How to measure the distance (or similarity) between

documents or usually query & document)?

First attempt: Similarity(Q, Di) = Q • Di = Σt∈TermsQt Dt,i,

i.e. the inner product of query and document weight vectors.

Example: Query = {TO, DO} (0, 1, 0, 0, 0, 1)

Q •D1 = (0, 1, 0, 0, 0, 1) • (2, 0, 0, 1, 1, 2) = 2 Q •D2 = (0, 1, 0, 0, 0, 1) • (1, 1, 1, 0, 0, 2) = 3 (best similarity) Q •D3 = (0, 1, 0, 0, 0, 1) • (2, 3, 0, 0, 0, 0) = 3 (best similarity)

SLIDE 20

MMDB-3 J. Teuhola 2012 58

Giving weights to terms

Problems:

If frequencies ft,i (term instances per document) are used as

weights, general terms are overweighted

Long documents are favored over short ones, because they

contain more terms. Applying Zipf’s law:

The weight of an index term t should be inversely proportional

to its frequency among documents, i.e. wt = 1/nt , where nt = number of documents containing term t.

An attenuated (weakened) form for this weight:

wt= log (1+N/nt), where N = number of documents.

This can be combined with the term frequency, giving

Dt,i = ft,i· log (1+N/nt) which is called TF*IDF rule (“term frequency times inverse document frequency”).

SLIDE 21

MMDB-3 J. Teuhola 2012 59

Measuring distance / similarity

Geometric distances in vector space:

Euclidean distance (similarity = inverse of distance):

This measure discriminates long documents: Dt,i is large, Q small.

Cosine rule: The angle (cosine) between query and document

vectors in space is a good measure of their distance (similarity). From vector algebra: The obtained similarity measure:

Combined with TF*IDF:

∑ ∈

− =

Terms t i t t i

D Q D Q d

2 , )

( ) , (

i Terms t t i i

D Q Q D Q D Q resp. , where , cos

2 ∑ = ⋅ =

∈

θ

i i

D Q D Q ⋅

=

θ cos

( )

i Q t t i t i

D Q n N f D Q ⋅ + =

∑

∈ 2 ,

)) / 1 (log( ) , cos(

SLIDE 22

MMDB-3 J. Teuhola 2012 60

Search engine architecture

Acquisition of web documents: By crawling using the links, and

monitoring RSS feeds

The documents are stored, preprocesses to text, and indexed. Preprocessing tasks: parsing, stopword removal, stemming, link

extraction, classification, etc.

Indexing: collecting statistics, giving weight to terms, and building the

index - inverted indexes are the common index type.

Query processing: preprocessing, retrieval, ranking, output, relevance

feedback.

Main goals: relevance of query results, and efficient processing Problem of scale: the amounts of documents and queries are huge.

Distribution, parallelism, replication, compression ... Are needed.

SLIDE 23

MMDB-3 J. Teuhola 2012 61

Ranking in Web search engines

Search steps:

Normal search by keywords resulting in ’hits’ Ranking of the hit documents

Main difference between web pages and traditional sets of

documents: (hyper)links. Links are the most important factor in ranking documents.

Other ranking factors:

Content relevance measure, Number of visits Estimated (formal) quality of the content, Page loading time Financial promotion

SLIDE 24

MMDB-3 J. Teuhola 2012 62

HITS algorithm for ranking

Retrieved pages are given two scores:

Authority score (ai) represents how respected the page is,

knowing the incoming links, and the scores of the referring pages.

Hub score (hi) measures the goodness of the page, in view of the

scores of the pages to which it refers.

The scores are inter-related; one cannot be decided before the other! Solution: Start with initial scores ai

(0) and hi (0), and iterate:

and for k = 1, 2, ... until scores stabilize

Normalization of score vectors must be done at each iteration The process actually computes the dominant eigenvectors of MTM

and MMT, where M is the adjacency matrix of the page-link graph.

Problem with HITS algorithm: Scores are computed during query

processing – not precomputed.

∑

−

=

j k j k i

h a

) 1 ( ) (

∑

−

=

j k j k i

a h

) 1 ( ) (

SLIDE 25

MMDB-3 J. Teuhola 2012 63

PageRank algorithm

The famous basis of Google’s ranking method – still essential! Rank (importance) of a webpage is determined by the importance of

the pages referring to it.

Network flow idea: if a page has importance r, and k outgoing links,

importance ’flows’ equally to the referred pages, namely r/k for each

Iterative computation:

where

L goes through the pages linking to P.
f(L) denotes the fanout (#out-pointers) of L

In matrix algebra: convergence to the left eigenvector of a transition

matrix with entry 1/f(Pi) on row i at columns where P links to.

Difference from HITS: PageRank values are independent of queries Problem with both: semantic correlations are not well considered.

∑

−

=

L k k

L f L r P r ) ( ) ( ) (

) 1 ( ) (

SLIDE 26

MMDB-3 J. Teuhola 2012 64

Latent semantic indexing (LSI)

Problem with the plain term-document approach:

Terms have synonyms, and documents often resemble each other. The term-document matrix is usually large and sparse (mostly 0).

Latent semantic indexing (LSI):

Reduction of the search space (and the matrix) Representation of terms and documents in ‘semantic space’, by

deriving a set of uncorrelated factors (‘concepts’) Steps in LSI:

1. Create the weighted term-document matrix D.
2. Compute a singular valued decomposition (A, S, B) of D by splitting

D into three matrices A, S, and B.

3. Reduce the size of matrices by eliminating insignificant

rows/columns.

4. Store the matrices, using any of the available indexing techniques.

SLIDE 27

MMDB-3 J. Teuhola 2012 65

LSI: Singular value decomposition (SVD)

Given any matrix D of size m × n, it is possible to find matrices

A, S, and B such that

D = A × S × BT
A is an orthogonal m × m matrix, i.e. ATA = I
B is an orthogonal n × n matrix, i.e. BTB = I
S is a diagonal m×n matrix (called singular matrix) where nonzero

elements are on the diagonal from top-left in non-increasing order

m × n m × m m × n n × n D A S BT = × ×

SLIDE 28

MMDB-3 J. Teuhola 2012 66

LSI: Singular value decomposition (SVD)

Idea:

Reduce r to a smaller value k, such that the least significant

(bottom-right) elements of S are discarded, as well as the corresponding columns in A and rows in B.

As a result we need to store only the reduced matrices

Ak (m × k), Sk (k × k), and Bk (k × n).

D Ak Sk Bk

T

≈ × ×

SLIDE 29

MMDB-3 J. Teuhola 2012 67

LSI usage

Query processing:

Compute the transformed query vector Qk = QT Ak Sk

1

and apply the vector similarity search to Qk using matrix Bk.

Update: Complicated; LSI can be recommended mainly for semi-

static document collections.

Strength: LSI is able to identify concepts or patterns among terms,

based on their co-occurrence in the documents. Semantic correlations are extracted from ‘noise’.

Note: LSI resembles Principal Component Analysis (PCA) – a well-

known dimensionality-reduction method based on eigenvectors.

SLIDE 30

MMDB-3 J. Teuhola 2012 68

Semi-structured documents: XML

XML = eXtensible Markup Language Accepted 1998 by W3C (World Wide Web Consortium) Simplified form of SGML (Standard Generalized Markup Language) Document = tree structure (hierarchy) of elements Element enclosed by start and end tags Elements can be references to media objects Elements can be further described by attributes (in the start tag) Differences from HTML:

Logical content separated from physical layout Extensible: new tags can be adopted according to need XML has also other purposes than web publishing

SLIDE 31

MMDB-3 J. Teuhola 2012 69

Example XML document

<?XML version=”1.0”?> <books> <book isbn=”123-456-789"> <title>Database systems</title> <authors> <author>Elmasri</author> <author>Navathe</author> </authors> </book> <book isbn=”987-654-321”> <title>Multimedia databases</title> <authors> <author>Dunckley</author> </authors> </book> </books>

books book* title authors author+

SLIDE 32

MMDB-3 J. Teuhola 2012 70

XML-based markup languages

Scalable Vector Graphics (SVG):

Presentation of variable-size vector graphics on screen.

Office Open XML (OOXML; OpenXML):

File format for representing office documents like (rich) text, spreadsheets, slide presentations, etc.

Web services:

Applications that can communicate with other applications using standard protocols (http) over the Internet.

Mathematical Markup Language (MathML):

Presentation of mathematical formulas.

Chemical Markup Language (CML): Presentation of chemical

formulas.

and many others ...

SLIDE 33

MMDB-3 J. Teuhola 2012 71

XML-based markup languages (cont.)

Synchronized Multimedia Integration Language (SMIL):

Controls layout, interaction, operation and timing of multimedia

presentations

Gathers the media files in the order that they should appear Combines them into a single stream Viewing by SMIL-enabled player (e.g. Ambulant 2.0) W3C recommendation, see http://www.w3.org/AudioVideo/

Latest official SMIL 2.1, latest proposal SMIL 3.0 (Dec. 2008)

Several media players support (e.g. RealPlayer; IE partially) Tutorial: http://www.w3schools.com/smil/default.asp

SLIDE 34

MMDB-3 J. Teuhola 2012 72

SMIL code example

<smil> <head> <layout> <root-layout height=“250" width=“300" background-color="#ffffff" title=“Z"/> <region id="title" width=“200" height=“100" top="0" left="0" z-index=“1" /> <region id=“img" width="200" height=“150" top=“50" left=“40" z-index=“2" /> </layout> </head> <body> <seq> <text src="http://www. xxx/head.txt" region="title" begin="2.00s” end=“3.00s" /> <par> <text src="http://www.yyy/text.txt" region=“title" /> <img src="http://www.ttt/fig.gif" region=“img" begin="1.00s" end=“10.00s“ /> <audio src="http://www.zzz/music.rm" begin=“5.00s" end=“10.00s" /> </par> </seq> </body> </smil>

SLIDE 35

MMDB-3 J. Teuhola 2012 73

XML and databases

Some storage alternatives:

1.

Normalized database + transformation of query results into XML. Advantage: Uniform presentation of database objects in heterogeneous, distributed and multi-tier database systems.

2.

Storage of the XML code as an attribute value in the document table (together with document id and other separated search attributes). The XML attribute is logically unnormalized. Advantage: No transformation needed for viewing.

3.

Storage of XML document elements and parent links as attributes in a relation. This needs careful indexing.

4.

Native XML database: Requires a query language (interface) and indexing support.

SLIDE 36

MMDB-3 J. Teuhola 2012 74

XQuery: XML query language

Developed by W3C (WWW Consortium), see

http://www.w3.org/TR/xquery/

XQuery 1.0, latest version January 2007. Important consitutent: Path expressions (using Xpath) Control:

looping (FOR) variable binding (LET) selection condition (WHERE) creating result (RETURN)

Arithmetic and logical operators Sorting; sequence processing XQuery syntax alternatives: SQL- or XML-oriented

SLIDE 37

MMDB-3 J. Teuhola 2012 75

XML query examples

XPath: Find titles of articles with type ”draft”: collection(’articles’)/article[@type=”draft”]/title XQuery: Find authors of articles written in 2005 (join of two document collections): for $art in collection(’articles’)/article[@year=”2005”] let $author := collection(’authors’)/author[@id=$art/auth_id] return <result> <title> { $art/title } </title> <author> { $author/name } </author> <result>

SLIDE 38

MMDB-3 J. Teuhola 2012 76

XML support in database management systems

Commercial DBMSs extended by XML and XQuery support:

IBM DB2 9 ‘Viper’ Oracle 11g XML DB: Microsoft SQL Server 2005

Some ’native’ XML databases:

dbXML (open-source) eXist (open-source) xDB (commercial)