Unstructured Data Typically refers to free text I Allows I G - - PDF document

unstructured data
SMART_READER_LITE
LIVE PREVIEW

Unstructured Data Typically refers to free text I Allows I G - - PDF document

Unstructured Data Management Advanced Topics in Database Management (INFSCI 2711) Textbooks: Database System Concepts - 2010 Introduction to Information Retrieval - 2008 Vladimir Zadorozhny, DINS, SCI, University of Pittsburgh Unstructured Data


slide-1
SLIDE 1

1

Advanced Topics in Database Management (INFSCI 2711)

Textbooks: Database System Concepts - 2010 Introduction to Information Retrieval - 2008

Vladimir Zadorozhny, DINS, SCI, University of Pittsburgh

Unstructured Data Management

2

Unstructured Data

I

Typically refers to free text

I

Allows

G Keyword queries including operators G More sophisticated “concept” queries e.g., 4 find all web pages dealing with drug abuse

I

Classic model for searching text documents

slide-2
SLIDE 2

2

Unstructured Data and Query Example

I Antony and Cleopatra, Act III, Scene ii

I Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, I When Antony found Julius Caesar dead, I He cried almost to roaring; and he wept I When at Philippi he found Brutus slain.

I Hamlet, Act III, Scene ii

I Lord Polonius: I did enact Julius Caesar I was killed i' the I Capitol; Brutus killed me.

Query: Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia?

4

Database Management vs Information Retrieval

I Data: DB: Set of Tables with well defined schema

IR: Set of (text) documents

I Goal: DB: Find an accurate response to a user query

IR: Retrieve documents with information that is relevant to user’s information need

slide-3
SLIDE 3

3

5

Querying Unstructured Data

I

Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia?

G One could grep all of Shakespeare’s plays for Brutus and

Caesar, then strip out lines containing Calpurnia?

4 Slow (for large corpora) 4 NOT Calpurnia is non-trivial 4 Other operations (e.g., find the word Romans near

countrymen) not feasible

4 Ranked retrieval (best documents to return)

Term-document incidence matrix

1 if play contains word, 0 otherwise

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth Antony 1 1 1 Brutus 1 1 1 Caesar 1 1 1 1 1 Calpurnia 1 Cleopatra 1 mercy 1 1 1 1 1 worser 1 1 1 1

Brutus AND Caesar but NOT Calpurnia

slide-4
SLIDE 4

4

7

Query evaluation and optimization

I

0/1 vector for each term.

I

To answer query: take the vectors for Brutus, Caesar and Calpurnia (complemented) è bitwise AND.

I

110100 AND 110111 AND 101111 = 100100.

  • Consider: 1M documents, each with about 1K terms.
  • 6GB of data in the documents (avg 6 bytes/term incl spaces/punctuation)
  • Assume 500K distinct terms.

500K x 1M matrix has half-a-trillion 0’s and 1’s.

8

Inverted index

For each term (token) T, we must store a list of all documents that contain T.

8

Brutus Calpurnia Caesar 2 4 8 16 32 64 128 2 3 5 8 13 21 34 13 16 1 Dictionary Postings lists

Sorted by docID

Posting

slide-5
SLIDE 5

5

I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me. Doc 1 So let it be with

  • Caesar. The noble

Brutus hath told you Caesar was ambitious Doc 2

Term Doc # I 1 did 1 enact 1 julius 1 caesar 1 I 1 was 1 killed 1 i' 1 the 1 capitol 1 brutus 1 killed 1 me 1 so 2 let 2 it 2 be 2 with 2 caesar 2 the 2 noble 2 brutus 2 hath 2 told 2 you 2

caesar 2

was 2 ambitious 2

Indexer step 1: Sequence of (Term, DocumentID) pairs

Term Doc # ambitious 2 be 2 brutus 1 brutus 2 capitol 1 caesar 1 caesar 2 caesar 2 did 1 enact 1 hath 1 I 1 I 1 i' 1 it 2 julius 1 killed 1 killed 1 let 2 me 1 noble 2 so 2 the 1 the 2 told 2 you 2 was 1 was 2 with 2 Term Doc # I 1 did 1 enact 1 julius 1 caesar 1 I 1 was 1 killed 1 i' 1 the 1 capitol 1 brutus 1 killed 1 me 1 so 2 let 2 it 2 be 2 with 2 caesar 2 the 2 noble 2 brutus 2 hath 2 told 2 you 2 caesar 2 was 2 ambitious 2

Indexer step 2: Sorting by terms

slide-6
SLIDE 6

6

I

Multiple term entries in a single document are merged.

I

Frequency information is added.

Term Doc # Term freq ambitious 2 1 be 2 1 brutus 1 1 brutus 2 1 capitol 1 1 caesar 1 1 caesar 2 2 did 1 1 enact 1 1 hath 2 1 I 1 2 i' 1 1 it 2 1 julius 1 1 killed 1 2 let 2 1 me 1 1 noble 2 1 so 2 1 the 1 1 the 2 1 told 2 1 you 2 1 was 1 1 was 2 1 with 2 1 Term Doc # ambitious 2 be 2 brutus 1 brutus 2 capitol 1 caesar 1 caesar 2 caesar 2 did 1 enact 1 hath 1 I 1 I 1 i' 1 it 2 julius 1 killed 1 killed 1 let 2 me 1 noble 2 so 2 the 1 the 2 told 2 you 2 was 1 was 2 with 2

Indexer step 3: Merging terms and adding frequencies

Doc # Freq 2 1 2 1 1 1 2 1 1 1 1 1 2 2 1 1 1 1 2 1 1 2 1 1 2 1 1 1 1 2 2 1 1 1 2 1 2 1 1 1 2 1 2 1 2 1 1 1 2 1 2 1 Term N docs Tot freq ambitious 1 1 be 1 1 brutus 2 2 capitol 1 1 caesar 2 3 did 1 1 enact 1 1 hath 1 1 I 1 2 i' 1 1 it 1 1 julius 1 1 killed 1 2 let 1 1 me 1 1 noble 1 1 so 1 1 the 2 2 told 1 1 you 1 1 was 2 2 with 1 1 Term Doc # Freq ambitious 2 1 be 2 1 brutus 1 1 brutus 2 1 capitol 1 1 caesar 1 1 caesar 2 2 did 1 1 enact 1 1 hath 2 1 I 1 2 i' 1 1 it 2 1 julius 1 1 killed 1 2 let 2 1 me 1 1 noble 2 1 so 2 1 the 1 1 the 2 1 told 2 1 you 2 1 was 1 1 was 2 1 with 2 1

Pointers

Indexer step 4: Splitting into dictionary and posting files

slide-7
SLIDE 7

7

13

Query processing: AND

I

Consider processing the query: Brutus AND Caesar

G Locate Brutus in the Dictionary; 4 Retrieve its postings. G Locate Caesar in the Dictionary; 4 Retrieve its postings. G “Merge” the two postings:

128 34 2 4 8 16 32 64 1 2 3 5 8 13 21 Br Brutus Caesa sar

14

34 128 2 4 8 16 32 64 1 2 3 5 8 13 21

The merge

I

Walk through the two postings simultaneously, in time linear in the total number of postings entries

128 34 2 4 8 16 32 64 1 2 3 5 8 13 21 Br Brutus Caesa sar 2 8 If the list lengths are x and y, the merge takes O(x+y)

  • perations.

Crucial: postings sorted by docID.

slide-8
SLIDE 8

8

15

Boolean queries: Exact match

I

The Boolean Retrieval model is being able to ask a query that is a Boolean expression:

G Boolean Queries are queries using AND, OR and NOT to join query

terms

4 Views each document as a set of words 4 Is precise: document matches condition or not. 16

Boolean queries: More general merges

Adapt the merge for the queries: Brutus AND NOT Caesar Brutus OR NOT Caesar

Can we still run through the merge in time O(x+y)?

slide-9
SLIDE 9

9

Query optimization

I

What is the best order for query processing?

I

Consider a query that is an AND of t terms.

I

For each of the t terms, get its postings, then AND them together.

Brutus Calpurnia Caesar 1 2 3 5 8 16 21 34 2 4 8 16 32 64128 13 16

Query: Brutus AND Calpurnia AND Caesar

17 18

Query optimization example

I

Process in order of increasing freq:

G start with smallest set, then keep cutting further.

Brutus Calpurnia Caesar 1 2 3 5 8 13 21 34 2 4 8 16 32 64128 13 16 Execute the query as (Caesa sar AND Brutus) s) AND Ca Calp lpurnia ia.

slide-10
SLIDE 10

10

19

More general optimization

I e.g., (madding OR crowd) AND (ignoble OR strife) I Get freq’s for all terms. I Estimate the size of each OR by the sum of its freq’s

(conservative).

I Process in increasing order of OR sizes.

20

Exercise

I

Recommend a query processing

  • rder for

(tangerine OR trees) AND (marmalade OR skies) AND (kaleidoscope OR eyes) Term Freq

eyes 213312 kaleidoscope 87009 marmalade 107913 skies 271658 tangerine 46653 trees 316812

slide-11
SLIDE 11

11

21

More Advanced IR

I

What about phrases?

G Stanford University

I

Proximity: Find Gates NEAR Microsoft.

G Need index to capture position information in docs.

I

Zones in documents: Find documents with (author = Ullman) AND (text contains automata).

22

Ranking search results

I

Boolean queries give inclusion or exclusion of docs.

I

Often we want to rank/group results

G Need to measure proximity from query to each doc. G Need to decide whether docs presented to user are singletons, or

a group of docs covering various aspects of the query.

slide-12
SLIDE 12

12

23

Clustering and classification

I

Given a set of docs, group them into clusters based on their contents.

I

Given a set of topics, plus a new doc D, decide which topic(s) D belongs to.

24

The web and its challenges

I Unusual and diverse documents I Unusual and diverse users, queries, information needs I Beyond terms, exploit ideas from social networks

G link analysis, clickstreams ...

I How do search engines work? And how can we make them

better?