CS 1655 / Spring 2013 Secure Data Management and Web Applications - - PDF document

▶

Apr 01, 2024 284 likes •367 views

CS 1655 / Spring 2013 Secure Data Management and Web Applications 04 Information Retrieval Alexandros Labrinidis University of Pittsburgh Alexandros Labrinidis, CS 1655 / Spring 2013 1 University of Pittsburgh What is

SLIDE 1

1

CS 1655 / Spring 2013  Secure Data Management   and Web Applications

Alexandros Labrinidis  University of Pittsburgh 04 – Information Retrieval

Alexandros Labrinidis, University of Pittsburgh

CS 1655 / Spring 2013

Alexandros Labrinidis, University of Pittsburgh

CS 1655 / Spring 2013

What is Information Retrieval?

Information organized into documents

– Large number of documents – Data in documents is unstructured

Quest:

– Locate documents that match userʼs needs – How:

Keywords
Sample documents
Like finding a needle in a haystack 

– Or worse: a hay-colored needle!

SLIDE 2

2

Alexandros Labrinidis, University of Pittsburgh

CS 1655 / Spring 2013

IR vs Databases

Database systems:

– Structured data – Complex data models – Updates/transactions/concurrency control

Information retrieval:

– Unstructured data – Collection of documents – Approximate searching/ranking of results

Alexandros Labrinidis, University of Pittsburgh

CS 1655 / Spring 2013

How to retrieve information?

One way:

– Get keywords from user – Scan entire collection of documents – Return documents that match – Problems?

Will not scale to large document collections

– E.g., the Web

Will not rank results

– E.g., too many matches for “Labrinidis”

SLIDE 3

3

Alexandros Labrinidis, University of Pittsburgh

CS 1655 / Spring 2013

Relevance Ranking using terms

Given a term t how relevant is a document d

to the term?

Approach #1:

– Use the number of occurrences of t in d – n(d, t)

Approach #2:

– Normalize number of occurrences of t in d by the total number of terms in d r(d,t) = log(1+ n(d,t) n(d) )

Alexandros Labrinidis, University of Pittsburgh

CS 1655 / Spring 2013

Handling Multiple Query Terms

Simple way:

– Compute independent relevance measures – Add them up

Better way:

– Use inverse document frequency for each term

Number of documents that contain term t

– Relevance of document d to set of terms Q: r(d,Q) = r(d,t) n(t)

t ∈Q

∑

SLIDE 4

4

Alexandros Labrinidis, University of Pittsburgh

CS 1655 / Spring 2013

Not all words created equal

Google query:

– the oranges from florida

http://www.google.com
the, from are very common and omitted from

search

– These are called stop words

Alexandros Labrinidis, University of Pittsburgh

CS 1655 / Spring 2013

Other factors affecting relevance

Word proximity

– If two query terms are closer in a document this should rank higher than when they are not – Example?

Using hyperlinks

– E.g., site popularity, hubs, authorities  (more on this next time)

SLIDE 5

5

Alexandros Labrinidis, University of Pittsburgh

CS 1655 / Spring 2013

Scaling to large collections

Effective index structure is crucial
Documents containing a specific term are

located using an inverted index

– Each keyword maps to a list of documents that contain it.

How to support or/and semantics?

– OR: compute union of sets – AND: compute intersection of sets

Alexandros Labrinidis, University of Pittsburgh

CS 1655 / Spring 2013

How to measure effectiveness

Approximate, incomplete results are usual

– Especially if using an index

How to measure quality of these results?
False negative:

– A relevant document was not returned

False positive:

– An irrelevant document was returned

SLIDE 6

6

Alexandros Labrinidis, University of Pittsburgh

CS 1655 / Spring 2013

Effectiveness metrics

Precision

– What percentage of retrieved documents are relevant to query

Recall

– What percentage of the documents that are relevant to the query has been retrieved

Alexandros Labrinidis, University of Pittsburgh

CS 1655 / Spring 2013

How to improve effectiveness

Better ranking
Better indexing
Classification of documents

– Instead of “global” pool, focus on smaller set of documents that are related

Feedback from users

SLIDE 7

7

Alexandros Labrinidis, University of Pittsburgh

CS 1655 / Spring 2013

Focused Crawling

google.com

– Search for “database management systems”

google.com

– Search for “database management systems” – +site: pitt.edu

scholar.google.com

1

CS 1655 / Spring 2013 Secure Data Management and Web Applications

Alexandros Labrinidis University of Pittsburgh 04 – Information Retrieval

What is Information Retrieval?

– Large number of documents – Data in documents is unstructured

– Locate documents that match userʼs needs – How:

– Or worse: a hay-colored needle!

2

IR vs Databases

– Structured data – Complex data models – Updates/transactions/concurrency control

– Unstructured data – Collection of documents – Approximate searching/ranking of results

How to retrieve information?

– Get keywords from user – Scan entire collection of documents – Return documents that match – Problems?

– E.g., the Web

– E.g., too many matches for “Labrinidis”

3

Relevance Ranking using terms

to the term?

– Use the number of occurrences of t in d – n(d, t)

– Normalize number of occurrences of t in d by the total number of terms in d r(d,t) = log(1+ n(d,t) n(d) )

Handling Multiple Query Terms

– Compute independent relevance measures – Add them up

– Use inverse document frequency for each term

– Relevance of document d to set of terms Q: r(d,Q) = r(d,t) n(t)

∑

4

Not all words created equal

– the oranges from florida

search

– These are called stop words

Other factors affecting relevance

– If two query terms are closer in a document this should rank higher than when they are not – Example?

– E.g., site popularity, hubs, authorities (more on this next time)

5

Scaling to large collections

located using an inverted index

– Each keyword maps to a list of documents that contain it.

– OR: compute union of sets – AND: compute intersection of sets

How to measure effectiveness

– Especially if using an index

– A relevant document was not returned

– An irrelevant document was returned

6

Effectiveness metrics

– What percentage of retrieved documents are relevant to query

– What percentage of the documents that are relevant to the query has been retrieved

How to improve effectiveness

– Instead of “global” pool, focus on smaller set of documents that are related

7

Focused Crawling

– Search for “database management systems”

– Search for “database management systems” – +site: pitt.edu

– Search for “database management systems”

CS 1655 / Spring 2013  Secure Data Management   and Web Applications

Alexandros Labrinidis  University of Pittsburgh 04 – Information Retrieval

– E.g., site popularity, hubs, authorities  (more on this next time)