CS 1655 / Spring 2013 Secure Data Management and Web Applications - - PDF document

cs 1655 spring 2013 secure data management and web
SMART_READER_LITE
LIVE PREVIEW

CS 1655 / Spring 2013 Secure Data Management and Web Applications - - PDF document

CS 1655 / Spring 2013 Secure Data Management and Web Applications 04 Information Retrieval Alexandros Labrinidis University of Pittsburgh Alexandros Labrinidis, CS 1655 / Spring 2013 1 University of Pittsburgh What is


slide-1
SLIDE 1

1

1

CS 1655 / Spring 2013
 Secure Data Management 
 and Web Applications

Alexandros Labrinidis
 University of Pittsburgh 04 – Information Retrieval

Alexandros Labrinidis, University of Pittsburgh

CS 1655 / Spring 2013

Alexandros Labrinidis, University of Pittsburgh

CS 1655 / Spring 2013

2

What is Information Retrieval?

  • Information organized into documents

– Large number of documents – Data in documents is unstructured

  • Quest:

– Locate documents that match userʼs needs – How:

  • Keywords
  • Sample documents
  • Like finding a needle in a haystack 

– Or worse: a hay-colored needle!

slide-2
SLIDE 2

2

Alexandros Labrinidis, University of Pittsburgh

CS 1655 / Spring 2013

3

IR vs Databases

  • Database systems:

– Structured data – Complex data models – Updates/transactions/concurrency control

  • Information retrieval:

– Unstructured data – Collection of documents – Approximate searching/ranking of results

Alexandros Labrinidis, University of Pittsburgh

CS 1655 / Spring 2013

4

How to retrieve information?

  • One way:

– Get keywords from user – Scan entire collection of documents – Return documents that match – Problems?

  • Will not scale to large document collections

– E.g., the Web

  • Will not rank results

– E.g., too many matches for “Labrinidis”

slide-3
SLIDE 3

3

Alexandros Labrinidis, University of Pittsburgh

CS 1655 / Spring 2013

5

Relevance Ranking using terms

  • Given a term t how relevant is a document d

to the term?

  • Approach #1:

– Use the number of occurrences of t in d – n(d, t)

  • Approach #2:

– Normalize number of occurrences of t in d by the total number of terms in d r(d,t) = log(1+ n(d,t) n(d) )

Alexandros Labrinidis, University of Pittsburgh

CS 1655 / Spring 2013

6

Handling Multiple Query Terms

  • Simple way:

– Compute independent relevance measures – Add them up

  • Better way:

– Use inverse document frequency for each term

  • Number of documents that contain term t

– Relevance of document d to set of terms Q: r(d,Q) = r(d,t) n(t)

t ∈Q

slide-4
SLIDE 4

4

Alexandros Labrinidis, University of Pittsburgh

CS 1655 / Spring 2013

7

Not all words created equal

  • Google query:

– the oranges from florida

  • http://www.google.com
  • the, from are very common and omitted from

search

– These are called stop words

Alexandros Labrinidis, University of Pittsburgh

CS 1655 / Spring 2013

8

Other factors affecting relevance

  • Word proximity

– If two query terms are closer in a document this should rank higher than when they are not – Example?

  • Using hyperlinks

– E.g., site popularity, hubs, authorities
 (more on this next time)

slide-5
SLIDE 5

5

Alexandros Labrinidis, University of Pittsburgh

CS 1655 / Spring 2013

9

Scaling to large collections

  • Effective index structure is crucial
  • Documents containing a specific term are

located using an inverted index

– Each keyword maps to a list of documents that contain it.

  • How to support or/and semantics?

– OR: compute union of sets – AND: compute intersection of sets

Alexandros Labrinidis, University of Pittsburgh

CS 1655 / Spring 2013

10

How to measure effectiveness

  • Approximate, incomplete results are usual

– Especially if using an index

  • How to measure quality of these results?
  • False negative:

– A relevant document was not returned

  • False positive:

– An irrelevant document was returned

slide-6
SLIDE 6

6

Alexandros Labrinidis, University of Pittsburgh

CS 1655 / Spring 2013

11

Effectiveness metrics

  • Precision

– What percentage of retrieved documents are relevant to query

  • Recall

– What percentage of the documents that are relevant to the query has been retrieved

Alexandros Labrinidis, University of Pittsburgh

CS 1655 / Spring 2013

12

How to improve effectiveness

  • Better ranking
  • Better indexing
  • Classification of documents

– Instead of “global” pool, focus on smaller set of documents that are related

  • Feedback from users
slide-7
SLIDE 7

7

Alexandros Labrinidis, University of Pittsburgh

CS 1655 / Spring 2013

13

Focused Crawling

  • google.com

– Search for “database management systems”

  • google.com

– Search for “database management systems” – +site: pitt.edu

  • scholar.google.com

– Search for “database management systems”