Latent Semantic Indexing Information Systems M Prof. Paolo Ciaccia - - PDF document

latent semantic indexing
SMART_READER_LITE
LIVE PREVIEW

Latent Semantic Indexing Information Systems M Prof. Paolo Ciaccia - - PDF document

Latent Semantic Indexing Information Systems M Prof. Paolo Ciaccia http://www-db.deis.unibo.it/courses/SI-M/ Two major problems plague the Vector Space Model: synonymy: many ways to refer to the same


slide-1
SLIDE 1

1

Latent Semantic Indexing

Information Systems M

  • Prof. Paolo Ciaccia

http://www-db.deis.unibo.it/courses/SI-M/

  • Two major problems plague the Vector Space Model:
  • synonymy: many ways to refer to the same object, e.g. car and

automobile.

leads to poor recall

  • polysemy: most words have more than one distinct meaning, e.g.model,

python, chip

leads to poor precision

  • Synonymy:

Low similarity, but related Polysemy: High similarity, but unrelated doc1 doc2 doc3

LSI Sistemi Informativi M 2

slide-2
SLIDE 2

2

  • Latent Semantic Indexing (LSI), also known as Latent Semantic Analysis (LSA)

when not applied to IR, was proposed at the end of 80’s as a way to solve such problem, http://lsi.argreenhouse.com/lsi/LSI.html

  • The basic observation is that terms are an unreliable means to assess the

relevance of a document wrt to a query

  • Because of synonymy and polysemy
  • Thus, one would like to represent document using a more semantically

accurate way, i.e., in terms of “concepts”

  • LSI achieves this by analyzing the whole term-document matrrix W,

and by projecting it in a lower-dimensional “latent” space spanned by relevant “concepts”

  • More precisely, LSI uses a linear algebra technique, called Singular Value

Decomposition (SVD), before performing dimensionality reduction

LSI Sistemi Informativi M 3

  • Given a square m×m matrix S, a non-null vector v is an eigenvector of S if

there exists a scalar λ, called the eigenvalue of v, such that:

  • The linear transformation associated to S does not change the directions of

eigenvectors, which are just stretched/shrinked by an amount given by the corresponding eigenvalue

  • There are at most m distinct eigenvalues, which are solutions of the

characteristic equation

  • For each eigenvalue, there are infinite corresponding eigenvectors
  • If is an eigenvector, so it is k, k ≠ 0
  • Thus, we can consider normalized eigenvectors, ∥v

v v v∥=1

  • Sv = λv

Det(S - λI) = 0

LSI Sistemi Informativi M 4

slide-3
SLIDE 3

3

  • If S is a real and symmetric matrix, then
  • All its eigenvalues are real
  • All the (normalized) eigenvectors of distinct eigenvalues are mutually
  • rthogonal (thus, linearly independent)
  • If S has m linearly independent eigenvectors, then it can be written as

where:

  • Λ

Λ Λ Λ is a diagonal matrix, Λ Λ Λ Λ = diag(λ1, λ2,…, λm), with λ1 ≥ λ2 ≥ … ≥ λm

  • The columns of are the corresponding eigenvectors of

U is a column-orthonormal matrix

LSI Sistemi Informativi M 5

      = 2 1 1 2 S

. 1 ) λ 2 ( | λI S |

2

= − − = −

1 λ ; 2 / 1 2 / 1 v 3 λ ; 2 / 1 2 / 1 v

2 2 1 1

=         − = =         =

S = UΛ Λ Λ ΛUT

=

× × =

m 1 c c j, c c i, j i,

u λ u s

  • LSI

Sistemi Informativi M 6

      −             − =       = 2 / 1 2 / 1 2 / 1 2 / 1 1 3 2 / 1 2 / 1 2 / 1 2 / 1 2 1 1 2 S

U UT

Λ Λ Λ Λ

1 2 / 1 2 / 3 2 / 1 1 2 / 1 2 / 1 3 2 / 1 u λ u u λ u u λ u s ; u λ u s

2,2 2 1,2 2,1 1 1,1 m 1 c c 2, c c 1, 1,2 m 1 c c j, c c i, j i,

= − = − × × + × × = × × + × × = × × = × × =

∑ ∑

= =

slide-4
SLIDE 4

4

!

  • Consider the M×N term-document weight matrix W
  • If W has rank r ≤ min{M,N}, then W can be factorized as:

where:

  • is an M×r column.orthonormal matrix (T =),
  • Λ

Λ Λ Λ is an r×r diagonal matrix

  • is an N×r column.orthonormal matrix (T =)
  • Λ

Λ Λ Λ is also called the “concept matrix”

  • T is the “term-concept similarity matrix”
  • D is the “document-concept similarity matrix”

LSI Sistemi Informativi M 7

W = TΛ Λ Λ ΛDT

=

× × =

r 1 c c j, c c i, j i,

d λ t w

  • SVD represents both terms and documents using a set of latent concepts
  • This just says that the weight of ti in docj is expressed as a

“linear combination of term-concept and doc-concept weights”

LSI Sistemi Informativi M 8

=

× × =

r 1 c c j, c c i, j i,

d λ t w

slide-5
SLIDE 5

5

"#$

  • Consider the 12×9 weight matrix below, whose rank is r = 9, in which two

“groups” of documents are present

LSI Sistemi Informativi M 9 W C1 C2 C3 C4 C5 G1 G2 G3 G4 Human 1 1 Interface 1 1 Computer 1 1 User 1 1 1 System 1 1 2 Response 1 1 Time 1 1 EPS 1 1 Survey 1 1 Tree 1 1 1 Graph 1 1 1 Minors 1 1

"%$

  • It is Λ

Λ Λ Λ = diag(3.34,2.54,2.35,1.64,1.50,1.31,0.85,0.56,0.35)

LSI Sistemi Informativi M 10

T 0.22

  • 0.11

0.29

  • 0.41
  • 0.11
  • 0.34

0.52

  • 0.06
  • 0.41

0.20

  • 0.07

0.14

  • 0.55

0.28 0.50

  • 0.07
  • 0.01
  • 0.11

0.24 0.04

  • 0.16
  • 0.59
  • 0.11
  • 0.25
  • 0.30

0.06 0.49 0.40 0.06

  • 0.34

0.10 0.33 0.38 0.00 0.00 0.01 0.64

  • 0.17

0.36 0.33

  • 0.16
  • 0.21
  • 0.17

0.03 0.27 0.27 0.11

  • 0.43

0.07 0.08

  • 0.17

0.28

  • 0.02
  • 0.05

0.27 0.11

  • 0.43

0.07 0.08

  • 0.17

0.28

  • 0.02
  • 0.05

0.30

  • 0.14

0.33 0.19 0.11 0.27 0.03

  • 0.02
  • 0.17

0.21 0.27

  • 0.18
  • 0.03
  • 0.54

0.08

  • 0.47
  • 0.04
  • 0.58

0.01 0.49 0.23 0.03 0.59

  • 0.39
  • 0.29

0.25

  • 0.23

0.04 0.62 0.22 0.00

  • 0.07

0.11 0.16

  • 0.68

0.23 0.03 0.45 0.14

  • 0.01
  • 0.30

0.28 0.34 0.68 0.18 D 0.20

  • 0.06

0.11

  • 0.95

0.05 0.08 0.18

  • 0.01 -0.06

0.61 0.17

  • 0.50 -0.03 -0.21 -0.26 -0.43

0.05 0.24 0.46

  • 0.13

0.21 0.04 0.38 0.72

  • 0.24

0.01 0.02 0.54

  • 0.23

0.57 0.27

  • 0.21 -0.37

0.26

  • 0.02 -0.08

0.28 0.11

  • 0.51

0.15 0.33 0.03 0.67

  • 0.06 -0.26

0.00 0.19 0.10 0.02 0.39

  • 0.30 -0.34

0.45

  • 0.62

0.01 0.44 0.19 0.02 0.35

  • 0.21 -0.15 -0.76

0.02 0.02 0.62 0.25 0.01 0.15 0.00 0.25 0.45 0.52 0.08 0.53 0.08

  • 0.03 -0.60

0.36 0.04

  • 0.07 -0.45
slide-6
SLIDE 6

6

!"#$

  • Since both T and D are column-orthonormal matrices, it is
  • T is the (real and symmetric) M×M term.term similarity matrix, and the

columns of are the eigenvectors of such matrix

  • T is the N×N document.document similarity matrix, and the columns of

are the eigenvectors of such matrix

  • Λ

Λ Λ Λ2 is a matrix with the eigenvalues of T (and T )

LSI Sistemi Informativi M 11

W WT = (TΛ Λ Λ ΛDT)(TΛ Λ Λ ΛDT)T = (TΛ Λ Λ ΛDT)(DΛ Λ Λ ΛTTT) = TΛ Λ Λ Λ2TT WT W = (DΛ Λ Λ ΛTTT)(TΛ Λ Λ ΛDT) = DΛ Λ Λ Λ2DT

!"%$

  • Since it is

we can view this as a “projection” of documents (columns of W) in the r-dimensional “concept space” spanned by T columns (i.e., TT rows)

  • In this space, documents are represented by the columns of Λ

Λ Λ ΛDT (i.e., rows of DΛ Λ Λ Λ)

  • It follows that WT W = (DΛ

Λ Λ Λ)(Λ Λ Λ ΛDT) amounts to compute the similarity between documents as the inner product in this r-dimensional latent semantic space

  • Similarly, in this space terms are represented by by the columns of Λ

Λ Λ ΛTT

LSI Sistemi Informativi M 12

TT W = Λ Λ Λ ΛDT

slide-7
SLIDE 7

7

  • Given a “target” dimensionality k (k ≪ r), one can obtain an optimal

approximation Wk of the W matrix by retaining only the k largest λc’s

  • Among all the rank-k approximations, Wk is the one that minimizes the

Frobenius norm:

  • Thus, now we have:
  • Typical values of k are 100 - 300

LSI Sistemi Informativi M 13

norm Frobenius A A X W W

M 1 i N 1 j 2 j i, F F k :rank(X) X k

min ∑∑

= = =

= − =

Wk = TkΛ Λ Λ ΛkDk

T

"#$

  • With k = 2 one obtains: Λ

Λ Λ Λ2 =diag(3.34,2.54)

LSI Sistemi Informativi M 14 T2 0.22

  • 0.11

0.20

  • 0.07

0.24 0.04 0.40 0.06 0.64

  • 0.17

0.27 0.11 0.27 0.11 0.30

  • 0.14

0.21 0.27 0.01 0.49 0.04 0.62 0.03 0.45 D2

T

0.20 0.61 0.46 0.54 0.28 0.00 0.01 0.02 0.08

  • 0.06

0.17

  • 0.13
  • 0.23

0.11 0.19 0.44 0.62 0.53 W2 C1 C2 C3 C4 C5 G1 G2 G3 G4 Human 0.16 0.40 0.38 0.47 0.18

  • 0.05
  • 0.12
  • 0.16
  • 0.09

Interface 0.14 0.37 0.33 0.40 0.16

  • 0.03
  • 0.07
  • 0.10
  • 0.04

Computer 0.15 0.51 0.36 0.41 0.24 0.02 0.06 0.09 0.12 User 0.26 0.84 0.61 0.70 0.39 0.03 0.08 0.12 0.19 System 0.45 1.23 1.05 1.27 0.56

  • 0.07
  • 0.15
  • 0.21
  • 0.05

Response 0.16 0.58 0.38 0.42 0.28 0.06 0.13 0.19 0.22 Time 0.16 0.58 0.38 0.42 0.28 0.06 0.13 0.19 0.22 EPS 0.22 0.55 0.51 0.63 0.24

  • 0.07
  • 0.14
  • 0.20
  • 0.11

Survey 0.10 0.53 0.23 0.21 0.27 0.14 0.31 0.44 0.42 Tree

  • 0.06

0.23

  • 0.14 -0.27

0.14 0.24 0.55 0.77 0.66 Graph

  • 0.06

0.34

  • 0.15 -0.30

0.20 0.31 0.69 0.98 0.85 Minors

  • 0.04

0.25

  • 0.10 -0.21

0.15 0.22 0.50 0.71 0.62

slide-8
SLIDE 8

8

&'

  • Given a query q (M×1) column vector, in matrix notation the N inner

products of q and the documents are computed as that now becomes:

  • Thus, the query vector is transformed into qTTk

(note that this is not anymore a sparse vector) Example: the query “Human computer interaction” corresponds to the vector qT = (1,0,1,0,0,0,0,0,0,0,0,0) This is transformed in the vector qTT2= (0.46, -0.07)

LSI Sistemi Informativi M 15

qTW = (qTT)(Λ Λ Λ ΛDT)

qTWk = (qTTk)(Λ Λ Λ ΛkDk

T)

"%$

  • The matrix Λ

Λ Λ Λ2D2

T is (Λ

Λ Λ Λ2 =diag(3.34,2.54)):

  • The figure shows how queries, documents and terms are distributed over

the 2-dimensional concept space

LSI Sistemi Informativi M 16

λ λ λ λ2D2

T

0.668 2.037 1.536 1.804 0.935 0.000 0.033 0.067 0.267

  • 0.152 0.432 -0.330 -0.584 0.279

0.483 1.118 1.575 1.346

  • 1
  • 0.5

0.5 1 1.5 2 0.5 1 1.5 2 2.5 Terms Documents Query

slide-9
SLIDE 9

9

"($

  • After normalizing and computing the cosine similarity, it is obtained
  • In general, it can be argued that if the document collection consists of k

“topics”, then LSI will produce a “good” rank-k approximation of W (good = low Frobenius error)

  • In this case, the distance between documents related to a same topic will

decrease, whereas that between documents of different topics will increase

LSI Sistemi Informativi M 17

Cosine similarity C1 C2 C3 C4 C5 G1 G2 G3 G4 VSM 0.81 0.29 0.29 LSI 0.99 0.93 0.99 0.98 0.90

  • 0.15 -0.12 -0.10

0.04

  • There are many situations in which a feature-object matrix shows up:
  • In IR, features = terms and objects = documents
  • For instance, take:
  • features = opinions
  • bjects = users
  • Then, LSI can bring together “similar users”
  • Other examples: spam filtering for e-mails, cross-language retrieval,

modelling of human cognitive function, etc.

  • It seems that also Google uses “some form” of LSI (“semantic search”),
  • btained by prefixing terms with “~” (e.g., ~phone, ~humor)

LSI Sistemi Informativi M 18

slide-10
SLIDE 10

10

  • PROS
  • Tends to improve the effectiveness of the retrieval process
  • Decreases the dimensionality of vectors
  • Good for machine learning, in which high dimensionality is a problem
  • Dimensions have “semantics”

CONS

  • Reduced vectors are not sparse anymore
  • Expensive to compute, difficult to scale
  • LSI does not solve the polysemy problem, since each term is represented as

a single point in the concept space (which is the weighted average of possible term meanings)

LSI Sistemi Informativi M 19

)

  • http://web.eecs.utk.edu/research/lsi/
  • http://lsi.research.telcordia.com/
  • Online demo of LSI
  • http://www.cs.brown.edu/~th/papers/Hofmann-SIGIR99.pdf
  • Describes an evolution of LSI, called PLSI (Probabilistic LSI), aiming to

provide a more accurate description of documents in terms of laternt concepts

LSI Sistemi Informativi M 20