Correlated T opic Models Authors: Blei and LaffertY, 2006 - - PowerPoint PPT Presentation

β–Ά
correlated t opic models
SMART_READER_LITE
LIVE PREVIEW

Correlated T opic Models Authors: Blei and LaffertY, 2006 - - PowerPoint PPT Presentation

Correlated T opic Models Authors: Blei and LaffertY, 2006 Reviewer: Casey Hanson Recap Latent Dirichlet Allocation set of documents . = set of topics . = set of all words. || words in each doc.


slide-1
SLIDE 1

Correlated T

  • pic Models

Authors: Blei and LaffertY, 2006 Reviewer: Casey Hanson

slide-2
SLIDE 2

Recap Latent Dirichlet Allocation

  • 𝐸 ≑ set of documents.
  • 𝐿 = set of topics.
  • π‘Š = set of all words. |𝑂| words in each doc.
  • πœ„π‘’ ≑ Multi over topics for a document d ∈ 𝐸.

πœ„π‘’ ~ 𝐸𝑗𝑠(𝛽)

  • 𝛾𝑙 ≑ Multi over words in a topic, 𝑙 ∈ 𝐿. 𝛾𝑙~𝐸𝑗𝑠(πœƒ)
  • π‘Žπ‘’,π‘œ ≑ topic selected for word π‘œ in document 𝑒. π‘Žπ‘’,π‘œ~Multi(πœ„π‘’)
  • 𝑋

𝑒,π‘œ ≑ π‘œπ‘’β„Ž word in document 𝑒. 𝑋 𝑒,π‘œ~ Multi(πΆπ‘Žπ‘’,π‘œ)

slide-3
SLIDE 3

Latent Dirichlet Allocation

  • Need to calculate posterior: 𝑄(πœ„1:𝐸, π‘Ž1:𝐸,1:𝑂 , 𝛾1:𝐿|𝑋

1:𝐸,1:𝑂, 𝛽, πœƒ)

  • ∝ π‘ž(πœ„1:𝐸, π‘Ž1:𝐸,1:𝑂, 𝛾1:𝐿, 𝑋

1:𝐸,1:𝑂, 𝛽, πœƒ)

  • Normalization factor,

𝛾 πœ„ π‘Ž π‘ž(. . ), is intractable

  • Need to use approximate inference.
  • Gibbs Sampling
  • Drawback
  • No intuitive relationship between topics.
  • Challenge
  • Develop method similar to LDA with relationships between topics.
slide-4
SLIDE 4

Normal or Gaussian Distribution

𝑔 𝑦 = 1 𝜏 2𝜌 π‘“βˆ’ π‘¦βˆ’πœˆ 2

2𝜏2

  • Continuous distribution
  • Symmetrical and defined for βˆ’βˆž < 𝑦 < ∞
  • Parameters: π’ͺ 𝜈, 𝜏2
  • 𝜈 ≑ mean
  • 𝜏2 ≑ variance
  • 𝜏 ≑ standard deviation
  • Estimation from Data: π‘Œ = 𝑦1 … π‘¦π‘œ
  • 𝜈 =

1 π‘œ 𝑗=1 π‘œ

𝑦𝑗

  • 𝜏2 =

1 π‘œ 𝑗=1 π‘œ

𝑦𝑗 βˆ’ 𝜈 2

slide-5
SLIDE 5

Multivariate Gaussian Distribution: 𝑙 dimensions

𝑔 𝒀 = 𝑔

𝑦 π‘Œ1 … π‘Œπ‘™ =

1 2𝜌 𝑙/2 det Ξ£ π‘“βˆ’1

2 π’€βˆ’π‚ π‘ˆΞ£βˆ’1(π’€βˆ’π‚)

  • 𝒀 = π‘Œ1 … π‘Œπ‘™ π‘ˆ~π’ͺ(𝝂, Ξ£)
  • 𝝂 ≑ 𝑙 x 1 vector of means for each dimension
  • 𝚻 ≑ 𝑙 x 𝑙 covariance matrix.

Example: 2D Case

  • 𝜈 = 𝐹 𝒀 =

𝐹[𝑦1] 𝐹[𝑦2] = 𝜈1 𝜈2

  • Ξ£ =

𝐹 𝑦1 βˆ’ 𝜈1 2 𝐹 𝑦1 βˆ’ 𝜈1 𝑦2 βˆ’ 𝜈2 𝐹 𝑦1 βˆ’ 𝜈1 𝑦2 βˆ’ 𝜈2 𝐹 𝑦2 βˆ’ 𝜈2 2

slide-6
SLIDE 6

2D Multivariate Gaussian:

  • Ξ£ =

πœπ‘Œ1

2

πœπ‘Œ1,π‘Œ2πœπ‘Œ1πœπ‘Œ2 πœπ‘Œ1,π‘Œ2πœπ‘Œ1πœπ‘Œ2 πœπ‘Œ2

2

  • Topic Correlations on Off Diagonal
  • πœπ‘Œ1,π‘Œ2πœπ‘Œ1πœπ‘Œ2= 𝐹

𝑦1 βˆ’ 𝜈1 𝑦2 βˆ’ 𝜈2 = 𝑗=1

π‘œ π‘Œπ‘—,1βˆ’πœˆ1 π‘Œπ‘—,2βˆ’πœˆ2 π‘œ

  • Covariance matrix is diagonal!
slide-7
SLIDE 7

Matlab Demo

slide-8
SLIDE 8

…Back to Topic Models

  • How can we adapt LDA to have correlations between topics.
  • In LDA, we assume two things:
  • Assumption 1: Topics in a document are independent. πœ„π‘’~𝐸𝑗𝑠(𝛽)
  • Assumption 2: Distribution of words in a topic is stationary. 𝐢𝑙~(πœƒ)
  • To sample topic distributions for topics that are correlated, we need

to correct assumption 1.

slide-9
SLIDE 9

Exponential Family of Distributions

  • Family of distributions that can be placed in the following form:

𝑔 𝑦 πœ„ = β„Ž 𝑦 β‹… π‘“πœƒ πœ„ β‹…π‘ˆ 𝑦 βˆ’π΅ πœ„

  • Ex: Binomial distribution: πœ„ = π‘ž

𝑔 𝑦|πœ„ = π‘œ 𝑦 π‘žπ‘¦(1 βˆ’ π‘ž)π‘œβˆ’π‘¦, 𝑦 ∈ 0,1,2, … , π‘œ

  • πœƒ(πœ„) = log

π‘ž 1βˆ’π‘ž

β„Ž 𝑦 =

π‘œ 𝑦 , 𝐡 πœ„ = π‘œ log 1 βˆ’ π‘ž, π‘ˆ 𝑦 = 𝑦

𝑔 𝑦 = π‘œ 𝑦 𝑓

𝑦⋅log π‘ž 1βˆ’π‘ž +π‘œβ‹…log 1βˆ’π‘ž

Natural Parameterization

slide-10
SLIDE 10

Categorical Distribution

  • Multinomial n=1:
  • 𝑔 𝑦1 = πœ„1; 𝑔 π‘Ž1 = πœ„π‘ˆ β‹… π‘Ž1
  • where π‘Ž1 = 1 0 0. . 0 π‘ˆ (Iverson Bracket or Indicator Vector)
  • 𝑨𝑗 = 1
  • Parameters: πœ„
  • πœ„ = π‘ž1 π‘ž2 π‘ž3 , where 𝑗 π‘žπ‘— = 1
  • πœ„β€² =

π‘ž1 π‘žπ‘™ π‘ž2 π‘žπ‘™ 1

  • log πœ„β€² = log

π‘ž1 π‘žπ‘™ log π‘ž2 π‘žπ‘™ 1

slide-11
SLIDE 11

Exponential Family Multinomial With N=1

  • π‘Ίπ’‡π’…π’ƒπ’Žπ’Ž: 𝑔 π‘Žπ‘— πœ„ = πœ„π‘ˆ β‹… π‘Žπ‘—
  • We want: 𝑔 𝑦 πœ„ = β„Ž 𝑦 β‹… π‘“πœƒ πœ„ β‹…π‘ˆ 𝑦 βˆ’π΅ πœ„
  • 𝑔 π‘Žπ‘— πœƒ = π‘“πœƒπ‘ˆπ‘Žπ‘—βˆ’log 𝑗=1 π‘“πœƒπ‘— =

π‘“πœƒπ‘ˆβ‹…π‘Žπ‘— 𝑗=1 π‘“πœƒπ‘—

  • Note: k-1 independent dimensions in Multinomial
  • πœƒβ€² = [log

π‘ž1 π‘žπ‘™ log π‘ž2 π‘žπ‘™ … .0], πœƒβ€²π‘— = log π‘žπ‘— π‘žπ‘™

  • 𝑔 π‘Žπ‘— πœƒβ€² =β‹…

π‘“πœƒβ€²π‘ˆβ‹…π‘Žπ‘— 1+ 𝑗=1

π‘™βˆ’1 𝑓𝑗 πœƒπ‘—β€²

slide-12
SLIDE 12

Verify: Classroom participation

  • Given: πœƒ = [log

π‘ž1 π‘žπ‘™ log π‘ž2 π‘žπ‘™ … 0]

  • Show: 𝑔 π‘Žπ‘— πœ„ = πœ„π‘ˆ β‹… π‘Žπ‘— = π‘“πœƒπ‘ˆπ‘Žπ‘—βˆ’log 𝑗=1 π‘“πœƒπ‘—
slide-13
SLIDE 13

Intuition and Demo

  • Can sample πœƒ from any number of places.
  • Choose normal (allows for correlation between topic dimensions)
  • Get a topic distribution for each document by sampling:

πœƒ ~ π’ͺ

π‘™βˆ’1 𝜈, 𝜏

  • What is the 𝜈
  • Expected deviation from last topic: log

π‘žπ‘— π‘žπ‘™

  • Negative means push density towards last topic (πœƒπ‘— < 0, π‘žπ‘™ > π‘žπ‘—)
  • What about the covariance
  • Shows variability in deviation from last topic between topics.

𝜈 = 0 0 π‘ˆ, 𝜏 = [1 0; 0 1]

slide-14
SLIDE 14

Favoring Topic 3

𝜈 = βˆ’0.9, βˆ’0.9 , Ξ£ = [1 0; 0 1] 𝜈 = βˆ’0.9, βˆ’0.9 , Ξ£ = [1 βˆ’ 0.9; βˆ’0.9 1]

slide-15
SLIDE 15

Favoring Topic 3:

𝜈 = βˆ’0.9, βˆ’0.9 , Ξ£ = [1 0.4; 0.4 1]

slide-16
SLIDE 16

Exercises

slide-17
SLIDE 17

Correlated Topic Model

  • Algorithm:
  • βˆ€π‘’ ∈ 𝐸
  • Draw πœƒπ‘’| 𝜈, Ξ£ ~ π’ͺ(𝜈, Ξ£)
  • βˆ€ π‘œ ∈ 1 … 𝑂 :
  • Draw topic assignment
  • π‘Žπ‘œ,𝑒|πœƒπ‘’ ~ Categorical 𝑔 πœƒπ‘’
  • Draw word
  • 𝑋

𝑒,π‘œ| π‘Žπ‘’,π‘œ, 𝛾1:𝐿 ~ Categorical π›Ύπ‘Žπ‘œ

  • Parameter Estimation:
  • Intractable
  • User variational inference (later)
slide-18
SLIDE 18

Evaluation I: CTM on Test Data

slide-19
SLIDE 19

Evaluation II: 10-Fold Cross Validation LDA vs CTM

  • ~1500 documents in corpus.
  • ~5600 unique words
  • After pruning
  • Methodology:
  • Partition data into 10 sets
  • 10 fold cross validation
  • Calculate the log likelihood of a

set, given you trained on the previous 9 sets, for both LDA and CTM.

  • Right(L(CTM) - L(LDA))
  • Left(L(CTM) – L(LDA))

CTM shows a much higher log likelihood as the number of topics increases.

slide-20
SLIDE 20

Evaluation II: Predictive Perplexity

  • Perplexity measure ≑ expected

number of equally likely words

  • Lower perplexity means higher

word resolution.

  • Suppose you see a percentage of

words in a document, how likely is the rest of the words in the document according to your model?

  • CTM does better with lower #’s of
  • bserved words.
  • Able to infer certain words given

topic probabilities.

slide-21
SLIDE 21

Conclusions

  • CTM changes the distribution from which hyper parameters are

drawn, from a Dirichlet to a logistic normal function.

  • Very similar to LDA
  • Able to model correlations between topics.
  • For larger topic sizes, CTM performs better than LDA.
  • With known topics, CTM is able to infer words associations better

than LDA.