[PPT] - Correlated T opic Models Authors: Blei and LaffertY, 2006 PowerPoint Presentation

SLIDE 1

Correlated T

pic Models

Authors: Blei and LaffertY, 2006 Reviewer: Casey Hanson

SLIDE 2

Recap Latent Dirichlet Allocation

𝐸 ≡ set of documents.
𝐿 = set of topics.
𝑊 = set of all words. |𝑂| words in each doc.
𝜄𝑒 ≡ Multi over topics for a document d ∈ 𝐸.

𝜄𝑒 ~ 𝐸𝑗𝑠(𝛽)

𝛾𝑙 ≡ Multi over words in a topic, 𝑙 ∈ 𝐿. 𝛾𝑙~𝐸𝑗𝑠(𝜃)
𝑎𝑒,𝑜 ≡ topic selected for word 𝑜 in document 𝑒. 𝑎𝑒,𝑜~Multi(𝜄𝑒)
𝑋

𝑒,𝑜 ≡ 𝑜𝑢ℎ word in document 𝑒. 𝑋 𝑒,𝑜~ Multi(𝐶𝑎𝑒,𝑜)

SLIDE 3

Latent Dirichlet Allocation

Need to calculate posterior: 𝑄(𝜄1:𝐸, 𝑎1:𝐸,1:𝑂 , 𝛾1:𝐿|𝑋

1:𝐸,1:𝑂, 𝛽, 𝜃)

∝ 𝑞(𝜄1:𝐸, 𝑎1:𝐸,1:𝑂, 𝛾1:𝐿, 𝑋

1:𝐸,1:𝑂, 𝛽, 𝜃)

Normalization factor,

𝛾 𝜄 𝑎 𝑞(. . ), is intractable

Need to use approximate inference.
Gibbs Sampling
Drawback
No intuitive relationship between topics.
Challenge
Develop method similar to LDA with relationships between topics.

SLIDE 4

Normal or Gaussian Distribution

𝑔 𝑦 = 1 𝜏 2𝜌 𝑓− 𝑦−𝜈 2

2𝜏2

Continuous distribution
Symmetrical and defined for −∞ < 𝑦 < ∞
Parameters: 𝒪 𝜈, 𝜏2
𝜈 ≡ mean
𝜏2 ≡ variance
𝜏 ≡ standard deviation
Estimation from Data: 𝑌 = 𝑦1 … 𝑦𝑜
𝜈 =

1 𝑜 𝑗=1 𝑜

𝑦𝑗

𝜏2 =

1 𝑜 𝑗=1 𝑜

𝑦𝑗 − 𝜈 2

SLIDE 5

Multivariate Gaussian Distribution: 𝑙 dimensions

𝑔 𝒀 = 𝑔

𝑦 𝑌1 … 𝑌𝑙 =

1 2𝜌 𝑙/2 det Σ 𝑓−1

2 𝒀−𝝂 𝑈Σ−1(𝒀−𝝂)

𝒀 = 𝑌1 … 𝑌𝑙 𝑈~𝒪(𝝂, Σ)
𝝂 ≡ 𝑙 x 1 vector of means for each dimension
𝚻 ≡ 𝑙 x 𝑙 covariance matrix.

Example: 2D Case

𝜈 = 𝐹 𝒀 =

𝐹[𝑦1] 𝐹[𝑦2] = 𝜈1 𝜈2

Σ =

𝐹 𝑦1 − 𝜈1 2 𝐹 𝑦1 − 𝜈1 𝑦2 − 𝜈2 𝐹 𝑦1 − 𝜈1 𝑦2 − 𝜈2 𝐹 𝑦2 − 𝜈2 2

SLIDE 6

2D Multivariate Gaussian:

Σ =

𝜏𝑌1

2

𝜍𝑌1,𝑌2𝜏𝑌1𝜏𝑌2 𝜍𝑌1,𝑌2𝜏𝑌1𝜏𝑌2 𝜏𝑌2

2

Topic Correlations on Off Diagonal
𝜍𝑌1,𝑌2𝜏𝑌1𝜏𝑌2= 𝐹

𝑦1 − 𝜈1 𝑦2 − 𝜈2 = 𝑗=1

𝑜 𝑌𝑗,1−𝜈1 𝑌𝑗,2−𝜈2 𝑜

Covariance matrix is diagonal!

SLIDE 7

Matlab Demo

SLIDE 8

…Back to Topic Models

How can we adapt LDA to have correlations between topics.
In LDA, we assume two things:
Assumption 1: Topics in a document are independent. 𝜄𝑒~𝐸𝑗𝑠(𝛽)
Assumption 2: Distribution of words in a topic is stationary. 𝐶𝑙~(𝜃)
To sample topic distributions for topics that are correlated, we need

to correct assumption 1.

SLIDE 9

Exponential Family of Distributions

Family of distributions that can be placed in the following form:

𝑔 𝑦 𝜄 = ℎ 𝑦 ⋅ 𝑓𝜃 𝜄 ⋅𝑈 𝑦 −𝐵 𝜄

Ex: Binomial distribution: 𝜄 = 𝑞

𝑔 𝑦|𝜄 = 𝑜 𝑦 𝑞𝑦(1 − 𝑞)𝑜−𝑦, 𝑦 ∈ 0,1,2, … , 𝑜

𝜃(𝜄) = log

𝑞 1−𝑞

ℎ 𝑦 =

𝑜 𝑦 , 𝐵 𝜄 = 𝑜 log 1 − 𝑞, 𝑈 𝑦 = 𝑦

𝑔 𝑦 = 𝑜 𝑦 𝑓

𝑦⋅log 𝑞 1−𝑞 +𝑜⋅log 1−𝑞

Natural Parameterization

SLIDE 10

Categorical Distribution

Multinomial n=1:
𝑔 𝑦1 = 𝜄1; 𝑔 𝑎1 = 𝜄𝑈 ⋅ 𝑎1
where 𝑎1 = 1 0 0. . 0 𝑈 (Iverson Bracket or Indicator Vector)
𝑨𝑗 = 1
Parameters: 𝜄
𝜄 = 𝑞1 𝑞2 𝑞3 , where 𝑗 𝑞𝑗 = 1
𝜄′ =

𝑞1 𝑞𝑙 𝑞2 𝑞𝑙 1

log 𝜄′ = log

𝑞1 𝑞𝑙 log 𝑞2 𝑞𝑙 1

SLIDE 11

Exponential Family Multinomial With N=1

𝑺𝒇𝒅𝒃𝒎𝒎: 𝑔 𝑎𝑗 𝜄 = 𝜄𝑈 ⋅ 𝑎𝑗
We want: 𝑔 𝑦 𝜄 = ℎ 𝑦 ⋅ 𝑓𝜃 𝜄 ⋅𝑈 𝑦 −𝐵 𝜄
𝑔 𝑎𝑗 𝜃 = 𝑓𝜃𝑈𝑎𝑗−log 𝑗=1 𝑓𝜃𝑗 =

𝑓𝜃𝑈⋅𝑎𝑗 𝑗=1 𝑓𝜃𝑗

Note: k-1 independent dimensions in Multinomial
𝜃′ = [log

𝑞1 𝑞𝑙 log 𝑞2 𝑞𝑙 … .0], 𝜃′𝑗 = log 𝑞𝑗 𝑞𝑙

𝑔 𝑎𝑗 𝜃′ =⋅

𝑓𝜃′𝑈⋅𝑎𝑗 1+ 𝑗=1

𝑙−1 𝑓𝑗 𝜃𝑗′

SLIDE 12

Verify: Classroom participation

Given: 𝜃 = [log

𝑞1 𝑞𝑙 log 𝑞2 𝑞𝑙 … 0]

Show: 𝑔 𝑎𝑗 𝜄 = 𝜄𝑈 ⋅ 𝑎𝑗 = 𝑓𝜃𝑈𝑎𝑗−log 𝑗=1 𝑓𝜃𝑗

SLIDE 13

Intuition and Demo

Can sample 𝜃 from any number of places.
Choose normal (allows for correlation between topic dimensions)
Get a topic distribution for each document by sampling:

𝜃 ~ 𝒪

𝑙−1 𝜈, 𝜏

What is the 𝜈
Expected deviation from last topic: log

𝑞𝑗 𝑞𝑙

Negative means push density towards last topic (𝜃𝑗 < 0, 𝑞𝑙 > 𝑞𝑗)
What about the covariance
Shows variability in deviation from last topic between topics.

𝜈 = 0 0 𝑈, 𝜏 = [1 0; 0 1]

SLIDE 14

Favoring Topic 3

𝜈 = −0.9, −0.9 , Σ = [1 0; 0 1] 𝜈 = −0.9, −0.9 , Σ = [1 − 0.9; −0.9 1]

SLIDE 15

Favoring Topic 3:

𝜈 = −0.9, −0.9 , Σ = [1 0.4; 0.4 1]

SLIDE 16

Exercises

SLIDE 17

Correlated Topic Model

Algorithm:
∀𝑒 ∈ 𝐸
Draw 𝜃𝑒| 𝜈, Σ ~ 𝒪(𝜈, Σ)
∀ 𝑜 ∈ 1 … 𝑂 :
Draw topic assignment
𝑎𝑜,𝑒|𝜃𝑒 ~ Categorical 𝑔 𝜃𝑒
Draw word
𝑋

𝑒,𝑜| 𝑎𝑒,𝑜, 𝛾1:𝐿 ~ Categorical 𝛾𝑎𝑜

Parameter Estimation:
Intractable
User variational inference (later)

SLIDE 18

Evaluation I: CTM on Test Data

SLIDE 19

Evaluation II: 10-Fold Cross Validation LDA vs CTM

~1500 documents in corpus.
~5600 unique words
After pruning
Methodology:
Partition data into 10 sets
10 fold cross validation
Calculate the log likelihood of a

set, given you trained on the previous 9 sets, for both LDA and CTM.

Right(L(CTM) - L(LDA))
Left(L(CTM) – L(LDA))

CTM shows a much higher log likelihood as the number of topics increases.

SLIDE 20

Evaluation II: Predictive Perplexity

Perplexity measure ≡ expected

number of equally likely words

Lower perplexity means higher

word resolution.

Suppose you see a percentage of

words in a document, how likely is the rest of the words in the document according to your model?

CTM does better with lower #’s of
bserved words.
Able to infer certain words given

topic probabilities.

SLIDE 21

Conclusions

CTM changes the distribution from which hyper parameters are

drawn, from a Dirichlet to a logistic normal function.

Very similar to LDA
Able to model correlations between topics.
For larger topic sizes, CTM performs better than LDA.
With known topics, CTM is able to infer words associations better

than LDA.