Gibbs Sampling for LDA Lei Tang Department of CSE Arizona State - - PowerPoint PPT Presentation

gibbs sampling for lda
SMART_READER_LITE
LIVE PREVIEW

Gibbs Sampling for LDA Lei Tang Department of CSE Arizona State - - PowerPoint PPT Presentation

Gibbs Sampling for LDA Lei Tang Department of CSE Arizona State University January 7, 2008 1 / 10 Graphical Representation , are fixed hyper-parameters. We need to estimate parameters for each document and for each topic. Z are


slide-1
SLIDE 1

Gibbs Sampling for LDA

Lei Tang

Department of CSE Arizona State University

January 7, 2008

1 / 10

slide-2
SLIDE 2

Graphical Representation

α, β are fixed hyper-parameters. We need to estimate parameters θ for each document and φ for each topic. Z are latent variables. This is different from original LDA work.

2 / 10

slide-3
SLIDE 3

Property of Dirichlet

The expectation of Dirichlet is E(µk) = αk α0 where α0 = αk.

3 / 10

slide-4
SLIDE 4

Gibbs Variants

1 Gibbs Sampling

Draw a conditioned on b, c Draw b conditioned on a, c Draw c conditioned on a, b

2 Block Gibbs Sampling

Draw a, b conditioned on c Draw c conditioned on a,b

3 Collapsed Gibbs Sampling

Draw a conditioned on c Draw c conditioned on a

b is collopsed out during the sampling process.

4 / 10

slide-5
SLIDE 5

Collapsed Sampling for LDA

In the original paper “Finding Scientific Topics”, the authors are more interested in text modelling, (find out Z), hence, the Gibbs sampling procedure boils down to estimate P(zi = j|z−i, w) Here, θ, φ are intergrated out. Actually, if we know the exact Z for each document, it’s trivial to estimate θ and φ. P(zi = j|z−i, w) ∝ P(zi = j, z−i, w) = P(wi|zi = j, z−i, w−i)P(zi = j|z−i, w−i) = P(wi|zi = j, z−i, w−i)P(zi = j|z−i) The first term is the likelihood and the 2nd term like a prior.

5 / 10

slide-6
SLIDE 6

P(wi|zi = j, z−i, w−i) =

  • P(wi|zi = j, φ(j))P(φ(j)|z−i, w−i)dφ(j)

=

  • φ(j)

wi P(φ(j)|z−i, w−i)dφ(j)

P(φ(j)|z−i, w−i) ∝ P(w−i|φ(j), z−i)P(φj) ∼ Dirichlet(β + n(w)

−i,j)

Here, n(w)

−i,j is the number of instances of word w assigned to topic j.

Using the property of expectation of Dirichlet distribution, we have P(wi|zi = j, z−i, w−i) = n(wi)

−i,j + β

n(·)

−i,j + W β

where n−i,j total number of words assigned to topic j.

6 / 10

slide-7
SLIDE 7

P(wi|zi = j, z−i, w−i) =

  • P(wi|zi = j, φ(j))P(φ(j)|z−i, w−i)dφ(j)

=

  • φ(j)

wi P(φ(j)|z−i, w−i)dφ(j)

P(φ(j)|z−i, w−i) ∝ P(w−i|φ(j), z−i)P(φj) ∼ Dirichlet(β + n(w)

−i,j)

Here, n(w)

−i,j is the number of instances of word w assigned to topic j.

Using the property of expectation of Dirichlet distribution, we have P(wi|zi = j, z−i, w−i) = n(wi)

−i,j + β

n(·)

−i,j + W β

where n−i,j total number of words assigned to topic j.

6 / 10

slide-8
SLIDE 8

Similarly, for the 2nd term, we have P(zi = j|z−i) =

  • P(zi = j|θ(d))P(θ(d)|z−i)dθ(d)

P(θ(d)|z−i) ∝ P(z−i|θ(d))P(θ(d)) ∼ Dirichlet(n(d)

−i,j + α)

where n(d)

−i,j is the number of words assigned to topic j excluding current

  • ne.

P(zi = j|z−i) = n(d)

−i,j + α

n(d)

−i,· + Kα

where n(d)

−i,· is the total number of topics assigned to document d

excluding current one.

7 / 10

slide-9
SLIDE 9

Algorithm

P(zi = j|z−i, w) ∝ n(wi)

−i,j + β

n(·)

−i,j + W β

n(d)

−i,j + α

n(d)

−i,· + Kα

Need to record four count variables: document-topic count n(d)

−i,j

document-topic sum n(d)

−i,· (actually a constant)

topic-term count n(wi)

−i,j

topic-term sum n(·)

−i,j

8 / 10

slide-10
SLIDE 10

Parameter Estimation

To obtain φ, and θ, two ways, (draw one sample of z or draw multiple samples of z to calculate the average) φj,w = n(j)

w + β

V

w=1 n(j) w + V β

θ(d)

j

= n(d)

j

+ α K

z=1 n(d) z

+ Kα where n(j)

w is the freqency of word assigned to topic j, and n(d) z

is the number of words assigned to topic z.

9 / 10

slide-11
SLIDE 11

Comment

Compared with VB, Gibbs Sampling is easy to implement. Easy to extend. More efficient. Faster to obtain good approximation.

10 / 10