[PPT] - Bayesian Inference for Parameter Estimation + Topic Modeling Matt PowerPoint Presentation

SLIDE 1

Bayesian Inference for Parameter Estimation + Topic Modeling

1

10-418 / 10-618 Machine Learning for Structured Data

Matt Gormley Lecture 20

Nov. 4, 2019

Machine Learning Department School of Computer Science Carnegie Mellon University

SLIDE 2

Reminders

Homework 3: Structured SVM

– Out: Fri, Oct. 24 – Due: Wed, Nov. 6 at 11:59pm

Homework 4: Topic Modeling

– Out: Wed, Nov. 6 – Due: Mon, Nov. 18 at 11:59pm

3

SLIDE 3

TOPIC MODELING

4

SLIDE 4

Topic Modeling

Motivation: Suppose you’re given a massive corpora and asked to carry out the following tasks

Organize the documents into thematic categories
Describe the evolution of those categories over time
Enable a domain expert to analyze and understand the content
Find relationships between the categories
Understand how authorship influences the content

SLIDE 5

Topic Modeling

Motivation: Suppose you’re given a massive corpora and asked to carry out the following tasks

Organize the documents into thematic categories
Describe the evolution of those categories over time
Enable a domain expert to analyze and understand the content
Find relationships between the categories
Understand how authorship influences the content

Topic Modeling: A method of (usually unsupervised) discovery of latent or hidden structure in a corpus

Applied primarily to text corpora, but techniques are more general
Provides a modeling toolbox
Has prompted the exploration of a variety of new inference methods to

accommodate large-scale datasets

SLIDE 6

Topic Modeling

http:// www.cs.umass.edu/~mimno/icml100.html

Dirichlet-multinomial regression (DMR) topic model on ICML (Mimno & McCallum, 2008)

SLIDE 7

Topic Modeling

Map of NIH Grants

https://app.nihmaps.org/

(Talley et al., 2011)

SLIDE 8

Outline

Applications of Topic Modeling
Latent Dirichlet Allocation (LDA)

1. Beta-Bernoulli 2. Dirichlet-Multinomial 3. Dirichlet-Multinomial Mixture Model 4. LDA

Bayesian Inference for Parameter Estimation

– Exact inference – EM – Monte Carlo EM – Gibbs sampler – Collapsed Gibbs sampler

Extensions of LDA

– Correlated topic models – Dynamic topic models – Polylingual topic models – Supervised LDA

SLIDE 10

BAYESIAN INFERENCE FOR NAÏVE BAYES

12

SLIDE 11

Beta-Bernoulli Model

Beta Distribution

f(φ|α, β) = 1 B(α, β)xα−1(1 − x)β−1

1 2 3 4 f(φ|α, β) 0.2 0.4 0.6 0.8 1 φ α = 0.1, β = 0.9 α = 0.5, β = 0.5 α = 1.0, β = 1.0 α = 5.0, β = 5.0 α = 10.0, β = 5.0

SLIDE 12

Beta-Bernoulli Model

Generative Process
Example corpus (heads/tails)

H T T H H T T H H H x1 x2 x3 x4 x5 x6 x7 x8 x9 x10

⇤ ∼ Beta(, ⇥) [draw distribution over words] For each word n ∈ {1, . . . , N} xn ∼ Bernoulli(⇤) [draw word]

SLIDE 13

Dirichlet-Multinomial Model

Dirichlet Distribution

f(φ|α, β) = 1 B(α, β)xα−1(1 − x)β−1

1 2 3 4 f(φ|α, β) 0.2 0.4 0.6 0.8 1 φ α = 0.1, β = 0.9 α = 0.5, β = 0.5 α = 1.0, β = 1.0 α = 5.0, β = 5.0 α = 10.0, β = 5.0

SLIDE 14

Dirichlet-Multinomial Model

Dirichlet Distribution

p(⌅ ⇤|α) = 1 B(α)

K

⇤

k=1

⇤αk−1

k

where B(α) = ⇥K

k=1 Γ(k)

Γ(K

k=1 k)

0.2 0.4 0.6 0.8 1 2 0.25 0.5 0.75 1

1

1.5 2 2.5 3 p ( ~

|

~ ↵ )

0.2 0.4 0.6 0.8 1 2 0.25 0.5 0.75 1

1

5 10 15 p(~ |~ ↵)

SLIDE 15

Dirichlet-Multinomial Model

Generative Process
Example corpus

the he is the and the she she is is x1 x2 x3 x4 x5 x6 x7 x8 x9 x10

φ ∼ Dir(β) [draw distribution over words] For each word n ∈ {1, . . . , N} xn ∼ Mult(1, φ) [draw word]

SLIDE 16

Dirichlet-Multinomial Mixture Model

Generative Process
Example corpus

the he is x11 x12 x13 the and the x21 x22 x23 she she is is x31 x32 x33 x34 Document 1 Document 2 Document 3 !"#$%& '"%()*+!&

,)$-!(.*/

SLIDE 17

Dirichlet-Multinomial Mixture Model

Generative Process
Example corpus

the he is x11 x12 x13 the and the x21 x22 x23 she she is is x31 x32 x33 x34 Document 1 Document 2 Document 3

For each topic k ∈ {1, . . . , K}: φk ∼ Dir(β) [draw distribution over words] θ ∼ Dir(α) [draw distribution over topics] For each document m ∈ {1, . . . , M} zm ∼ Mult(1, θ) [draw topic assignment] For each word n ∈ {1, . . . , Nm} xmn ∼ Mult(1, φzm) [draw word]

SLIDE 18

Bayesian Inference for Naïve Bayes

Whiteboard:

– Naïve Bayes is not Bayesian – What if we observed both words and topics? – Dirichlet-Multinomial in the fully observed setting is just Naïve Bayes – Three ways of estimating parameters:

1. MLE for Naïve Bayes

2. MAP estimation for Naïve Bayes
3. Bayesian parameter estimation for Naïve Bayes

20

SLIDE 19

Dirichlet-Multinomial Model

The Dirichlet is conjugate to the Multinomial
The posterior of ⇤ is p(⇤|X) = p(X|φ)p(φ)

P (X)

Define the count vector n such that nt denotes the number of

times word t appeared

Then the posterior is also a Dirichlet distribution:

p(⇤|X) ∼ Dir(β + n) φ ∼ Dir(β) [draw distribution over words] For each word n ∈ {1, . . . , N} xn ∼ Mult(1, φ) [draw word]

SLIDE 20

LATENT DIRICHLET ALLOCATION (LDA)

22

SLIDE 21

Mixture vs. Admixture (LDA)

!"#$%& '"%()*+!&

,)$-!(.*/

!"#$%& '"%()*+!&

,0')$-!(.*/

Diagrams from Wallach, JHU 2011, slides

SLIDE 22

Latent Dirichlet Allocation

Generative Process
Example corpus

the he is x11 x12 x13 the and the x21 x22 x23 she she is is x31 x32 x33 x34 Document 1 Document 2 Document 3 !"#$%& '"%()*+!&

,0')$-!(.*/

SLIDE 23

Latent Dirichlet Allocation

Generative Process
Example corpus

the he is x11 x12 x13 the and the x21 x22 x23 she she is is x31 x32 x33 x34 Document 1 Document 2 Document 3

For each topic k ∈ {1, . . . , K}: φk ∼ Dir(β) [draw distribution over words] For each document m ∈ {1, . . . , M} θm ∼ Dir(α) [draw distribution over topics] For each word n ∈ {1, . . . , Nm} zmn ∼ Mult(1, θm) [draw topic assignment] xmn ∼ φzmi [draw word]

SLIDE 24

LDA for Topic Modeling

The generative story begins with only a Dirichlet

prior over the topics.

Each topic is defined as a Multinomial distribution
ver the vocabulary, parameterized by ϕk

26

(Blei, Ng, & Jordan, 2003)

0.000 0.006 0.000 0.006 0.000 0.006 0.012 0.000 0.006 0.000 0.006 0.000 0.006 0.012

Dirichlet(β)

SLIDE 25

LDA for Topic Modeling

The generative story begins with only a Dirichlet

prior over the topics.

Each topic is defined as a Multinomial distribution
ver the vocabulary, parameterized by ϕk

27

ϕ1 ϕ2 ϕ3 ϕ4 ϕ5 ϕ6

0.000 0.006 0.000 0.006 0.000 0.006 0.012 0.000 0.006 0.000 0.006 0.000 0.006 0.012

(Blei, Ng, & Jordan, 2003)

Dirichlet(β)

SLIDE 26

LDA for Topic Modeling

A topic is visualized as its high probability

words.

A pedagogical label is used to identify the topic.

28

ϕ1 ϕ2 ϕ3 ϕ4 ϕ5 ϕ6 team, season, hockey, player, penguins, ice, canadiens, puck, montreal, stanley, cup

0.000 0.006 0.000 0.006 0.000 0.006 0.012 0.000 0.006 0.000 0.006 0.000 0.006 0.012

(Blei, Ng, & Jordan, 2003)

Dirichlet(β)

SLIDE 27

LDA for Topic Modeling

A topic is visualized as its high probability

words.

A pedagogical label is used to identify the topic.

29

ϕ1 ϕ2 ϕ3

{hockey}

ϕ4 ϕ5 ϕ6 team, season, hockey, player, penguins, ice, canadiens, puck, montreal, stanley, cup

0.000 0.006 0.000 0.006 0.000 0.006 0.012 0.000 0.006 0.000 0.006 0.000 0.006 0.012

(Blei, Ng, & Jordan, 2003)

Dirichlet(β)

SLIDE 28

LDA for Topic Modeling

A topic is visualized as its high probability

words.

A pedagogical label is used to identify the topic.

30

{Canadian gov.} {government} {hockey} {U.S. gov.} {baseball} {Japan}

ϕ1 ϕ2 ϕ3 ϕ4 ϕ5 ϕ6

0.000 0.006 0.000 0.006 0.000 0.006 0.012 0.000 0.006 0.000 0.006 0.000 0.006 0.012

(Blei, Ng, & Jordan, 2003)

Dirichlet(β)

SLIDE 29

37

Dirichlet(β)

The 54/40' boundary dispute is still unresolved, and Canadian and US Coast Guard vessels regularly if infrequently detain each other's fish boats in the disputed waters off Dixon… In the year before Lemieux came, Pittsburgh finished with 38 points. Following his arrival, the Pens finished… The Orioles' pitching staff again is having a fine exhibition season. Four shutouts, low team ERA, (Well, I haven't gotten any baseball…

θ1= θ2= θ3= Dirichlet(α)

Distributions

ver

topics (docs)

{Canadian gov.} {government} {hockey} {U.S. gov.} {baseball} {Japan}

ϕ1 ϕ2 ϕ3 ϕ4 ϕ5 ϕ6

0.000 0.006 0.000 0.006 0.000 0.006 0.012 0.000 0.006 0.000 0.006 0.000 0.006 0.012

Distributions

ver words

(topics)

(Blei, Ng, & Jordan, 2003)

SLIDE 36

LDA for Topic Modeling

38

The 54/40' boundary dispute is still unresolved, and Canadian and US Coast Guard vessels regularly if infrequently detain each other's fish boats in the disputed waters off Dixon… In the year before Lemieux came, Pittsburgh finished with 38 points. Following his arrival, the Pens finished… The Orioles' pitching staff again is having a fine exhibition season. Four shutouts, low team ERA, (Well, I haven't gotten any baseball…

θ1= θ2= θ3= Dirichlet(α)

{Canadian gov.} {government} {hockey} {U.S. gov.} {baseball} {Japan}

ϕ1 ϕ2 ϕ3 ϕ4 ϕ5 ϕ6

0.000 0.006 0.000 0.006 0.000 0.006 0.012 0.000 0.006 0.000 0.006 0.000 0.006 0.012

(Blei, Ng, & Jordan, 2003)

Dirichlet(β)

SLIDE 37

The 54/40' boundary dispute is still unresolved, and Canadian and US Coast Guard vessels regularly if infrequently detain each other's fish boats in the disputed waters off Dixon… In the year before Lemieux came, Pittsburgh finished with 38 points. Following his arrival, the Pens finished… The Orioles' itching staff again is having a fine exhibition season. Four shutouts, low team ERA, (Well, I haven't gotten any baseball…

LDA for Topic Modeling

39

Dirichlet( ) θ1= θ2= θ3= Dirichlet( )

ϕ1 ϕ2 ϕ3 ϕ4 ϕ5 ϕ6

(Blei, Ng, & Jordan, 2003) = = = = = =

Inference and learning start with only the data

SLIDE 38

Plate Diagrams

Whiteboard:

– Example #1: Plate diagram for Dirichlet- Multinomial model – Example #2: Plate diagram for LDA

40

SLIDE 39

Latent Dirichlet Allocation

Plate Diagram

M Nm K xmn zmn θm α φk β

SLIDE 40

Latent Dirichlet Allocation

Plate Diagram

M Nm K xmn zmn θm α φk β

Dirichlet Document-specific topic distribution Topic assignment Observed word Topic Dirichlet

SLIDE 41

Latent Dirichlet Allocation

Questions:

Is this a believable story for the generation
f a corpus of documents?
Why might it work well anyway?

SLIDE 42

Latent Dirichlet Allocation

Why does LDA “work”?

LDA trades off two goals.

1 For each document, allocate its words to as few topics as possible. 2 For each topic, assign high probability to as few terms as possible.

These goals are at odds.
Putting a document in a single topic makes #2 hard:

All of its words must have probability under that topic.

Putting very few words in each topic makes #1 hard:

To cover a document’s words, it must assign many topics to it.

Trading off these goals finds groups of tightly co-occurring words.

Slide from David Blei, MLSS 2012

SLIDE 43

Latent Dirichlet Allocation

How does this relate to my other favorite model for capturing low-dimensional representations

f a corpus?
Builds on latent semantic analysis (Deerwester

et al., 1990; Hofmann, 1999)

It is a mixed-membership model (Erosheva,

2004).

It relates to PCA and matrix factorization

(Jakulin and Buntine, 2002)

Was independently invented for genetics

(Pritchard et al., 2000)

Slide from David Blei, MLSS 2012

SLIDE 44

Outline

Applications of Topic Modeling
Latent Dirichlet Allocation (LDA)

1. Beta-Bernoulli 2. Dirichlet-Multinomial 3. Dirichlet-Multinomial Mixture Model 4. LDA

Bayesian Inference for Parameter Estimation

– Exact inference – EM – Monte Carlo EM – Gibbs sampler – Collapsed Gibbs sampler

Extensions of LDA

– Correlated topic models – Dynamic topic models – Polylingual topic models – Supervised LDA

SLIDE 45

BAYESIAN INFERENCE FOR PARAMETER ESTIMATION

47

SLIDE 46

Fully Observed MLE

LDA Inference

M Nm K xmn zmn θm α φk β

Document-specific topic distribution Topic assignment Observed word Topic Optimized Observed

Learning like this would be easy, but in practice we do not observe the topic assignments zmn

SLIDE 47

LDA Inference

Full Observed MAP Estimation

M Nm K xmn zmn θm α φk β

Dirichlet Document-specific topic distribution Topic assignment Observed word Topic Dirichlet Optimized Observed

Learning like this would be easy, but in practice we do not observe the topic assignments zmn

SLIDE 48

Unsupervised Learning

Three learning paradigms:

1. Maximum likelihood estimation (MLE)

2. Maximum a posteriori (MAP) estimation
3. Bayesian approach

arg max

θ

p(X|θ) arg max

θ

p(θ|X) ∝ p(X|θ)p(θ)

te p(θ|X) = …

Estimate the posterior:

SLIDE 49

Standard EM (MLE)

Exactly computing the posterior is intractable in

LDA

– Junction tree algorithm: exact inference in general graphical models

1. “moralization” converts directed to undirected 2. “triangulation” breaks 4-cycles by adding edges 3. Cliques arranged into a junction tree

– Time complexity is exponential in size of cliques – LDA cliques will be large (at least O(# topics)), so complexity is O(2# topics)

Exact MAP inference in LDA is NP-hard for a

large number of topics (Sontag & Roy, 2011)

SLIDE 56

LDA Inference

Explicit Gibbs Sampler

M Nm K xmn zmn θm α φk β

Dirichlet Document-specific topic distribution Topic assignment Observed word Topic Dirichlet Sampled

SLIDE 57

LDA Inference

Collapsed Gibbs Sampler

M Nm K xmn zmn θm α φk β

Dirichlet Document-specific topic distribution Topic assignment Observed word Topic Dirichlet Integrated out Sampled

Bayesian Inference for Parameter Estimation + Topic Modeling

10-418 / 10-618 Machine Learning for Structured Data

Reminders

– Out: Fri, Oct. 24 – Due: Wed, Nov. 6 at 11:59pm

– Out: Wed, Nov. 6 – Due: Mon, Nov. 18 at 11:59pm

TOPIC MODELING

Topic Modeling

Topic Modeling

Topic Modeling

Topic Modeling

Other Applications of Topic Models

Outline

BAYESIAN INFERENCE FOR NAÏVE BAYES

Beta-Bernoulli Model

f(φ|α, β) = 1 B(α, β)xα−1(1 − x)β−1

Beta-Bernoulli Model

Dirichlet-Multinomial Model

f(φ|α, β) = 1 B(α, β)xα−1(1 − x)β−1

Dirichlet-Multinomial Model

Dirichlet-Multinomial Model

Dirichlet-Multinomial Mixture Model

,)$-!(.*/

Dirichlet-Multinomial Mixture Model

Bayesian Inference for Naïve Bayes

Whiteboard:

– Naïve Bayes is not Bayesian – What if we observed both words and topics? – Dirichlet-Multinomial in the fully observed setting is just Naïve Bayes – Three ways of estimating parameters:

1. MLE for Naïve Bayes

Dirichlet-Multinomial Model

LATENT DIRICHLET ALLOCATION (LDA)

Mixture vs. Admixture (LDA)

,)$-!(.*/

,0')$-!(.*/

Latent Dirichlet Allocation

,0')$-!(.*/

Latent Dirichlet Allocation

For each topic k ∈ {1, . . . , K}: φk ∼ Dir(β) [draw distribution over words] For each document m ∈ {1, . . . , M} θm ∼ Dir(α) [draw distribution over topics] For each word n ∈ {1, . . . , Nm} zmn ∼ Mult(1, θm) [draw topic assignment] xmn ∼ φzmi [draw word]

LDA for Topic Modeling

prior over the topics.

LDA for Topic Modeling

prior over the topics.

LDA for Topic Modeling

words.

LDA for Topic Modeling

words.

LDA for Topic Modeling

words.

LDA for Topic Modeling

LDA for Topic Modeling

LDA for Topic Modeling

LDA for Topic Modeling

LDA for Topic Modeling

LDA for Topic Modeling

LDA for Topic Modeling

Distributions

topics (docs)

Distributions

(topics)

LDA for Topic Modeling

LDA for Topic Modeling

Inference and learning start with only the data

Plate Diagrams

Whiteboard:

– Example #1: Plate diagram for Dirichlet- Multinomial model – Example #2: Plate diagram for LDA

Latent Dirichlet Allocation

Latent Dirichlet Allocation

Latent Dirichlet Allocation

Questions:

Latent Dirichlet Allocation

Why does LDA “work”?

Latent Dirichlet Allocation

How does this relate to my other favorite model for capturing low-dimensional representations

et al., 1990; Hofmann, 1999)

2004).

(Jakulin and Buntine, 2002)

(Pritchard et al., 2000)

Outline

BAYESIAN INFERENCE FOR PARAMETER ESTIMATION

LDA Inference

Learning like this would be easy, but in practice we do not observe the topic assignments zmn

LDA Inference