Bayesian Inference for Parameter Estimation + Topic Modeling Matt - - PowerPoint PPT Presentation

bayesian inference for parameter estimation topic modeling
SMART_READER_LITE
LIVE PREVIEW

Bayesian Inference for Parameter Estimation + Topic Modeling Matt - - PowerPoint PPT Presentation

10-418 / 10-618 Machine Learning for Structured Data Machine Learning Department School of Computer Science Carnegie Mellon University Bayesian Inference for Parameter Estimation + Topic Modeling Matt Gormley Lecture 20 Nov. 4, 2019 1


slide-1
SLIDE 1

Bayesian Inference for Parameter Estimation + Topic Modeling

1

10-418 / 10-618 Machine Learning for Structured Data

Matt Gormley Lecture 20

  • Nov. 4, 2019

Machine Learning Department School of Computer Science Carnegie Mellon University

slide-2
SLIDE 2

Reminders

  • Homework 3: Structured SVM

– Out: Fri, Oct. 24 – Due: Wed, Nov. 6 at 11:59pm

  • Homework 4: Topic Modeling

– Out: Wed, Nov. 6 – Due: Mon, Nov. 18 at 11:59pm

3

slide-3
SLIDE 3

TOPIC MODELING

4

slide-4
SLIDE 4

Topic Modeling

Motivation: Suppose you’re given a massive corpora and asked to carry out the following tasks

  • Organize the documents into thematic categories
  • Describe the evolution of those categories over time
  • Enable a domain expert to analyze and understand the content
  • Find relationships between the categories
  • Understand how authorship influences the content
slide-5
SLIDE 5

Topic Modeling

Motivation: Suppose you’re given a massive corpora and asked to carry out the following tasks

  • Organize the documents into thematic categories
  • Describe the evolution of those categories over time
  • Enable a domain expert to analyze and understand the content
  • Find relationships between the categories
  • Understand how authorship influences the content

Topic Modeling: A method of (usually unsupervised) discovery of latent or hidden structure in a corpus

  • Applied primarily to text corpora, but techniques are more general
  • Provides a modeling toolbox
  • Has prompted the exploration of a variety of new inference methods to

accommodate large-scale datasets

slide-6
SLIDE 6

Topic Modeling

http:// www.cs.umass.edu/~mimno/icml100.html

Dirichlet-multinomial regression (DMR) topic model on ICML (Mimno & McCallum, 2008)

slide-7
SLIDE 7

Topic Modeling

  • Map of NIH Grants

https://app.nihmaps.org/

(Talley et al., 2011)

slide-8
SLIDE 8

Other Applications of Topic Models

  • Spacial LDA

(Wang & Grimson, 2007) Manual LDA SLDA

slide-9
SLIDE 9

Outline

  • Applications of Topic Modeling
  • Latent Dirichlet Allocation (LDA)

1. Beta-Bernoulli 2. Dirichlet-Multinomial 3. Dirichlet-Multinomial Mixture Model 4. LDA

  • Bayesian Inference for Parameter Estimation

– Exact inference – EM – Monte Carlo EM – Gibbs sampler – Collapsed Gibbs sampler

  • Extensions of LDA

– Correlated topic models – Dynamic topic models – Polylingual topic models – Supervised LDA

slide-10
SLIDE 10

BAYESIAN INFERENCE FOR NAÏVE BAYES

12

slide-11
SLIDE 11

Beta-Bernoulli Model

  • Beta Distribution

f(φ|α, β) = 1 B(α, β)xα−1(1 − x)β−1

1 2 3 4 f(φ|α, β) 0.2 0.4 0.6 0.8 1 φ α = 0.1, β = 0.9 α = 0.5, β = 0.5 α = 1.0, β = 1.0 α = 5.0, β = 5.0 α = 10.0, β = 5.0

slide-12
SLIDE 12

Beta-Bernoulli Model

  • Generative Process
  • Example corpus (heads/tails)

H T T H H T T H H H x1 x2 x3 x4 x5 x6 x7 x8 x9 x10

⇤ ∼ Beta(, ⇥) [draw distribution over words] For each word n ∈ {1, . . . , N} xn ∼ Bernoulli(⇤) [draw word]

slide-13
SLIDE 13

Dirichlet-Multinomial Model

  • Dirichlet Distribution

f(φ|α, β) = 1 B(α, β)xα−1(1 − x)β−1

1 2 3 4 f(φ|α, β) 0.2 0.4 0.6 0.8 1 φ α = 0.1, β = 0.9 α = 0.5, β = 0.5 α = 1.0, β = 1.0 α = 5.0, β = 5.0 α = 10.0, β = 5.0

slide-14
SLIDE 14

Dirichlet-Multinomial Model

  • Dirichlet Distribution

p(⌅ ⇤|α) = 1 B(α)

K

k=1

⇤αk−1

k

where B(α) = ⇥K

k=1 Γ(k)

Γ(K

k=1 k)

0.2 0.4 0.6 0.8 1 2 0.25 0.5 0.75 1

  • 1

1.5 2 2.5 3 p ( ~

  • |

~ ↵ )

0.2 0.4 0.6 0.8 1 2 0.25 0.5 0.75 1

  • 1

5 10 15 p(~ |~ ↵)

slide-15
SLIDE 15

Dirichlet-Multinomial Model

  • Generative Process
  • Example corpus

the he is the and the she she is is x1 x2 x3 x4 x5 x6 x7 x8 x9 x10

φ ∼ Dir(β) [draw distribution over words] For each word n ∈ {1, . . . , N} xn ∼ Mult(1, φ) [draw word]

slide-16
SLIDE 16

Dirichlet-Multinomial Mixture Model

  • Generative Process
  • Example corpus

the he is x11 x12 x13 the and the x21 x22 x23 she she is is x31 x32 x33 x34 Document 1 Document 2 Document 3 !"#$%& '"%()*+!&

,)$-!(.*/

slide-17
SLIDE 17

Dirichlet-Multinomial Mixture Model

  • Generative Process
  • Example corpus

the he is x11 x12 x13 the and the x21 x22 x23 she she is is x31 x32 x33 x34 Document 1 Document 2 Document 3

For each topic k ∈ {1, . . . , K}: φk ∼ Dir(β) [draw distribution over words] θ ∼ Dir(α) [draw distribution over topics] For each document m ∈ {1, . . . , M} zm ∼ Mult(1, θ) [draw topic assignment] For each word n ∈ {1, . . . , Nm} xmn ∼ Mult(1, φzm) [draw word]

slide-18
SLIDE 18

Bayesian Inference for Naïve Bayes

Whiteboard:

– Naïve Bayes is not Bayesian – What if we observed both words and topics? – Dirichlet-Multinomial in the fully observed setting is just Naïve Bayes – Three ways of estimating parameters:

1. MLE for Naïve Bayes

  • 2. MAP estimation for Naïve Bayes
  • 3. Bayesian parameter estimation for Naïve Bayes

20

slide-19
SLIDE 19

Dirichlet-Multinomial Model

  • The Dirichlet is conjugate to the Multinomial
  • The posterior of ⇤ is p(⇤|X) = p(X|φ)p(φ)

P (X)

  • Define the count vector n such that nt denotes the number of

times word t appeared

  • Then the posterior is also a Dirichlet distribution:

p(⇤|X) ∼ Dir(β + n) φ ∼ Dir(β) [draw distribution over words] For each word n ∈ {1, . . . , N} xn ∼ Mult(1, φ) [draw word]

slide-20
SLIDE 20

LATENT DIRICHLET ALLOCATION (LDA)

22

slide-21
SLIDE 21

Mixture vs. Admixture (LDA)

!"#$%& '"%()*+!&

,)$-!(.*/

!"#$%& '"%()*+!&

,0')$-!(.*/

Diagrams from Wallach, JHU 2011, slides

slide-22
SLIDE 22

Latent Dirichlet Allocation

  • Generative Process
  • Example corpus

the he is x11 x12 x13 the and the x21 x22 x23 she she is is x31 x32 x33 x34 Document 1 Document 2 Document 3 !"#$%& '"%()*+!&

,0')$-!(.*/

slide-23
SLIDE 23

Latent Dirichlet Allocation

  • Generative Process
  • Example corpus

the he is x11 x12 x13 the and the x21 x22 x23 she she is is x31 x32 x33 x34 Document 1 Document 2 Document 3

For each topic k ∈ {1, . . . , K}: φk ∼ Dir(β) [draw distribution over words] For each document m ∈ {1, . . . , M} θm ∼ Dir(α) [draw distribution over topics] For each word n ∈ {1, . . . , Nm} zmn ∼ Mult(1, θm) [draw topic assignment] xmn ∼ φzmi [draw word]

slide-24
SLIDE 24

LDA for Topic Modeling

  • The generative story begins with only a Dirichlet

prior over the topics.

  • Each topic is defined as a Multinomial distribution
  • ver the vocabulary, parameterized by ϕk

26

(Blei, Ng, & Jordan, 2003)

0.000 0.006 0.000 0.006 0.000 0.006 0.012 0.000 0.006 0.000 0.006 0.000 0.006 0.012

Dirichlet(β)

slide-25
SLIDE 25

LDA for Topic Modeling

  • The generative story begins with only a Dirichlet

prior over the topics.

  • Each topic is defined as a Multinomial distribution
  • ver the vocabulary, parameterized by ϕk

27

ϕ1 ϕ2 ϕ3 ϕ4 ϕ5 ϕ6

0.000 0.006 0.000 0.006 0.000 0.006 0.012 0.000 0.006 0.000 0.006 0.000 0.006 0.012

(Blei, Ng, & Jordan, 2003)

Dirichlet(β)

slide-26
SLIDE 26

LDA for Topic Modeling

  • A topic is visualized as its high probability

words.

  • A pedagogical label is used to identify the topic.

28

ϕ1 ϕ2 ϕ3 ϕ4 ϕ5 ϕ6 team, season, hockey, player, penguins, ice, canadiens, puck, montreal, stanley, cup

0.000 0.006 0.000 0.006 0.000 0.006 0.012 0.000 0.006 0.000 0.006 0.000 0.006 0.012

(Blei, Ng, & Jordan, 2003)

Dirichlet(β)

slide-27
SLIDE 27

LDA for Topic Modeling

  • A topic is visualized as its high probability

words.

  • A pedagogical label is used to identify the topic.

29

ϕ1 ϕ2 ϕ3

{hockey}

ϕ4 ϕ5 ϕ6 team, season, hockey, player, penguins, ice, canadiens, puck, montreal, stanley, cup

0.000 0.006 0.000 0.006 0.000 0.006 0.012 0.000 0.006 0.000 0.006 0.000 0.006 0.012

(Blei, Ng, & Jordan, 2003)

Dirichlet(β)

slide-28
SLIDE 28

LDA for Topic Modeling

  • A topic is visualized as its high probability

words.

  • A pedagogical label is used to identify the topic.

30

{Canadian gov.} {government} {hockey} {U.S. gov.} {baseball} {Japan}

ϕ1 ϕ2 ϕ3 ϕ4 ϕ5 ϕ6

0.000 0.006 0.000 0.006 0.000 0.006 0.012 0.000 0.006 0.000 0.006 0.000 0.006 0.012

(Blei, Ng, & Jordan, 2003)

Dirichlet(β)

slide-29
SLIDE 29

LDA for Topic Modeling

31

θ1= Dirichlet(α)

{Canadian gov.} {government} {hockey} {U.S. gov.} {baseball} {Japan}

ϕ1 ϕ2 ϕ3 ϕ4 ϕ5 ϕ6

0.000 0.006 0.000 0.006 0.000 0.006 0.012 0.000 0.006 0.000 0.006 0.000 0.006 0.012

(Blei, Ng, & Jordan, 2003)

Dirichlet(β)

slide-30
SLIDE 30

LDA for Topic Modeling

32

The 54/40' boundary dispute is still unresolved, and Canadian and US

θ1= Dirichlet(α)

{Canadian gov.} {government} {hockey} {U.S. gov.} {baseball} {Japan}

ϕ1 ϕ2 ϕ3 ϕ4 ϕ5 ϕ6

0.000 0.006 0.000 0.006 0.000 0.006 0.012 0.000 0.006 0.000 0.006 0.000 0.006 0.012

(Blei, Ng, & Jordan, 2003)

Dirichlet(β)

slide-31
SLIDE 31

LDA for Topic Modeling

33

The 54/40' boundary dispute is still unresolved, and Canadian and US

θ1= Dirichlet(α)

{Canadian gov.} {government} {hockey} {U.S. gov.} {baseball} {Japan}

ϕ1 ϕ2 ϕ3 ϕ4 ϕ5 ϕ6

0.000 0.006 0.000 0.006 0.000 0.006 0.012 0.000 0.006 0.000 0.006 0.000 0.006 0.012

(Blei, Ng, & Jordan, 2003)

Dirichlet(β)

slide-32
SLIDE 32

LDA for Topic Modeling

34

The 54/40' boundary dispute is still unresolved, and Canadian and US Coast Guard

θ1= Dirichlet(α)

{Canadian gov.} {government} {hockey} {U.S. gov.} {baseball} {Japan}

ϕ1 ϕ2 ϕ3 ϕ4 ϕ5 ϕ6

0.000 0.006 0.000 0.006 0.000 0.006 0.012 0.000 0.006 0.000 0.006 0.000 0.006 0.012

(Blei, Ng, & Jordan, 2003)

Dirichlet(β)

slide-33
SLIDE 33

LDA for Topic Modeling

35

The 54/40' boundary dispute is still unresolved, and Canadian and US Coast Guard vessels regularly if infrequently detain each other's fish boats in the disputed waters off Dixon…

θ1= Dirichlet(α)

{Canadian gov.} {government} {hockey} {U.S. gov.} {baseball} {Japan}

ϕ1 ϕ2 ϕ3 ϕ4 ϕ5 ϕ6

0.000 0.006 0.000 0.006 0.000 0.006 0.012 0.000 0.006 0.000 0.006 0.000 0.006 0.012

(Blei, Ng, & Jordan, 2003)

Dirichlet(β)

slide-34
SLIDE 34

LDA for Topic Modeling

36

The 54/40' boundary dispute is still unresolved, and Canadian and US Coast Guard vessels regularly if infrequently detain each other's fish boats in the disputed waters off Dixon… In the year before Lemieux came, Pittsburgh finished with 38 points. Following his arrival, the Pens finished… The Orioles' pitching staff again is having a fine exhibition season. Four shutouts, low team ERA, (Well, I haven't gotten any baseball…

θ1= θ2= θ3= Dirichlet(α)

{Canadian gov.} {government} {hockey} {U.S. gov.} {baseball} {Japan}

ϕ1 ϕ2 ϕ3 ϕ4 ϕ5 ϕ6

0.000 0.006 0.000 0.006 0.000 0.006 0.012 0.000 0.006 0.000 0.006 0.000 0.006 0.012

(Blei, Ng, & Jordan, 2003)

Dirichlet(β)

slide-35
SLIDE 35

LDA for Topic Modeling

37

Dirichlet(β)

The 54/40' boundary dispute is still unresolved, and Canadian and US Coast Guard vessels regularly if infrequently detain each other's fish boats in the disputed waters off Dixon… In the year before Lemieux came, Pittsburgh finished with 38 points. Following his arrival, the Pens finished… The Orioles' pitching staff again is having a fine exhibition season. Four shutouts, low team ERA, (Well, I haven't gotten any baseball…

θ1= θ2= θ3= Dirichlet(α)

Distributions

  • ver

topics (docs)

{Canadian gov.} {government} {hockey} {U.S. gov.} {baseball} {Japan}

ϕ1 ϕ2 ϕ3 ϕ4 ϕ5 ϕ6

0.000 0.006 0.000 0.006 0.000 0.006 0.012 0.000 0.006 0.000 0.006 0.000 0.006 0.012

Distributions

  • ver words

(topics)

(Blei, Ng, & Jordan, 2003)

slide-36
SLIDE 36

LDA for Topic Modeling

38

The 54/40' boundary dispute is still unresolved, and Canadian and US Coast Guard vessels regularly if infrequently detain each other's fish boats in the disputed waters off Dixon… In the year before Lemieux came, Pittsburgh finished with 38 points. Following his arrival, the Pens finished… The Orioles' pitching staff again is having a fine exhibition season. Four shutouts, low team ERA, (Well, I haven't gotten any baseball…

θ1= θ2= θ3= Dirichlet(α)

{Canadian gov.} {government} {hockey} {U.S. gov.} {baseball} {Japan}

ϕ1 ϕ2 ϕ3 ϕ4 ϕ5 ϕ6

0.000 0.006 0.000 0.006 0.000 0.006 0.012 0.000 0.006 0.000 0.006 0.000 0.006 0.012

(Blei, Ng, & Jordan, 2003)

Dirichlet(β)

slide-37
SLIDE 37

The 54/40' boundary dispute is still unresolved, and Canadian and US Coast Guard vessels regularly if infrequently detain each other's fish boats in the disputed waters off Dixon… In the year before Lemieux came, Pittsburgh finished with 38 points. Following his arrival, the Pens finished… The Orioles' itching staff again is having a fine exhibition season. Four shutouts, low team ERA, (Well, I haven't gotten any baseball…

LDA for Topic Modeling

39

Dirichlet( ) θ1= θ2= θ3= Dirichlet( )

ϕ1 ϕ2 ϕ3 ϕ4 ϕ5 ϕ6

(Blei, Ng, & Jordan, 2003) = = = = = =

Inference and learning start with only the data

slide-38
SLIDE 38

Plate Diagrams

Whiteboard:

– Example #1: Plate diagram for Dirichlet- Multinomial model – Example #2: Plate diagram for LDA

40

slide-39
SLIDE 39

Latent Dirichlet Allocation

  • Plate Diagram

M Nm K xmn zmn θm α φk β

slide-40
SLIDE 40

Latent Dirichlet Allocation

  • Plate Diagram

M Nm K xmn zmn θm α φk β

Dirichlet Document-specific topic distribution Topic assignment Observed word Topic Dirichlet

slide-41
SLIDE 41

Latent Dirichlet Allocation

Questions:

  • Is this a believable story for the generation
  • f a corpus of documents?
  • Why might it work well anyway?
slide-42
SLIDE 42

Latent Dirichlet Allocation

Why does LDA “work”?

  • LDA trades off two goals.

1 For each document, allocate its words to as few topics as possible. 2 For each topic, assign high probability to as few terms as possible.

  • These goals are at odds.
  • Putting a document in a single topic makes #2 hard:

All of its words must have probability under that topic.

  • Putting very few words in each topic makes #1 hard:

To cover a document’s words, it must assign many topics to it.

  • Trading off these goals finds groups of tightly co-occurring words.

Slide from David Blei, MLSS 2012

slide-43
SLIDE 43

Latent Dirichlet Allocation

How does this relate to my other favorite model for capturing low-dimensional representations

  • f a corpus?
  • Builds on latent semantic analysis (Deerwester

et al., 1990; Hofmann, 1999)

  • It is a mixed-membership model (Erosheva,

2004).

  • It relates to PCA and matrix factorization

(Jakulin and Buntine, 2002)

  • Was independently invented for genetics

(Pritchard et al., 2000)

Slide from David Blei, MLSS 2012

slide-44
SLIDE 44

Outline

  • Applications of Topic Modeling
  • Latent Dirichlet Allocation (LDA)

1. Beta-Bernoulli 2. Dirichlet-Multinomial 3. Dirichlet-Multinomial Mixture Model 4. LDA

  • Bayesian Inference for Parameter Estimation

– Exact inference – EM – Monte Carlo EM – Gibbs sampler – Collapsed Gibbs sampler

  • Extensions of LDA

– Correlated topic models – Dynamic topic models – Polylingual topic models – Supervised LDA

slide-45
SLIDE 45

BAYESIAN INFERENCE FOR PARAMETER ESTIMATION

47

slide-46
SLIDE 46
  • Fully Observed MLE

LDA Inference

M Nm K xmn zmn θm α φk β

Document-specific topic distribution Topic assignment Observed word Topic Optimized Observed

Learning like this would be easy, but in practice we do not observe the topic assignments zmn

slide-47
SLIDE 47

LDA Inference

  • Full Observed MAP Estimation

M Nm K xmn zmn θm α φk β

Dirichlet Document-specific topic distribution Topic assignment Observed word Topic Dirichlet Optimized Observed

Learning like this would be easy, but in practice we do not observe the topic assignments zmn

slide-48
SLIDE 48

Unsupervised Learning

Three learning paradigms:

1. Maximum likelihood estimation (MLE)

  • 2. Maximum a posteriori (MAP) estimation
  • 3. Bayesian approach

arg max

θ

p(X|θ) arg max

θ

p(θ|X) ∝ p(X|θ)p(θ)

te p(θ|X) = …

Estimate the posterior:

slide-49
SLIDE 49
  • Standard EM (MLE)

LDA Inference

M Nm K xmn zmn θm α φk β

Document-specific topic distribution Topic assignment Observed word Topic Optimized Exact Inference

slide-50
SLIDE 50

LDA Inference

  • Standard EM (MAP Estimation)

M Nm K xmn zmn θm α φk β

Dirichlet Document-specific topic distribution Topic assignment Observed word Topic Dirichlet Optimized Exact Inference

slide-51
SLIDE 51

LDA Inference

  • Monte Carlo EM (MAP Estimation)

M Nm K xmn zmn θm α φk β

Dirichlet Document-specific topic distribution Topic assignment Observed word Topic Dirichlet Optimized Sampled

slide-52
SLIDE 52

LDA Inference

  • Bayesian Approach

M Nm K xmn zmn θm α φk β

Dirichlet Document-specific topic distribution Topic assignment Observed word Topic Dirichlet Exact Inference?

slide-53
SLIDE 53

Bayesian Inference

Whiteboard:

– Posteriors over parameters – Bayesian inference for parameter estimation

55

slide-54
SLIDE 54

LDA Inference

  • Bayesian Approach

M Nm K xmn zmn θm α φk β

Dirichlet Document-specific topic distribution Topic assignment Observed word Topic Dirichlet Exact Inference? Intractable

slide-55
SLIDE 55

Exact Inference in LDA

  • Exactly computing the posterior is intractable in

LDA

– Junction tree algorithm: exact inference in general graphical models

1. “moralization” converts directed to undirected 2. “triangulation” breaks 4-cycles by adding edges 3. Cliques arranged into a junction tree

– Time complexity is exponential in size of cliques – LDA cliques will be large (at least O(# topics)), so complexity is O(2# topics)

  • Exact MAP inference in LDA is NP-hard for a

large number of topics (Sontag & Roy, 2011)

slide-56
SLIDE 56

LDA Inference

  • Explicit Gibbs Sampler

M Nm K xmn zmn θm α φk β

Dirichlet Document-specific topic distribution Topic assignment Observed word Topic Dirichlet Sampled

slide-57
SLIDE 57

LDA Inference

  • Collapsed Gibbs Sampler

M Nm K xmn zmn θm α φk β

Dirichlet Document-specific topic distribution Topic assignment Observed word Topic Dirichlet Integrated out Sampled