[PPT] - Message Passing/Belief Propagation CMSC 691 UMBC Markov Random PowerPoint Presentation

SLIDE 1

Message Passing/Belief Propagation

CMSC 691 UMBC

SLIDE 2

Markov Random Fields: Undirected Graphs

𝑞 𝑦1, 𝑦2, 𝑦3, … , 𝑦𝑂 = 1 𝑎 ෑ

𝐷

𝜔𝐷 𝑦𝑑

variables part

f the clique C

maximal cliques global normalization clique: subset of nodes, where nodes are pairwise connected maximal clique: a clique that cannot add a node and remain a clique potential function (not necessarily a probability!) Q: What restrictions should we place on the potentials 𝜔𝐷? A: 𝜔𝐷 ≥ 0 (or 𝜔𝐷 > 0)

SLIDE 3

Terminology: Potential Functions

𝜔𝐷 𝑦𝑑 = exp −𝐹(𝑦𝐷)

energy function (for clique C) Boltzmann distribution

(get the total energy of a configuration by summing the individual energy functions)

𝑞 𝑦1, 𝑦2, 𝑦3, … , 𝑦𝑂 = 1 𝑎 ෑ

𝐷

𝜔𝐷 𝑦𝑑

SLIDE 4

MRFs as Factor Graphs

Undirected graphs: G=(V,E) that represents 𝑞(𝑌1, … , 𝑌𝑂) Factor graph of p: Bipartite graph of evidence nodes X, factor nodes F, and edges T Evidence nodes X are the random variables Factor nodes F take values associated with the potential functions Edges show what variables are used in which factors

SLIDE 5

MRFs as Factor Graphs

Undirected graphs: G=(V,E) that represents 𝑞(𝑌1, … , 𝑌𝑂) Factor graph of p: Bipartite graph of evidence nodes X, factor nodes F, and edges T Evidence nodes X are the random variables Factor nodes F take values associated with the potential functions Edges show what variables are used in which factors

X Y Z X Y Z

SLIDE 6

Outline

Message Passing: Graphical Model Inference Example: Linear Chain CRF

SLIDE 7

Two Problems for Undirected Models

Finding the normalizer 𝑎 = ෍

𝑦

ෑ

𝑑

𝜔𝑑(𝑦𝑑) Computing the marginals 𝑎𝑜(𝑤) = ෍

𝑦:𝑦𝑜=𝑤

ෑ

𝑑

𝜔𝑑(𝑦𝑑)

Sum over all variable combinations, with the xn coordinate fixed 𝑎2(𝑤) = ෍

𝑦1

෍

𝑦3

ෑ

𝑑

𝜔𝑑(𝑦 = 𝑦1, 𝑤, 𝑦3 ) Example: 3 variables, fix the 2nd dimension

𝑞 𝑦1, 𝑦2, 𝑦3, … , 𝑦𝑂 = 1 𝑎 ෑ

𝐷

𝜔𝐷 𝑦𝑑

SLIDE 8

Two Problems for Undirected Models

Finding the normalizer 𝑎 = ෍

𝑦

ෑ

𝑑

𝜔𝑑(𝑦𝑑) Computing the marginals 𝑎𝑜(𝑤) = ෍

𝑦:𝑦𝑜=𝑤

ෑ

𝑑

𝜔𝑑(𝑦𝑑)

Q: Why are these difficult? A: Many different combinations Sum over all variable combinations, with the xn coordinate fixed 𝑎2(𝑤) = ෍

𝑦1

෍

𝑦3

ෑ

𝑑

𝜔𝑑(𝑦 = 𝑦1, 𝑤, 𝑦3 ) Example: 3 variables, fix the 2nd dimension

𝑞 𝑦1, 𝑦2, 𝑦3, … , 𝑦𝑂 = 1 𝑎 ෑ

𝐷

𝜔𝐷 𝑦𝑑

SLIDE 9

Message Passing: Count the Soldiers

If you are the front soldier in the line, say the number ‘one’ to the soldier behind you. If you are the rearmost soldier in the line, say the number ‘one’ to the soldier in front of you. If a soldier ahead of or behind you says a number to you, add

ne to it, and say the new

number to the soldier on the

ther side

ITILA, Ch 16

SLIDE 10

Message Passing: Count the Soldiers

If you are the front soldier in the line, say the number ‘one’ to the soldier behind you. If you are the rearmost soldier in the line, say the number ‘one’ to the soldier in front of you. If a soldier ahead of or behind you says a number to you, add one to it, and say the new number to the soldier on the other side

ITILA, Ch 16

SLIDE 11

Message Passing: Count the Soldiers

If you are the front soldier in the line, say the number ‘one’ to the soldier behind you. If you are the rearmost soldier in the line, say the number ‘one’ to the soldier in front of you. If a soldier ahead of or behind you says a number to you, add one to it, and say the new number to the soldier on the other side

ITILA, Ch 16

SLIDE 12

Message Passing: Count the Soldiers

If you are the front soldier in the line, say the number ‘one’ to the soldier behind you. If you are the rearmost soldier in the line, say the number ‘one’ to the soldier in front of you. If a soldier ahead of or behind you says a number to you, add one to it, and say the new number to the soldier on the other side

ITILA, Ch 16

SLIDE 13

Sum-Product Algorithm

Main idea: message passing An exact inference algorithm for tree-like graphs Belief propagation (forward-backward for HMMs) is a special case

SLIDE 14

Sum-Product

𝑞 𝑦𝑗 = 𝑤 = ෍

𝑦:𝑦𝑗=𝑤

𝑞 𝑦1, 𝑦2, … , 𝑦𝑗, … , 𝑦𝑂

definition of marginal

… …

SLIDE 15

Sum-Product

definition of marginal

… …

main idea: use bipartite nature of graph to efficiently compute the marginals

The factor nodes can act as filters

𝑞 𝑦𝑗 = 𝑤 = ෍

𝑦:𝑦𝑗=𝑤

𝑞 𝑦1, 𝑦2, … , 𝑦𝑗, … , 𝑦𝑂

SLIDE 16

Sum-Product

definition of marginal

… …

main idea: use bipartite nature of graph to efficiently compute the marginals 𝑠

𝑛1→𝑜

𝑠

𝑛3→𝑜

𝑠

𝑛2→𝑜

𝑠

𝑛→𝑜 is a message from factor

node m to evidence node n

𝑞 𝑦𝑗 = 𝑤 = ෍

𝑦:𝑦𝑗=𝑤

𝑞 𝑦1, 𝑦2, … , 𝑦𝑗, … , 𝑦𝑂

SLIDE 17

Sum-Product

𝑞 𝑦𝑗 = 𝑤 = ς𝑔 𝑠

𝑔→𝑦𝑗(𝑦𝑗 = 𝑤)

σ𝑥 ς𝑔 𝑠

𝑔→𝑦𝑗(𝑦𝑗 = 𝑥) ∝ ෑ 𝑔

𝑠

𝑔→𝑦𝑗(𝑦𝑗) alternative marginal computation

… …

main idea: use bipartite nature of graph to efficiently compute the marginals 𝑠

𝑛1→𝑜

𝑠

𝑛3→𝑜

𝑠

𝑛2→𝑜

𝑠

𝑛→𝑜 is a message from factor

node m to evidence node n

SLIDE 18

SLIDE 19

Sum-Product

… …

𝑠

𝑛1→𝑜

𝑠

𝑛3→𝑜

𝑠

𝑛2→𝑜

𝑠

𝑛→𝑜 is a message from factor

node m to evidence node n 𝑟𝑜→ 𝑛1 𝑟𝑜→ 𝑛2 𝑟𝑜→ 𝑛3 𝑟𝑜→𝑜 is a message from evidence node n to factor node m

SLIDE 20

Sum-Product

From variables to factors 𝑟𝑜→𝑛 𝑦𝑜 =

n m

n aggregates information from the rest of its graph via its neighbors

SLIDE 21

Sum-Product

From variables to factors 𝑟𝑜→𝑛 𝑦𝑜 = ෑ

𝑛′∈𝑁(𝑜)\𝑛

𝑠𝑛′→𝑜 𝑦𝑜

n m

set of factors in which variable n participates

default value of 1 if empty product

SLIDE 22

Sum-Product

From variables to factors 𝑟𝑜→𝑛 𝑦𝑜 = ෑ

𝑛′∈𝑁(𝑜)\𝑛

𝑠𝑛′→𝑜 𝑦𝑜 From factors to variables 𝑠

𝑛→𝑜 𝑦𝑜 = n m n m

set of factors in which variable n participates

default value of 1 if empty product

m aggregates information from the rest of its graph via its neighbors

SLIDE 23

Sum-Product

From variables to factors 𝑟𝑜→𝑛 𝑦𝑜 = ෑ

𝑛′∈𝑁(𝑜)\𝑛

𝑠𝑛′→𝑜 𝑦𝑜 From factors to variables 𝑠

𝑛→𝑜 𝑦𝑜 =

n m n m

set of factors in which variable n participates

default value of 1 if empty product

m aggregates information from the rest of its graph via its neighbors But these neighbors are R.V.s that take on different values

SLIDE 24

Sum-Product

From variables to factors 𝑟𝑜→𝑛 𝑦𝑜 = ෑ

𝑛′∈𝑁(𝑜)\𝑛

𝑠𝑛′→𝑜 𝑦𝑜 From factors to variables

n m n m

set of factors in which variable n participates

1. sum over configuration of

variables for the mth factor, with variable n fixed

2. aggregate info those other

variables provide about the rest of the graph

default value of 1 if empty product

𝑠

𝑛→𝑜 𝑦𝑜 =

SLIDE 25

Sum-Product

From variables to factors 𝑟𝑜→𝑛 𝑦𝑜 = ෑ

𝑛′∈𝑁(𝑜)\𝑛

𝑠𝑛′→𝑜 𝑦𝑜 From factors to variables 𝑠

𝑛→𝑜 𝑦𝑜

= ෍

𝒙𝑛\𝑜

𝑔

𝑛 𝒙𝑛

ෑ

𝑜′∈𝑂(𝑛)\𝑜

𝑟𝑜′→𝑛(𝑦𝑜′)

n m n m

set of factors in which variable n participates

default value of 1 if empty product

2. aggregate info those
ther variables provide

about the rest of the graph

1. sum over configuration of

variables for the mth factor, with variable n fixed

SLIDE 26

Sum-Product

From variables to factors 𝑟𝑜→𝑛 𝑦𝑜 = ෑ

𝑛′∈𝑁(𝑜)\𝑛

𝑠𝑛′→𝑜 𝑦𝑜 From factors to variables 𝑠

𝑛→𝑜 𝑦𝑜

= ෍

𝒙𝑛\𝑜

𝑔

𝑛 𝒙𝑛

ෑ

𝑜′∈𝑂(𝑛)\𝑜

𝑟𝑜′→𝑛(𝑦𝑜′)

n m n m

set of variables that the mth factor depends on set of factors in which variable n participates sum over configuration of variables for the mth factor, with variable n fixed

default value of 1 if empty product

SLIDE 27

Meaning of the Computed Values

From variables to factors 𝑟𝑜→𝑛 𝑦𝑜 = ෑ

𝑛′∈𝑁(𝑜)\𝑛

𝑠𝑛′→𝑜 𝑦𝑜 From factors to variables 𝑠

𝑛→𝑜 𝑦𝑜

= ෍

𝒙𝑛\𝑜

𝑔

𝑛 𝒙𝑛

ෑ

𝑜′∈𝑂(𝑛)\𝑜

𝑟𝑜′→𝑛(𝑦𝑜′) 𝑦𝑜 telling factor m the “goodness” for the rest of the graph if 𝑦𝑜 has a particular value factor m telling 𝑦𝑜 the “goodness” for the rest of the graph if 𝑦𝑜 has a particular value

SLIDE 28

From Messages to Variable Beliefs

n m1 m2 𝑠

𝑛1→𝑜 𝑦𝑜 tells 𝑦𝑜

the “goodness” from m1’s perspective if 𝑦𝑜 has a particular value 𝑠

𝑛2→𝑜 𝑦𝑜 tells 𝑦𝑜

the “goodness” from m2’s perspective if 𝑦𝑜 has a particular value

SLIDE 29

From Messages to Variable Beliefs

n m1 m2 𝑠

𝑛1→𝑜 𝑦𝑜 tells 𝑦𝑜

the “goodness” from m1’s perspective if 𝑦𝑜 has a particular value 𝑠

𝑛2→𝑜 𝑦𝑜 tells 𝑦𝑜

the “goodness” from m2’s perspective if 𝑦𝑜 has a particular value

Together, they describe the cover the entire graph!

SLIDE 30

From Messages to Variable Beliefs

n m1 m2 𝑠

𝑛1→𝑜 𝑦𝑜 tells 𝑦𝑜

the “goodness” from m1’s perspective if 𝑦𝑜 has a particular value 𝑠

𝑛2→𝑜 𝑦𝑜 tells 𝑦𝑜

the “goodness” from m2’s perspective if 𝑦𝑜 has a particular value

Together, they describe the cover the entire graph!

𝑞 𝑦𝑜 = 𝑤 ∝ 𝑠

𝑛1→𝑜 𝑦𝑜 = 𝑤 𝑠 𝑛2→𝑜 𝑦𝑜 = 𝑤

SLIDE 31

From Messages to Variable Beliefs: General Formula

n m1 m2 𝑠

𝑛1→𝑜 𝑦𝑜 tells 𝑦𝑜

the “goodness” from m1’s perspective if 𝑦𝑜 has a particular value 𝑠

𝑛2→𝑜 𝑦𝑜 tells 𝑦𝑜

the “goodness” from m2’s perspective if 𝑦𝑜 has a particular value

𝑞 𝑦𝑜 = 𝑤 ∝ ෑ

𝑛∈𝑂(𝑦𝑜)

𝑠

𝑛→𝑜 𝑦𝑜 = 𝑤

SLIDE 32

From Messages to Factor Beliefs: General Formula

n1 n2 n3 m 𝑟𝑜𝑗→𝑛tells 𝑛 the “goodness” from 𝑦𝑜𝑗’s perspective if it has a particular value

𝑞 𝑦{𝑛} = 𝒘 ∝ 𝑛 𝑦 𝑛 = 𝒘 ෑ

𝑦𝑜𝑗∈𝑂(𝑛)

𝑟𝑜𝑗→𝑛 𝑦𝑜𝑗 = 𝑤𝑗

SLIDE 33

How to Use these Messages

1.Select the root, or pick one if a tree

a) Send messages from leaves to root b) Send messages from root to leaves c) Use messages to compute (unnormalized) marginal probabilities

2.Are we done?

a) If a tree structure, we’ve converged b) If not:

i. Either accept the partially converged result, or… ii. Go back to (1) and repeat

SLIDE 34

How to Use these Messages

Compute Marginals/Normalizer

1. Select the root, or pick one if a

tree

a) Send messages from leaves to root b) Send messages from root to leaves c) Use messages to compute (unnormalized) marginal probabilities

2. Are we done?

a) If a tree structure, we’ve converged b) If not:

i. Either accept the partially converged result, or… ii. Go back to (1) and repeat

For Learning/Inference Whenever you need to compute a likelihood, marginal probability,

r a model-specific expectation,

run this algorithm to compute the necessary probabilities

– Prediction:

Of a sequence

𝑞 𝑨1, … , 𝑨𝑂 𝑥1:𝑂

Of an individual tag 𝑞(𝑨𝑗|𝑥1:𝑂)

– Marginal (if appropriate)

𝑞(𝑥1:𝑂)

– Learning model parameters

EM
Variational inference
…

SLIDE 35

Example

Q: What are the variables? 𝑦2 𝑦1 𝑦3 𝑦4 𝑔

𝑏

𝑔

𝑐

𝑔

𝑑

SLIDE 36

Example

Q: What are the variables? A: 𝑦1, 𝑦2, 𝑦3, 𝑦4 Q: What are the factors? 𝑦2 𝑦1 𝑦3 𝑦4 𝑔

𝑏

𝑔

𝑐

𝑔

𝑑

SLIDE 37

Example

Q: What are the variables? A: 𝑦1, 𝑦2, 𝑦3, 𝑦4 Q: What are the factors? A: 𝑔

𝑏 𝑦1, 𝑦2 ,

𝑔

𝑐 𝑦2, 𝑦3 ,

𝑔

𝑑(𝑦2, 𝑦4)

𝑦2 𝑦1 𝑦3 𝑦4 𝑔

𝑏

𝑔

𝑐

𝑔

𝑑

SLIDE 38

Example

𝑦2 𝑦1 𝑦3 𝑦4 𝑔

𝑏

𝑔

𝑐

𝑔

𝑑

Q: What is the distribution we’re modeling?

SLIDE 39

Example

𝑦2 𝑦1 𝑦3 𝑦4 𝑔

𝑏

𝑔

𝑐

𝑔

𝑑

Q: What is the distribution we’re modeling?

A: 𝑞 𝑦1, 𝑦2, 𝑦3, 𝑦4 = 𝑔

𝑏 𝑦1, 𝑦2 𝑔 𝑐 𝑦2, 𝑦3 𝑔 𝑑(𝑦2, 𝑦4)

SLIDE 40

Example

𝑦2 𝑦1 𝑦3 𝑦4 𝑔

𝑏

𝑔

𝑐

𝑔

𝑑

1. Select the root, or pick one if a tree (𝑦3)
1. Send messages from leaves to root

𝑟𝑦1→𝑔

𝑏 𝑦1 = 1

𝑟𝑦4→𝑔

𝑑 𝑦4 = 1

𝑟𝑜→𝑛 𝑦𝑜 = ෑ

𝑛′∈𝑁(𝑜)\𝑛

𝑠𝑛′→𝑜 𝑦𝑜

𝑠

𝑛→𝑜 𝑦𝑜 = ෍ 𝒙𝑛\𝑜

𝑔

𝑛 𝒙𝑛

ෑ

𝑜′∈𝑂(𝑛)\𝑜

𝑟𝑜′→𝑛(𝑦𝑜′)

SLIDE 41

Example

𝑦2 𝑦1 𝑦3 𝑦4 𝑔

𝑏

𝑔

𝑐

𝑔

𝑑

1. Select the root, or pick one if a tree (𝑦3)
1. Send messages from leaves to root

𝑟𝑦1→𝑔

𝑏 𝑦1 = 1

𝑟𝑦4→𝑔

𝑑 𝑦4 = 1

𝑠

𝑔

𝑏→𝑦2 𝑦2 =? ? ?

𝑟𝑜→𝑛 𝑦𝑜 = ෑ

𝑛′∈𝑁(𝑜)\𝑛

𝑠𝑛′→𝑜 𝑦𝑜

𝑠

𝑛→𝑜 𝑦𝑜 = ෍ 𝒙𝑛\𝑜

𝑔

𝑛 𝒙𝑛

ෑ

𝑜′∈𝑂(𝑛)\𝑜

𝑟𝑜′→𝑛(𝑦𝑜′)

SLIDE 42

Example

𝑦2 𝑦1 𝑦3 𝑦4 𝑔

𝑏

𝑔

𝑐

𝑔

𝑑

1. Select the root, or pick one if a tree (𝑦3)
1. Send messages from leaves to root

𝑟𝑦1→𝑔

𝑏 𝑦1 = 1

𝑟𝑦4→𝑔

𝑑 𝑦4 = 1

𝑠

𝑔

𝑏→𝑦2 𝑦2 = ෍

𝑙

𝑔

𝑏(𝑦1 = 𝑙, 𝑦2)

𝑠

𝑔

𝑑→𝑦2 𝑦2 = ෍

𝑙

𝑔

𝑏(𝑦2, 𝑦4 = 𝑙)

𝑟𝑜→𝑛 𝑦𝑜 = ෑ

𝑛′∈𝑁(𝑜)\𝑛

𝑠𝑛′→𝑜 𝑦𝑜

𝑠

𝑛→𝑜 𝑦𝑜 = ෍ 𝒙𝑛\𝑜

𝑔

𝑛 𝒙𝑛

ෑ

𝑜′∈𝑂(𝑛)\𝑜

𝑟𝑜′→𝑛(𝑦𝑜′)

SLIDE 43

Example

𝑦2 𝑦1 𝑦3 𝑦4 𝑔

𝑏

𝑔

𝑐

𝑔

𝑑

1. Select the root, or pick one if a tree (𝑦3)
1. Send messages from leaves to root

𝑟𝑦1→𝑔

𝑏 𝑦1 = 1

𝑟𝑦4→𝑔

𝑑 𝑦4 = 1

𝑠

𝑔

𝑏→𝑦2 𝑦2 = ෍

𝑙

𝑔

𝑏(𝑦1 = 𝑙, 𝑦2)

𝑠

𝑔

𝑑→𝑦2 𝑦2 = ෍

𝑙

𝑔

𝑏(𝑦2, 𝑦4 = 𝑙)

𝑟𝑦2→𝑔𝑐 𝑦2 =? ? ? 𝑟𝑜→𝑛 𝑦𝑜 = ෑ

𝑛′∈𝑁(𝑜)\𝑛

𝑠𝑛′→𝑜 𝑦𝑜

𝑠

𝑛→𝑜 𝑦𝑜 = ෍ 𝒙𝑛\𝑜

𝑔

𝑛 𝒙𝑛

ෑ

𝑜′∈𝑂(𝑛)\𝑜

𝑟𝑜′→𝑛(𝑦𝑜′)

SLIDE 44

Example

𝑦2 𝑦1 𝑦3 𝑦4 𝑔

𝑏

𝑔

𝑐

𝑔

𝑑

1. Select the root, or pick one if a tree (𝑦3)
1. Send messages from leaves to root

𝑟𝑦1→𝑔

𝑏 𝑦1 = 1

𝑟𝑦4→𝑔

𝑑 𝑦4 = 1

𝑠

𝑔

𝑏→𝑦2 𝑦2 = ෍

𝑙

𝑔

𝑏(𝑦1 = 𝑙, 𝑦2)

𝑠

𝑔

𝑑→𝑦2 𝑦2 = ෍

𝑙

𝑔

𝑏(𝑦2, 𝑦4 = 𝑙)

𝑟𝑦2→𝑔𝑐 𝑦2 = 𝑠

𝑔

𝑏→𝑦2 𝑦2 𝑠

𝑔

𝑑→𝑦2 𝑦2

𝑟𝑜→𝑛 𝑦𝑜 = ෑ

𝑛′∈𝑁(𝑜)\𝑛

𝑠𝑛′→𝑜 𝑦𝑜

𝑠

𝑛→𝑜 𝑦𝑜 = ෍ 𝒙𝑛\𝑜

𝑔

𝑛 𝒙𝑛

ෑ

𝑜′∈𝑂(𝑛)\𝑜

𝑟𝑜′→𝑛(𝑦𝑜′)

SLIDE 45

Example

𝑦2 𝑦1 𝑦3 𝑦4 𝑔

𝑏

𝑔

𝑐

𝑔

𝑑

1. Select the root, or pick one if a tree (𝑦3)
1. Send messages from leaves to root

𝑟𝑦1→𝑔

𝑏 𝑦1 = 1

𝑟𝑦4→𝑔

𝑑 𝑦4 = 1

𝑠

𝑔

𝑏→𝑦2 𝑦2 = ෍

𝑙

𝑔

𝑏(𝑦1 = 𝑙, 𝑦2)

𝑠

𝑔

𝑑→𝑦2 𝑦2 = ෍

𝑙

𝑔

𝑏(𝑦2, 𝑦4 = 𝑙)

𝑟𝑦2→𝑔𝑐 𝑦2 = 𝑠

𝑔

𝑏→𝑦2 𝑦2 𝑠

𝑔

𝑑→𝑦2 𝑦2

𝑠

𝑔𝑐→𝑦3 𝑦3 = ෍ 𝑙

𝑔

𝑐(𝑦2 = 𝑙, 𝑦3)

𝑟𝑜→𝑛 𝑦𝑜 = ෑ

𝑛′∈𝑁(𝑜)\𝑛

𝑠𝑛′→𝑜 𝑦𝑜

𝑠

𝑛→𝑜 𝑦𝑜 = ෍ 𝒙𝑛\𝑜

𝑔

𝑛 𝒙𝑛

ෑ

𝑜′∈𝑂(𝑛)\𝑜

𝑟𝑜′→𝑛(𝑦𝑜′)

SLIDE 46

Example

𝑦2 𝑦1 𝑦3 𝑦4 𝑔

𝑏

𝑔

𝑐

𝑔

𝑑

1. Select the root, or pick one if a tree (𝑦3)
1. Send messages from leaves to root
2. Send messages from root to leaves

𝑟𝑦3→𝑔𝑐 𝑦3 = 1 𝑟𝑜→𝑛 𝑦𝑜 = ෑ

𝑛′∈𝑁(𝑜)\𝑛

𝑠𝑛′→𝑜 𝑦𝑜

𝑠

𝑛→𝑜 𝑦𝑜 = ෍ 𝒙𝑛\𝑜

𝑔

𝑛 𝒙𝑛

ෑ

𝑜′∈𝑂(𝑛)\𝑜

𝑟𝑜′→𝑛(𝑦𝑜′)

SLIDE 47

Example

𝑦2 𝑦1 𝑦3 𝑦4 𝑔

𝑏

𝑔

𝑐

𝑔

𝑑

1. Select the root, or pick one if a tree (𝑦3)
1. Send messages from leaves to root
2. Send messages from root to leaves

𝑟𝑦3→𝑔𝑐 𝑦3 = 1 𝑠

𝑔𝑐→𝑦2 𝑦2 = ෍ 𝑙

𝑔

𝑐(𝑦2, 𝑦3 = 𝑙)

𝑟𝑜→𝑛 𝑦𝑜 = ෑ

𝑛′∈𝑁(𝑜)\𝑛

𝑠𝑛′→𝑜 𝑦𝑜

𝑠

𝑛→𝑜 𝑦𝑜 = ෍ 𝒙𝑛\𝑜

𝑔

𝑛 𝒙𝑛

ෑ

𝑜′∈𝑂(𝑛)\𝑜

𝑟𝑜′→𝑛(𝑦𝑜′)

SLIDE 48

Example

𝑦2 𝑦1 𝑦3 𝑦4 𝑔

𝑏

𝑔

𝑐

𝑔

𝑑

1. Select the root, or pick one if a tree (𝑦3)
1. Send messages from leaves to root
2. Send messages from root to leaves

𝑟𝑦3→𝑔𝑐 𝑦3 = 1 𝑠

𝑔𝑐→𝑦2 𝑦2 = ෍ 𝑙

𝑔

𝑐(𝑦2, 𝑦3 = 𝑙)

𝑟𝑦2→𝑔

𝑏 𝑦2 = ? ? ?

𝑟𝑜→𝑛 𝑦𝑜 = ෑ

𝑛′∈𝑁(𝑜)\𝑛

𝑠𝑛′→𝑜 𝑦𝑜

𝑠

𝑛→𝑜 𝑦𝑜 = ෍ 𝒙𝑛\𝑜

𝑔

𝑛 𝒙𝑛

ෑ

𝑜′∈𝑂(𝑛)\𝑜

𝑟𝑜′→𝑛(𝑦𝑜′)

SLIDE 49

Example

𝑦2 𝑦1 𝑦3 𝑦4 𝑔

𝑏

𝑔

𝑐

𝑔

𝑑

1. Select the root, or pick one if a tree (𝑦3)
1. Send messages from leaves to root
2. Send messages from root to leaves

𝑟𝑦3→𝑔𝑐 𝑦3 = 1 𝑠

𝑔𝑐→𝑦2 𝑦2 = ෍ 𝑙

𝑔

𝑐(𝑦2, 𝑦3 = 𝑙)

𝑟𝑦2→𝑔

𝑏 𝑦2 = 𝑠

𝑔𝑐→𝑦2 𝑦2 𝑠 𝑔

𝑑→𝑦2 𝑦2

𝑟𝑜→𝑛 𝑦𝑜 = ෑ

𝑛′∈𝑁(𝑜)\𝑛

𝑠𝑛′→𝑜 𝑦𝑜

𝑠

𝑛→𝑜 𝑦𝑜 = ෍ 𝒙𝑛\𝑜

𝑔

𝑛 𝒙𝑛

ෑ

𝑜′∈𝑂(𝑛)\𝑜

𝑟𝑜′→𝑛(𝑦𝑜′)

We just computed this Q: Where did we compute this?

SLIDE 50

Example

𝑦2 𝑦1 𝑦3 𝑦4 𝑔

𝑏

𝑔

𝑐

𝑔

𝑑

1. Select the root, or pick one if a tree (𝑦3)
1. Send messages from leaves to root
2. Send messages from root to leaves

𝑟𝑦3→𝑔𝑐 𝑦3 = 1 𝑠

𝑔𝑐→𝑦2 𝑦2 = ෍ 𝑙

𝑔

𝑐(𝑦2, 𝑦3 = 𝑙)

𝑟𝑦2→𝑔

𝑏 𝑦2 = 𝑠

𝑔𝑐→𝑦2 𝑦2 𝑠 𝑔

𝑑→𝑦2 𝑦2

𝑟𝑜→𝑛 𝑦𝑜 = ෑ

𝑛′∈𝑁(𝑜)\𝑛

𝑠𝑛′→𝑜 𝑦𝑜

𝑠

𝑛→𝑜 𝑦𝑜 = ෍ 𝒙𝑛\𝑜

𝑔

𝑛 𝒙𝑛

ෑ

𝑜′∈𝑂(𝑛)\𝑜

𝑟𝑜′→𝑛(𝑦𝑜′)

We just computed this Q: Where did we compute this? A: In step 1 (leaves → root)

SLIDE 51

Example

𝑦2 𝑦1 𝑦3 𝑦4 𝑔

𝑏

𝑔

𝑐

𝑔

𝑑

1. Select the root, or pick one if a tree (𝑦3)
1. Send messages from leaves to root
2. Send messages from root to leaves

𝑟𝑦3→𝑔𝑐 𝑦3 = 1 𝑠

𝑔𝑐→𝑦2 𝑦2 = ෍ 𝑙

𝑔

𝑐(𝑦2, 𝑦3 = 𝑙)

𝑟𝑦2→𝑔

𝑏 𝑦2 = 𝑠

𝑔𝑐→𝑦2 𝑦2 𝑠 𝑔

𝑑→𝑦2 𝑦2

𝑟𝑦2→𝑔

𝑑 𝑦2 = 𝑠

𝑔

𝑏→𝑦2 𝑦2 𝑠

𝑔𝑐→𝑦2 𝑦2

𝑟𝑜→𝑛 𝑦𝑜 = ෑ

𝑛′∈𝑁(𝑜)\𝑛

𝑠𝑛′→𝑜 𝑦𝑜

𝑠

𝑛→𝑜 𝑦𝑜 = ෍ 𝒙𝑛\𝑜

𝑔

𝑛 𝒙𝑛

ෑ

𝑜′∈𝑂(𝑛)\𝑜

𝑟𝑜′→𝑛(𝑦𝑜′)

SLIDE 52

Example

𝑦2 𝑦1 𝑦3 𝑦4 𝑔

𝑏

𝑔

𝑐

𝑔

𝑑

1. Select the root, or pick one if a tree (𝑦3)
1. Send messages from leaves to root
2. Send messages from root to leaves

𝑟𝑦3→𝑔𝑐 𝑦3 = 1 𝑠

𝑔𝑐→𝑦2 𝑦2 = ෍ 𝑙

𝑔

𝑐(𝑦2, 𝑦3 = 𝑙)

𝑟𝑦2→𝑔

𝑏 𝑦2 = 𝑠

𝑔𝑐→𝑦2 𝑦2 𝑠 𝑔

𝑑→𝑦2 𝑦2

𝑟𝑦2→𝑔

𝑑 𝑦2 = 𝑠

𝑔

𝑏→𝑦2 𝑦2 𝑠

𝑔𝑐→𝑦2 𝑦2

𝑠

𝑔

𝑑→𝑦4 𝑦4 = ෍

𝑙

𝑔

𝑑(𝑦2 = 𝑙, 𝑦4)

𝑟𝑜→𝑛 𝑦𝑜 = ෑ

𝑛′∈𝑁(𝑜)\𝑛

𝑠𝑛′→𝑜 𝑦𝑜

𝑠

𝑛→𝑜 𝑦𝑜 = ෍ 𝒙𝑛\𝑜

𝑔

𝑛 𝒙𝑛

ෑ

𝑜′∈𝑂(𝑛)\𝑜

𝑟𝑜′→𝑛(𝑦𝑜′)

SLIDE 53

Example

𝑦2 𝑦1 𝑦3 𝑦4 𝑔

𝑏

𝑔

𝑐

𝑔

𝑑

1. Select the root, or pick one if a tree (𝑦3)
1. Send messages from leaves to root
2. Send messages from root to leaves

𝑟𝑦3→𝑔𝑐 𝑦3 = 1 𝑠

𝑔𝑐→𝑦2 𝑦2 = ෍ 𝑙

𝑔

𝑐(𝑦2, 𝑦3 = 𝑙)

𝑟𝑦2→𝑔

𝑏 𝑦2 = 𝑠

𝑔𝑐→𝑦2 𝑦2 𝑠 𝑔

𝑑→𝑦2 𝑦2

𝑟𝑦2→𝑔

𝑑 𝑦2 = 𝑠

𝑔

𝑏→𝑦2 𝑦2 𝑠

𝑔𝑐→𝑦2 𝑦2

𝑠

𝑔

𝑑→𝑦4 𝑦4 = ෍

𝑙

𝑔

𝑑(𝑦2 = 𝑙, 𝑦4)

𝑠

𝑔

𝑏→𝑦1 𝑦1 = ෍

𝑙

𝑔

𝑏(𝑦1, 𝑦2 = 𝑙)

𝑟𝑜→𝑛 𝑦𝑜 = ෑ

𝑛′∈𝑁(𝑜)\𝑛

𝑠𝑛′→𝑜 𝑦𝑜

𝑠

𝑛→𝑜 𝑦𝑜 = ෍ 𝒙𝑛\𝑜

𝑔

𝑛 𝒙𝑛

ෑ

𝑜′∈𝑂(𝑛)\𝑜

𝑟𝑜′→𝑛(𝑦𝑜′)

SLIDE 54

Example

𝑦2 𝑦1 𝑦3 𝑦4 𝑔

𝑏

𝑔

𝑐

𝑔

𝑑

1. Select the root, or pick one if a tree (𝑦3)
1. Send messages from leaves to root
2. Send messages from root to leaves
3. Use messages to compute marginal

probabilities 𝑞 𝑦𝑜 = ෑ

𝑛′∈𝑁(𝑜)\𝑛

𝑠𝑛′→𝑜 𝑦𝑜 𝑟𝑜→𝑛 𝑦𝑜 = ෑ

𝑛′∈𝑁(𝑜)\𝑛

𝑠𝑛′→𝑜 𝑦𝑜

𝑠

𝑛→𝑜 𝑦𝑜 = ෍ 𝒙𝑛\𝑜

𝑔

𝑛 𝒙𝑛

ෑ

𝑜′∈𝑂(𝑛)\𝑜

𝑟𝑜′→𝑛(𝑦𝑜′)

SLIDE 55

Example

𝑦2 𝑦1 𝑦3 𝑦4 𝑔

𝑏

𝑔

𝑐

𝑔

𝑑

1. Select the root, or pick one if a tree (𝑦3)
1. Send messages from leaves to root
2. Send messages from root to leaves
3. Use messages to compute marginal

probabilities 𝑞 𝑦𝑜 = ෑ

𝑛′∈𝑁(𝑜)\𝑛

𝑠𝑛′→𝑜 𝑦𝑜 𝑞 𝑦1 = 𝑠

𝑔

𝑏→𝑦1 𝑦1

𝑟𝑜→𝑛 𝑦𝑜 = ෑ

𝑛′∈𝑁(𝑜)\𝑛

𝑠𝑛′→𝑜 𝑦𝑜

𝑠

𝑛→𝑜 𝑦𝑜 = ෍ 𝒙𝑛\𝑜

𝑔

𝑛 𝒙𝑛

ෑ

𝑜′∈𝑂(𝑛)\𝑜

𝑟𝑜′→𝑛(𝑦𝑜′)

SLIDE 56

Example

𝑦2 𝑦1 𝑦3 𝑦4 𝑔

𝑏

𝑔

𝑐

𝑔

𝑑

1. Select the root, or pick one if a tree (𝑦3)
1. Send messages from leaves to root
2. Send messages from root to leaves
3. Use messages to compute marginal

probabilities 𝑞 𝑦𝑜 = ෑ

𝑛′∈𝑁(𝑜)\𝑛

𝑠𝑛′→𝑜 𝑦𝑜 𝑞 𝑦1 = 𝑠

𝑔

𝑏→𝑦1 𝑦1

𝑞 𝑦2 = 𝑠

𝑔

𝑏→𝑦2 𝑦2 𝑠

𝑔𝑐→𝑦2 𝑦2 𝑠 𝑔

𝑑→𝑦2 𝑦2

𝑟𝑜→𝑛 𝑦𝑜 = ෑ

𝑛′∈𝑁(𝑜)\𝑛

𝑠𝑛′→𝑜 𝑦𝑜

𝑠

𝑛→𝑜 𝑦𝑜 = ෍ 𝒙𝑛\𝑜

𝑔

𝑛 𝒙𝑛

ෑ

𝑜′∈𝑂(𝑛)\𝑜

𝑟𝑜′→𝑛(𝑦𝑜′)

SLIDE 57

Example

𝑦2 𝑦1 𝑦3 𝑦4 𝑔

𝑏

𝑔

𝑐

𝑔

𝑑

1. Select the root, or pick one if a tree (𝑦3)
1. Send messages from leaves to root
2. Send messages from root to leaves
3. Use messages to compute marginal

probabilities 𝑞 𝑦𝑜 = ෑ

𝑛′∈𝑁(𝑜)\𝑛

𝑠𝑛′→𝑜 𝑦𝑜 𝑞 𝑦1 = 𝑠

𝑔

𝑏→𝑦1 𝑦1

𝑞 𝑦2 = 𝑠

𝑔

𝑏→𝑦2 𝑦2 𝑠

𝑔𝑐→𝑦2 𝑦2 𝑠 𝑔

𝑑→𝑦2 𝑦2

𝑞 𝑦3 = 𝑠

𝑔𝑐→𝑦3 𝑦3

𝑞 𝑦4 = 𝑠

𝑔

𝑑→𝑦4 𝑦4

𝑟𝑜→𝑛 𝑦𝑜 = ෑ

𝑛′∈𝑁(𝑜)\𝑛

𝑠𝑛′→𝑜 𝑦𝑜

𝑠

𝑛→𝑜 𝑦𝑜 = ෍ 𝒙𝑛\𝑜

𝑔

𝑛 𝒙𝑛

ෑ

𝑜′∈𝑂(𝑛)\𝑜

𝑟𝑜′→𝑛(𝑦𝑜′)

SLIDE 58

Example

𝑦2 𝑦1 𝑦3 𝑦4 𝑔

𝑏

𝑔

𝑐

𝑔

𝑑

1. Select the root, or pick one if a tree (𝑦3)
1. Send messages from leaves to root
2. Send messages from root to leaves
3. Use messages to compute marginal

probabilities

2. Are we done?
1. If a tree structure, we’ve converged

2. 𝑟𝑜→𝑛 𝑦𝑜 = ෑ

𝑛′∈𝑁(𝑜)\𝑛

𝑠𝑛′→𝑜 𝑦𝑜

𝑠

𝑛→𝑜 𝑦𝑜 = ෍ 𝒙𝑛\𝑜

𝑔

𝑛 𝒙𝑛

ෑ

𝑜′∈𝑂(𝑛)\𝑜

𝑟𝑜′→𝑛(𝑦𝑜′)

SLIDE 59

Example

𝑦2 𝑦1 𝑦3 𝑦4 𝑔

𝑏

𝑔

𝑐

𝑔

𝑑

1. Select the root, or pick one if a tree (𝑦3)
1. Send messages from leaves to root
2. Send messages from root to leaves
3. Use messages to compute marginal

probabilities

2. Are we done?
1. If a tree structure, we’ve converged
2. If not:
1. Either accept the partially

converged result, or… 2. 𝑟𝑜→𝑛 𝑦𝑜 = ෑ

𝑛′∈𝑁(𝑜)\𝑛

𝑠𝑛′→𝑜 𝑦𝑜

𝑠

𝑛→𝑜 𝑦𝑜 = ෍ 𝒙𝑛\𝑜

𝑔

𝑛 𝒙𝑛

ෑ

𝑜′∈𝑂(𝑛)\𝑜

𝑟𝑜′→𝑛(𝑦𝑜′)

SLIDE 60

Example

𝑦2 𝑦1 𝑦3 𝑦4 𝑔

𝑏

𝑔

𝑐

𝑔

𝑑

1. Select the root, or pick one if a tree
1. Send messages from leaves to root
2. Send messages from root to leaves
3. Use messages to compute marginal

probabilities

2. Are we done?
1. If a tree structure, we’ve converged
2. If not:
1. Either accept the partially

converged result, or…

2. Go back to (1) and repeat

[Loopy BP]

𝑟𝑜→𝑛 𝑦𝑜 = ෑ

𝑛′∈𝑁(𝑜)\𝑛

𝑠𝑛′→𝑜 𝑦𝑜

𝑠

𝑛→𝑜 𝑦𝑜 = ෍ 𝒙𝑛\𝑜

𝑔

𝑛 𝒙𝑛

ෑ

𝑜′∈𝑂(𝑛)\𝑜

𝑟𝑜′→𝑛(𝑦𝑜′)

SLIDE 61

Max-Product (Max-Sum)

Problem: how to find the most likely (best) setting of latent variables Replace sum (+) with max in factor→variable computations

𝑠

𝑛→𝑜 𝑦𝑜 = max 𝒙𝑛\𝑜 𝑔 𝑛 𝒙𝑛

ෑ

𝑜′∈𝑂(𝑛)\𝑜

𝑟𝑜′→𝑛(𝑦𝑜′)

(why max-sum? computationally, implement with logs)

SLIDE 62

Loopy Belief Propagation

Sum-product algorithm is not exact for general graphs Loopy Belief Propagation (Loopy BP): run sum- product algorithm anyway and hope for the best Requires a message passing schedule

SLIDE 63

Outline

Message Passing: Graphical Model Inference Example: Linear Chain CRF

SLIDE 64

Example: Linear Chain

Directed (e.g.,

hidden Markov model [HMM]; generative) z1

w1 w2 w3 w4

z2 z3 z4

Generate each tag, and generate each word from the tag
Locally normalized

SLIDE 65

Example: Linear Chain

Directed (e.g.,

hidden Markov model [HMM]; generative) z1

w1 w2 w3 w4

z2 z3 z4

Directed (e.g..,

maximum entropy Markov model [MEMM]; conditional) z1

w1 w2 w3 w4

z2 z3 z4

Given each word, generate (predict) each tag
Locally normalized

SLIDE 66

Example: Linear Chain

Directed (e.g.,

hidden Markov model [HMM]; generative) z1

w1 w2 w3 w4

z2 z3 z4

Directed (e.g..,

maximum entropy Markov model [MEMM]; conditional) z1

w1 w2 w3 w4

z2 z3 z4

Undirected

(e.g., conditional random field [CRF]) z1 z2 z3 z4

w1w2w3w4 w1w2w3w4 w1w2w3w4 w1w2w3w4

Given all words, generate each tag
Globally normalized

SLIDE 67

Example: Linear Chain

Directed (e.g.,

hidden Markov model [HMM]; generative) z1

w1 w2 w3 w4

z2 z3 z4

Undirected as factor graph

(e.g., conditional random field [CRF]) z1 z2 z3 z4

Directed (e.g..,

maximum entropy Markov model [MEMM]; conditional) z1

w1 w2 w3 w4

z2 z3 z4

Given all words, generate each tag
Globally normalized

SLIDE 68

Example: Linear Chain

Directed (e.g.,

hidden Markov model [HMM]; generative) z1

w1 w2 w3 w4

z2 z3 z4

Undirected as factor graph

(e.g., conditional random field [CRF]) z1 z2 z3 z4

Directed (e.g..,

maximum entropy Markov model [MEMM]; conditional) z1

w1 w2 w3 w4

z2 z3 z4

Given all words, generate each tag
Globally normalized

Q: What would the purple factors contain?

SLIDE 69

Example: Linear Chain

Directed (e.g.,

hidden Markov model [HMM]; generative) z1

w1 w2 w3 w4

z2 z3 z4

Undirected as factor graph

(e.g., conditional random field [CRF]) z1 z2 z3 z4

Directed (e.g..,

maximum entropy Markov model [MEMM]; conditional) z1

w1 w2 w3 w4

z2 z3 z4

Given all words, generate each tag
Globally normalized

Q: What would the purple factors contain? A: Tag-to-tag potential scores

SLIDE 70

Example: Linear Chain

Directed (e.g.,

hidden Markov model [HMM]; generative) z1

w1 w2 w3 w4

z2 z3 z4

Undirected as factor graph

(e.g., conditional random field [CRF]) z1 z2 z3 z4

Directed (e.g..,

maximum entropy Markov model [MEMM]; conditional) z1

w1 w2 w3 w4

z2 z3 z4

Given all words, generate each tag
Globally normalized

Q: What would the purple factors contain? Q: What would the green factors contain? A: Tag-to-tag potential scores

SLIDE 71

Example: Linear Chain

Directed (e.g.,

hidden Markov model [HMM]; generative) z1

w1 w2 w3 w4

z2 z3 z4

Undirected as factor graph

(e.g., conditional random field [CRF]) z1 z2 z3 z4

Directed (e.g..,

maximum entropy Markov model [MEMM]; conditional) z1

w1 w2 w3 w4

z2 z3 z4

Given all words, generate each tag
Globally normalized

Q: What would the purple factors contain? Q: What would the green factors contain? A: Tag-to-tag potential scores A: Sequence & tag potential scores

SLIDE 72

Example: Linear Chain Conditional Random Field

Widely used in applications like part-of-speech tagging

z1 z2 z3 z4

President Obama told Congress …

Noun-Mod Noun Noun Verb

SLIDE 73

Example: Linear Chain Conditional Random Field

Widely used in applications like part-of-speech tagging and named entity recognition

z1 z2 z3 z4

President Obama told Congress …

Noun-Mod Noun Noun Verb

President Obama told Congress …

Person Person Org. Other

SLIDE 74

Linear Chain CRFs for Part of Speech Tagging

A linear chain CRF is a conditional probabilistic model of the sequence of tags 𝑨1, 𝑨2, … , 𝑨𝑂 conditioned on the entire input sequence 𝑦1:𝑂

SLIDE 75

Linear Chain CRFs for Part of Speech Tagging

𝑞 ♣|♢

A linear chain CRF is a conditional probabilistic model of the sequence of tags 𝑨1, 𝑨2, … , 𝑨𝑂 conditioned on the entire input sequence 𝑦1:𝑂

SLIDE 76

Linear Chain CRFs for Part of Speech Tagging

𝑞 𝑨1, 𝑨2, … , 𝑨𝑂|♢

A linear chain CRF is a conditional probabilistic model of the sequence of tags 𝑨1, 𝑨2, … , 𝑨𝑂 conditioned on the entire input sequence 𝑦1:𝑂

SLIDE 77

Linear Chain CRFs for Part of Speech Tagging

𝑞 𝑨1, 𝑨2, … , 𝑨𝑂|𝑦1:𝑂

A linear chain CRF is a conditional probabilistic model of the sequence of tags 𝑨1, 𝑨2, … , 𝑨𝑂 conditioned on the entire input sequence 𝑦1:𝑂

SLIDE 78

Linear Chain CRFs for Part of Speech Tagging

z1 z2 z3 z4

𝑔

1

𝑔

2

𝑔

3

𝑔

4

𝑕1 𝑕2 𝑕3 𝑕4

𝑞 𝑨1, 𝑨2, … , 𝑨𝑂|𝑦1:𝑂

SLIDE 79

Linear Chain CRFs for Part of Speech Tagging

z1 z2 z3 z4

𝑔

1

𝑔

2

𝑔

3

𝑔

4

𝑕1 𝑕2 𝑕3 𝑕4 Q: What’s the general formula for a factor graph/undirected PGM distribution?

SLIDE 80

Linear Chain CRFs for Part of Speech Tagging

z1 z2 z3 z4

𝑔

1

𝑔

2

𝑔

3

𝑔

4

𝑕1 𝑕2 𝑕3 𝑕4

A: 𝑞 𝑨1, 𝑨2, … , 𝑨𝑂 =

1 𝑎 ς𝐷 𝜔𝐷 𝑨𝑑

Q: What’s the general formula for a factor graph/undirected PGM distribution?

SLIDE 81

Linear Chain CRFs for Part of Speech Tagging

z1 z2 z3 z4

𝑔

1

𝑔

2

𝑔

3

𝑔

4

𝑕1 𝑕2 𝑕3 𝑕4

𝑞 𝑨1, 𝑨2, … , 𝑨𝑂|𝑦1:𝑂 ∝ product of exponentiated potential scores

𝑞 𝑨1, 𝑨2, … , 𝑨𝑂 = 1 𝑎 ෑ

𝐷

𝜔𝐷 𝑨𝑑

SLIDE 82

Linear Chain CRFs for Part of Speech Tagging

z1 z2 z3 z4

𝑔

1

𝑔

2

𝑔

3

𝑔

4

𝑕1 𝑕2 𝑕3 𝑕4

𝑞 𝑨1, 𝑨2, … , 𝑨𝑂|𝑦1:𝑂 ∝ exp −𝐹𝑕1 𝑕1 … exp −𝐹𝑕𝑂 𝑕𝑂 ∗ exp −𝐹

𝑔

1 𝑔

1

… exp −𝐹

𝑔𝑂 𝑔 𝑂

SLIDE 83

Linear Chain CRFs for Part of Speech Tagging

z1 z2 z3 z4

𝑔

1

𝑔

2

𝑔

3

𝑔

4

𝑕1 𝑕2 𝑕3 𝑕4

𝑞 𝑨1, 𝑨2, … , 𝑨𝑂|𝑦1:𝑂 ∝ exp −𝐹𝑕1 𝑕1 … exp −𝐹𝑕𝑂 𝑕𝑂 ∗ exp −𝐹

𝑔

1 𝑔

1

… exp −𝐹

𝑔𝑂 𝑔 𝑂

We use the notation 𝐹𝑕𝑗 𝑕𝑗 to

separate the features 𝑕𝑗 from how we reweight them

We use −𝐹𝑕𝑗 to represent these

as Boltzmann distributions

SLIDE 84

Linear Chain CRFs for Part of Speech Tagging

z1 z2 z3 z4

𝑔

1

𝑔

2

𝑔

3

𝑔

4

𝑕1 𝑕2 𝑕3 𝑕4

𝑞 𝑨1, 𝑨2, … , 𝑨𝑂|𝑦1:𝑂 ∝ ෑ

𝑗=1 𝑂

exp −𝐹𝑕𝑗 𝑕𝑗 exp −𝐹

𝑔𝑗 𝑔 𝑗

SLIDE 85

Linear Chain CRFs for Part of Speech Tagging

z1 z2 z3 z4

𝑔

1

𝑔

2

𝑔

3

𝑔

4

𝑕1 𝑕2 𝑕3 𝑕4

𝑞 𝑨1, 𝑨2, … , 𝑨𝑂|𝑦1:𝑂 ∝ ෑ

𝑗=1 𝑂

exp − 𝐹𝑕𝑗 𝑕𝑗 + 𝐹

𝑔𝑗(𝑔 𝑗)

SLIDE 86

Linear Chain CRFs for Part of Speech Tagging

z1 z2 z3 z4

𝑔

1

𝑔

2

𝑔

3

𝑔

4

𝑕1 𝑕2 𝑕3 𝑕4

𝑞 𝑨1, 𝑨2, … , 𝑨𝑂|𝑦1:𝑂 ∝ ෑ

𝑗=1 𝑂

exp − 𝐹𝑕𝑗 𝑕𝑗 + 𝐹

𝑔𝑗(𝑔 𝑗)

Let 𝐹𝑕𝑗 𝑕𝑗 = −⟨𝜄 𝑕 , 𝑕𝑗⟩ Let 𝐹

𝑔𝑗 𝑔 𝑗 = −⟨𝜄 𝑔 , 𝑔 𝑗⟩

where 𝜄 𝑔 , 𝜄 𝑕 are learnable parameters

SLIDE 87

Linear Chain CRFs for Part of Speech Tagging

z1 z2 z3 z4

𝑔

1

𝑔

2

𝑔

3

𝑔

4

𝑕1 𝑕2 𝑕3 𝑕4

𝑞 𝑨1, 𝑨2, … , 𝑨𝑂|𝑦1:𝑂 ∝ ෑ

i=1 N

exp( 𝜄 𝑔 , 𝑔

𝑗 𝑨𝑗

+ 𝜄 𝑕 , 𝑕𝑗 𝑨𝑗, 𝑨𝑗+1 )

SLIDE 88

Linear Chain CRFs for Part of Speech Tagging

z1 z2 z3 z4

𝑔

1

𝑔

2

𝑔

3

𝑔

4

𝑕1 𝑕2 𝑕3 𝑕4

𝑕𝑘: inter-tag features (can depend on any/all input words 𝑦1:𝑂)

SLIDE 89

Linear Chain CRFs for Part of Speech Tagging

z1 z2 z3 z4

𝑔

1

𝑔

2

𝑔

3

𝑔

4

𝑕1 𝑕2 𝑕3 𝑕4

𝑕𝑘: inter-tag features (can depend on any/all input words 𝑦1:𝑂) 𝑔

𝑗: solo tag features

(can depend on any/all input words 𝑦1:𝑂)

SLIDE 90

Linear Chain CRFs for Part of Speech Tagging

z1 z2 z3 z4

𝑔

1

𝑔

2

𝑔

3

𝑔

4

𝑕1 𝑕2 𝑕3 𝑕4

𝑕𝑘: inter-tag features (can depend on any/all input words 𝑦1:𝑂) 𝑔

𝑗: solo tag features

(can depend on any/all input words 𝑦1:𝑂)

Feature design, just like in maxent models!

SLIDE 91

Linear Chain CRFs for Part of Speech Tagging

z1 z2 z3 z4

𝑔

1

𝑔

2

𝑔

3

𝑔

4

𝑕1 𝑕2 𝑕3 𝑕4

𝑕𝑘: inter-tag features (can depend on any/all input words 𝑦1:𝑂) 𝑔

𝑗: solo tag features

(can depend on any/all input words 𝑦1:𝑂) Example: 𝑕𝑘,𝑂→𝑊 zj, zj+1 = 1 (if zj == N & zj+1 == V) else 0 𝑕𝑘,told,𝑂→𝑊 zj, zj+1 = 1 (if zj == N & zj+1 == V & xj == told) else 0

SLIDE 92

(For discussion/whiteboard)

How would we learn a CRF?
What objective would we optimize?
How would we use BP?

SLIDE 93

Key Insights (1)

Minimize (structured) cross-entropy loss ↔

(structured) maximum likelihood

Gradient has very familiar form of

“observed feature counts – expected feature counts”

SLIDE 94

Key Insights (2)

Rely on adjacency connections/independence

assumptions to compute

𝔽𝑧′ ෍

𝑗

ℎ𝑗(𝑧′) = ෍

𝑗

෍

𝑧𝑗−1,𝑧𝑗

𝑞(𝑧𝑗−1, 𝑧𝑗|𝑦1:𝑂)ℎ𝑗 𝑧𝑗−1, 𝑧𝑗

SLIDE 95

Key Insights (3)

Run BP to compute beliefs (unnormalized, joint marginals)

𝑞 𝑧𝑗−1, 𝑧𝑗|𝑦1:𝑂 ∝ 𝑕𝑗−1 𝑧𝑗−1, 𝑧𝑗 ∗ 𝑟𝑧𝑗−1→𝑕𝑗−1 𝑧𝑗−1 ∗ 𝑟𝑧𝑗→𝑕𝑗−1 𝑧𝑗

y1 y2 y3 y4

𝑔

1

𝑔

2

𝑔

3

𝑔

4

𝑕1 𝑕2 𝑕3 𝑕4

𝑞 𝑦{𝑛} = 𝒘 ∝ 𝑛 𝑦 𝑛 = 𝒘 ෑ

𝑦𝑜𝑗∈𝑂(𝑛)

𝑟𝑜𝑗→𝑛 𝑦𝑜𝑗 = 𝑤𝑗

SLIDE 96