Message Passing/Belief Propagation CMSC 691 UMBC Markov Random - - PowerPoint PPT Presentation

β–Ά
message passing belief propagation
SMART_READER_LITE
LIVE PREVIEW

Message Passing/Belief Propagation CMSC 691 UMBC Markov Random - - PowerPoint PPT Presentation

Message Passing/Belief Propagation CMSC 691 UMBC Markov Random Fields: Undirected Graphs clique : subset of nodes, where nodes are pairwise connected maximal clique : a clique that cannot add a node and remain a clique 1 , 2 ,


slide-1
SLIDE 1

Message Passing/Belief Propagation

CMSC 691 UMBC

slide-2
SLIDE 2

Markov Random Fields: Undirected Graphs

π‘ž 𝑦1, 𝑦2, 𝑦3, … , 𝑦𝑂 = 1 π‘Ž ΰ·‘

𝐷

πœ”π· 𝑦𝑑

variables part

  • f the clique C

maximal cliques global normalization clique: subset of nodes, where nodes are pairwise connected maximal clique: a clique that cannot add a node and remain a clique potential function (not necessarily a probability!) Q: What restrictions should we place on the potentials πœ”π·? A: πœ”π· β‰₯ 0 (or πœ”π· > 0)

slide-3
SLIDE 3

Terminology: Potential Functions

πœ”π· 𝑦𝑑 = exp βˆ’πΉ(𝑦𝐷)

energy function (for clique C) Boltzmann distribution

(get the total energy of a configuration by summing the individual energy functions)

π‘ž 𝑦1, 𝑦2, 𝑦3, … , 𝑦𝑂 = 1 π‘Ž ΰ·‘

𝐷

πœ”π· 𝑦𝑑

slide-4
SLIDE 4

MRFs as Factor Graphs

Undirected graphs: G=(V,E) that represents π‘ž(π‘Œ1, … , π‘Œπ‘‚) Factor graph of p: Bipartite graph of evidence nodes X, factor nodes F, and edges T Evidence nodes X are the random variables Factor nodes F take values associated with the potential functions Edges show what variables are used in which factors

slide-5
SLIDE 5

MRFs as Factor Graphs

Undirected graphs: G=(V,E) that represents π‘ž(π‘Œ1, … , π‘Œπ‘‚) Factor graph of p: Bipartite graph of evidence nodes X, factor nodes F, and edges T Evidence nodes X are the random variables Factor nodes F take values associated with the potential functions Edges show what variables are used in which factors

X Y Z X Y Z

slide-6
SLIDE 6

Outline

Message Passing: Graphical Model Inference Example: Linear Chain CRF

slide-7
SLIDE 7

Two Problems for Undirected Models

Finding the normalizer π‘Ž = ෍

𝑦

ΰ·‘

𝑑

πœ”π‘‘(𝑦𝑑) Computing the marginals π‘Žπ‘œ(𝑀) = ෍

𝑦:π‘¦π‘œ=𝑀

ΰ·‘

𝑑

πœ”π‘‘(𝑦𝑑)

Sum over all variable combinations, with the xn coordinate fixed π‘Ž2(𝑀) = ෍

𝑦1

෍

𝑦3

ΰ·‘

𝑑

πœ”π‘‘(𝑦 = 𝑦1, 𝑀, 𝑦3 ) Example: 3 variables, fix the 2nd dimension

π‘ž 𝑦1, 𝑦2, 𝑦3, … , 𝑦𝑂 = 1 π‘Ž ΰ·‘

𝐷

πœ”π· 𝑦𝑑

slide-8
SLIDE 8

Two Problems for Undirected Models

Finding the normalizer π‘Ž = ෍

𝑦

ΰ·‘

𝑑

πœ”π‘‘(𝑦𝑑) Computing the marginals π‘Žπ‘œ(𝑀) = ෍

𝑦:π‘¦π‘œ=𝑀

ΰ·‘

𝑑

πœ”π‘‘(𝑦𝑑)

Q: Why are these difficult? A: Many different combinations Sum over all variable combinations, with the xn coordinate fixed π‘Ž2(𝑀) = ෍

𝑦1

෍

𝑦3

ΰ·‘

𝑑

πœ”π‘‘(𝑦 = 𝑦1, 𝑀, 𝑦3 ) Example: 3 variables, fix the 2nd dimension

π‘ž 𝑦1, 𝑦2, 𝑦3, … , 𝑦𝑂 = 1 π‘Ž ΰ·‘

𝐷

πœ”π· 𝑦𝑑

slide-9
SLIDE 9

Message Passing: Count the Soldiers

If you are the front soldier in the line, say the number β€˜one’ to the soldier behind you. If you are the rearmost soldier in the line, say the number β€˜one’ to the soldier in front of you. If a soldier ahead of or behind you says a number to you, add

  • ne to it, and say the new

number to the soldier on the

  • ther side

ITILA, Ch 16

slide-10
SLIDE 10

Message Passing: Count the Soldiers

If you are the front soldier in the line, say the number β€˜one’ to the soldier behind you. If you are the rearmost soldier in the line, say the number β€˜one’ to the soldier in front of you. If a soldier ahead of or behind you says a number to you, add one to it, and say the new number to the soldier on the other side

ITILA, Ch 16

slide-11
SLIDE 11

Message Passing: Count the Soldiers

If you are the front soldier in the line, say the number β€˜one’ to the soldier behind you. If you are the rearmost soldier in the line, say the number β€˜one’ to the soldier in front of you. If a soldier ahead of or behind you says a number to you, add one to it, and say the new number to the soldier on the other side

ITILA, Ch 16

slide-12
SLIDE 12

Message Passing: Count the Soldiers

If you are the front soldier in the line, say the number β€˜one’ to the soldier behind you. If you are the rearmost soldier in the line, say the number β€˜one’ to the soldier in front of you. If a soldier ahead of or behind you says a number to you, add one to it, and say the new number to the soldier on the other side

ITILA, Ch 16

slide-13
SLIDE 13

Sum-Product Algorithm

Main idea: message passing An exact inference algorithm for tree-like graphs Belief propagation (forward-backward for HMMs) is a special case

slide-14
SLIDE 14

Sum-Product

π‘ž 𝑦𝑗 = 𝑀 = ෍

𝑦:𝑦𝑗=𝑀

π‘ž 𝑦1, 𝑦2, … , 𝑦𝑗, … , 𝑦𝑂

definition of marginal

… …

slide-15
SLIDE 15

Sum-Product

definition of marginal

… …

main idea: use bipartite nature of graph to efficiently compute the marginals

The factor nodes can act as filters

π‘ž 𝑦𝑗 = 𝑀 = ෍

𝑦:𝑦𝑗=𝑀

π‘ž 𝑦1, 𝑦2, … , 𝑦𝑗, … , 𝑦𝑂

slide-16
SLIDE 16

Sum-Product

definition of marginal

… …

main idea: use bipartite nature of graph to efficiently compute the marginals 𝑠

𝑛1β†’π‘œ

𝑠

𝑛3β†’π‘œ

𝑠

𝑛2β†’π‘œ

𝑠

π‘›β†’π‘œ is a message from factor

node m to evidence node n

π‘ž 𝑦𝑗 = 𝑀 = ෍

𝑦:𝑦𝑗=𝑀

π‘ž 𝑦1, 𝑦2, … , 𝑦𝑗, … , 𝑦𝑂

slide-17
SLIDE 17

Sum-Product

π‘ž 𝑦𝑗 = 𝑀 = ς𝑔 𝑠

𝑔→𝑦𝑗(𝑦𝑗 = 𝑀)

Οƒπ‘₯ ς𝑔 𝑠

𝑔→𝑦𝑗(𝑦𝑗 = π‘₯) ∝ ΰ·‘ 𝑔

𝑠

𝑔→𝑦𝑗(𝑦𝑗) alternative marginal computation

… …

main idea: use bipartite nature of graph to efficiently compute the marginals 𝑠

𝑛1β†’π‘œ

𝑠

𝑛3β†’π‘œ

𝑠

𝑛2β†’π‘œ

𝑠

π‘›β†’π‘œ is a message from factor

node m to evidence node n

slide-18
SLIDE 18
slide-19
SLIDE 19

Sum-Product

… …

𝑠

𝑛1β†’π‘œ

𝑠

𝑛3β†’π‘œ

𝑠

𝑛2β†’π‘œ

𝑠

π‘›β†’π‘œ is a message from factor

node m to evidence node n π‘Ÿπ‘œβ†’ 𝑛1 π‘Ÿπ‘œβ†’ 𝑛2 π‘Ÿπ‘œβ†’ 𝑛3 π‘Ÿπ‘œβ†’π‘œ is a message from evidence node n to factor node m

slide-20
SLIDE 20

Sum-Product

From variables to factors π‘Ÿπ‘œβ†’π‘› π‘¦π‘œ =

n m

n aggregates information from the rest of its graph via its neighbors

slide-21
SLIDE 21

Sum-Product

From variables to factors π‘Ÿπ‘œβ†’π‘› π‘¦π‘œ = ΰ·‘

π‘›β€²βˆˆπ‘(π‘œ)\𝑛

π‘ π‘›β€²β†’π‘œ π‘¦π‘œ

n m

set of factors in which variable n participates

default value of 1 if empty product

slide-22
SLIDE 22

Sum-Product

From variables to factors π‘Ÿπ‘œβ†’π‘› π‘¦π‘œ = ΰ·‘

π‘›β€²βˆˆπ‘(π‘œ)\𝑛

π‘ π‘›β€²β†’π‘œ π‘¦π‘œ From factors to variables 𝑠

π‘›β†’π‘œ π‘¦π‘œ = n m n m

set of factors in which variable n participates

default value of 1 if empty product

m aggregates information from the rest of its graph via its neighbors

slide-23
SLIDE 23

Sum-Product

From variables to factors π‘Ÿπ‘œβ†’π‘› π‘¦π‘œ = ΰ·‘

π‘›β€²βˆˆπ‘(π‘œ)\𝑛

π‘ π‘›β€²β†’π‘œ π‘¦π‘œ From factors to variables 𝑠

π‘›β†’π‘œ π‘¦π‘œ =

n m n m

set of factors in which variable n participates

default value of 1 if empty product

m aggregates information from the rest of its graph via its neighbors But these neighbors are R.V.s that take on different values

slide-24
SLIDE 24

Sum-Product

From variables to factors π‘Ÿπ‘œβ†’π‘› π‘¦π‘œ = ΰ·‘

π‘›β€²βˆˆπ‘(π‘œ)\𝑛

π‘ π‘›β€²β†’π‘œ π‘¦π‘œ From factors to variables

n m n m

set of factors in which variable n participates

  • 1. sum over configuration of

variables for the mth factor, with variable n fixed

  • 2. aggregate info those other

variables provide about the rest of the graph

default value of 1 if empty product

𝑠

π‘›β†’π‘œ π‘¦π‘œ =

slide-25
SLIDE 25

Sum-Product

From variables to factors π‘Ÿπ‘œβ†’π‘› π‘¦π‘œ = ΰ·‘

π‘›β€²βˆˆπ‘(π‘œ)\𝑛

π‘ π‘›β€²β†’π‘œ π‘¦π‘œ From factors to variables 𝑠

π‘›β†’π‘œ π‘¦π‘œ

= ෍

𝒙𝑛\π‘œ

𝑔

𝑛 𝒙𝑛

ΰ·‘

π‘œβ€²βˆˆπ‘‚(𝑛)\π‘œ

π‘Ÿπ‘œβ€²β†’π‘›(π‘¦π‘œβ€²)

n m n m

set of factors in which variable n participates

default value of 1 if empty product

  • 2. aggregate info those
  • ther variables provide

about the rest of the graph

  • 1. sum over configuration of

variables for the mth factor, with variable n fixed

slide-26
SLIDE 26

Sum-Product

From variables to factors π‘Ÿπ‘œβ†’π‘› π‘¦π‘œ = ΰ·‘

π‘›β€²βˆˆπ‘(π‘œ)\𝑛

π‘ π‘›β€²β†’π‘œ π‘¦π‘œ From factors to variables 𝑠

π‘›β†’π‘œ π‘¦π‘œ

= ෍

𝒙𝑛\π‘œ

𝑔

𝑛 𝒙𝑛

ΰ·‘

π‘œβ€²βˆˆπ‘‚(𝑛)\π‘œ

π‘Ÿπ‘œβ€²β†’π‘›(π‘¦π‘œβ€²)

n m n m

set of variables that the mth factor depends on set of factors in which variable n participates sum over configuration of variables for the mth factor, with variable n fixed

default value of 1 if empty product

slide-27
SLIDE 27

Meaning of the Computed Values

From variables to factors π‘Ÿπ‘œβ†’π‘› π‘¦π‘œ = ΰ·‘

π‘›β€²βˆˆπ‘(π‘œ)\𝑛

π‘ π‘›β€²β†’π‘œ π‘¦π‘œ From factors to variables 𝑠

π‘›β†’π‘œ π‘¦π‘œ

= ෍

𝒙𝑛\π‘œ

𝑔

𝑛 𝒙𝑛

ΰ·‘

π‘œβ€²βˆˆπ‘‚(𝑛)\π‘œ

π‘Ÿπ‘œβ€²β†’π‘›(π‘¦π‘œβ€²) π‘¦π‘œ telling factor m the β€œgoodness” for the rest of the graph if π‘¦π‘œ has a particular value factor m telling π‘¦π‘œ the β€œgoodness” for the rest of the graph if π‘¦π‘œ has a particular value

slide-28
SLIDE 28

From Messages to Variable Beliefs

n m1 m2 𝑠

𝑛1β†’π‘œ π‘¦π‘œ tells π‘¦π‘œ

the β€œgoodness” from m1’s perspective if π‘¦π‘œ has a particular value 𝑠

𝑛2β†’π‘œ π‘¦π‘œ tells π‘¦π‘œ

the β€œgoodness” from m2’s perspective if π‘¦π‘œ has a particular value

slide-29
SLIDE 29

From Messages to Variable Beliefs

n m1 m2 𝑠

𝑛1β†’π‘œ π‘¦π‘œ tells π‘¦π‘œ

the β€œgoodness” from m1’s perspective if π‘¦π‘œ has a particular value 𝑠

𝑛2β†’π‘œ π‘¦π‘œ tells π‘¦π‘œ

the β€œgoodness” from m2’s perspective if π‘¦π‘œ has a particular value

Together, they describe the cover the entire graph!

slide-30
SLIDE 30

From Messages to Variable Beliefs

n m1 m2 𝑠

𝑛1β†’π‘œ π‘¦π‘œ tells π‘¦π‘œ

the β€œgoodness” from m1’s perspective if π‘¦π‘œ has a particular value 𝑠

𝑛2β†’π‘œ π‘¦π‘œ tells π‘¦π‘œ

the β€œgoodness” from m2’s perspective if π‘¦π‘œ has a particular value

Together, they describe the cover the entire graph!

π‘ž π‘¦π‘œ = 𝑀 ∝ 𝑠

𝑛1β†’π‘œ π‘¦π‘œ = 𝑀 𝑠 𝑛2β†’π‘œ π‘¦π‘œ = 𝑀

slide-31
SLIDE 31

From Messages to Variable Beliefs: General Formula

n m1 m2 𝑠

𝑛1β†’π‘œ π‘¦π‘œ tells π‘¦π‘œ

the β€œgoodness” from m1’s perspective if π‘¦π‘œ has a particular value 𝑠

𝑛2β†’π‘œ π‘¦π‘œ tells π‘¦π‘œ

the β€œgoodness” from m2’s perspective if π‘¦π‘œ has a particular value

π‘ž π‘¦π‘œ = 𝑀 ∝ ΰ·‘

π‘›βˆˆπ‘‚(π‘¦π‘œ)

𝑠

π‘›β†’π‘œ π‘¦π‘œ = 𝑀

slide-32
SLIDE 32

From Messages to Factor Beliefs: General Formula

n1 n2 n3 m π‘Ÿπ‘œπ‘—β†’π‘›tells 𝑛 the β€œgoodness” from π‘¦π‘œπ‘—β€™s perspective if it has a particular value

π‘ž 𝑦{𝑛} = π’˜ ∝ 𝑛 𝑦 𝑛 = π’˜ ΰ·‘

π‘¦π‘œπ‘—βˆˆπ‘‚(𝑛)

π‘Ÿπ‘œπ‘—β†’π‘› π‘¦π‘œπ‘— = 𝑀𝑗

slide-33
SLIDE 33

How to Use these Messages

1.Select the root, or pick one if a tree

a) Send messages from leaves to root b) Send messages from root to leaves c) Use messages to compute (unnormalized) marginal probabilities

2.Are we done?

a) If a tree structure, we’ve converged b) If not:

i. Either accept the partially converged result, or… ii. Go back to (1) and repeat

slide-34
SLIDE 34

How to Use these Messages

Compute Marginals/Normalizer

  • 1. Select the root, or pick one if a

tree

a) Send messages from leaves to root b) Send messages from root to leaves c) Use messages to compute (unnormalized) marginal probabilities

  • 2. Are we done?

a) If a tree structure, we’ve converged b) If not:

i. Either accept the partially converged result, or… ii. Go back to (1) and repeat

For Learning/Inference Whenever you need to compute a likelihood, marginal probability,

  • r a model-specific expectation,

run this algorithm to compute the necessary probabilities

– Prediction:

  • Of a sequence

π‘ž 𝑨1, … , 𝑨𝑂 π‘₯1:𝑂

  • Of an individual tag π‘ž(𝑨𝑗|π‘₯1:𝑂)

– Marginal (if appropriate)

  • π‘ž(π‘₯1:𝑂)

– Learning model parameters

  • EM
  • Variational inference
  • …
slide-35
SLIDE 35

Example

Q: What are the variables? 𝑦2 𝑦1 𝑦3 𝑦4 𝑔

𝑏

𝑔

𝑐

𝑔

𝑑

slide-36
SLIDE 36

Example

Q: What are the variables? A: 𝑦1, 𝑦2, 𝑦3, 𝑦4 Q: What are the factors? 𝑦2 𝑦1 𝑦3 𝑦4 𝑔

𝑏

𝑔

𝑐

𝑔

𝑑

slide-37
SLIDE 37

Example

Q: What are the variables? A: 𝑦1, 𝑦2, 𝑦3, 𝑦4 Q: What are the factors? A: 𝑔

𝑏 𝑦1, 𝑦2 ,

𝑔

𝑐 𝑦2, 𝑦3 ,

𝑔

𝑑(𝑦2, 𝑦4)

𝑦2 𝑦1 𝑦3 𝑦4 𝑔

𝑏

𝑔

𝑐

𝑔

𝑑

slide-38
SLIDE 38

Example

𝑦2 𝑦1 𝑦3 𝑦4 𝑔

𝑏

𝑔

𝑐

𝑔

𝑑

Q: What is the distribution we’re modeling?

slide-39
SLIDE 39

Example

𝑦2 𝑦1 𝑦3 𝑦4 𝑔

𝑏

𝑔

𝑐

𝑔

𝑑

Q: What is the distribution we’re modeling?

A: π‘ž 𝑦1, 𝑦2, 𝑦3, 𝑦4 = 𝑔

𝑏 𝑦1, 𝑦2 𝑔 𝑐 𝑦2, 𝑦3 𝑔 𝑑(𝑦2, 𝑦4)

slide-40
SLIDE 40

Example

𝑦2 𝑦1 𝑦3 𝑦4 𝑔

𝑏

𝑔

𝑐

𝑔

𝑑

  • 1. Select the root, or pick one if a tree (𝑦3)
  • 1. Send messages from leaves to root

π‘Ÿπ‘¦1→𝑔

𝑏 𝑦1 = 1

π‘Ÿπ‘¦4→𝑔

𝑑 𝑦4 = 1

π‘Ÿπ‘œβ†’π‘› π‘¦π‘œ = ΰ·‘

π‘›β€²βˆˆπ‘(π‘œ)\𝑛

π‘ π‘›β€²β†’π‘œ π‘¦π‘œ

𝑠

π‘›β†’π‘œ π‘¦π‘œ = ෍ 𝒙𝑛\π‘œ

𝑔

𝑛 𝒙𝑛

ΰ·‘

π‘œβ€²βˆˆπ‘‚(𝑛)\π‘œ

π‘Ÿπ‘œβ€²β†’π‘›(π‘¦π‘œβ€²)

slide-41
SLIDE 41

Example

𝑦2 𝑦1 𝑦3 𝑦4 𝑔

𝑏

𝑔

𝑐

𝑔

𝑑

  • 1. Select the root, or pick one if a tree (𝑦3)
  • 1. Send messages from leaves to root

π‘Ÿπ‘¦1→𝑔

𝑏 𝑦1 = 1

π‘Ÿπ‘¦4→𝑔

𝑑 𝑦4 = 1

𝑠

𝑔

𝑏→𝑦2 𝑦2 =? ? ?

π‘Ÿπ‘œβ†’π‘› π‘¦π‘œ = ΰ·‘

π‘›β€²βˆˆπ‘(π‘œ)\𝑛

π‘ π‘›β€²β†’π‘œ π‘¦π‘œ

𝑠

π‘›β†’π‘œ π‘¦π‘œ = ෍ 𝒙𝑛\π‘œ

𝑔

𝑛 𝒙𝑛

ΰ·‘

π‘œβ€²βˆˆπ‘‚(𝑛)\π‘œ

π‘Ÿπ‘œβ€²β†’π‘›(π‘¦π‘œβ€²)

slide-42
SLIDE 42

Example

𝑦2 𝑦1 𝑦3 𝑦4 𝑔

𝑏

𝑔

𝑐

𝑔

𝑑

  • 1. Select the root, or pick one if a tree (𝑦3)
  • 1. Send messages from leaves to root

π‘Ÿπ‘¦1→𝑔

𝑏 𝑦1 = 1

π‘Ÿπ‘¦4→𝑔

𝑑 𝑦4 = 1

𝑠

𝑔

𝑏→𝑦2 𝑦2 = ෍

𝑙

𝑔

𝑏(𝑦1 = 𝑙, 𝑦2)

𝑠

𝑔

𝑑→𝑦2 𝑦2 = ෍

𝑙

𝑔

𝑏(𝑦2, 𝑦4 = 𝑙)

π‘Ÿπ‘œβ†’π‘› π‘¦π‘œ = ΰ·‘

π‘›β€²βˆˆπ‘(π‘œ)\𝑛

π‘ π‘›β€²β†’π‘œ π‘¦π‘œ

𝑠

π‘›β†’π‘œ π‘¦π‘œ = ෍ 𝒙𝑛\π‘œ

𝑔

𝑛 𝒙𝑛

ΰ·‘

π‘œβ€²βˆˆπ‘‚(𝑛)\π‘œ

π‘Ÿπ‘œβ€²β†’π‘›(π‘¦π‘œβ€²)

slide-43
SLIDE 43

Example

𝑦2 𝑦1 𝑦3 𝑦4 𝑔

𝑏

𝑔

𝑐

𝑔

𝑑

  • 1. Select the root, or pick one if a tree (𝑦3)
  • 1. Send messages from leaves to root

π‘Ÿπ‘¦1→𝑔

𝑏 𝑦1 = 1

π‘Ÿπ‘¦4→𝑔

𝑑 𝑦4 = 1

𝑠

𝑔

𝑏→𝑦2 𝑦2 = ෍

𝑙

𝑔

𝑏(𝑦1 = 𝑙, 𝑦2)

𝑠

𝑔

𝑑→𝑦2 𝑦2 = ෍

𝑙

𝑔

𝑏(𝑦2, 𝑦4 = 𝑙)

π‘Ÿπ‘¦2→𝑔𝑐 𝑦2 =? ? ? π‘Ÿπ‘œβ†’π‘› π‘¦π‘œ = ΰ·‘

π‘›β€²βˆˆπ‘(π‘œ)\𝑛

π‘ π‘›β€²β†’π‘œ π‘¦π‘œ

𝑠

π‘›β†’π‘œ π‘¦π‘œ = ෍ 𝒙𝑛\π‘œ

𝑔

𝑛 𝒙𝑛

ΰ·‘

π‘œβ€²βˆˆπ‘‚(𝑛)\π‘œ

π‘Ÿπ‘œβ€²β†’π‘›(π‘¦π‘œβ€²)

slide-44
SLIDE 44

Example

𝑦2 𝑦1 𝑦3 𝑦4 𝑔

𝑏

𝑔

𝑐

𝑔

𝑑

  • 1. Select the root, or pick one if a tree (𝑦3)
  • 1. Send messages from leaves to root

π‘Ÿπ‘¦1→𝑔

𝑏 𝑦1 = 1

π‘Ÿπ‘¦4→𝑔

𝑑 𝑦4 = 1

𝑠

𝑔

𝑏→𝑦2 𝑦2 = ෍

𝑙

𝑔

𝑏(𝑦1 = 𝑙, 𝑦2)

𝑠

𝑔

𝑑→𝑦2 𝑦2 = ෍

𝑙

𝑔

𝑏(𝑦2, 𝑦4 = 𝑙)

π‘Ÿπ‘¦2→𝑔𝑐 𝑦2 = 𝑠

𝑔

𝑏→𝑦2 𝑦2 𝑠

𝑔

𝑑→𝑦2 𝑦2

π‘Ÿπ‘œβ†’π‘› π‘¦π‘œ = ΰ·‘

π‘›β€²βˆˆπ‘(π‘œ)\𝑛

π‘ π‘›β€²β†’π‘œ π‘¦π‘œ

𝑠

π‘›β†’π‘œ π‘¦π‘œ = ෍ 𝒙𝑛\π‘œ

𝑔

𝑛 𝒙𝑛

ΰ·‘

π‘œβ€²βˆˆπ‘‚(𝑛)\π‘œ

π‘Ÿπ‘œβ€²β†’π‘›(π‘¦π‘œβ€²)

slide-45
SLIDE 45

Example

𝑦2 𝑦1 𝑦3 𝑦4 𝑔

𝑏

𝑔

𝑐

𝑔

𝑑

  • 1. Select the root, or pick one if a tree (𝑦3)
  • 1. Send messages from leaves to root

π‘Ÿπ‘¦1→𝑔

𝑏 𝑦1 = 1

π‘Ÿπ‘¦4→𝑔

𝑑 𝑦4 = 1

𝑠

𝑔

𝑏→𝑦2 𝑦2 = ෍

𝑙

𝑔

𝑏(𝑦1 = 𝑙, 𝑦2)

𝑠

𝑔

𝑑→𝑦2 𝑦2 = ෍

𝑙

𝑔

𝑏(𝑦2, 𝑦4 = 𝑙)

π‘Ÿπ‘¦2→𝑔𝑐 𝑦2 = 𝑠

𝑔

𝑏→𝑦2 𝑦2 𝑠

𝑔

𝑑→𝑦2 𝑦2

𝑠

𝑔𝑐→𝑦3 𝑦3 = ෍ 𝑙

𝑔

𝑐(𝑦2 = 𝑙, 𝑦3)

π‘Ÿπ‘œβ†’π‘› π‘¦π‘œ = ΰ·‘

π‘›β€²βˆˆπ‘(π‘œ)\𝑛

π‘ π‘›β€²β†’π‘œ π‘¦π‘œ

𝑠

π‘›β†’π‘œ π‘¦π‘œ = ෍ 𝒙𝑛\π‘œ

𝑔

𝑛 𝒙𝑛

ΰ·‘

π‘œβ€²βˆˆπ‘‚(𝑛)\π‘œ

π‘Ÿπ‘œβ€²β†’π‘›(π‘¦π‘œβ€²)

slide-46
SLIDE 46

Example

𝑦2 𝑦1 𝑦3 𝑦4 𝑔

𝑏

𝑔

𝑐

𝑔

𝑑

  • 1. Select the root, or pick one if a tree (𝑦3)
  • 1. Send messages from leaves to root
  • 2. Send messages from root to leaves

π‘Ÿπ‘¦3→𝑔𝑐 𝑦3 = 1 π‘Ÿπ‘œβ†’π‘› π‘¦π‘œ = ΰ·‘

π‘›β€²βˆˆπ‘(π‘œ)\𝑛

π‘ π‘›β€²β†’π‘œ π‘¦π‘œ

𝑠

π‘›β†’π‘œ π‘¦π‘œ = ෍ 𝒙𝑛\π‘œ

𝑔

𝑛 𝒙𝑛

ΰ·‘

π‘œβ€²βˆˆπ‘‚(𝑛)\π‘œ

π‘Ÿπ‘œβ€²β†’π‘›(π‘¦π‘œβ€²)

slide-47
SLIDE 47

Example

𝑦2 𝑦1 𝑦3 𝑦4 𝑔

𝑏

𝑔

𝑐

𝑔

𝑑

  • 1. Select the root, or pick one if a tree (𝑦3)
  • 1. Send messages from leaves to root
  • 2. Send messages from root to leaves

π‘Ÿπ‘¦3→𝑔𝑐 𝑦3 = 1 𝑠

𝑔𝑐→𝑦2 𝑦2 = ෍ 𝑙

𝑔

𝑐(𝑦2, 𝑦3 = 𝑙)

π‘Ÿπ‘œβ†’π‘› π‘¦π‘œ = ΰ·‘

π‘›β€²βˆˆπ‘(π‘œ)\𝑛

π‘ π‘›β€²β†’π‘œ π‘¦π‘œ

𝑠

π‘›β†’π‘œ π‘¦π‘œ = ෍ 𝒙𝑛\π‘œ

𝑔

𝑛 𝒙𝑛

ΰ·‘

π‘œβ€²βˆˆπ‘‚(𝑛)\π‘œ

π‘Ÿπ‘œβ€²β†’π‘›(π‘¦π‘œβ€²)

slide-48
SLIDE 48

Example

𝑦2 𝑦1 𝑦3 𝑦4 𝑔

𝑏

𝑔

𝑐

𝑔

𝑑

  • 1. Select the root, or pick one if a tree (𝑦3)
  • 1. Send messages from leaves to root
  • 2. Send messages from root to leaves

π‘Ÿπ‘¦3→𝑔𝑐 𝑦3 = 1 𝑠

𝑔𝑐→𝑦2 𝑦2 = ෍ 𝑙

𝑔

𝑐(𝑦2, 𝑦3 = 𝑙)

π‘Ÿπ‘¦2→𝑔

𝑏 𝑦2 = ? ? ?

π‘Ÿπ‘œβ†’π‘› π‘¦π‘œ = ΰ·‘

π‘›β€²βˆˆπ‘(π‘œ)\𝑛

π‘ π‘›β€²β†’π‘œ π‘¦π‘œ

𝑠

π‘›β†’π‘œ π‘¦π‘œ = ෍ 𝒙𝑛\π‘œ

𝑔

𝑛 𝒙𝑛

ΰ·‘

π‘œβ€²βˆˆπ‘‚(𝑛)\π‘œ

π‘Ÿπ‘œβ€²β†’π‘›(π‘¦π‘œβ€²)

slide-49
SLIDE 49

Example

𝑦2 𝑦1 𝑦3 𝑦4 𝑔

𝑏

𝑔

𝑐

𝑔

𝑑

  • 1. Select the root, or pick one if a tree (𝑦3)
  • 1. Send messages from leaves to root
  • 2. Send messages from root to leaves

π‘Ÿπ‘¦3→𝑔𝑐 𝑦3 = 1 𝑠

𝑔𝑐→𝑦2 𝑦2 = ෍ 𝑙

𝑔

𝑐(𝑦2, 𝑦3 = 𝑙)

π‘Ÿπ‘¦2→𝑔

𝑏 𝑦2 = 𝑠

𝑔𝑐→𝑦2 𝑦2 𝑠 𝑔

𝑑→𝑦2 𝑦2

π‘Ÿπ‘œβ†’π‘› π‘¦π‘œ = ΰ·‘

π‘›β€²βˆˆπ‘(π‘œ)\𝑛

π‘ π‘›β€²β†’π‘œ π‘¦π‘œ

𝑠

π‘›β†’π‘œ π‘¦π‘œ = ෍ 𝒙𝑛\π‘œ

𝑔

𝑛 𝒙𝑛

ΰ·‘

π‘œβ€²βˆˆπ‘‚(𝑛)\π‘œ

π‘Ÿπ‘œβ€²β†’π‘›(π‘¦π‘œβ€²)

We just computed this Q: Where did we compute this?

slide-50
SLIDE 50

Example

𝑦2 𝑦1 𝑦3 𝑦4 𝑔

𝑏

𝑔

𝑐

𝑔

𝑑

  • 1. Select the root, or pick one if a tree (𝑦3)
  • 1. Send messages from leaves to root
  • 2. Send messages from root to leaves

π‘Ÿπ‘¦3→𝑔𝑐 𝑦3 = 1 𝑠

𝑔𝑐→𝑦2 𝑦2 = ෍ 𝑙

𝑔

𝑐(𝑦2, 𝑦3 = 𝑙)

π‘Ÿπ‘¦2→𝑔

𝑏 𝑦2 = 𝑠

𝑔𝑐→𝑦2 𝑦2 𝑠 𝑔

𝑑→𝑦2 𝑦2

π‘Ÿπ‘œβ†’π‘› π‘¦π‘œ = ΰ·‘

π‘›β€²βˆˆπ‘(π‘œ)\𝑛

π‘ π‘›β€²β†’π‘œ π‘¦π‘œ

𝑠

π‘›β†’π‘œ π‘¦π‘œ = ෍ 𝒙𝑛\π‘œ

𝑔

𝑛 𝒙𝑛

ΰ·‘

π‘œβ€²βˆˆπ‘‚(𝑛)\π‘œ

π‘Ÿπ‘œβ€²β†’π‘›(π‘¦π‘œβ€²)

We just computed this Q: Where did we compute this? A: In step 1 (leaves β†’ root)

slide-51
SLIDE 51

Example

𝑦2 𝑦1 𝑦3 𝑦4 𝑔

𝑏

𝑔

𝑐

𝑔

𝑑

  • 1. Select the root, or pick one if a tree (𝑦3)
  • 1. Send messages from leaves to root
  • 2. Send messages from root to leaves

π‘Ÿπ‘¦3→𝑔𝑐 𝑦3 = 1 𝑠

𝑔𝑐→𝑦2 𝑦2 = ෍ 𝑙

𝑔

𝑐(𝑦2, 𝑦3 = 𝑙)

π‘Ÿπ‘¦2→𝑔

𝑏 𝑦2 = 𝑠

𝑔𝑐→𝑦2 𝑦2 𝑠 𝑔

𝑑→𝑦2 𝑦2

π‘Ÿπ‘¦2→𝑔

𝑑 𝑦2 = 𝑠

𝑔

𝑏→𝑦2 𝑦2 𝑠

𝑔𝑐→𝑦2 𝑦2

π‘Ÿπ‘œβ†’π‘› π‘¦π‘œ = ΰ·‘

π‘›β€²βˆˆπ‘(π‘œ)\𝑛

π‘ π‘›β€²β†’π‘œ π‘¦π‘œ

𝑠

π‘›β†’π‘œ π‘¦π‘œ = ෍ 𝒙𝑛\π‘œ

𝑔

𝑛 𝒙𝑛

ΰ·‘

π‘œβ€²βˆˆπ‘‚(𝑛)\π‘œ

π‘Ÿπ‘œβ€²β†’π‘›(π‘¦π‘œβ€²)

slide-52
SLIDE 52

Example

𝑦2 𝑦1 𝑦3 𝑦4 𝑔

𝑏

𝑔

𝑐

𝑔

𝑑

  • 1. Select the root, or pick one if a tree (𝑦3)
  • 1. Send messages from leaves to root
  • 2. Send messages from root to leaves

π‘Ÿπ‘¦3→𝑔𝑐 𝑦3 = 1 𝑠

𝑔𝑐→𝑦2 𝑦2 = ෍ 𝑙

𝑔

𝑐(𝑦2, 𝑦3 = 𝑙)

π‘Ÿπ‘¦2→𝑔

𝑏 𝑦2 = 𝑠

𝑔𝑐→𝑦2 𝑦2 𝑠 𝑔

𝑑→𝑦2 𝑦2

π‘Ÿπ‘¦2→𝑔

𝑑 𝑦2 = 𝑠

𝑔

𝑏→𝑦2 𝑦2 𝑠

𝑔𝑐→𝑦2 𝑦2

𝑠

𝑔

𝑑→𝑦4 𝑦4 = ෍

𝑙

𝑔

𝑑(𝑦2 = 𝑙, 𝑦4)

π‘Ÿπ‘œβ†’π‘› π‘¦π‘œ = ΰ·‘

π‘›β€²βˆˆπ‘(π‘œ)\𝑛

π‘ π‘›β€²β†’π‘œ π‘¦π‘œ

𝑠

π‘›β†’π‘œ π‘¦π‘œ = ෍ 𝒙𝑛\π‘œ

𝑔

𝑛 𝒙𝑛

ΰ·‘

π‘œβ€²βˆˆπ‘‚(𝑛)\π‘œ

π‘Ÿπ‘œβ€²β†’π‘›(π‘¦π‘œβ€²)

slide-53
SLIDE 53

Example

𝑦2 𝑦1 𝑦3 𝑦4 𝑔

𝑏

𝑔

𝑐

𝑔

𝑑

  • 1. Select the root, or pick one if a tree (𝑦3)
  • 1. Send messages from leaves to root
  • 2. Send messages from root to leaves

π‘Ÿπ‘¦3→𝑔𝑐 𝑦3 = 1 𝑠

𝑔𝑐→𝑦2 𝑦2 = ෍ 𝑙

𝑔

𝑐(𝑦2, 𝑦3 = 𝑙)

π‘Ÿπ‘¦2→𝑔

𝑏 𝑦2 = 𝑠

𝑔𝑐→𝑦2 𝑦2 𝑠 𝑔

𝑑→𝑦2 𝑦2

π‘Ÿπ‘¦2→𝑔

𝑑 𝑦2 = 𝑠

𝑔

𝑏→𝑦2 𝑦2 𝑠

𝑔𝑐→𝑦2 𝑦2

𝑠

𝑔

𝑑→𝑦4 𝑦4 = ෍

𝑙

𝑔

𝑑(𝑦2 = 𝑙, 𝑦4)

𝑠

𝑔

𝑏→𝑦1 𝑦1 = ෍

𝑙

𝑔

𝑏(𝑦1, 𝑦2 = 𝑙)

π‘Ÿπ‘œβ†’π‘› π‘¦π‘œ = ΰ·‘

π‘›β€²βˆˆπ‘(π‘œ)\𝑛

π‘ π‘›β€²β†’π‘œ π‘¦π‘œ

𝑠

π‘›β†’π‘œ π‘¦π‘œ = ෍ 𝒙𝑛\π‘œ

𝑔

𝑛 𝒙𝑛

ΰ·‘

π‘œβ€²βˆˆπ‘‚(𝑛)\π‘œ

π‘Ÿπ‘œβ€²β†’π‘›(π‘¦π‘œβ€²)

slide-54
SLIDE 54

Example

𝑦2 𝑦1 𝑦3 𝑦4 𝑔

𝑏

𝑔

𝑐

𝑔

𝑑

  • 1. Select the root, or pick one if a tree (𝑦3)
  • 1. Send messages from leaves to root
  • 2. Send messages from root to leaves
  • 3. Use messages to compute marginal

probabilities π‘ž π‘¦π‘œ = ΰ·‘

π‘›β€²βˆˆπ‘(π‘œ)\𝑛

π‘ π‘›β€²β†’π‘œ π‘¦π‘œ π‘Ÿπ‘œβ†’π‘› π‘¦π‘œ = ΰ·‘

π‘›β€²βˆˆπ‘(π‘œ)\𝑛

π‘ π‘›β€²β†’π‘œ π‘¦π‘œ

𝑠

π‘›β†’π‘œ π‘¦π‘œ = ෍ 𝒙𝑛\π‘œ

𝑔

𝑛 𝒙𝑛

ΰ·‘

π‘œβ€²βˆˆπ‘‚(𝑛)\π‘œ

π‘Ÿπ‘œβ€²β†’π‘›(π‘¦π‘œβ€²)

slide-55
SLIDE 55

Example

𝑦2 𝑦1 𝑦3 𝑦4 𝑔

𝑏

𝑔

𝑐

𝑔

𝑑

  • 1. Select the root, or pick one if a tree (𝑦3)
  • 1. Send messages from leaves to root
  • 2. Send messages from root to leaves
  • 3. Use messages to compute marginal

probabilities π‘ž π‘¦π‘œ = ΰ·‘

π‘›β€²βˆˆπ‘(π‘œ)\𝑛

π‘ π‘›β€²β†’π‘œ π‘¦π‘œ π‘ž 𝑦1 = 𝑠

𝑔

𝑏→𝑦1 𝑦1

π‘Ÿπ‘œβ†’π‘› π‘¦π‘œ = ΰ·‘

π‘›β€²βˆˆπ‘(π‘œ)\𝑛

π‘ π‘›β€²β†’π‘œ π‘¦π‘œ

𝑠

π‘›β†’π‘œ π‘¦π‘œ = ෍ 𝒙𝑛\π‘œ

𝑔

𝑛 𝒙𝑛

ΰ·‘

π‘œβ€²βˆˆπ‘‚(𝑛)\π‘œ

π‘Ÿπ‘œβ€²β†’π‘›(π‘¦π‘œβ€²)

slide-56
SLIDE 56

Example

𝑦2 𝑦1 𝑦3 𝑦4 𝑔

𝑏

𝑔

𝑐

𝑔

𝑑

  • 1. Select the root, or pick one if a tree (𝑦3)
  • 1. Send messages from leaves to root
  • 2. Send messages from root to leaves
  • 3. Use messages to compute marginal

probabilities π‘ž π‘¦π‘œ = ΰ·‘

π‘›β€²βˆˆπ‘(π‘œ)\𝑛

π‘ π‘›β€²β†’π‘œ π‘¦π‘œ π‘ž 𝑦1 = 𝑠

𝑔

𝑏→𝑦1 𝑦1

π‘ž 𝑦2 = 𝑠

𝑔

𝑏→𝑦2 𝑦2 𝑠

𝑔𝑐→𝑦2 𝑦2 𝑠 𝑔

𝑑→𝑦2 𝑦2

π‘Ÿπ‘œβ†’π‘› π‘¦π‘œ = ΰ·‘

π‘›β€²βˆˆπ‘(π‘œ)\𝑛

π‘ π‘›β€²β†’π‘œ π‘¦π‘œ

𝑠

π‘›β†’π‘œ π‘¦π‘œ = ෍ 𝒙𝑛\π‘œ

𝑔

𝑛 𝒙𝑛

ΰ·‘

π‘œβ€²βˆˆπ‘‚(𝑛)\π‘œ

π‘Ÿπ‘œβ€²β†’π‘›(π‘¦π‘œβ€²)

slide-57
SLIDE 57

Example

𝑦2 𝑦1 𝑦3 𝑦4 𝑔

𝑏

𝑔

𝑐

𝑔

𝑑

  • 1. Select the root, or pick one if a tree (𝑦3)
  • 1. Send messages from leaves to root
  • 2. Send messages from root to leaves
  • 3. Use messages to compute marginal

probabilities π‘ž π‘¦π‘œ = ΰ·‘

π‘›β€²βˆˆπ‘(π‘œ)\𝑛

π‘ π‘›β€²β†’π‘œ π‘¦π‘œ π‘ž 𝑦1 = 𝑠

𝑔

𝑏→𝑦1 𝑦1

π‘ž 𝑦2 = 𝑠

𝑔

𝑏→𝑦2 𝑦2 𝑠

𝑔𝑐→𝑦2 𝑦2 𝑠 𝑔

𝑑→𝑦2 𝑦2

π‘ž 𝑦3 = 𝑠

𝑔𝑐→𝑦3 𝑦3

π‘ž 𝑦4 = 𝑠

𝑔

𝑑→𝑦4 𝑦4

π‘Ÿπ‘œβ†’π‘› π‘¦π‘œ = ΰ·‘

π‘›β€²βˆˆπ‘(π‘œ)\𝑛

π‘ π‘›β€²β†’π‘œ π‘¦π‘œ

𝑠

π‘›β†’π‘œ π‘¦π‘œ = ෍ 𝒙𝑛\π‘œ

𝑔

𝑛 𝒙𝑛

ΰ·‘

π‘œβ€²βˆˆπ‘‚(𝑛)\π‘œ

π‘Ÿπ‘œβ€²β†’π‘›(π‘¦π‘œβ€²)

slide-58
SLIDE 58

Example

𝑦2 𝑦1 𝑦3 𝑦4 𝑔

𝑏

𝑔

𝑐

𝑔

𝑑

  • 1. Select the root, or pick one if a tree (𝑦3)
  • 1. Send messages from leaves to root
  • 2. Send messages from root to leaves
  • 3. Use messages to compute marginal

probabilities

  • 2. Are we done?
  • 1. If a tree structure, we’ve converged

2. π‘Ÿπ‘œβ†’π‘› π‘¦π‘œ = ΰ·‘

π‘›β€²βˆˆπ‘(π‘œ)\𝑛

π‘ π‘›β€²β†’π‘œ π‘¦π‘œ

𝑠

π‘›β†’π‘œ π‘¦π‘œ = ෍ 𝒙𝑛\π‘œ

𝑔

𝑛 𝒙𝑛

ΰ·‘

π‘œβ€²βˆˆπ‘‚(𝑛)\π‘œ

π‘Ÿπ‘œβ€²β†’π‘›(π‘¦π‘œβ€²)

slide-59
SLIDE 59

Example

𝑦2 𝑦1 𝑦3 𝑦4 𝑔

𝑏

𝑔

𝑐

𝑔

𝑑

  • 1. Select the root, or pick one if a tree (𝑦3)
  • 1. Send messages from leaves to root
  • 2. Send messages from root to leaves
  • 3. Use messages to compute marginal

probabilities

  • 2. Are we done?
  • 1. If a tree structure, we’ve converged
  • 2. If not:
  • 1. Either accept the partially

converged result, or… 2. π‘Ÿπ‘œβ†’π‘› π‘¦π‘œ = ΰ·‘

π‘›β€²βˆˆπ‘(π‘œ)\𝑛

π‘ π‘›β€²β†’π‘œ π‘¦π‘œ

𝑠

π‘›β†’π‘œ π‘¦π‘œ = ෍ 𝒙𝑛\π‘œ

𝑔

𝑛 𝒙𝑛

ΰ·‘

π‘œβ€²βˆˆπ‘‚(𝑛)\π‘œ

π‘Ÿπ‘œβ€²β†’π‘›(π‘¦π‘œβ€²)

slide-60
SLIDE 60

Example

𝑦2 𝑦1 𝑦3 𝑦4 𝑔

𝑏

𝑔

𝑐

𝑔

𝑑

  • 1. Select the root, or pick one if a tree
  • 1. Send messages from leaves to root
  • 2. Send messages from root to leaves
  • 3. Use messages to compute marginal

probabilities

  • 2. Are we done?
  • 1. If a tree structure, we’ve converged
  • 2. If not:
  • 1. Either accept the partially

converged result, or…

  • 2. Go back to (1) and repeat

[Loopy BP]

π‘Ÿπ‘œβ†’π‘› π‘¦π‘œ = ΰ·‘

π‘›β€²βˆˆπ‘(π‘œ)\𝑛

π‘ π‘›β€²β†’π‘œ π‘¦π‘œ

𝑠

π‘›β†’π‘œ π‘¦π‘œ = ෍ 𝒙𝑛\π‘œ

𝑔

𝑛 𝒙𝑛

ΰ·‘

π‘œβ€²βˆˆπ‘‚(𝑛)\π‘œ

π‘Ÿπ‘œβ€²β†’π‘›(π‘¦π‘œβ€²)

slide-61
SLIDE 61

Max-Product (Max-Sum)

Problem: how to find the most likely (best) setting of latent variables Replace sum (+) with max in factor→variable computations

𝑠

π‘›β†’π‘œ π‘¦π‘œ = max 𝒙𝑛\π‘œ 𝑔 𝑛 𝒙𝑛

ΰ·‘

π‘œβ€²βˆˆπ‘‚(𝑛)\π‘œ

π‘Ÿπ‘œβ€²β†’π‘›(π‘¦π‘œβ€²)

(why max-sum? computationally, implement with logs)

slide-62
SLIDE 62

Loopy Belief Propagation

Sum-product algorithm is not exact for general graphs Loopy Belief Propagation (Loopy BP): run sum- product algorithm anyway and hope for the best Requires a message passing schedule

slide-63
SLIDE 63

Outline

Message Passing: Graphical Model Inference Example: Linear Chain CRF

slide-64
SLIDE 64

Example: Linear Chain

Directed (e.g.,

hidden Markov model [HMM]; generative) z1

w1 w2 w3 w4

z2 z3 z4

  • Generate each tag, and generate each word from the tag
  • Locally normalized
slide-65
SLIDE 65

Example: Linear Chain

Directed (e.g.,

hidden Markov model [HMM]; generative) z1

w1 w2 w3 w4

z2 z3 z4

Directed (e.g..,

maximum entropy Markov model [MEMM]; conditional) z1

w1 w2 w3 w4

z2 z3 z4

  • Given each word, generate (predict) each tag
  • Locally normalized
slide-66
SLIDE 66

Example: Linear Chain

Directed (e.g.,

hidden Markov model [HMM]; generative) z1

w1 w2 w3 w4

z2 z3 z4

Directed (e.g..,

maximum entropy Markov model [MEMM]; conditional) z1

w1 w2 w3 w4

z2 z3 z4

Undirected

(e.g., conditional random field [CRF]) z1 z2 z3 z4

w1w2w3w4 w1w2w3w4 w1w2w3w4 w1w2w3w4

  • Given all words, generate each tag
  • Globally normalized
slide-67
SLIDE 67

Example: Linear Chain

Directed (e.g.,

hidden Markov model [HMM]; generative) z1

w1 w2 w3 w4

z2 z3 z4

Undirected as factor graph

(e.g., conditional random field [CRF]) z1 z2 z3 z4

Directed (e.g..,

maximum entropy Markov model [MEMM]; conditional) z1

w1 w2 w3 w4

z2 z3 z4

  • Given all words, generate each tag
  • Globally normalized
slide-68
SLIDE 68

Example: Linear Chain

Directed (e.g.,

hidden Markov model [HMM]; generative) z1

w1 w2 w3 w4

z2 z3 z4

Undirected as factor graph

(e.g., conditional random field [CRF]) z1 z2 z3 z4

Directed (e.g..,

maximum entropy Markov model [MEMM]; conditional) z1

w1 w2 w3 w4

z2 z3 z4

  • Given all words, generate each tag
  • Globally normalized

Q: What would the purple factors contain?

slide-69
SLIDE 69

Example: Linear Chain

Directed (e.g.,

hidden Markov model [HMM]; generative) z1

w1 w2 w3 w4

z2 z3 z4

Undirected as factor graph

(e.g., conditional random field [CRF]) z1 z2 z3 z4

Directed (e.g..,

maximum entropy Markov model [MEMM]; conditional) z1

w1 w2 w3 w4

z2 z3 z4

  • Given all words, generate each tag
  • Globally normalized

Q: What would the purple factors contain? A: Tag-to-tag potential scores

slide-70
SLIDE 70

Example: Linear Chain

Directed (e.g.,

hidden Markov model [HMM]; generative) z1

w1 w2 w3 w4

z2 z3 z4

Undirected as factor graph

(e.g., conditional random field [CRF]) z1 z2 z3 z4

Directed (e.g..,

maximum entropy Markov model [MEMM]; conditional) z1

w1 w2 w3 w4

z2 z3 z4

  • Given all words, generate each tag
  • Globally normalized

Q: What would the purple factors contain? Q: What would the green factors contain? A: Tag-to-tag potential scores

slide-71
SLIDE 71

Example: Linear Chain

Directed (e.g.,

hidden Markov model [HMM]; generative) z1

w1 w2 w3 w4

z2 z3 z4

Undirected as factor graph

(e.g., conditional random field [CRF]) z1 z2 z3 z4

Directed (e.g..,

maximum entropy Markov model [MEMM]; conditional) z1

w1 w2 w3 w4

z2 z3 z4

  • Given all words, generate each tag
  • Globally normalized

Q: What would the purple factors contain? Q: What would the green factors contain? A: Tag-to-tag potential scores A: Sequence & tag potential scores

slide-72
SLIDE 72

Example: Linear Chain Conditional Random Field

Widely used in applications like part-of-speech tagging

z1 z2 z3 z4

President Obama told Congress …

Noun-Mod Noun Noun Verb

slide-73
SLIDE 73

Example: Linear Chain Conditional Random Field

Widely used in applications like part-of-speech tagging and named entity recognition

z1 z2 z3 z4

President Obama told Congress …

Noun-Mod Noun Noun Verb

President Obama told Congress …

Person Person Org. Other

slide-74
SLIDE 74

Linear Chain CRFs for Part of Speech Tagging

A linear chain CRF is a conditional probabilistic model of the sequence of tags 𝑨1, 𝑨2, … , 𝑨𝑂 conditioned on the entire input sequence 𝑦1:𝑂

slide-75
SLIDE 75

Linear Chain CRFs for Part of Speech Tagging

π‘ž ♣|β™’

A linear chain CRF is a conditional probabilistic model of the sequence of tags 𝑨1, 𝑨2, … , 𝑨𝑂 conditioned on the entire input sequence 𝑦1:𝑂

slide-76
SLIDE 76

Linear Chain CRFs for Part of Speech Tagging

π‘ž 𝑨1, 𝑨2, … , 𝑨𝑂|β™’

A linear chain CRF is a conditional probabilistic model of the sequence of tags 𝑨1, 𝑨2, … , 𝑨𝑂 conditioned on the entire input sequence 𝑦1:𝑂

slide-77
SLIDE 77

Linear Chain CRFs for Part of Speech Tagging

π‘ž 𝑨1, 𝑨2, … , 𝑨𝑂|𝑦1:𝑂

A linear chain CRF is a conditional probabilistic model of the sequence of tags 𝑨1, 𝑨2, … , 𝑨𝑂 conditioned on the entire input sequence 𝑦1:𝑂

slide-78
SLIDE 78

Linear Chain CRFs for Part of Speech Tagging

z1 z2 z3 z4

𝑔

1

𝑔

2

𝑔

3

𝑔

4

𝑕1 𝑕2 𝑕3 𝑕4

π‘ž 𝑨1, 𝑨2, … , 𝑨𝑂|𝑦1:𝑂

slide-79
SLIDE 79

Linear Chain CRFs for Part of Speech Tagging

z1 z2 z3 z4

𝑔

1

𝑔

2

𝑔

3

𝑔

4

𝑕1 𝑕2 𝑕3 𝑕4 Q: What’s the general formula for a factor graph/undirected PGM distribution?

slide-80
SLIDE 80

Linear Chain CRFs for Part of Speech Tagging

z1 z2 z3 z4

𝑔

1

𝑔

2

𝑔

3

𝑔

4

𝑕1 𝑕2 𝑕3 𝑕4

A: π‘ž 𝑨1, 𝑨2, … , 𝑨𝑂 =

1 π‘Ž ς𝐷 πœ”π· 𝑨𝑑

Q: What’s the general formula for a factor graph/undirected PGM distribution?

slide-81
SLIDE 81

Linear Chain CRFs for Part of Speech Tagging

z1 z2 z3 z4

𝑔

1

𝑔

2

𝑔

3

𝑔

4

𝑕1 𝑕2 𝑕3 𝑕4

π‘ž 𝑨1, 𝑨2, … , 𝑨𝑂|𝑦1:𝑂 ∝ product of exponentiated potential scores

π‘ž 𝑨1, 𝑨2, … , 𝑨𝑂 = 1 π‘Ž ΰ·‘

𝐷

πœ”π· 𝑨𝑑

slide-82
SLIDE 82

Linear Chain CRFs for Part of Speech Tagging

z1 z2 z3 z4

𝑔

1

𝑔

2

𝑔

3

𝑔

4

𝑕1 𝑕2 𝑕3 𝑕4

π‘ž 𝑨1, 𝑨2, … , 𝑨𝑂|𝑦1:𝑂 ∝ exp βˆ’πΉπ‘•1 𝑕1 … exp βˆ’πΉπ‘•π‘‚ 𝑕𝑂 βˆ— exp βˆ’πΉ

𝑔

1 𝑔

1

… exp βˆ’πΉ

𝑔𝑂 𝑔 𝑂

slide-83
SLIDE 83

Linear Chain CRFs for Part of Speech Tagging

z1 z2 z3 z4

𝑔

1

𝑔

2

𝑔

3

𝑔

4

𝑕1 𝑕2 𝑕3 𝑕4

π‘ž 𝑨1, 𝑨2, … , 𝑨𝑂|𝑦1:𝑂 ∝ exp βˆ’πΉπ‘•1 𝑕1 … exp βˆ’πΉπ‘•π‘‚ 𝑕𝑂 βˆ— exp βˆ’πΉ

𝑔

1 𝑔

1

… exp βˆ’πΉ

𝑔𝑂 𝑔 𝑂

  • We use the notation 𝐹𝑕𝑗 𝑕𝑗 to

separate the features 𝑕𝑗 from how we reweight them

  • We use βˆ’πΉπ‘•π‘— to represent these

as Boltzmann distributions

slide-84
SLIDE 84

Linear Chain CRFs for Part of Speech Tagging

z1 z2 z3 z4

𝑔

1

𝑔

2

𝑔

3

𝑔

4

𝑕1 𝑕2 𝑕3 𝑕4

π‘ž 𝑨1, 𝑨2, … , 𝑨𝑂|𝑦1:𝑂 ∝ ΰ·‘

𝑗=1 𝑂

exp βˆ’πΉπ‘•π‘— 𝑕𝑗 exp βˆ’πΉ

𝑔𝑗 𝑔 𝑗

slide-85
SLIDE 85

Linear Chain CRFs for Part of Speech Tagging

z1 z2 z3 z4

𝑔

1

𝑔

2

𝑔

3

𝑔

4

𝑕1 𝑕2 𝑕3 𝑕4

π‘ž 𝑨1, 𝑨2, … , 𝑨𝑂|𝑦1:𝑂 ∝ ΰ·‘

𝑗=1 𝑂

exp βˆ’ 𝐹𝑕𝑗 𝑕𝑗 + 𝐹

𝑔𝑗(𝑔 𝑗)

slide-86
SLIDE 86

Linear Chain CRFs for Part of Speech Tagging

z1 z2 z3 z4

𝑔

1

𝑔

2

𝑔

3

𝑔

4

𝑕1 𝑕2 𝑕3 𝑕4

π‘ž 𝑨1, 𝑨2, … , 𝑨𝑂|𝑦1:𝑂 ∝ ΰ·‘

𝑗=1 𝑂

exp βˆ’ 𝐹𝑕𝑗 𝑕𝑗 + 𝐹

𝑔𝑗(𝑔 𝑗)

Let 𝐹𝑕𝑗 𝑕𝑗 = βˆ’βŸ¨πœ„ 𝑕 , π‘•π‘—βŸ© Let 𝐹

𝑔𝑗 𝑔 𝑗 = βˆ’βŸ¨πœ„ 𝑔 , 𝑔 π‘—βŸ©

where πœ„ 𝑔 , πœ„ 𝑕 are learnable parameters

slide-87
SLIDE 87

Linear Chain CRFs for Part of Speech Tagging

z1 z2 z3 z4

𝑔

1

𝑔

2

𝑔

3

𝑔

4

𝑕1 𝑕2 𝑕3 𝑕4

π‘ž 𝑨1, 𝑨2, … , 𝑨𝑂|𝑦1:𝑂 ∝ ΰ·‘

i=1 N

exp( πœ„ 𝑔 , 𝑔

𝑗 𝑨𝑗

+ πœ„ 𝑕 , 𝑕𝑗 𝑨𝑗, 𝑨𝑗+1 )

slide-88
SLIDE 88

Linear Chain CRFs for Part of Speech Tagging

z1 z2 z3 z4

𝑔

1

𝑔

2

𝑔

3

𝑔

4

𝑕1 𝑕2 𝑕3 𝑕4

π‘•π‘˜: inter-tag features (can depend on any/all input words 𝑦1:𝑂)

slide-89
SLIDE 89

Linear Chain CRFs for Part of Speech Tagging

z1 z2 z3 z4

𝑔

1

𝑔

2

𝑔

3

𝑔

4

𝑕1 𝑕2 𝑕3 𝑕4

π‘•π‘˜: inter-tag features (can depend on any/all input words 𝑦1:𝑂) 𝑔

𝑗: solo tag features

(can depend on any/all input words 𝑦1:𝑂)

slide-90
SLIDE 90

Linear Chain CRFs for Part of Speech Tagging

z1 z2 z3 z4

𝑔

1

𝑔

2

𝑔

3

𝑔

4

𝑕1 𝑕2 𝑕3 𝑕4

π‘•π‘˜: inter-tag features (can depend on any/all input words 𝑦1:𝑂) 𝑔

𝑗: solo tag features

(can depend on any/all input words 𝑦1:𝑂)

Feature design, just like in maxent models!

slide-91
SLIDE 91

Linear Chain CRFs for Part of Speech Tagging

z1 z2 z3 z4

𝑔

1

𝑔

2

𝑔

3

𝑔

4

𝑕1 𝑕2 𝑕3 𝑕4

π‘•π‘˜: inter-tag features (can depend on any/all input words 𝑦1:𝑂) 𝑔

𝑗: solo tag features

(can depend on any/all input words 𝑦1:𝑂) Example: π‘•π‘˜,π‘‚β†’π‘Š zj, zj+1 = 1 (if zj == N & zj+1 == V) else 0 π‘•π‘˜,told,π‘‚β†’π‘Š zj, zj+1 = 1 (if zj == N & zj+1 == V & xj == told) else 0

slide-92
SLIDE 92

(For discussion/whiteboard)

  • How would we learn a CRF?
  • What objective would we optimize?
  • How would we use BP?
slide-93
SLIDE 93

Key Insights (1)

  • Minimize (structured) cross-entropy loss ↔

(structured) maximum likelihood

  • Gradient has very familiar form of

β€œobserved feature counts – expected feature counts”

slide-94
SLIDE 94

Key Insights (2)

  • Rely on adjacency connections/independence

assumptions to compute

𝔽𝑧′ ෍

𝑗

β„Žπ‘—(𝑧′) = ෍

𝑗

෍

π‘§π‘—βˆ’1,𝑧𝑗

π‘ž(π‘§π‘—βˆ’1, 𝑧𝑗|𝑦1:𝑂)β„Žπ‘— π‘§π‘—βˆ’1, 𝑧𝑗

slide-95
SLIDE 95

Key Insights (3)

  • Run BP to compute beliefs (unnormalized, joint marginals)

π‘ž π‘§π‘—βˆ’1, 𝑧𝑗|𝑦1:𝑂 ∝ π‘•π‘—βˆ’1 π‘§π‘—βˆ’1, 𝑧𝑗 βˆ— π‘Ÿπ‘§π‘—βˆ’1β†’π‘•π‘—βˆ’1 π‘§π‘—βˆ’1 βˆ— π‘Ÿπ‘§π‘—β†’π‘•π‘—βˆ’1 𝑧𝑗

y1 y2 y3 y4

𝑔

1

𝑔

2

𝑔

3

𝑔

4

𝑕1 𝑕2 𝑕3 𝑕4

π‘ž 𝑦{𝑛} = π’˜ ∝ 𝑛 𝑦 𝑛 = π’˜ ΰ·‘

π‘¦π‘œπ‘—βˆˆπ‘‚(𝑛)

π‘Ÿπ‘œπ‘—β†’π‘› π‘¦π‘œπ‘— = 𝑀𝑗

slide-96
SLIDE 96

Outline

Message Passing: Graphical Model Inference Example: Linear Chain CRF