Message Passing/Belief Propagation CMSC 691 UMBC Markov Random - - PowerPoint PPT Presentation
Message Passing/Belief Propagation CMSC 691 UMBC Markov Random - - PowerPoint PPT Presentation
Message Passing/Belief Propagation CMSC 691 UMBC Markov Random Fields: Undirected Graphs clique : subset of nodes, where nodes are pairwise connected maximal clique : a clique that cannot add a node and remain a clique 1 , 2 ,
Markov Random Fields: Undirected Graphs
π π¦1, π¦2, π¦3, β¦ , π¦π = 1 π ΰ·
π·
ππ· π¦π
variables part
- f the clique C
maximal cliques global normalization clique: subset of nodes, where nodes are pairwise connected maximal clique: a clique that cannot add a node and remain a clique potential function (not necessarily a probability!) Q: What restrictions should we place on the potentials ππ·? A: ππ· β₯ 0 (or ππ· > 0)
Terminology: Potential Functions
ππ· π¦π = exp βπΉ(π¦π·)
energy function (for clique C) Boltzmann distribution
(get the total energy of a configuration by summing the individual energy functions)
π π¦1, π¦2, π¦3, β¦ , π¦π = 1 π ΰ·
π·
ππ· π¦π
MRFs as Factor Graphs
Undirected graphs: G=(V,E) that represents π(π1, β¦ , ππ) Factor graph of p: Bipartite graph of evidence nodes X, factor nodes F, and edges T Evidence nodes X are the random variables Factor nodes F take values associated with the potential functions Edges show what variables are used in which factors
MRFs as Factor Graphs
Undirected graphs: G=(V,E) that represents π(π1, β¦ , ππ) Factor graph of p: Bipartite graph of evidence nodes X, factor nodes F, and edges T Evidence nodes X are the random variables Factor nodes F take values associated with the potential functions Edges show what variables are used in which factors
X Y Z X Y Z
Outline
Message Passing: Graphical Model Inference Example: Linear Chain CRF
Two Problems for Undirected Models
Finding the normalizer π = ΰ·
π¦
ΰ·
π
ππ(π¦π) Computing the marginals ππ(π€) = ΰ·
π¦:π¦π=π€
ΰ·
π
ππ(π¦π)
Sum over all variable combinations, with the xn coordinate fixed π2(π€) = ΰ·
π¦1
ΰ·
π¦3
ΰ·
π
ππ(π¦ = π¦1, π€, π¦3 ) Example: 3 variables, fix the 2nd dimension
π π¦1, π¦2, π¦3, β¦ , π¦π = 1 π ΰ·
π·
ππ· π¦π
Two Problems for Undirected Models
Finding the normalizer π = ΰ·
π¦
ΰ·
π
ππ(π¦π) Computing the marginals ππ(π€) = ΰ·
π¦:π¦π=π€
ΰ·
π
ππ(π¦π)
Q: Why are these difficult? A: Many different combinations Sum over all variable combinations, with the xn coordinate fixed π2(π€) = ΰ·
π¦1
ΰ·
π¦3
ΰ·
π
ππ(π¦ = π¦1, π€, π¦3 ) Example: 3 variables, fix the 2nd dimension
π π¦1, π¦2, π¦3, β¦ , π¦π = 1 π ΰ·
π·
ππ· π¦π
Message Passing: Count the Soldiers
If you are the front soldier in the line, say the number βoneβ to the soldier behind you. If you are the rearmost soldier in the line, say the number βoneβ to the soldier in front of you. If a soldier ahead of or behind you says a number to you, add
- ne to it, and say the new
number to the soldier on the
- ther side
ITILA, Ch 16
Message Passing: Count the Soldiers
If you are the front soldier in the line, say the number βoneβ to the soldier behind you. If you are the rearmost soldier in the line, say the number βoneβ to the soldier in front of you. If a soldier ahead of or behind you says a number to you, add one to it, and say the new number to the soldier on the other side
ITILA, Ch 16
Message Passing: Count the Soldiers
If you are the front soldier in the line, say the number βoneβ to the soldier behind you. If you are the rearmost soldier in the line, say the number βoneβ to the soldier in front of you. If a soldier ahead of or behind you says a number to you, add one to it, and say the new number to the soldier on the other side
ITILA, Ch 16
Message Passing: Count the Soldiers
If you are the front soldier in the line, say the number βoneβ to the soldier behind you. If you are the rearmost soldier in the line, say the number βoneβ to the soldier in front of you. If a soldier ahead of or behind you says a number to you, add one to it, and say the new number to the soldier on the other side
ITILA, Ch 16
Sum-Product Algorithm
Main idea: message passing An exact inference algorithm for tree-like graphs Belief propagation (forward-backward for HMMs) is a special case
Sum-Product
π π¦π = π€ = ΰ·
π¦:π¦π=π€
π π¦1, π¦2, β¦ , π¦π, β¦ , π¦π
definition of marginal
β¦ β¦
Sum-Product
definition of marginal
β¦ β¦
main idea: use bipartite nature of graph to efficiently compute the marginals
The factor nodes can act as filters
π π¦π = π€ = ΰ·
π¦:π¦π=π€
π π¦1, π¦2, β¦ , π¦π, β¦ , π¦π
Sum-Product
definition of marginal
β¦ β¦
main idea: use bipartite nature of graph to efficiently compute the marginals π
π1βπ
π
π3βπ
π
π2βπ
π
πβπ is a message from factor
node m to evidence node n
π π¦π = π€ = ΰ·
π¦:π¦π=π€
π π¦1, π¦2, β¦ , π¦π, β¦ , π¦π
Sum-Product
π π¦π = π€ = Οπ π
πβπ¦π(π¦π = π€)
Οπ₯ Οπ π
πβπ¦π(π¦π = π₯) β ΰ· π
π
πβπ¦π(π¦π) alternative marginal computation
β¦ β¦
main idea: use bipartite nature of graph to efficiently compute the marginals π
π1βπ
π
π3βπ
π
π2βπ
π
πβπ is a message from factor
node m to evidence node n
Sum-Product
β¦ β¦
π
π1βπ
π
π3βπ
π
π2βπ
π
πβπ is a message from factor
node m to evidence node n ππβ π1 ππβ π2 ππβ π3 ππβπ is a message from evidence node n to factor node m
Sum-Product
From variables to factors ππβπ π¦π =
n m
n aggregates information from the rest of its graph via its neighbors
Sum-Product
From variables to factors ππβπ π¦π = ΰ·
πβ²βπ(π)\π
π πβ²βπ π¦π
n m
set of factors in which variable n participates
default value of 1 if empty product
Sum-Product
From variables to factors ππβπ π¦π = ΰ·
πβ²βπ(π)\π
π πβ²βπ π¦π From factors to variables π
πβπ π¦π = n m n m
set of factors in which variable n participates
default value of 1 if empty product
m aggregates information from the rest of its graph via its neighbors
Sum-Product
From variables to factors ππβπ π¦π = ΰ·
πβ²βπ(π)\π
π πβ²βπ π¦π From factors to variables π
πβπ π¦π =
n m n m
set of factors in which variable n participates
default value of 1 if empty product
m aggregates information from the rest of its graph via its neighbors But these neighbors are R.V.s that take on different values
Sum-Product
From variables to factors ππβπ π¦π = ΰ·
πβ²βπ(π)\π
π πβ²βπ π¦π From factors to variables
n m n m
set of factors in which variable n participates
- 1. sum over configuration of
variables for the mth factor, with variable n fixed
- 2. aggregate info those other
variables provide about the rest of the graph
default value of 1 if empty product
π
πβπ π¦π =
Sum-Product
From variables to factors ππβπ π¦π = ΰ·
πβ²βπ(π)\π
π πβ²βπ π¦π From factors to variables π
πβπ π¦π
= ΰ·
ππ\π
π
π ππ
ΰ·
πβ²βπ(π)\π
ππβ²βπ(π¦πβ²)
n m n m
set of factors in which variable n participates
default value of 1 if empty product
- 2. aggregate info those
- ther variables provide
about the rest of the graph
- 1. sum over configuration of
variables for the mth factor, with variable n fixed
Sum-Product
From variables to factors ππβπ π¦π = ΰ·
πβ²βπ(π)\π
π πβ²βπ π¦π From factors to variables π
πβπ π¦π
= ΰ·
ππ\π
π
π ππ
ΰ·
πβ²βπ(π)\π
ππβ²βπ(π¦πβ²)
n m n m
set of variables that the mth factor depends on set of factors in which variable n participates sum over configuration of variables for the mth factor, with variable n fixed
default value of 1 if empty product
Meaning of the Computed Values
From variables to factors ππβπ π¦π = ΰ·
πβ²βπ(π)\π
π πβ²βπ π¦π From factors to variables π
πβπ π¦π
= ΰ·
ππ\π
π
π ππ
ΰ·
πβ²βπ(π)\π
ππβ²βπ(π¦πβ²) π¦π telling factor m the βgoodnessβ for the rest of the graph if π¦π has a particular value factor m telling π¦π the βgoodnessβ for the rest of the graph if π¦π has a particular value
From Messages to Variable Beliefs
n m1 m2 π
π1βπ π¦π tells π¦π
the βgoodnessβ from m1βs perspective if π¦π has a particular value π
π2βπ π¦π tells π¦π
the βgoodnessβ from m2βs perspective if π¦π has a particular value
From Messages to Variable Beliefs
n m1 m2 π
π1βπ π¦π tells π¦π
the βgoodnessβ from m1βs perspective if π¦π has a particular value π
π2βπ π¦π tells π¦π
the βgoodnessβ from m2βs perspective if π¦π has a particular value
Together, they describe the cover the entire graph!
From Messages to Variable Beliefs
n m1 m2 π
π1βπ π¦π tells π¦π
the βgoodnessβ from m1βs perspective if π¦π has a particular value π
π2βπ π¦π tells π¦π
the βgoodnessβ from m2βs perspective if π¦π has a particular value
Together, they describe the cover the entire graph!
π π¦π = π€ β π
π1βπ π¦π = π€ π π2βπ π¦π = π€
From Messages to Variable Beliefs: General Formula
n m1 m2 π
π1βπ π¦π tells π¦π
the βgoodnessβ from m1βs perspective if π¦π has a particular value π
π2βπ π¦π tells π¦π
the βgoodnessβ from m2βs perspective if π¦π has a particular value
π π¦π = π€ β ΰ·
πβπ(π¦π)
π
πβπ π¦π = π€
From Messages to Factor Beliefs: General Formula
n1 n2 n3 m πππβπtells π the βgoodnessβ from π¦ππβs perspective if it has a particular value
π π¦{π} = π β π π¦ π = π ΰ·
π¦ππβπ(π)
πππβπ π¦ππ = π€π
How to Use these Messages
1.Select the root, or pick one if a tree
a) Send messages from leaves to root b) Send messages from root to leaves c) Use messages to compute (unnormalized) marginal probabilities
2.Are we done?
a) If a tree structure, weβve converged b) If not:
i. Either accept the partially converged result, or⦠ii. Go back to (1) and repeat
How to Use these Messages
Compute Marginals/Normalizer
- 1. Select the root, or pick one if a
tree
a) Send messages from leaves to root b) Send messages from root to leaves c) Use messages to compute (unnormalized) marginal probabilities
- 2. Are we done?
a) If a tree structure, weβve converged b) If not:
i. Either accept the partially converged result, or⦠ii. Go back to (1) and repeat
For Learning/Inference Whenever you need to compute a likelihood, marginal probability,
- r a model-specific expectation,
run this algorithm to compute the necessary probabilities
β Prediction:
- Of a sequence
π π¨1, β¦ , π¨π π₯1:π
- Of an individual tag π(π¨π|π₯1:π)
β Marginal (if appropriate)
- π(π₯1:π)
β Learning model parameters
- EM
- Variational inference
- β¦
Example
Q: What are the variables? π¦2 π¦1 π¦3 π¦4 π
π
π
π
π
π
Example
Q: What are the variables? A: π¦1, π¦2, π¦3, π¦4 Q: What are the factors? π¦2 π¦1 π¦3 π¦4 π
π
π
π
π
π
Example
Q: What are the variables? A: π¦1, π¦2, π¦3, π¦4 Q: What are the factors? A: π
π π¦1, π¦2 ,
π
π π¦2, π¦3 ,
π
π(π¦2, π¦4)
π¦2 π¦1 π¦3 π¦4 π
π
π
π
π
π
Example
π¦2 π¦1 π¦3 π¦4 π
π
π
π
π
π
Q: What is the distribution weβre modeling?
Example
π¦2 π¦1 π¦3 π¦4 π
π
π
π
π
π
Q: What is the distribution weβre modeling?
A: π π¦1, π¦2, π¦3, π¦4 = π
π π¦1, π¦2 π π π¦2, π¦3 π π(π¦2, π¦4)
Example
π¦2 π¦1 π¦3 π¦4 π
π
π
π
π
π
- 1. Select the root, or pick one if a tree (π¦3)
- 1. Send messages from leaves to root
ππ¦1βπ
π π¦1 = 1
ππ¦4βπ
π π¦4 = 1
ππβπ π¦π = ΰ·
πβ²βπ(π)\π
π πβ²βπ π¦π
π
πβπ π¦π = ΰ· ππ\π
π
π ππ
ΰ·
πβ²βπ(π)\π
ππβ²βπ(π¦πβ²)
Example
π¦2 π¦1 π¦3 π¦4 π
π
π
π
π
π
- 1. Select the root, or pick one if a tree (π¦3)
- 1. Send messages from leaves to root
ππ¦1βπ
π π¦1 = 1
ππ¦4βπ
π π¦4 = 1
π
π
πβπ¦2 π¦2 =? ? ?
ππβπ π¦π = ΰ·
πβ²βπ(π)\π
π πβ²βπ π¦π
π
πβπ π¦π = ΰ· ππ\π
π
π ππ
ΰ·
πβ²βπ(π)\π
ππβ²βπ(π¦πβ²)
Example
π¦2 π¦1 π¦3 π¦4 π
π
π
π
π
π
- 1. Select the root, or pick one if a tree (π¦3)
- 1. Send messages from leaves to root
ππ¦1βπ
π π¦1 = 1
ππ¦4βπ
π π¦4 = 1
π
π
πβπ¦2 π¦2 = ΰ·
π
π
π(π¦1 = π, π¦2)
π
π
πβπ¦2 π¦2 = ΰ·
π
π
π(π¦2, π¦4 = π)
ππβπ π¦π = ΰ·
πβ²βπ(π)\π
π πβ²βπ π¦π
π
πβπ π¦π = ΰ· ππ\π
π
π ππ
ΰ·
πβ²βπ(π)\π
ππβ²βπ(π¦πβ²)
Example
π¦2 π¦1 π¦3 π¦4 π
π
π
π
π
π
- 1. Select the root, or pick one if a tree (π¦3)
- 1. Send messages from leaves to root
ππ¦1βπ
π π¦1 = 1
ππ¦4βπ
π π¦4 = 1
π
π
πβπ¦2 π¦2 = ΰ·
π
π
π(π¦1 = π, π¦2)
π
π
πβπ¦2 π¦2 = ΰ·
π
π
π(π¦2, π¦4 = π)
ππ¦2βππ π¦2 =? ? ? ππβπ π¦π = ΰ·
πβ²βπ(π)\π
π πβ²βπ π¦π
π
πβπ π¦π = ΰ· ππ\π
π
π ππ
ΰ·
πβ²βπ(π)\π
ππβ²βπ(π¦πβ²)
Example
π¦2 π¦1 π¦3 π¦4 π
π
π
π
π
π
- 1. Select the root, or pick one if a tree (π¦3)
- 1. Send messages from leaves to root
ππ¦1βπ
π π¦1 = 1
ππ¦4βπ
π π¦4 = 1
π
π
πβπ¦2 π¦2 = ΰ·
π
π
π(π¦1 = π, π¦2)
π
π
πβπ¦2 π¦2 = ΰ·
π
π
π(π¦2, π¦4 = π)
ππ¦2βππ π¦2 = π
π
πβπ¦2 π¦2 π
π
πβπ¦2 π¦2
ππβπ π¦π = ΰ·
πβ²βπ(π)\π
π πβ²βπ π¦π
π
πβπ π¦π = ΰ· ππ\π
π
π ππ
ΰ·
πβ²βπ(π)\π
ππβ²βπ(π¦πβ²)
Example
π¦2 π¦1 π¦3 π¦4 π
π
π
π
π
π
- 1. Select the root, or pick one if a tree (π¦3)
- 1. Send messages from leaves to root
ππ¦1βπ
π π¦1 = 1
ππ¦4βπ
π π¦4 = 1
π
π
πβπ¦2 π¦2 = ΰ·
π
π
π(π¦1 = π, π¦2)
π
π
πβπ¦2 π¦2 = ΰ·
π
π
π(π¦2, π¦4 = π)
ππ¦2βππ π¦2 = π
π
πβπ¦2 π¦2 π
π
πβπ¦2 π¦2
π
ππβπ¦3 π¦3 = ΰ· π
π
π(π¦2 = π, π¦3)
ππβπ π¦π = ΰ·
πβ²βπ(π)\π
π πβ²βπ π¦π
π
πβπ π¦π = ΰ· ππ\π
π
π ππ
ΰ·
πβ²βπ(π)\π
ππβ²βπ(π¦πβ²)
Example
π¦2 π¦1 π¦3 π¦4 π
π
π
π
π
π
- 1. Select the root, or pick one if a tree (π¦3)
- 1. Send messages from leaves to root
- 2. Send messages from root to leaves
ππ¦3βππ π¦3 = 1 ππβπ π¦π = ΰ·
πβ²βπ(π)\π
π πβ²βπ π¦π
π
πβπ π¦π = ΰ· ππ\π
π
π ππ
ΰ·
πβ²βπ(π)\π
ππβ²βπ(π¦πβ²)
Example
π¦2 π¦1 π¦3 π¦4 π
π
π
π
π
π
- 1. Select the root, or pick one if a tree (π¦3)
- 1. Send messages from leaves to root
- 2. Send messages from root to leaves
ππ¦3βππ π¦3 = 1 π
ππβπ¦2 π¦2 = ΰ· π
π
π(π¦2, π¦3 = π)
ππβπ π¦π = ΰ·
πβ²βπ(π)\π
π πβ²βπ π¦π
π
πβπ π¦π = ΰ· ππ\π
π
π ππ
ΰ·
πβ²βπ(π)\π
ππβ²βπ(π¦πβ²)
Example
π¦2 π¦1 π¦3 π¦4 π
π
π
π
π
π
- 1. Select the root, or pick one if a tree (π¦3)
- 1. Send messages from leaves to root
- 2. Send messages from root to leaves
ππ¦3βππ π¦3 = 1 π
ππβπ¦2 π¦2 = ΰ· π
π
π(π¦2, π¦3 = π)
ππ¦2βπ
π π¦2 = ? ? ?
ππβπ π¦π = ΰ·
πβ²βπ(π)\π
π πβ²βπ π¦π
π
πβπ π¦π = ΰ· ππ\π
π
π ππ
ΰ·
πβ²βπ(π)\π
ππβ²βπ(π¦πβ²)
Example
π¦2 π¦1 π¦3 π¦4 π
π
π
π
π
π
- 1. Select the root, or pick one if a tree (π¦3)
- 1. Send messages from leaves to root
- 2. Send messages from root to leaves
ππ¦3βππ π¦3 = 1 π
ππβπ¦2 π¦2 = ΰ· π
π
π(π¦2, π¦3 = π)
ππ¦2βπ
π π¦2 = π
ππβπ¦2 π¦2 π π
πβπ¦2 π¦2
ππβπ π¦π = ΰ·
πβ²βπ(π)\π
π πβ²βπ π¦π
π
πβπ π¦π = ΰ· ππ\π
π
π ππ
ΰ·
πβ²βπ(π)\π
ππβ²βπ(π¦πβ²)
We just computed this Q: Where did we compute this?
Example
π¦2 π¦1 π¦3 π¦4 π
π
π
π
π
π
- 1. Select the root, or pick one if a tree (π¦3)
- 1. Send messages from leaves to root
- 2. Send messages from root to leaves
ππ¦3βππ π¦3 = 1 π
ππβπ¦2 π¦2 = ΰ· π
π
π(π¦2, π¦3 = π)
ππ¦2βπ
π π¦2 = π
ππβπ¦2 π¦2 π π
πβπ¦2 π¦2
ππβπ π¦π = ΰ·
πβ²βπ(π)\π
π πβ²βπ π¦π
π
πβπ π¦π = ΰ· ππ\π
π
π ππ
ΰ·
πβ²βπ(π)\π
ππβ²βπ(π¦πβ²)
We just computed this Q: Where did we compute this? A: In step 1 (leaves β root)
Example
π¦2 π¦1 π¦3 π¦4 π
π
π
π
π
π
- 1. Select the root, or pick one if a tree (π¦3)
- 1. Send messages from leaves to root
- 2. Send messages from root to leaves
ππ¦3βππ π¦3 = 1 π
ππβπ¦2 π¦2 = ΰ· π
π
π(π¦2, π¦3 = π)
ππ¦2βπ
π π¦2 = π
ππβπ¦2 π¦2 π π
πβπ¦2 π¦2
ππ¦2βπ
π π¦2 = π
π
πβπ¦2 π¦2 π
ππβπ¦2 π¦2
ππβπ π¦π = ΰ·
πβ²βπ(π)\π
π πβ²βπ π¦π
π
πβπ π¦π = ΰ· ππ\π
π
π ππ
ΰ·
πβ²βπ(π)\π
ππβ²βπ(π¦πβ²)
Example
π¦2 π¦1 π¦3 π¦4 π
π
π
π
π
π
- 1. Select the root, or pick one if a tree (π¦3)
- 1. Send messages from leaves to root
- 2. Send messages from root to leaves
ππ¦3βππ π¦3 = 1 π
ππβπ¦2 π¦2 = ΰ· π
π
π(π¦2, π¦3 = π)
ππ¦2βπ
π π¦2 = π
ππβπ¦2 π¦2 π π
πβπ¦2 π¦2
ππ¦2βπ
π π¦2 = π
π
πβπ¦2 π¦2 π
ππβπ¦2 π¦2
π
π
πβπ¦4 π¦4 = ΰ·
π
π
π(π¦2 = π, π¦4)
ππβπ π¦π = ΰ·
πβ²βπ(π)\π
π πβ²βπ π¦π
π
πβπ π¦π = ΰ· ππ\π
π
π ππ
ΰ·
πβ²βπ(π)\π
ππβ²βπ(π¦πβ²)
Example
π¦2 π¦1 π¦3 π¦4 π
π
π
π
π
π
- 1. Select the root, or pick one if a tree (π¦3)
- 1. Send messages from leaves to root
- 2. Send messages from root to leaves
ππ¦3βππ π¦3 = 1 π
ππβπ¦2 π¦2 = ΰ· π
π
π(π¦2, π¦3 = π)
ππ¦2βπ
π π¦2 = π
ππβπ¦2 π¦2 π π
πβπ¦2 π¦2
ππ¦2βπ
π π¦2 = π
π
πβπ¦2 π¦2 π
ππβπ¦2 π¦2
π
π
πβπ¦4 π¦4 = ΰ·
π
π
π(π¦2 = π, π¦4)
π
π
πβπ¦1 π¦1 = ΰ·
π
π
π(π¦1, π¦2 = π)
ππβπ π¦π = ΰ·
πβ²βπ(π)\π
π πβ²βπ π¦π
π
πβπ π¦π = ΰ· ππ\π
π
π ππ
ΰ·
πβ²βπ(π)\π
ππβ²βπ(π¦πβ²)
Example
π¦2 π¦1 π¦3 π¦4 π
π
π
π
π
π
- 1. Select the root, or pick one if a tree (π¦3)
- 1. Send messages from leaves to root
- 2. Send messages from root to leaves
- 3. Use messages to compute marginal
probabilities π π¦π = ΰ·
πβ²βπ(π)\π
π πβ²βπ π¦π ππβπ π¦π = ΰ·
πβ²βπ(π)\π
π πβ²βπ π¦π
π
πβπ π¦π = ΰ· ππ\π
π
π ππ
ΰ·
πβ²βπ(π)\π
ππβ²βπ(π¦πβ²)
Example
π¦2 π¦1 π¦3 π¦4 π
π
π
π
π
π
- 1. Select the root, or pick one if a tree (π¦3)
- 1. Send messages from leaves to root
- 2. Send messages from root to leaves
- 3. Use messages to compute marginal
probabilities π π¦π = ΰ·
πβ²βπ(π)\π
π πβ²βπ π¦π π π¦1 = π
π
πβπ¦1 π¦1
ππβπ π¦π = ΰ·
πβ²βπ(π)\π
π πβ²βπ π¦π
π
πβπ π¦π = ΰ· ππ\π
π
π ππ
ΰ·
πβ²βπ(π)\π
ππβ²βπ(π¦πβ²)
Example
π¦2 π¦1 π¦3 π¦4 π
π
π
π
π
π
- 1. Select the root, or pick one if a tree (π¦3)
- 1. Send messages from leaves to root
- 2. Send messages from root to leaves
- 3. Use messages to compute marginal
probabilities π π¦π = ΰ·
πβ²βπ(π)\π
π πβ²βπ π¦π π π¦1 = π
π
πβπ¦1 π¦1
π π¦2 = π
π
πβπ¦2 π¦2 π
ππβπ¦2 π¦2 π π
πβπ¦2 π¦2
ππβπ π¦π = ΰ·
πβ²βπ(π)\π
π πβ²βπ π¦π
π
πβπ π¦π = ΰ· ππ\π
π
π ππ
ΰ·
πβ²βπ(π)\π
ππβ²βπ(π¦πβ²)
Example
π¦2 π¦1 π¦3 π¦4 π
π
π
π
π
π
- 1. Select the root, or pick one if a tree (π¦3)
- 1. Send messages from leaves to root
- 2. Send messages from root to leaves
- 3. Use messages to compute marginal
probabilities π π¦π = ΰ·
πβ²βπ(π)\π
π πβ²βπ π¦π π π¦1 = π
π
πβπ¦1 π¦1
π π¦2 = π
π
πβπ¦2 π¦2 π
ππβπ¦2 π¦2 π π
πβπ¦2 π¦2
π π¦3 = π
ππβπ¦3 π¦3
π π¦4 = π
π
πβπ¦4 π¦4
ππβπ π¦π = ΰ·
πβ²βπ(π)\π
π πβ²βπ π¦π
π
πβπ π¦π = ΰ· ππ\π
π
π ππ
ΰ·
πβ²βπ(π)\π
ππβ²βπ(π¦πβ²)
Example
π¦2 π¦1 π¦3 π¦4 π
π
π
π
π
π
- 1. Select the root, or pick one if a tree (π¦3)
- 1. Send messages from leaves to root
- 2. Send messages from root to leaves
- 3. Use messages to compute marginal
probabilities
- 2. Are we done?
- 1. If a tree structure, weβve converged
2. ππβπ π¦π = ΰ·
πβ²βπ(π)\π
π πβ²βπ π¦π
π
πβπ π¦π = ΰ· ππ\π
π
π ππ
ΰ·
πβ²βπ(π)\π
ππβ²βπ(π¦πβ²)
Example
π¦2 π¦1 π¦3 π¦4 π
π
π
π
π
π
- 1. Select the root, or pick one if a tree (π¦3)
- 1. Send messages from leaves to root
- 2. Send messages from root to leaves
- 3. Use messages to compute marginal
probabilities
- 2. Are we done?
- 1. If a tree structure, weβve converged
- 2. If not:
- 1. Either accept the partially
converged result, orβ¦ 2. ππβπ π¦π = ΰ·
πβ²βπ(π)\π
π πβ²βπ π¦π
π
πβπ π¦π = ΰ· ππ\π
π
π ππ
ΰ·
πβ²βπ(π)\π
ππβ²βπ(π¦πβ²)
Example
π¦2 π¦1 π¦3 π¦4 π
π
π
π
π
π
- 1. Select the root, or pick one if a tree
- 1. Send messages from leaves to root
- 2. Send messages from root to leaves
- 3. Use messages to compute marginal
probabilities
- 2. Are we done?
- 1. If a tree structure, weβve converged
- 2. If not:
- 1. Either accept the partially
converged result, orβ¦
- 2. Go back to (1) and repeat
[Loopy BP]
ππβπ π¦π = ΰ·
πβ²βπ(π)\π
π πβ²βπ π¦π
π
πβπ π¦π = ΰ· ππ\π
π
π ππ
ΰ·
πβ²βπ(π)\π
ππβ²βπ(π¦πβ²)
Max-Product (Max-Sum)
Problem: how to find the most likely (best) setting of latent variables Replace sum (+) with max in factorβvariable computations
π
πβπ π¦π = max ππ\π π π ππ
ΰ·
πβ²βπ(π)\π
ππβ²βπ(π¦πβ²)
(why max-sum? computationally, implement with logs)
Loopy Belief Propagation
Sum-product algorithm is not exact for general graphs Loopy Belief Propagation (Loopy BP): run sum- product algorithm anyway and hope for the best Requires a message passing schedule
Outline
Message Passing: Graphical Model Inference Example: Linear Chain CRF
Example: Linear Chain
Directed (e.g.,
hidden Markov model [HMM]; generative) z1
w1 w2 w3 w4
z2 z3 z4
- Generate each tag, and generate each word from the tag
- Locally normalized
Example: Linear Chain
Directed (e.g.,
hidden Markov model [HMM]; generative) z1
w1 w2 w3 w4
z2 z3 z4
Directed (e.g..,
maximum entropy Markov model [MEMM]; conditional) z1
w1 w2 w3 w4
z2 z3 z4
- Given each word, generate (predict) each tag
- Locally normalized
Example: Linear Chain
Directed (e.g.,
hidden Markov model [HMM]; generative) z1
w1 w2 w3 w4
z2 z3 z4
Directed (e.g..,
maximum entropy Markov model [MEMM]; conditional) z1
w1 w2 w3 w4
z2 z3 z4
Undirected
(e.g., conditional random field [CRF]) z1 z2 z3 z4
w1w2w3w4 w1w2w3w4 w1w2w3w4 w1w2w3w4
- Given all words, generate each tag
- Globally normalized
Example: Linear Chain
Directed (e.g.,
hidden Markov model [HMM]; generative) z1
w1 w2 w3 w4
z2 z3 z4
Undirected as factor graph
(e.g., conditional random field [CRF]) z1 z2 z3 z4
Directed (e.g..,
maximum entropy Markov model [MEMM]; conditional) z1
w1 w2 w3 w4
z2 z3 z4
- Given all words, generate each tag
- Globally normalized
Example: Linear Chain
Directed (e.g.,
hidden Markov model [HMM]; generative) z1
w1 w2 w3 w4
z2 z3 z4
Undirected as factor graph
(e.g., conditional random field [CRF]) z1 z2 z3 z4
Directed (e.g..,
maximum entropy Markov model [MEMM]; conditional) z1
w1 w2 w3 w4
z2 z3 z4
- Given all words, generate each tag
- Globally normalized
Q: What would the purple factors contain?
Example: Linear Chain
Directed (e.g.,
hidden Markov model [HMM]; generative) z1
w1 w2 w3 w4
z2 z3 z4
Undirected as factor graph
(e.g., conditional random field [CRF]) z1 z2 z3 z4
Directed (e.g..,
maximum entropy Markov model [MEMM]; conditional) z1
w1 w2 w3 w4
z2 z3 z4
- Given all words, generate each tag
- Globally normalized
Q: What would the purple factors contain? A: Tag-to-tag potential scores
Example: Linear Chain
Directed (e.g.,
hidden Markov model [HMM]; generative) z1
w1 w2 w3 w4
z2 z3 z4
Undirected as factor graph
(e.g., conditional random field [CRF]) z1 z2 z3 z4
Directed (e.g..,
maximum entropy Markov model [MEMM]; conditional) z1
w1 w2 w3 w4
z2 z3 z4
- Given all words, generate each tag
- Globally normalized
Q: What would the purple factors contain? Q: What would the green factors contain? A: Tag-to-tag potential scores
Example: Linear Chain
Directed (e.g.,
hidden Markov model [HMM]; generative) z1
w1 w2 w3 w4
z2 z3 z4
Undirected as factor graph
(e.g., conditional random field [CRF]) z1 z2 z3 z4
Directed (e.g..,
maximum entropy Markov model [MEMM]; conditional) z1
w1 w2 w3 w4
z2 z3 z4
- Given all words, generate each tag
- Globally normalized
Q: What would the purple factors contain? Q: What would the green factors contain? A: Tag-to-tag potential scores A: Sequence & tag potential scores
Example: Linear Chain Conditional Random Field
Widely used in applications like part-of-speech tagging
z1 z2 z3 z4
President Obama told Congress β¦
Noun-Mod Noun Noun Verb
Example: Linear Chain Conditional Random Field
Widely used in applications like part-of-speech tagging and named entity recognition
z1 z2 z3 z4
President Obama told Congress β¦
Noun-Mod Noun Noun Verb
President Obama told Congress β¦
Person Person Org. Other
Linear Chain CRFs for Part of Speech Tagging
A linear chain CRF is a conditional probabilistic model of the sequence of tags π¨1, π¨2, β¦ , π¨π conditioned on the entire input sequence π¦1:π
Linear Chain CRFs for Part of Speech Tagging
π β£|β’
A linear chain CRF is a conditional probabilistic model of the sequence of tags π¨1, π¨2, β¦ , π¨π conditioned on the entire input sequence π¦1:π
Linear Chain CRFs for Part of Speech Tagging
π π¨1, π¨2, β¦ , π¨π|β’
A linear chain CRF is a conditional probabilistic model of the sequence of tags π¨1, π¨2, β¦ , π¨π conditioned on the entire input sequence π¦1:π
Linear Chain CRFs for Part of Speech Tagging
π π¨1, π¨2, β¦ , π¨π|π¦1:π
A linear chain CRF is a conditional probabilistic model of the sequence of tags π¨1, π¨2, β¦ , π¨π conditioned on the entire input sequence π¦1:π
Linear Chain CRFs for Part of Speech Tagging
z1 z2 z3 z4
π
1
π
2
π
3
π
4
π1 π2 π3 π4
π π¨1, π¨2, β¦ , π¨π|π¦1:π
Linear Chain CRFs for Part of Speech Tagging
z1 z2 z3 z4
π
1
π
2
π
3
π
4
π1 π2 π3 π4 Q: Whatβs the general formula for a factor graph/undirected PGM distribution?
Linear Chain CRFs for Part of Speech Tagging
z1 z2 z3 z4
π
1
π
2
π
3
π
4
π1 π2 π3 π4
A: π π¨1, π¨2, β¦ , π¨π =
1 π Οπ· ππ· π¨π
Q: Whatβs the general formula for a factor graph/undirected PGM distribution?
Linear Chain CRFs for Part of Speech Tagging
z1 z2 z3 z4
π
1
π
2
π
3
π
4
π1 π2 π3 π4
π π¨1, π¨2, β¦ , π¨π|π¦1:π β product of exponentiated potential scores
π π¨1, π¨2, β¦ , π¨π = 1 π ΰ·
π·
ππ· π¨π
Linear Chain CRFs for Part of Speech Tagging
z1 z2 z3 z4
π
1
π
2
π
3
π
4
π1 π2 π3 π4
π π¨1, π¨2, β¦ , π¨π|π¦1:π β exp βπΉπ1 π1 β¦ exp βπΉππ ππ β exp βπΉ
π
1 π
1
β¦ exp βπΉ
ππ π π
Linear Chain CRFs for Part of Speech Tagging
z1 z2 z3 z4
π
1
π
2
π
3
π
4
π1 π2 π3 π4
π π¨1, π¨2, β¦ , π¨π|π¦1:π β exp βπΉπ1 π1 β¦ exp βπΉππ ππ β exp βπΉ
π
1 π
1
β¦ exp βπΉ
ππ π π
- We use the notation πΉππ ππ to
separate the features ππ from how we reweight them
- We use βπΉππ to represent these
as Boltzmann distributions
Linear Chain CRFs for Part of Speech Tagging
z1 z2 z3 z4
π
1
π
2
π
3
π
4
π1 π2 π3 π4
π π¨1, π¨2, β¦ , π¨π|π¦1:π β ΰ·
π=1 π
exp βπΉππ ππ exp βπΉ
ππ π π
Linear Chain CRFs for Part of Speech Tagging
z1 z2 z3 z4
π
1
π
2
π
3
π
4
π1 π2 π3 π4
π π¨1, π¨2, β¦ , π¨π|π¦1:π β ΰ·
π=1 π
exp β πΉππ ππ + πΉ
ππ(π π)
Linear Chain CRFs for Part of Speech Tagging
z1 z2 z3 z4
π
1
π
2
π
3
π
4
π1 π2 π3 π4
π π¨1, π¨2, β¦ , π¨π|π¦1:π β ΰ·
π=1 π
exp β πΉππ ππ + πΉ
ππ(π π)
Let πΉππ ππ = ββ¨π π , ππβ© Let πΉ
ππ π π = ββ¨π π , π πβ©
where π π , π π are learnable parameters
Linear Chain CRFs for Part of Speech Tagging
z1 z2 z3 z4
π
1
π
2
π
3
π
4
π1 π2 π3 π4
π π¨1, π¨2, β¦ , π¨π|π¦1:π β ΰ·
i=1 N
exp( π π , π
π π¨π
+ π π , ππ π¨π, π¨π+1 )
Linear Chain CRFs for Part of Speech Tagging
z1 z2 z3 z4
π
1
π
2
π
3
π
4
π1 π2 π3 π4
ππ: inter-tag features (can depend on any/all input words π¦1:π)
Linear Chain CRFs for Part of Speech Tagging
z1 z2 z3 z4
π
1
π
2
π
3
π
4
π1 π2 π3 π4
ππ: inter-tag features (can depend on any/all input words π¦1:π) π
π: solo tag features
(can depend on any/all input words π¦1:π)
Linear Chain CRFs for Part of Speech Tagging
z1 z2 z3 z4
π
1
π
2
π
3
π
4
π1 π2 π3 π4
ππ: inter-tag features (can depend on any/all input words π¦1:π) π
π: solo tag features
(can depend on any/all input words π¦1:π)
Feature design, just like in maxent models!
Linear Chain CRFs for Part of Speech Tagging
z1 z2 z3 z4
π
1
π
2
π
3
π
4
π1 π2 π3 π4
ππ: inter-tag features (can depend on any/all input words π¦1:π) π
π: solo tag features
(can depend on any/all input words π¦1:π) Example: ππ,πβπ zj, zj+1 = 1 (if zj == N & zj+1 == V) else 0 ππ,told,πβπ zj, zj+1 = 1 (if zj == N & zj+1 == V & xj == told) else 0
(For discussion/whiteboard)
- How would we learn a CRF?
- What objective would we optimize?
- How would we use BP?
Key Insights (1)
- Minimize (structured) cross-entropy loss β
(structured) maximum likelihood
- Gradient has very familiar form of
βobserved feature counts β expected feature countsβ
Key Insights (2)
- Rely on adjacency connections/independence
assumptions to compute
π½π§β² ΰ·
π
βπ(π§β²) = ΰ·
π
ΰ·
π§πβ1,π§π
π(π§πβ1, π§π|π¦1:π)βπ π§πβ1, π§π
Key Insights (3)
- Run BP to compute beliefs (unnormalized, joint marginals)
π π§πβ1, π§π|π¦1:π β ππβ1 π§πβ1, π§π β ππ§πβ1βππβ1 π§πβ1 β ππ§πβππβ1 π§π
y1 y2 y3 y4
π
1
π
2
π
3
π
4
π1 π2 π3 π4
π π¦{π} = π β π π¦ π = π ΰ·
π¦ππβπ(π)
πππβπ π¦ππ = π€π