[PPT] - Junction-tree algorithm Probabilistic Graphical Models Sharif PowerPoint Presentation

SLIDE 1

Junction-tree algorithm

Probabilistic Graphical Models Sharif University of Technology Spring 2016 Soleymani

SLIDE 2

Junction-tree algorithm: a general approach

2

 Junction trees as opposed to the sum-product on trees can be

applied on general graphs

 Junction tree as opposed to the elimination algorithm is not

“query-oriented”

 enables us to record and use the intermediated factors to respond to

multiple queries simoultaneously

 Upon convergence of the algorithms, we obtain marginal probabilities for all

cliques of the original graph.

SLIDE 3

Cluster tree

3

 Cluster tree is a singly connected graph (i.e., exactly one

path between each pair of nodes) in which the nodes are the cliques of an underlying graph

 A separator set is defined each linked pair of cliques contain

the variables in the intersection of the cliques

𝑌𝐵, 𝑌𝐶 𝑌𝐶, 𝑌𝐷 𝑌𝐵 𝑌𝐶 𝑌𝐷 separator set 𝑌𝐶

SLIDE 4

Example: variable elimination and cluster tree

4

 Elimination order: 𝑌6, 𝑌5, 𝑌4, 𝑌3, 𝑌2

Moralized graph 𝑌1 𝑌2 𝑌3 𝑌6 𝑌5 𝑌4 𝑌1 𝑌2 𝑌3 𝑌6 𝑌5 𝑌4

SLIDE 5

Example: elimination cliques

5

 Elimination order: 𝑌6, 𝑌5, 𝑌4, 𝑌3, 𝑌2

𝑌1 𝑌2 𝑌3 𝑌6 𝑌5 𝑌4 𝑌1 𝑌2 𝑌3 𝑌6 𝑌5 𝑌4 𝑌1 𝑌2 𝑌3 𝑌6 𝑌5 𝑌4 𝑌1 𝑌2 𝑌3 𝑌6 𝑌5 𝑌4 𝑌1 𝑌2 𝑌3 𝑌6 𝑌5 𝑌4

SLIDE 6

Example: cluster tree obtained by VE

6

𝑌1 𝑌2 𝑌3 𝑌6 𝑌5 𝑌4 Elimination order: 𝑌6, 𝑌5, 𝑌4, 𝑌3, 𝑌2  The cluster tree contains the cliques (fully connected subsets)

generated as elimination executes

 This cluster graph induced by an execution ofVE is necessarily a tree

 Indeed, after an elimination, the corresponding elimination clique will not be

reappear

SLIDE 7

Example: cluster tree obtained by VE

7

𝑌1 𝑌2 𝑌3 𝑌6 𝑌5 𝑌4 Elimination order: 𝑌6, 𝑌5, 𝑌4, 𝑌3, 𝑌2  The cluster tree contains the cliques (fully connected subsets)

generated as elimination executes

 This cluster graph induced by an execution ofVE is necessarily a tree

 Indeed, after an elimination, the corresponding elimination clique will not be

reappear

Maximal cliques

SLIDE 8

Cluster tree usefulness

8

 Cluster tree provides a structure for caching computations

 Multiple

queries can be performed much more efficiently than performingVE for each one separately.

 Cluster tree dictates a partial order over the operations that

are performed on factors to reach a better computational complexity

SLIDE 9

Junction tree property

9

 Junction tree property: If a variable appears in the two

cliques in the clique tree, it must appear in all cliques on the paths connecting them

 For every pair of cliques 𝐷𝑗 and 𝐷

𝑘, all cliques on the path

between 𝐷𝑗 and 𝐷

𝑘 contain 𝑇𝑗𝑘 = 𝐷𝑗 ∩ 𝐷 𝑘

 Also called as running intersection property  The cluster tree that satisfies the running intersection

property is called clique tree or junction tree.

SLIDE 10

Theorem

10

 The tree induced by a variable elimination algorithm

satisfies running intersection property

 Proof:

 Let 𝐷 and 𝐷′ be two clusters that contain 𝑌 and 𝐷𝑌 be the

cluster where 𝑌 is eliminated, we will prove that 𝑌 must be present in every clique on the path between 𝐷 and 𝐷𝑌 (and similarly on the path between 𝐷𝑌 and 𝐷′)

 Idea: the computation at 𝐷𝑌 must happen later than the computation

at 𝐷 or 𝐷′

SLIDE 11

Separation set

11

 Theorem 1: In a clique tree induced by a variable

elimination algorithm, let 𝑛𝑗𝑘 be a message that 𝐷𝑗 sends to the neighboring cluster 𝐷

𝑘 then the scope of this

message is 𝑇𝑗𝑘 = 𝐷𝑗⋂𝐷

𝑘

 Theorem

2: A cluster tree satisfies running intersection property if and

nly

if for every separation set 𝑇𝑗𝑘, 𝑊

≺ 𝑗,𝑘 and 𝑊 ≺ 𝑘,𝑗 are separated in 𝐼

given 𝑇𝑗𝑘

𝑊

≺ 𝑗,𝑘 : set of all variables in the scope of all

cliques in the 𝐷𝑗 side of the edge (𝑗, 𝑘)

SLIDE 12

Junction tree algorithm

12

 Given a factorized probability distribution 𝑄 with the Markov

network 𝐼, builds a junction tree 𝑈 based on 𝐼

 For each clique, it finds the marginal probability over the

variables in that clique

 Message-passing sum product (Shafer-Shenoy algorithm)

 Run a message-passing algorithm on the junction tree constructed according

to the distribution

 Belief update: Local consistency preservation (Hugin algorithm)

 rescaling (update) equations

SLIDE 13

Junction tree algorithm: inference

13

 Junction tree inference algorithm is a message passing on

a junction tree structure.

 Each clique starts with a set of initial factors.

 We assign a factor in the distribution 𝑄 to one and only one clique in

𝑈 if the scope of the factor is a subset of the variables in that clique

 Each clique sends one message to each neighbor in a schedule.

 Each clique multiplies the incoming messages and its potential, sum

ut over one or more variables and send an outcoming message.

 After message-passing, by combining its potential with the

messages it receives from its neighbors, it can compute the marginal over its variables.

SLIDE 14

Junction-tree message passing: Shafer-Shenoy algorithm

14

𝑛𝑗𝑘 𝑇𝑗𝑘 =

𝐷𝑗−S𝑗𝑘

𝜔𝑗

𝑙∈𝒪 𝑗 −{j}

𝑛𝑙𝑗(𝑇𝑙𝑗) 𝑄(𝐷𝑠) ∝ 𝜔𝑠

𝑙∈𝒪(𝑠)

𝑛𝑙𝑠(𝑇𝑙𝑠)

𝐷

𝑘

𝑇𝑗𝑘 𝐷𝑗 𝑛𝑗𝑘 𝑇𝑗𝑘 𝑛𝑘𝑗 𝑇𝑗𝑘

𝑄(𝑇𝑗𝑘) ∝ 𝑛𝑗𝑘 𝑇𝑗𝑘 𝑛𝑘𝑗 𝑇𝑗𝑘

Marginal on a clique as a product of the initial potential and the messages from its neighbors 𝜔𝑗 =

𝜚∈𝐺𝑗

𝜚 𝐺𝑗 shows the set of factors assigned to clique 𝐷𝑗

SLIDE 15

Junction-tree algorithm: correctness

15

 If 𝑌 is eliminated when a message is sent from 𝐷𝑗 to a neighboring 𝐷

𝑘 such

that 𝑌 ∈ 𝐷𝑗 and 𝑌 ∉ 𝐷

𝑘, then 𝑌 does not appear in the tree on the 𝐷 𝑘 side

f the edge (𝑗, 𝑘) after elimination

𝑊

≺ 𝑗,𝑘 : set of all variables in the

scope of all cliques in the 𝐷𝑗 side

f the edge (𝑗, 𝑘)

𝐺≺ 𝑗,𝑘 : set of factors in the cliques in the 𝐷𝑗 side of the edge (𝑗, 𝑘) 𝐺𝑗: set of factors in the clique 𝐷𝑗

SLIDE 16

Junction-tree algorithm: correctness

16  Induction on the length of the path from the leaves:



Base step: leaves



Inductive step:

𝑛𝑗→𝑘 𝑇𝑗𝑘 =

𝑊≺ 𝑗,𝑘 𝜚∈𝐺≺ 𝑗,𝑘

𝜚 =

𝐷𝑗\𝑇𝑗𝑘 𝑊≺ 𝑗1,𝑗

…

𝑊≺ 𝑗𝑙,𝑗 𝜚∈𝐺≺ 𝑗1,𝑗

𝜚 …

𝜚∈𝐺≺ 𝑗𝑙,𝑗

𝜚

𝜚∈𝐺𝑗

𝜚 =

𝐷𝑗\𝑇𝑗𝑘 𝜚∈𝐺𝑗

𝜚

𝑊≺ 𝑗1,𝑗 𝜚∈𝐺≺ 𝑗1,𝑗

𝜚 …

𝑊≺ 𝑗𝑙,𝑗 𝜚∈𝐺≺ 𝑗𝑙,𝑗

𝜚 =

𝐷𝑗\𝑇𝑗𝑘

𝜔𝑗 × 𝑛𝑗1→𝑗 × ⋯ × 𝑛𝑗𝑙→𝑗

𝑊

≺ 𝑗,𝑘 is a disjoint union of 𝑊 ≺ 𝑗𝑙,𝑗 for 𝑙 = 1, … , 𝑛

𝐷𝑗1 𝐷𝑗𝑛 … 𝐷𝑗 𝐷

𝑘



SLIDE 17

Message passing schedule

17

 A two-pass message-passing schedule: arbitrarily pick a

node as the root

 First pass: starting at the leaves and proceeds inward

 each node passes a message to its parent.  continues until the root has obtained messages from all of its

adjoining nodes.

 Second pass: starting at the root and passing the messages back

ut

 messages are passed in the reverse direction.  continues until all leaves have received their messages.

SLIDE 18

Junction tree algorithm Belief update perspective: Hugin algorithm

18

 We define two sets of potential functions:

 Clique potentials: on each clique 𝒀𝐷, we define a potential function 𝜔 𝒀𝐷

that is proportional to the marginal probability on that clique 𝜔 𝑌𝐷 ∝ 𝑄 𝑌𝐷

 Separator potentials: on each separator set 𝒀𝑇 , we define a potential

function 𝜚 𝒀𝑇 that is proportional to the marginal probability on 𝒀𝑇

𝜔 𝑌𝑇 ∝ 𝑄 𝑌𝑇

 Enables us to obtain local representation of marginal probabilities in

cliques

𝑊 𝑋 𝑇 𝜚𝑇 𝜔𝑊 𝜔𝑋

SLIDE 19

Extended representation of joint probability

19

 We intend to find extended representation:

𝑄 𝒀 ∝ 𝐷 𝜔𝐷(𝒀𝐷) 𝑇 𝜚𝑇(𝒀𝑇)

 where the global representation

𝐷 𝜔𝐷(𝒀𝐷) 𝑇 𝜚𝑇(𝒀𝑇) corresponds to the joint

probabilities

 and the local representations 𝜔𝐷(𝒀𝐷) and 𝜚𝑇(𝒀𝑇) correspond to

marginal probabilities

SLIDE 20

Consistency

20

 Consistency: since the potentials are required to represent

marginal probabilities, they must give the same marginals for the nodes that they have in common

 Consistency is a necessary and sufficient condition for the inference algorithm to

find potentials that are marginals  We will first introduce local consistency:

𝜚𝑇𝑗𝑘 =

𝐷𝑗\S𝑗𝑘

𝜔𝑗 =

𝐷𝑘\S𝑗𝑘

𝜔𝑘

SLIDE 21

Local consistency (updates)

21

 Updating 𝑋 based on 𝑊 (passing information from 𝑊 to 𝑋):

𝜚𝑇

∗ = 𝑊\S

𝜔𝑊 𝜔𝑋

∗ = 𝜚𝑇 ∗

𝜚𝑇 𝜔𝑋 𝜔𝑊

∗ = 𝜔𝑊

 Updating 𝑊 based on 𝑋 (passing information from 𝑋 to 𝑊):

𝜚𝑇

∗∗ = 𝑋\S

𝜔𝑋

∗

𝜔𝑊

∗∗ = 𝜚𝑇 ∗∗

𝜚𝑇

∗ 𝜔𝑊 ∗

𝜔𝑋

∗∗ = 𝜔𝑋 ∗ 𝑊 𝑋 𝑇

𝜚𝑇 𝜔𝑊 𝜔𝑋 𝑊 𝑋 𝑇 𝜚𝑇 𝜔𝑊 𝜔𝑋 marginalization rescaling The separator potentials have been initialized to unity

SLIDE 22

Properties of updates

22

 Invariant

joint: after these updates, the joint probability remains unchanged

𝜔𝑋

∗∗𝜔𝑊 ∗∗

𝜚𝑇

∗∗

= 𝜔𝑋

∗ 𝜔𝑊 ∗

𝜚𝑇

∗

= 𝜔𝑋𝜔𝑊 𝜚𝑇

 Local consistency: 𝜔𝑊

∗∗ and 𝜔𝑋 ∗∗ are consistent with

respect to 𝑇

𝑄 𝒀𝑇 = 𝜚𝑇

∗∗ = 𝑋\S

𝜔𝑋

∗∗ = 𝑊\S

𝜔𝑊

∗∗

𝑋\S

𝜔𝑋

∗∗ = 𝜚𝑇 ∗∗ = 𝜚𝑇 ∗∗ 𝜚𝑇 ∗

𝜚𝑇

∗ = 𝜚𝑇 ∗∗ 𝑊\S 𝜔𝑊 ∗

𝜚𝑇

∗

=

𝑊\S

𝜚𝑇

∗∗

𝜚𝑇

∗ 𝜔𝑊 ∗ = 𝑊\S

𝜔𝑊

∗∗

SLIDE 23

Global consistency

23

 To ensure global consistency from enforcing local consistency,

the variables that appear in two different cliques must appear also in all of the cliques on the path

 Junction-tree property: Local consistency ⇒ global consistency  Thus, it will suffice to arrange the cliques into a junction tree

and only enforce local consistency on the neighboring cliques

 Instead of enforcing consistency on all pair of overlapping cliques

 The information flow on the path to ensure consistency of the overlapping

cliques

SLIDE 24

Message-passing in a clique-tree

24

 We

indeed perform local updates (to maintain local consistency) when we have multiple overlapping cliques

 We need a protocol that constrains the order in which updates are

performed.

 Updates are arranged such that local consistency between a clique and

its neighbors is not ruined by subsequent updates between the clique and other neighbors

 Message-passing protocol: a clique can send a message to a

neighboring clique only when it has received messages from all

f its other neighbors

SLIDE 25

Hugin algorithm: summary

25

 Complication



directed model → undirected moralized graph



Graph triangulation



Creating a junction tree using maximum spanning tree approach

 Initialization & evidence



Each potential of the original graph (possibly sliced) is multiplied onto exactly one clique of junction tree. Separators are initialized to unity.

 Propagation of probabilities



Propagation of the probabilities by applying update equation according to a schedule



When the algorithm terminates, the clique potential and separator potentials are proportional to marginal probabilities

 Normalize the clique potentials to obtain conditional probabilities on the

corresponding cliques.

SLIDE 26

Correctness of junction-tree algorithm

26

 Theorem: Let 𝑄(𝒚, 𝒇) be represented by the clique potentials

𝜔𝐷 and separator potentials 𝜚𝑇 of a junction tree (i.e., 𝑄(𝒚, 𝒇) ∝

𝐷 𝜔𝐷 𝑇 𝜚𝑇). When the junction tree terminates the clique

potentials and the separator potentials are proportional to local marginals.

SLIDE 27

Clique trees from variable elimination

27

 Each clique in the clique tree induced by VE is also a clique in

the induced graph and vice versa.

 However, for inference we can reduce the clique tree to

contain only maximal cliques of the induced graph.

SLIDE 28

Clique trees from variable elimination

28

 Each clique in the clique tree induced by VE is also a clique in

the induced graph and vice versa.

 However, for inference we can reduce the clique tree to

contain only maximal cliques of the induced graph.

SLIDE 29

Triangulated graphs

29

 What class of graphs have junction tree?  A triangulated (or chordal) graph contains no cycles with four

r more nodes in which there is no chord

 Triangulation is the necessary and sufficient condition for a

graph to have a junction tree

 only triangulated graphs have the property that their clique trees are

junction trees.

SLIDE 30

Elimination algorithm: traingulation

30

 Every induced graph is chordal  For, any chordal graph there is an elimination ordering that

does not add any fill edges

 In general, finding the best traingulation is NP-hard but some

good heuristics exist

Moralized graph 𝑌1 𝑌2 𝑌3 𝑌6 𝑌5 𝑌4 𝑌1 𝑌2 𝑌3 𝑌6 𝑌5 𝑌4 Induced graph 𝑌1 𝑌2 𝑌3 𝑌6 𝑌5 𝑌4

SLIDE 31

Junction-tree construction

31

 Construct the undirected graph  Triangulate the graph

 e.g., Find an induced graph resulted from VE with a specified elimination

rder of nodes

 Find the set of maximal elimination cliques of the triangulated

graph

 Build a weighted, complete graph over these maximal cliques.

 Weight each edge (between cliques 𝐷𝑗 and 𝐷

𝑘 as 𝐷𝑗 ∩ 𝐷 𝑘 )

 Find a Maximal SpanningTree as a junction tree for 𝐻

 A clique-tree is a junction-tree iff it is maximal spanning tree

SLIDE 32

Junction tree construction: Example

32

𝑌1 𝑌2 𝑌3 𝑌6 𝑌5 𝑌4 𝑌1 𝑌2 𝑌3 𝑌6 𝑌5 𝑌4 𝑌1 𝑌2 𝑌3 𝑌6 𝑌5 𝑌4 𝑌1 𝑌2 𝑌3 𝑌6 𝑌5 𝑌4 𝑌1 𝑌2 𝑌3 𝑌6 𝑌5 𝑌4

{𝑌1, 𝑌2, 𝑌3} {𝑌2, 𝑌3, 𝑌5} {𝑌2, 𝑌5, 𝑌6} {𝑌2, 𝑌4} {𝑌2, 𝑌3} {𝑌2} {𝑌2} {𝑌2, 𝑌5} {𝑌2} {𝑌2} {𝑌1, 𝑌2, 𝑌3} {𝑌2, 𝑌3, 𝑌5} {𝑌2, 𝑌5, 𝑌6} {𝑌2, 𝑌4} {𝑌2, 𝑌3} {𝑌2} {𝑌2} {𝑌2, 𝑌5} {𝑌2} {𝑌2}

SLIDE 33

Building a junction tree

33

 Different

junction trees are

btained

for different triangulations

 Obtained from different elimination orders (and different maximum

spanning trees).

 Complexity of junction tree algorithms

 The time and space complexity is dominated by the size of the largest

clique in the junction tree (exponential in the size of the largest clique)

 Finding the junction tree with the smallest cliques is an NP-

hard problem.

 finding the optimum ordering in the Elimination algorithm is NP-hard  but for many graph optimum or near-optimum can often be heuristically

found

SLIDE 34

Junction tree algorithm: summary

34

 A generic exact inference algorithm for any graphical model  Results in marginal probabilities of all cliques by message-

passing algorithm on a junction tree constructed from the

riginal graph

 Can solve multiple queries in a single run

 The time and space complexity of this algorithm is exponential

w.r.t. the size of the largest clique