[PPT] - Learning chordal Markov networks by dynamic programming Kustaa PowerPoint Presentation

SLIDE 1

Learning chordal Markov networks by dynamic programming

Kustaa Kangas Teppo Niinim¨ aki Mikko Koivisto NIPS 2014 (to appear) November 27, 2014

Kustaa Kangas Learning chordal Markov networks by dynamic programming

SLIDE 2

Probabilistic graphical models

Graphical model

◮ Graph structure G on the vertex set V = {1, . . . , n} ◮ Represents conditional independencies in a joint distribution

p(X) = p(X1, . . . , Xn) Advantages

◮ Easy to read ◮ Compact way to store a distribution ◮ Efficient inference

Kustaa Kangas Learning chordal Markov networks by dynamic programming

SLIDE 3

Probabilistic graphical models

Directed models: Bayesian networks, ... Undirected models: Markov networks, ... Structure learning problem: Given samples from p(X1, . . . , Xn), find a model that best fits the sampled data.

Kustaa Kangas Learning chordal Markov networks by dynamic programming

SLIDE 4

Probabilistic graphical models

Structure learning in chordal Markov networks: Find a chordal Markov network that maximizes a given decomposable score. Prior work:

◮ Constraint satisfaction, Corander et al. ◮ Integer linear programming, Bartlett and Cussens

Our result: Dynamic programming in O(4n) time and O(3n) space for n variables.

◮ First non-trivial bound ◮ Competitive in practice

Kustaa Kangas Learning chordal Markov networks by dynamic programming

SLIDE 5

Markov networks

◮ Joint distribution p(X) = p(X1, . . . Xn) ◮ Undirected graph G on V = {1, . . . , n} with the Global

Markov property: For A, B, S ⊆ V it holds that XA ⊥ ⊥ XB | XS if S separates A and B in G.

Kustaa Kangas Learning chordal Markov networks by dynamic programming

SLIDE 6

Markov networks

◮ Joint distribution p(X) = p(X1, . . . Xn) ◮ Undirected graph G on V = {1, . . . , n} with the Global

Markov property: For A, B, S ⊆ V it holds that XA ⊥ ⊥ XB | XS if S separates A and B in G.

Kustaa Kangas Learning chordal Markov networks by dynamic programming

SLIDE 7

Markov networks

◮ Joint distribution p(X) = p(X1, . . . Xn) ◮ Undirected graph G on V = {1, . . . , n} with the Global

Markov property: For A, B, S ⊆ V it holds that XA ⊥ ⊥ XB | XS if S separates A and B in G.

Kustaa Kangas Learning chordal Markov networks by dynamic programming

SLIDE 8

Markov networks

◮ Joint distribution p(X) = p(X1, . . . Xn) ◮ Undirected graph G on V = {1, . . . , n} with the Global

Markov property: For A, B, S ⊆ V it holds that XA ⊥ ⊥ XB | XS if S separates A and B in G.

Kustaa Kangas Learning chordal Markov networks by dynamic programming

SLIDE 9

Markov networks

If p is strictly positive, it factorizes as p(X1, . . . , Xn) =

C∈C

ψC(XC) , where

◮ C is the set of (maximal) cliques of G ◮ ψC are mappings to positive reals ◮ XC = {Xv : v ∈ C}

(Hammersley–Clifford Theorem)

Kustaa Kangas Learning chordal Markov networks by dynamic programming

SLIDE 10

Bayesian networks

◮ Directed acyclic graph ◮ Conditional independencies by d-separation ◮ Factorizes:

p(X1, . . . , Xn) =

n

i=1

p(Xi | parents(Xi))

Kustaa Kangas Learning chordal Markov networks by dynamic programming

SLIDE 11

Bayesian and Markov networks

◮ Bayesian and Markov networks are not equivalent ◮ Chordal Markov networks are the intersection between the two

Kustaa Kangas Learning chordal Markov networks by dynamic programming

SLIDE 12

Chordal graphs

◮ A chord is an edge between two non-consecutive vertices in a

cycle.

◮ An graph is chordal or triangulated if every cycle of at least 4

vertices has a chord.

Kustaa Kangas Learning chordal Markov networks by dynamic programming

SLIDE 13

Chordal graphs

◮ A chord is an edge between two non-consecutive vertices in a

cycle.

◮ An graph is chordal or triangulated if every cycle of at least 4

vertices has a chord.

Kustaa Kangas Learning chordal Markov networks by dynamic programming

SLIDE 14

Clique tree decomposition

1 3 4 2 5 6 7 8 9

Kustaa Kangas Learning chordal Markov networks by dynamic programming

SLIDE 15

Clique tree decomposition

1 3 4 2 5 6 7 8 9

Kustaa Kangas Learning chordal Markov networks by dynamic programming

SLIDE 16

Clique tree decomposition

3 5 6 8 9 1 2 7 1 3 4 2 2 8 Running intersection property: For all C1, C2 ∈ C, every clique

n the path between C1 and C2 contains C1 ∩ C2.

Kustaa Kangas Learning chordal Markov networks by dynamic programming

SLIDE 17

Clique tree decomposition

3 5 6 8 9 1 2 7 1 3 4 2 2 8 Running intersection property: For all C1, C2 ∈ C, every clique

n the path between C1 and C2 contains C1 ∩ C2.

Kustaa Kangas Learning chordal Markov networks by dynamic programming

SLIDE 18

Clique tree decomposition

3 5 6 8 9 1 2 7 1 3 4 2 2 8 Running intersection property: For all C1, C2 ∈ C, every clique

n the path between C1 and C2 contains C1 ∩ C2.

Kustaa Kangas Learning chordal Markov networks by dynamic programming

SLIDE 19

Clique tree decomposition

3 5 6 8 9 1 2 7 1 3 4 2 2 8 Separator: Intersection of adjacent cliques in a clique tree. Every clique tree has the same multiset of separators.

Kustaa Kangas Learning chordal Markov networks by dynamic programming

SLIDE 20

Clique tree decomposition

1 3 4 2 5 6 7 8 9

3 5 6 8 9 1 2 7 1 3 4 2 2 8

Theorem: A graph is chordal if and only if it has a clique tree.

Kustaa Kangas Learning chordal Markov networks by dynamic programming

SLIDE 21

Chordal Markov networks

1 3 4 2 5 6 7 8 9

◮ ψi(XCi) = p(Ci)/p(Si) ◮ Factorization becomes

p(X1, . . . , Xn) =

C∈C

ψC(XC) =

C∈C p(XC)
S∈S p(XS) ,

where C and S are the sets of cliques and separators.

Kustaa Kangas Learning chordal Markov networks by dynamic programming

SLIDE 22

Structure learning

Given sampled data D from p(X1, . . . Xn), how well does a graph structure G fit the data? Common scoring criteria decompose as score(G) =

C∈C score(C)
S∈S score(S)

Each score(C) is the probability of the data projected to C, possibly extended with a prior or penalization term. e.g. maximum likelihood, Bayesian Dirichlet, ...

Kustaa Kangas Learning chordal Markov networks by dynamic programming

SLIDE 23

Structure learning

Structure learning problem in chordal Markov networks: Given score(C) for each C ⊆ V , find a chordal graph G that maximizes score(G) =

C∈C score(C)
S∈S score(S) .

We assume each score(C) can be efficiently computed and focus

n the combinatorial problem.

Kustaa Kangas Learning chordal Markov networks by dynamic programming

SLIDE 24

Structure learning

Bruteforce solution:

◮ Enumerate undirected graphs ◮ Determine which are chordal ◮ For each chordal G, find a clique tree to evaluate score(G) ◮ O∗(2(n

2)) Kustaa Kangas Learning chordal Markov networks by dynamic programming

SLIDE 25

Structure learning

We denote score(T) = score(G) when T is a clique tree of G.

◮ Every clique tree T uniquely specifies a chordal graph G. ◮ We can search the space of clique trees instead.

Kustaa Kangas Learning chordal Markov networks by dynamic programming

SLIDE 26

Recursive characterization

3 5 6 8 9 1 2 7 1 3 4 2 2 8

Let T be rooted at C with subtrees T1, . . . , Tk rooted at C1, . . . , Ck. Then, score(T) = score(C)

k

i=1

score(Ti) score(C ∩ Ci)

Kustaa Kangas Learning chordal Markov networks by dynamic programming

SLIDE 27

Recurrence

For S ⊂ V and ∅ ⊂ R ⊆ V \ S, let f (S, R) be the maximum score(G) over chordal G on S ∪ R such that S is a proper subset of a clique. Then, the solution is given by f (∅, V ).

Kustaa Kangas Learning chordal Markov networks by dynamic programming

SLIDE 28

Recurrence

For S ⊂ V and ∅ ⊂ R ⊆ V \ S, let f (S, R) be the maximum score(G) over chordal G on S ∪ R such that S is a proper subset of a clique. Then, the solution is given by f (∅, V ). f (S, R) = max

S ⊂ C ⊆ S ∪ R {R1, . . . , Rk} ❁ R \ C S1, . . . , Sk ⊂ C

score(C)

k

i=1

f (Si, Ri) score(Si) .

Kustaa Kangas Learning chordal Markov networks by dynamic programming

SLIDE 29

Recurrence

score(T) = score(C)

k

i=1

score(Ti) score(C ∩ Ci) f (S, R) = max

S ⊂ C ⊆ S ∪ R {R1, . . . , Rk} ❁ R \ C S1, . . . , Sk ⊂ C

score(C)

k

i=1

f (Si, Ri) score(Si)

C S

R

R1 R2 R3 C

C S R

Kustaa Kangas Learning chordal Markov networks by dynamic programming

SLIDE 30

Recurrence

score(T) = score(C)

k

i=1

score(Ti) score(C ∩ Ci) f (R) = max

∅ ⊂ C ⊆ R {R1, . . . , Rk} ❁ R \ C S1, . . . , Sk ⊂ C

score(C)

k

i=1

f (Si ∪ Ri) score(Si)

C S

R

R1 R2 R3 C

C S R

Kustaa Kangas Learning chordal Markov networks by dynamic programming

SLIDE 31

Recurrence

f (S, R) = max

S ⊂ C ⊆ S ∪ R {R1, . . . , Rk} ❁ R \ C S1, . . . , Sk ⊂ C

score(C)

k

i=1

f (Si, Ri) score(Si)

Kustaa Kangas Learning chordal Markov networks by dynamic programming

SLIDE 32

Recurrence

f (S, R) = max

S ⊂ C ⊆ S ∪ R {R1, . . . , Rk} ❁ R \ C

score(C)

k

i=1

max

Si⊂C

f (Si, Ri) score(Si)

Kustaa Kangas Learning chordal Markov networks by dynamic programming

SLIDE 33

Recurrence

f (S, R) = max

S ⊂ C ⊆ S ∪ R {R1, . . . , Rk} ❁ R \ C

score(C)

k

i=1

max

Si⊂C

f (Si, Ri) score(Si) h(C, R) = max

S⊂C

f (S, R) score(S)

Kustaa Kangas Learning chordal Markov networks by dynamic programming

SLIDE 34

Recurrence

f (S, R) = max

S ⊂ C ⊆ S ∪ R {R1, . . . , Rk} ❁ R \ C

score(C)

k

i=1

h(C, Ri) h(C, R) = max

S⊂C

f (S, R) score(S)

Kustaa Kangas Learning chordal Markov networks by dynamic programming

SLIDE 35

Recurrence

f (S, R) = max

S ⊂ C ⊆ S ∪ R {R1, . . . , Rk} ❁ R \ C

score(C)

k

i=1

h(C, Ri)

Kustaa Kangas Learning chordal Markov networks by dynamic programming

SLIDE 36

Recurrence

f (S, R) = max

S⊂C⊆S∪R score(C)

max

{R1,...,Rk}❁R\C k

i=1

h(C, Ri)

Kustaa Kangas Learning chordal Markov networks by dynamic programming

SLIDE 37

Recurrence

f (S, R) = max

S⊂C⊆S∪R score(C)

max

{R1,...,Rk}❁R\C k

i=1

h(C, Ri) g(C, U) = max

{R1,...,Rk}❁U k

i=1

h(C, Ri)

Kustaa Kangas Learning chordal Markov networks by dynamic programming

SLIDE 38

Recurrence

f (S, R) = max

S⊂C⊆S∪R score(C)g(C, R \ C)

g(C, U) = max

{R1,...,Rk}❁U k

i=1

h(C, Ri)

Kustaa Kangas Learning chordal Markov networks by dynamic programming

SLIDE 39

Recurrence

g(C, U) = max

{R1,...,Rk}❁U k

i=1

h(C, Ri)

Kustaa Kangas Learning chordal Markov networks by dynamic programming

SLIDE 40

Recurrence

g(C, U) = max

{R1,...,Rk}❁U k

i=1

h(C, Ri) If U = ∅, then g(C, U) = 1 (empty product).

Kustaa Kangas Learning chordal Markov networks by dynamic programming

SLIDE 41

Recurrence

g(C, U) = max

{R1,...,Rk}❁U k

i=1

h(C, Ri) If U = ∅, then g(C, U) = 1 (empty product). Otherwise g(C, U) = max

{R1,...,Rk}❁U k

i=1

h(C, Ri)

Kustaa Kangas Learning chordal Markov networks by dynamic programming

SLIDE 42

Recurrence

g(C, U) = max

{R1,...,Rk}❁U k

i=1

h(C, Ri) If U = ∅, then g(C, U) = 1 (empty product). Otherwise g(C, U) = max

∅=R1⊆U

max

{R2,...,Rk}❁U\R1 k

i=1

h(C, Ri)

Kustaa Kangas Learning chordal Markov networks by dynamic programming

SLIDE 43

Recurrence

g(C, U) = max

{R1,...,Rk}❁U k

i=1

h(C, Ri) If U = ∅, then g(C, U) = 1 (empty product). Otherwise g(C, U) = max

∅=R1⊆U h(C, R1)

max

{R2,...,Rk}❁U\R1 k

i=2

h(C, Ri)

Kustaa Kangas Learning chordal Markov networks by dynamic programming

SLIDE 44

Recurrence

g(C, U) = max

{R1,...,Rk}❁U k

i=1

h(C, Ri) If U = ∅, then g(C, U) = 1 (empty product). Otherwise g(C, U) = max

∅=R1⊆U h(C, R1)g(C, U \ R1)

Kustaa Kangas Learning chordal Markov networks by dynamic programming

SLIDE 45

Recurrence

g(C, U) = max

{R1,...,Rk}❁U k

i=1

h(C, Ri) If U = ∅, then g(C, U) = 1 (empty product). Otherwise g(C, U) = max

∅=R⊆U h(C, R)g(C, U \ R)

Kustaa Kangas Learning chordal Markov networks by dynamic programming

SLIDE 46

Recurrence

We have the split into three simpler recurrences: f (S, R) = max

S⊂C⊆S∪R score(C)g(C, R \ C)

g(C, U) = max

∅⊂R⊆U h(C, R)g(C, U \ R)

h(C, R) = max

S⊂C f (S, R)

score(S)

Dynamic programming in the increasing order of set size. Space: O(3n) Time: O(4n)

Kustaa Kangas Learning chordal Markov networks by dynamic programming

SLIDE 47

Efficient indexing

For each pair (A, B) compute the index

n

v=1

3v−1 · Iv(A, B) where Iv(A, B) =    1 if v ∈ A, 2 if v ∈ B, 0 otherwise.

Kustaa Kangas Learning chordal Markov networks by dynamic programming

SLIDE 48

Experiments

w = 3 w = 4 w = 5 w = 6

8 10 12 14 16 18 1s 60s 1h Junctor, any GOBNILP, large GOBNILP, medium GOBNILP, small 8 10 12 14 16 18 1s 60s 1h 8 10 12 14 16 18 1s 60s 1h 8 10 12 14 16 18 1s 60s 1h 8 10 12 14 16 18 1s 60s 1h 8 10 12 14 16 18 1s 60s 1h 8 10 12 14 16 18 1s 60s 1h 8 10 12 14 16 18 1s 60s 1h

Kustaa Kangas Learning chordal Markov networks by dynamic programming

SLIDE 49

Experiments

Dataset Abbr. n m Tic-tac-toe X 10 958 Poker P 11 10000 Bridges B 12 108 Flare F 13 1066 Zoo Z 17 101 Dataset Abbr. n m Voting V 17 435 Tumor T 18 339 Lymph L 19 148 Hypothyroid 22 3772 Mushroom 22 8124

w = 3 w = 4 w = 5 w = 6

1s 60s 1h Junctor 1s 60s 1h GOBNILP B F L P X T V Z 1s 60s 1h Junctor 1s 60s 1h GOBNILP B F L P X T V Z 1s 60s 1h Junctor 1s 60s 1h GOBNILP B F L P X T V Z 1s 60s 1h Junctor 1s 60s 1h GOBNILP B F L P X T V Z

Kustaa Kangas Learning chordal Markov networks by dynamic programming

SLIDE 50

Thank you!

Kustaa Kangas Learning chordal Markov networks by dynamic programming