[PPT] - Discrete Markov Random Fields the Inference story Pradeep Ravikumar PowerPoint Presentation

SLIDE 1

Discrete Markov Random Fields the Inference story Pradeep Ravikumar

SLIDE 2

Graphical Models, The History

How to model stochastic processes of the world? I want to model the world, and I like graphs...

2

SLIDE 3

History

Mid to Late Twentieth Century Pioneering work of Conspiracy Theorists The System, it is all connected...

3

SLIDE 4

History

Late Twentieth Century: people realize that existing scientific literature offers a marriage between probability theory and graph theory – which can be used to model the world.

4

SLIDE 5

History

Common Misconception: Called graphical models after Grafici Modeles, a sculptor protege of Da Vinci. Called Graphical Models because it models stochastic systems using graphs.

5

SLIDE 6

History

Common Misconception: Called graphical models after Grafici Modeles, a sculptor protege of Da Vinci. Called Graphical Models because it models stochastic systems using graphs.

6

SLIDE 7

Graphical Models

X1 X2 X3 X4

7

SLIDE 8

Graphical Models

X1 X2 X3 X4

8

SLIDE 9

Graphical Models

X1 X2 X3 X4

Separating Set ∼ (X2, X3) disconnects X1 and X4

9

SLIDE 10

Graphical Models

X1 X2 X3 X4

Separating Set ∼ (X2, X3) disconnects X1 and X4 Global Markov Property ∼ X1 ⊥ X4 | (X2, X3)

10

SLIDE 11

Graphical Models

MP(G) ∼ Set of all Markov properties, by ranging over separating sets of G. P represented by G ∼ P satisfies MP(G)

G(X) P1(X) P2(X) P3(X) P4(X)

11

SLIDE 12

Hammersley and Clifford Theorem

Positive P over X satisfies MP(G) iff P factorizes according to cliques C in G, P(X) = 1 Z

C∈C

ψC(XC) Specific member of family specified by weights over cliques.

G(X) P2(X) P4(X) P1(X)

P3(X)

{ψ(3)

C (XC)} 12

SLIDE 13

Exponential Family

p(X) = 1 Z

C∈C

ψC(XC) = exp(

C∈C

log ψC(XC) − log Z) Exponential family: p(X; θ) = exp

α∈C θαφα(X) − Ψ(θ)
{φα} ∼ features
{θα} ∼ parameters
Ψ(θ) ∼ log partition function

13

SLIDE 14

Inference

Answering queries about the graphical model probability distribution.

14

SLIDE 15

Inference

For undirected model p(x; θ) = exp

α∈I θαφα(x) − Ψ(θ)
key

inference problems are:

⊲ compute log partition function (normalization constant) Ψ(θ) ⊲ marginals p(xA) = P

xv,v∈A p(x)

⊲ most probable configurations x∗ = arg max x p(x | xL)

These problems are intractable in full generality.

15

SLIDE 16

Log Partition Function

Z = log

x
α∈I

ψα(xα)

16

SLIDE 17

Variable Elimination

Z = log

x
α∈I

ψα(xα)

x
α

ψα(xα) =

{xj=i}
xi
α

ψα(xα) =

{xj=i}
α∈C\i

ψα(xα)

xi
α∈Ci

ψα(xα) =

{xj=i}
α∈C\i

ψα(xα)g(xj=i)

17

SLIDE 18

Variable Elimination

Z =

{xj=i}
α∈C\i

ψα(xα)g(xj=i) Continue to “eliminate” other variables xj. Is this a linear time method then?

18

SLIDE 19

Variable Elimination

Z =

{xj=i}
α∈C\i

ψα(xα)g(xj=i) Continue to “eliminate” other variables xj. Is this a linear time method then? g(xj=i) depends on variables j which share a factor with i.

19

SLIDE 20

Variable Elimination

Z = log

x
α∈I

ψα(xα)

ψ15 ψ25 ψ57 ψ67 ψ46 ψ36 x1 x2 x3 x4 x5 x6 x7 20

SLIDE 21

Variable Elimination

Z = log

x
α∈I

ψα(xα)

ψ15 ψ25 ψ57 ψ67 ψ46 ψ36 x1 x2 x3 x4 x5 x6 x7 21

SLIDE 22

Variable Elimination

Z = log

x
α∈I

ψα(xα)

x5 x6 x7 m5 m6 ψ57 ψ67 22

SLIDE 23

Variable Elimination

Z = log

x
α∈I

ψα(xα)

x5 x6 x7 m5 m6 ψ57 ψ67 23

SLIDE 24

Variable Elimination

Z = log

x
α∈I

ψα(xα) Exponential in tree-width.

ψ15 ψ25 ψ57 ψ67 ψ46 ψ36 x1 x2 x3 x4 x5 x6 x7

x7
x5

ψ57 (

x1

ψ15

x2

ψ25)

x6

ψ67 (

3

ψ36

x4

ψ46)

24

SLIDE 25

Inference

p(x; θ) = exp(θ⊤φ(x) − A(θ)) A(θ) ∼ log partition function

25

SLIDE 26

Inference

A(θ) = log

x

exp(θ⊤φ(x)) ≤ B(θ, λ) ≥ C(θ, λ) λ ∼ “variational” parameter A(θ) ≤ inf

λ B(θ, λ)

≥ sup

λ

C(θ, λ) Summing over configurations → Optimization!

26

SLIDE 27

Inference

But... (there’s always a but!) is there a principled way to obtain “parametrized” bounds B(θ, λ) and C(θ, λ)?

27

SLIDE 28

Fenchel Duality

f(x) ∼ concave function Define f ∗(λ) = minx{λ⊤x − f(x)}. = ⇒ f

′(xλ) = λ, Slope of tangent at xλ is λ

Tangent ∼ λ⊤x − (Intercept) = ⇒ f(xλ) = λ⊤xλ − (Intercept) = ⇒ f ∗(λ) ∼ Intercept of line with slope λ tangent to f(x).

28

SLIDE 29

Fenchel Duality

Tangent ∼ λ⊤x − f ∗(λ) f(x) = minλ{λ⊤x − f ∗(λ)} Thus, g(x, λ) = λ⊤x − f ∗(λ) is an upper bound of f(x)!

29

SLIDE 30

Fenchel Duality

Let us apply fenchel duality to the log partition function! A(θ) ∼ convex A∗(µ) = sup

θ

(θ⊤µ − A(θ)) A(θ) = sup

µ (θ⊤µ − A∗(µ))

30

SLIDE 31

Log Partition Function

Define the “marginal polytope” M = {µ ∈ Rd| ∃p(.) s.t.

x

φ(x)p(x) = µ} M ∼ Convex hull of {φ(x)}

31

SLIDE 32

Mean parameter mapping

Consider the mapping ∧ : Θ → M, ∧(θ) := Eθ[φ(x)] =

x

φ(x)p(x; θ) The mapping associates θ to “mean parameters” µ := ∧(θ) ∈ M. Conversely, for µ in Int(M), ∃θ = ∧−1(µ) (unique if exponential family is minimal)

32

SLIDE 33

Partition function conjugates

A∗(µ) = sup

θ

(θ⊤µ − A(θ)) A(θ) = sup

µ (θ⊤µ − A∗(µ))

Optimal parameters given by, θµ = ∧−1(µ) µθ = ∧(θ)

33

SLIDE 34

Partition function conjugate

Properties of the fenchel conjugate, A∗(µ) = supθ(θ⊤µ − A(θ))

⊲ A∗(µ) is finite only for µ ∈ M. ⊲ A∗(µ) is the entropy of graphical model distribution with “mean

parameters” µ, or equivalently with parameters ∧−1(µ)!

34

SLIDE 35

Partition Function

A(θ) = supµ∈M θ⊤µ − A∗(µ) “Hardness” is due to two bottlenecks

M:

a polytope with exponentially many vertices and no compact representation

A∗(µ): entropy computation

Approximate either or both!

35

SLIDE 36

Pairwise Graphical Models

θ12φ12(x1, x2) x2 x1

Overcomplete potentials: Ij(xs) =

1

xs = j

therwise

Ij,k(xs, xt) =

1

xs = j and xt = k

therwise .

p(x|θ) = exp  

s,j

θs;jIj(xs) +

s,t;j,k

θs,t;j,kIj,k(xs, xt) − Ψ(θ)  

36

SLIDE 37

Overcomplete Representation; Mean Parameters

µs;j := Eθ[Ij(xs)] = p(xs = j; θ) µs,t;j,k := Eθ[Ij,k(xs, xt)] = p(xs = j, xt = k; θ) Mean parameters are marginals! Define the following functional forms, µs(xs) =

j

µs;jIj(xs) µst(xs, xt) =

j,k

µs,t;j,kIjk(xs, xt)

37

SLIDE 38

Outer Polytope Approximations

LOCAL(G) := {µ ≥ 0|

xs

µs(xs) = 1,

xt

µst(xs, xt) = µs(xs)}

38

SLIDE 39

Inner Polytope Approximations

For the given graph G and a subgraph H, let E(H) = {θ′ | θ′

st = θst 1(s,t)∈H}

M(G; H) = {µ | µ = Eθ[φ(x)] for some θ ∈ E(H)} . M(G; H) ⊆ M(G)

39

SLIDE 40

Entropy Approximations

Tree-structured distributions, p(x; µ) =

s

µs(xs)

(s,t)∈E

µst(xs, xt) µsµt Define, Hs(µs) :=

xs

µs(xs) log µs(xs) Ist(µst) :=

xs,xt

µst(xs, xt) log µst(xs, xt)

40

SLIDE 41

Tree-structured Entropy, A∗

tree(µ) = −

s∈V

Hs(µs) +

(s,t)∈E

Ist(µst) Compact representation; can be used as an approximation.

41

SLIDE 42

Approximate Inference Techniques

Belief Propagation – Polytope ∼ LOCAL(G), Entropy ∼ Tree- structured entropy! Structured Mean Field – Polytope ∼ M(G; H), Entropy ∼ H- structured entropy Mean Field – H = H0, completely independent graph

42

SLIDE 43

Divergence Measure View

Given: p(x; θ) ∝ exp(θ⊤φ(x)) Would like a more “manageable” surrogate distribution, q ∈ Q, min

q∈Q D(q(x)||p(x; θ))

43

SLIDE 44

Divergence Measure View

min

q∈Q D(q(x)||p(x; θ))

D(q||p) = KL(q||p) ∼ Structured Mean Field, Belief Propagation D(p||q) = KL(p||q) ∼ Expectation Propagation (look out for talk on Continuous Markov Random Fields!) Typically approximate KL measure with “energy approximations” (Bethe free energy, Kikuchi free energy) (Ravikumar,Lafferty 05; Preconditioner Approximations) Optimizing for a minimax criterion reduces task to a generalized linear systems

44

SLIDE 45

problem!

45

SLIDE 46

Bounds on event probabilities

Doctor: So what is the lower bound on the diagnosis probability? Graphical Model: I don’t know, but here is an “approximate” value. Doctor: =( Can we get upper and lower bounds on p(X ∈ C; θ) (instead

f just “approximate” values)?

46

SLIDE 47

Bounds on event probabilities

Classical Chernoff Bounds give useful estimates for i.i.d. random variables. Can they be extended to graphical models? [Ravikumar, Lafferty 04; Variational Chernoff Bounds]

47

SLIDE 48

Classical Chernoff Bounds

pθ(X ≥ u) ≤ Eθ(X)

u

Markov Inequality pθ(X ≥ u) = pθ(eλX ≥ eλu) ≤ Eθ[eλ(X−u)] From this it follows that: log pθ(X ≥ u) ≤ infλ≥0(−λu + log Eθ[eλX]) Bounds on cumulant function Eθ[eλX] yield the standard Chernoff Bounds.

48

SLIDE 49

Classical Chernoff Bounds

pθ(X ≥ u) ≤ Eθ(X)

u

Markov Inequality pθ(X ≥ u) = pθ(eλX ≥ eλu) ≤ Eθ[eλ(X−u)] From this it follows that: log pθ(X ≥ u) ≤ infλ≥0(−λu + log Eθ[eλX]) Bounds on cumulant function Eθ[eλX] yield the standard Chernoff Bounds.

49

SLIDE 50

Classical Chernoff Bounds

pθ(X ≥ u) ≤ Eθ(X)

u

Markov Inequality pθ(X ≥ u) = pθ(eλX ≥ eλu) ≤ Eθ[eλ(X−u)] From this it follows that: log pθ(X ≥ u) ≤ infλ≥0(−λu + log Eθ[eλX]) Bounds on cumulant function Eθ[eλX] yield the standard Chernoff Bounds.

50

SLIDE 51

Generalized Chernoff Bounds

Event: X ∈ C IC(X) =

1

X ∈ C

therwise

IC(X) ≤ fλ pθ(X ∈ C) ≤ Eθ[fλ]

51

SLIDE 52

Generalized Chernoff Bounds

Event: X ∈ C IC(X) =

1

X ∈ C

therwise

IC(X) ≤ fλ pθ(X ∈ C) ≤ Eθ[fλ]

52

SLIDE 53

Graphical Model Chernoff Bounds

With fλ = exp(λ, x + u), we get: log pθ(X ∈ C) ≤ inf

λ (SC(−λ) + log Eθ[e<λ,x>])

where SC(λ) = supx∈C x, λ is the support function of the set C. For an exponential model with sufficient statistic φ(x), the above becomes: log pθ(X ∈ C) ≤ inf

λ (SC,φ(−λ) + Φ(θ + λ) − Φ(θ))

where Φ is the log-partition function and SC,φ(λ) = supx∈C φ(x), λ

53

SLIDE 54

Graphical Model Chernoff Bounds

With fλ = exp(λ, x + u), we get: log pθ(X ∈ C) ≤ inf

λ (SC(−λ) + log Eθ[e<λ,x>])

where SC(λ) = supx∈C x, λ is the support function of the set C. For an exponential model with sufficient statistic φ(x), the above becomes: log pθ(X ∈ C) ≤ inf

λ (SC,φ(−λ) + Φ(θ + λ) − Φ(θ))

where Φ is the log-partition function and SC,φ(λ) = supx∈C φ(x), λ

54

SLIDE 55

MAP Estimation

x2 x3 x4 x1 55

SLIDE 56

MAP Estimation

x2 x3 x4 x1

PROB = 0.01

56

SLIDE 57

MAP Estimation

x2 x3 x4 x1

PROB = 0.2

57

SLIDE 58

MAP Estimation

x2 x3 x4 x1

PROB = 0.2

Most Probable Configuration?

58

SLIDE 59

Polytope View

µ∗ = max

x

θ⊤φ(x) = sup

µ∈M

θ⊤µ

59

SLIDE 60

Outer Polytope Relaxations

LOCAL(G) := {µ ≥ 0|

xs

µs(xs) = 1,

xt

µst(xs, xt) = µs(xs)}

60

SLIDE 61

Outer Polytope Relaxations

sup

µ∈M(G)

θ⊤µ ≤ sup

µ∈LOCAL(G)

θ⊤µ A Linear Program! (Chekuri, Khanna, Naor, Zosin 05; LP Formulation for Metric Labeling) (Wainwright, Jaakkola, Willsky 05; Tree-reweighted Max-Product, Dual of LP)

61

SLIDE 62

Inner Polytope Approximations

If MI ⊂ M is any subset of the marginal polytope that includes all of the vertices, µ∗ = max

x

θ, φ = sup

µ∈MI

θ, µ

62

SLIDE 63

Inner Polytope Approximations

For the given graph G and a subgraph H, let E(H) = {θ′ | θ′

st = θst 1(s,t)∈H}

M(G; H) = {µ | µ = Eθ[φ(x)] for some θ ∈ E(H)} . M(G; H) ⊆ M(G)

63

SLIDE 64

Inner Polytope Approximations

Mean Field parameters,

M(G; H0) = {µ(s; j), µ(s, j; t, k) | 0 ≤ µ(s; j) ≤ 1, µ(s, j; t, k) = µ(s; j)µ(t; k)} Mean Field Relaxation, sup

µ∈M(G;H0)

θ, µ = sup

µ∈M(G;H0)

s;j

θs;jµ(s; j) +

st;jk

θs,j;t,kµ(s, j; t, k) = sup

µ∈M(G;H0)

s;j

θs;jµ(s; j) +

st;jk

θs,j;t,kµ(s; j)µ(t; k) Quadratic Program! (Ravikumar, Lafferty 06; Quadratic Relaxations for Metric Labeling and MAP in MRFs)

64

SLIDE 65

References

⊲ Martin. J. Wainwright and Michael I. Jordan (2003). Graphical models, exponential families, and variational inference. ⊲ M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul (1999). An introduction to variational methods for graphical models. ⊲ Pradeep Ravikumar, John Lafferty (2005). Preconditioner Approximations for Probabilistic Graphical Models. ⊲ Pradeep Ravikumar, John Lafferty (2004). Variational Chernoff Bounds for Graphical Models. ⊲ Chekuri, C., Khanna, S., Naor, J., Zosin, L. (2005). A linear programming formulation and approximation algorithms for the metric labeling problem. ⊲ Pradeep Ravikumar, John Lafferty (2006). Quadratic Programming Relaxations for Metric Labeling and Markov Random Field MAP Estimation.

65