Discrete Markov Random Fields the Inference story Pradeep Ravikumar - - PowerPoint PPT Presentation

discrete markov random fields the inference story pradeep
SMART_READER_LITE
LIVE PREVIEW

Discrete Markov Random Fields the Inference story Pradeep Ravikumar - - PowerPoint PPT Presentation

Discrete Markov Random Fields the Inference story Pradeep Ravikumar Graphical Models, The History How to model stochastic processes of the world? I want to model the world, and I like graphs... 2 History Mid to Late Twentieth Century


slide-1
SLIDE 1

Discrete Markov Random Fields the Inference story Pradeep Ravikumar

slide-2
SLIDE 2

Graphical Models, The History

How to model stochastic processes of the world? I want to model the world, and I like graphs...

2

slide-3
SLIDE 3

History

Mid to Late Twentieth Century Pioneering work of Conspiracy Theorists The System, it is all connected...

3

slide-4
SLIDE 4

History

Late Twentieth Century: people realize that existing scientific literature offers a marriage between probability theory and graph theory – which can be used to model the world.

4

slide-5
SLIDE 5

History

Common Misconception: Called graphical models after Grafici Modeles, a sculptor protege of Da Vinci. Called Graphical Models because it models stochastic systems using graphs.

5

slide-6
SLIDE 6

History

Common Misconception: Called graphical models after Grafici Modeles, a sculptor protege of Da Vinci. Called Graphical Models because it models stochastic systems using graphs.

6

slide-7
SLIDE 7

Graphical Models

X1 X2 X3 X4

7

slide-8
SLIDE 8

Graphical Models

X1 X2 X3 X4

8

slide-9
SLIDE 9

Graphical Models

X1 X2 X3 X4

Separating Set ∼ (X2, X3) disconnects X1 and X4

9

slide-10
SLIDE 10

Graphical Models

X1 X2 X3 X4

Separating Set ∼ (X2, X3) disconnects X1 and X4 Global Markov Property ∼ X1 ⊥ X4 | (X2, X3)

10

slide-11
SLIDE 11

Graphical Models

MP(G) ∼ Set of all Markov properties, by ranging over separating sets of G. P represented by G ∼ P satisfies MP(G)

G(X) P1(X) P2(X) P3(X) P4(X)

11

slide-12
SLIDE 12

Hammersley and Clifford Theorem

Positive P over X satisfies MP(G) iff P factorizes according to cliques C in G, P(X) = 1 Z

  • C∈C

ψC(XC) Specific member of family specified by weights over cliques.

G(X) P2(X) P4(X) P1(X)

P3(X)

{ψ(3)

C (XC)} 12

slide-13
SLIDE 13

Exponential Family

p(X) = 1 Z

  • C∈C

ψC(XC) = exp(

  • C∈C

log ψC(XC) − log Z) Exponential family: p(X; θ) = exp

  • α∈C θαφα(X) − Ψ(θ)
  • {φα} ∼ features
  • {θα} ∼ parameters
  • Ψ(θ) ∼ log partition function

13

slide-14
SLIDE 14

Inference

Answering queries about the graphical model probability distribution.

14

slide-15
SLIDE 15

Inference

For undirected model p(x; θ) = exp

  • α∈I θαφα(x) − Ψ(θ)
  • key

inference problems are:

⊲ compute log partition function (normalization constant) Ψ(θ) ⊲ marginals p(xA) = P

xv,v∈A p(x)

⊲ most probable configurations x∗ = arg max x p(x | xL)

These problems are intractable in full generality.

15

slide-16
SLIDE 16

Log Partition Function

Z = log

  • x
  • α∈I

ψα(xα)

16

slide-17
SLIDE 17

Variable Elimination

Z = log

  • x
  • α∈I

ψα(xα)

  • x
  • α

ψα(xα) =

  • {xj=i}
  • xi
  • α

ψα(xα) =

  • {xj=i}
  • α∈C\i

ψα(xα)

  • xi
  • α∈Ci

ψα(xα) =

  • {xj=i}
  • α∈C\i

ψα(xα)g(xj=i)

17

slide-18
SLIDE 18

Variable Elimination

Z =

  • {xj=i}
  • α∈C\i

ψα(xα)g(xj=i) Continue to “eliminate” other variables xj. Is this a linear time method then?

18

slide-19
SLIDE 19

Variable Elimination

Z =

  • {xj=i}
  • α∈C\i

ψα(xα)g(xj=i) Continue to “eliminate” other variables xj. Is this a linear time method then? g(xj=i) depends on variables j which share a factor with i.

19

slide-20
SLIDE 20

Variable Elimination

Z = log

  • x
  • α∈I

ψα(xα)

ψ15 ψ25 ψ57 ψ67 ψ46 ψ36 x1 x2 x3 x4 x5 x6 x7 20

slide-21
SLIDE 21

Variable Elimination

Z = log

  • x
  • α∈I

ψα(xα)

ψ15 ψ25 ψ57 ψ67 ψ46 ψ36 x1 x2 x3 x4 x5 x6 x7 21

slide-22
SLIDE 22

Variable Elimination

Z = log

  • x
  • α∈I

ψα(xα)

x5 x6 x7 m5 m6 ψ57 ψ67 22

slide-23
SLIDE 23

Variable Elimination

Z = log

  • x
  • α∈I

ψα(xα)

x5 x6 x7 m5 m6 ψ57 ψ67 23

slide-24
SLIDE 24

Variable Elimination

Z = log

  • x
  • α∈I

ψα(xα) Exponential in tree-width.

ψ15 ψ25 ψ57 ψ67 ψ46 ψ36 x1 x2 x3 x4 x5 x6 x7

  • x7
  • x5

ψ57 (

  • x1

ψ15

  • x2

ψ25)

  • x6

ψ67 (

  • 3

ψ36

  • x4

ψ46)

24

slide-25
SLIDE 25

Inference

p(x; θ) = exp(θ⊤φ(x) − A(θ)) A(θ) ∼ log partition function

25

slide-26
SLIDE 26

Inference

A(θ) = log

  • x

exp(θ⊤φ(x)) ≤ B(θ, λ) ≥ C(θ, λ) λ ∼ “variational” parameter A(θ) ≤ inf

λ B(θ, λ)

≥ sup

λ

C(θ, λ) Summing over configurations → Optimization!

26

slide-27
SLIDE 27

Inference

But... (there’s always a but!) is there a principled way to obtain “parametrized” bounds B(θ, λ) and C(θ, λ)?

27

slide-28
SLIDE 28

Fenchel Duality

f(x) ∼ concave function Define f ∗(λ) = minx{λ⊤x − f(x)}. = ⇒ f

′(xλ) = λ, Slope of tangent at xλ is λ

Tangent ∼ λ⊤x − (Intercept) = ⇒ f(xλ) = λ⊤xλ − (Intercept) = ⇒ f ∗(λ) ∼ Intercept of line with slope λ tangent to f(x).

28

slide-29
SLIDE 29

Fenchel Duality

Tangent ∼ λ⊤x − f ∗(λ) f(x) = minλ{λ⊤x − f ∗(λ)} Thus, g(x, λ) = λ⊤x − f ∗(λ) is an upper bound of f(x)!

29

slide-30
SLIDE 30

Fenchel Duality

Let us apply fenchel duality to the log partition function! A(θ) ∼ convex A∗(µ) = sup

θ

(θ⊤µ − A(θ)) A(θ) = sup

µ (θ⊤µ − A∗(µ))

30

slide-31
SLIDE 31

Log Partition Function

Define the “marginal polytope” M = {µ ∈ Rd| ∃p(.) s.t.

  • x

φ(x)p(x) = µ} M ∼ Convex hull of {φ(x)}

31

slide-32
SLIDE 32

Mean parameter mapping

Consider the mapping ∧ : Θ → M, ∧(θ) := Eθ[φ(x)] =

  • x

φ(x)p(x; θ) The mapping associates θ to “mean parameters” µ := ∧(θ) ∈ M. Conversely, for µ in Int(M), ∃θ = ∧−1(µ) (unique if exponential family is minimal)

32

slide-33
SLIDE 33

Partition function conjugates

A∗(µ) = sup

θ

(θ⊤µ − A(θ)) A(θ) = sup

µ (θ⊤µ − A∗(µ))

Optimal parameters given by, θµ = ∧−1(µ) µθ = ∧(θ)

33

slide-34
SLIDE 34

Partition function conjugate

Properties of the fenchel conjugate, A∗(µ) = supθ(θ⊤µ − A(θ))

⊲ A∗(µ) is finite only for µ ∈ M. ⊲ A∗(µ) is the entropy of graphical model distribution with “mean

parameters” µ, or equivalently with parameters ∧−1(µ)!

34

slide-35
SLIDE 35

Partition Function

A(θ) = supµ∈M θ⊤µ − A∗(µ) “Hardness” is due to two bottlenecks

  • M:

a polytope with exponentially many vertices and no compact representation

  • A∗(µ): entropy computation

Approximate either or both!

35

slide-36
SLIDE 36

Pairwise Graphical Models

θ12φ12(x1, x2) x2 x1

Overcomplete potentials: Ij(xs) =

  • 1

xs = j

  • therwise

Ij,k(xs, xt) =

  • 1

xs = j and xt = k

  • therwise .

p(x|θ) = exp  

s,j

θs;jIj(xs) +

  • s,t;j,k

θs,t;j,kIj,k(xs, xt) − Ψ(θ)  

36

slide-37
SLIDE 37

Overcomplete Representation; Mean Parameters

µs;j := Eθ[Ij(xs)] = p(xs = j; θ) µs,t;j,k := Eθ[Ij,k(xs, xt)] = p(xs = j, xt = k; θ) Mean parameters are marginals! Define the following functional forms, µs(xs) =

  • j

µs;jIj(xs) µst(xs, xt) =

  • j,k

µs,t;j,kIjk(xs, xt)

37

slide-38
SLIDE 38

Outer Polytope Approximations

LOCAL(G) := {µ ≥ 0|

  • xs

µs(xs) = 1,

  • xt

µst(xs, xt) = µs(xs)}

38

slide-39
SLIDE 39

Inner Polytope Approximations

For the given graph G and a subgraph H, let E(H) = {θ′ | θ′

st = θst 1(s,t)∈H}

M(G; H) = {µ | µ = Eθ[φ(x)] for some θ ∈ E(H)} . M(G; H) ⊆ M(G)

39

slide-40
SLIDE 40

Entropy Approximations

Tree-structured distributions, p(x; µ) =

  • s

µs(xs)

  • (s,t)∈E

µst(xs, xt) µsµt Define, Hs(µs) :=

  • xs

µs(xs) log µs(xs) Ist(µst) :=

  • xs,xt

µst(xs, xt) log µst(xs, xt)

40

slide-41
SLIDE 41

Tree-structured Entropy, A∗

tree(µ) = −

  • s∈V

Hs(µs) +

  • (s,t)∈E

Ist(µst) Compact representation; can be used as an approximation.

41

slide-42
SLIDE 42

Approximate Inference Techniques

Belief Propagation – Polytope ∼ LOCAL(G), Entropy ∼ Tree- structured entropy! Structured Mean Field – Polytope ∼ M(G; H), Entropy ∼ H- structured entropy Mean Field – H = H0, completely independent graph

42

slide-43
SLIDE 43

Divergence Measure View

Given: p(x; θ) ∝ exp(θ⊤φ(x)) Would like a more “manageable” surrogate distribution, q ∈ Q, min

q∈Q D(q(x)||p(x; θ))

43

slide-44
SLIDE 44

Divergence Measure View

min

q∈Q D(q(x)||p(x; θ))

D(q||p) = KL(q||p) ∼ Structured Mean Field, Belief Propagation D(p||q) = KL(p||q) ∼ Expectation Propagation (look out for talk on Continuous Markov Random Fields!) Typically approximate KL measure with “energy approximations” (Bethe free energy, Kikuchi free energy) (Ravikumar,Lafferty 05; Preconditioner Approximations) Optimizing for a minimax criterion reduces task to a generalized linear systems

44

slide-45
SLIDE 45

problem!

45

slide-46
SLIDE 46

Bounds on event probabilities

Doctor: So what is the lower bound on the diagnosis probability? Graphical Model: I don’t know, but here is an “approximate” value. Doctor: =( Can we get upper and lower bounds on p(X ∈ C; θ) (instead

  • f just “approximate” values)?

46

slide-47
SLIDE 47

Bounds on event probabilities

Classical Chernoff Bounds give useful estimates for i.i.d. random variables. Can they be extended to graphical models? [Ravikumar, Lafferty 04; Variational Chernoff Bounds]

47

slide-48
SLIDE 48

Classical Chernoff Bounds

pθ(X ≥ u) ≤ Eθ(X)

u

Markov Inequality pθ(X ≥ u) = pθ(eλX ≥ eλu) ≤ Eθ[eλ(X−u)] From this it follows that: log pθ(X ≥ u) ≤ infλ≥0(−λu + log Eθ[eλX]) Bounds on cumulant function Eθ[eλX] yield the standard Chernoff Bounds.

48

slide-49
SLIDE 49

Classical Chernoff Bounds

pθ(X ≥ u) ≤ Eθ(X)

u

Markov Inequality pθ(X ≥ u) = pθ(eλX ≥ eλu) ≤ Eθ[eλ(X−u)] From this it follows that: log pθ(X ≥ u) ≤ infλ≥0(−λu + log Eθ[eλX]) Bounds on cumulant function Eθ[eλX] yield the standard Chernoff Bounds.

49

slide-50
SLIDE 50

Classical Chernoff Bounds

pθ(X ≥ u) ≤ Eθ(X)

u

Markov Inequality pθ(X ≥ u) = pθ(eλX ≥ eλu) ≤ Eθ[eλ(X−u)] From this it follows that: log pθ(X ≥ u) ≤ infλ≥0(−λu + log Eθ[eλX]) Bounds on cumulant function Eθ[eλX] yield the standard Chernoff Bounds.

50

slide-51
SLIDE 51

Generalized Chernoff Bounds

Event: X ∈ C IC(X) =

  • 1

X ∈ C

  • therwise

IC(X) ≤ fλ pθ(X ∈ C) ≤ Eθ[fλ]

51

slide-52
SLIDE 52

Generalized Chernoff Bounds

Event: X ∈ C IC(X) =

  • 1

X ∈ C

  • therwise

IC(X) ≤ fλ pθ(X ∈ C) ≤ Eθ[fλ]

52

slide-53
SLIDE 53

Graphical Model Chernoff Bounds

With fλ = exp(λ, x + u), we get: log pθ(X ∈ C) ≤ inf

λ (SC(−λ) + log Eθ[e<λ,x>])

where SC(λ) = supx∈C x, λ is the support function of the set C. For an exponential model with sufficient statistic φ(x), the above becomes: log pθ(X ∈ C) ≤ inf

λ (SC,φ(−λ) + Φ(θ + λ) − Φ(θ))

where Φ is the log-partition function and SC,φ(λ) = supx∈C φ(x), λ

53

slide-54
SLIDE 54

Graphical Model Chernoff Bounds

With fλ = exp(λ, x + u), we get: log pθ(X ∈ C) ≤ inf

λ (SC(−λ) + log Eθ[e<λ,x>])

where SC(λ) = supx∈C x, λ is the support function of the set C. For an exponential model with sufficient statistic φ(x), the above becomes: log pθ(X ∈ C) ≤ inf

λ (SC,φ(−λ) + Φ(θ + λ) − Φ(θ))

where Φ is the log-partition function and SC,φ(λ) = supx∈C φ(x), λ

54

slide-55
SLIDE 55

MAP Estimation

x2 x3 x4 x1 55

slide-56
SLIDE 56

MAP Estimation

x2 x3 x4 x1

PROB = 0.01

56

slide-57
SLIDE 57

MAP Estimation

x2 x3 x4 x1

PROB = 0.2

57

slide-58
SLIDE 58

MAP Estimation

x2 x3 x4 x1

PROB = 0.2

Most Probable Configuration?

58

slide-59
SLIDE 59

Polytope View

µ∗ = max

x

θ⊤φ(x) = sup

µ∈M

θ⊤µ

59

slide-60
SLIDE 60

Outer Polytope Relaxations

LOCAL(G) := {µ ≥ 0|

  • xs

µs(xs) = 1,

  • xt

µst(xs, xt) = µs(xs)}

60

slide-61
SLIDE 61

Outer Polytope Relaxations

sup

µ∈M(G)

θ⊤µ ≤ sup

µ∈LOCAL(G)

θ⊤µ A Linear Program! (Chekuri, Khanna, Naor, Zosin 05; LP Formulation for Metric Labeling) (Wainwright, Jaakkola, Willsky 05; Tree-reweighted Max-Product, Dual of LP)

61

slide-62
SLIDE 62

Inner Polytope Approximations

If MI ⊂ M is any subset of the marginal polytope that includes all of the vertices, µ∗ = max

x

θ, φ = sup

µ∈MI

θ, µ

62

slide-63
SLIDE 63

Inner Polytope Approximations

For the given graph G and a subgraph H, let E(H) = {θ′ | θ′

st = θst 1(s,t)∈H}

M(G; H) = {µ | µ = Eθ[φ(x)] for some θ ∈ E(H)} . M(G; H) ⊆ M(G)

63

slide-64
SLIDE 64

Inner Polytope Approximations

Mean Field parameters,

M(G; H0) = {µ(s; j), µ(s, j; t, k) | 0 ≤ µ(s; j) ≤ 1, µ(s, j; t, k) = µ(s; j)µ(t; k)} Mean Field Relaxation, sup

µ∈M(G;H0)

θ, µ = sup

µ∈M(G;H0)

  • s;j

θs;jµ(s; j) +

  • st;jk

θs,j;t,kµ(s, j; t, k) = sup

µ∈M(G;H0)

  • s;j

θs;jµ(s; j) +

  • st;jk

θs,j;t,kµ(s; j)µ(t; k) Quadratic Program! (Ravikumar, Lafferty 06; Quadratic Relaxations for Metric Labeling and MAP in MRFs)

64

slide-65
SLIDE 65

References

⊲ Martin. J. Wainwright and Michael I. Jordan (2003). Graphical models, exponential families, and variational inference. ⊲ M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul (1999). An introduction to variational methods for graphical models. ⊲ Pradeep Ravikumar, John Lafferty (2005). Preconditioner Approximations for Probabilistic Graphical Models. ⊲ Pradeep Ravikumar, John Lafferty (2004). Variational Chernoff Bounds for Graphical Models. ⊲ Chekuri, C., Khanna, S., Naor, J., Zosin, L. (2005). A linear programming formulation and approximation algorithms for the metric labeling problem. ⊲ Pradeep Ravikumar, John Lafferty (2006). Quadratic Programming Relaxations for Metric Labeling and Markov Random Field MAP Estimation.

65