Marianas Labs
Graphical Models
10-715 Fall 2015
Alexander Smola alex@smola.org Office hours - after class in my office
Graphical Models 10-715 Fall 2015 Alexander Smola alex@smola.org - - PowerPoint PPT Presentation
Graphical Models 10-715 Fall 2015 Alexander Smola alex@smola.org Office hours - after class in my office Marianas Labs Directed Graphical Models Brain & Brawn p (brain) = 0 . 1 p (sports) = 0 . 2 smart strong 0 1 0 0.1 0.8 1
Marianas Labs
10-715 Fall 2015
Alexander Smola alex@smola.org Office hours - after class in my office
smart strong
1 0.1 0.8 1 0.8 0.9
p(g, s, b) = p(g|s, b)p(s)p(b) p(brain) = 0.1 p(sports) = 0.2
1 0.1 0.8 1 0.8 0.9
p(g, s, b) = p(g|s, b)p(s)p(b)
? 1 0.72 0.08 1 0.18 0.02
p(s, b) = p(s)p(b) p(brain) = 0.1 p(sports) = 0.2
1 0.1 0.8 1 0.8 0.9
p(g, s, b) = p(g|s, b)p(s)p(b)
g=1 1 0.072 0.064 1 0.144 0.018
p(s, b|g) = p(s)p(b)p(g|s, b) P
s0,b0 p(s0)p(b0)p(g|s0, b0)
element-wise multiply
p(brain) = 0.1 p(sports) = 0.2
1 0.1 0.8 1 0.8 0.9
p(g, s, b) = p(g|s, b)p(s)p(b)
|g=1 1 0.242 0.215 1 0.483 0.06
p(s, b|g) = p(s)p(b)p(g|s, b) P
s0,b0 p(s0)p(b0)p(g|s0, b0)
renormalize to 1
p(brain) = 0.1 p(sports) = 0.2
p(g, s, b) = p(g|s, b)p(s)p(b)
|g=1 1 0.242 0.215 1 0.483 0.06
p(s, b|g) = p(s)p(b)p(g|s, b) P
s0,b0 p(s0)p(b0)p(g|s0, b0)
p(brain) = 0.1 p(sports) = 0.2 p(brain|graduate) = 0.275 p(sports|graduate) = 0.544 p(brain|graduate, sports) = 0.111 p(brain|graduate, nosports) = 0.471 p(sports|graduate, brain) = 0.220 p(sports|graduate, nobrain) = 0.333
smart strong
p(g, s, b) = p(g)p(s|g)p(b|g) p(s, b) = X
g
p(s|g)p(b|g)p(g) p(s, b|g) = p(s|g)p(b|g)
MySQL Apache Website
MySQL Apache Website
a and m are dependent conditioned on w
MySQL Apache Website
p(m, a, w) = p(w|m, a)p(m)p(a) p(m, a|w) = p(w|m, a)p(m)p(a) P
m0,a0 p(w|m0, a0)p(m0)p(a0)
MySQL Apache Website
is working
MySQL is working Apache is working
MySQL Apache Website
is working
MySQL is working Apache is working
is broken
At least one of the two services is broken (not independent)
MySQL Apache Website
m a w m a w m a w u
user action
p(c|e)p(e) or p(e|c)p(c)
p(c|e)p(e) or p(e|c)p(c)
distribution
children|parents
1 2 3 5 4 7 6 9 8
p(x) =p(x1)p(x2|x1)p(x3|x2) p(x4|x3, x7)p(x5|x2, x3, x6) p(x6|x9)p(x7|x6)p(x8|x5)p(x9)
p(x) = Y
i
p(xi|xparents(i)) log p(x|θ) = X
i
log p(xi|xparents(i), θ) minimize
θi
− log p(xi|xparents(i), θi) − log p(θi)
… don’t worry, there’s math why this works …
p(x) = Y
i
p(xi|xparents(i)) q(xmissing) = p(xmissing|xobserved) minimize
θi
Exmissing∼q ⇥ − log p(xi|xparents(i), θi) ⇤ − log p(θi)
Independent variables become dependent conditioned on a joint child.
Observed parent makes children independent
p(x) = Y
i
p(xi|xparents(i))
c a b c a b
p(a, b, c) = p(a)p(b|a)p(c|b) p(a, c|b) = p(a)p(b|a)p(c|b) P
a0,c0 p(a0)p(b|a0)p(c0|b)
= p(a)p(b|a) P
a0 p(a0)p(b|a0)
p(c|b) P
c0 p(c0|b)
a ⊥ c|b
independence
c a b c a b
a ⊥ c|b p(a, b, c) = p(a|b)p(b)p(c|b) p(a, c) = X
b
p(a|b)p(b)p(c|b) p(a, c|b) = p(a|b)p(c|b)
dependence
c a b c a b
p(a, b, c) = p(a)p(b|a, c)p(c) p(a, c|b) = p(a)p(b|a, c)p(c) P
a0,c0 p(a0)p(b|a0, c0)p(c0)
are conditionally independent given C
X Y Z X Y Z
(a) (b) (a) (b)
X Y X Y (a)
X Y Z X Y Z
(b)
(a) (b)
X Y X Y (a)
X Y Z
(b)
X Y Z
Courtesy of Sam Roweis
x2 ⊥ x3|{x1, x6} ?
1
X
2
X
3
X X 4 X 5 X6
ball can travel
x2 ⊥ x3|{x1, x6} ?
1
X
2
X
3
X X 4 X 5 X6
ball can travel
x2 ⊥ x3|{x1, x6} ?
1
X
2
X
3
X X 4 X 5 X6
ball can travel
x2 ⊥ x3|{x1, x6} ?
1
X
2
X
3
X X 4 X 5 X6
ball can travel
independent
x1 x2 x3 x4
Θ
xi
Θ
xi
Θ
yi
p(X, θ) = p(θ) Y
i
p(xi|θ)
w
p(X, Y, θ, w) =p(θ)p(w) Y
i
p(xi|θ)p(yi|xi, w)
x1 x2 x3 x4
Θ
xi
Θ
xi
Θ
yi
p(X, θ) = p(θ) Y
i
p(xi|θ)
w
p(X, Y, θ, w) =p(θ)p(w) Y
i
p(xi|θ)p(yi|xi, w)
Markov Chain
past past
present
future future
Markov Chain
past past
present
future future
Plate
Markov Chain
past past
present
future future
Plate Hidden Markov Chain
user action user’s mindset
Markov Chain
past past
present
future future
Plate Hidden Markov Chain
user action user’s mindset
user model for traversal through search results
Markov Chain
past past
present
future future
Plate Hidden Markov Chain
user action user’s mindset
user model for traversal through search results
Markov Chain Hidden Markov Chain
user action user’s mindset
user model for traversal through search results
p(x, y; θ) = p(x0; θ)
n−1
Y
i=1
p(xi+1|xi; θ)
n
Y
i=1
p(yi|xi) p(x; θ) = p(x0; θ)
n−1
Y
i=1
p(xi+1|xi; θ)
Plate
Latent Factors Observed Effects
Click behavior, queries, watched news, emails
Latent Factors Observed Effects
Click behavior, queries, watched news, emails
User profile, news content, hot keywords, social connectivity graph, events
Latent Factors Observed Effects
Click behavior, queries, watched news, emails
User profile, news content, hot keywords, social connectivity graph, events
Restricted Boltzmann Machine
Latent Factors Observed Effects
Latent Factors Observed Effects
x ∼ N d X
i=1
yivi, σ21 ! and p(y) =
d
Y
i=1
p(yi)
Click behavior, queries, watched news, emails
Latent Factors Observed Effects
x ∼ N d X
i=1
yivi, σ21 ! and p(y) =
d
Y
i=1
p(yi)
Click behavior, queries, watched news, emails
Latent Factors Observed Effects
x ∼ N d X
i=1
yivi, σ21 ! and p(y) =
d
Y
i=1
p(yi)
Click behavior, queries, watched news, emails
Latent Factors Observed Effects
x ∼ N d X
i=1
yivi, σ21 ! and p(y) =
d
Y
i=1
p(yi)
Click behavior, queries, watched news, emails
Latent Factors Observed Effects
x ∼ N d X
i=1
yivi, σ21 ! and p(y) =
d
Y
i=1
p(yi)
r u m
r u m
r u m
... intersecting plates ... (like nested FOR loops)
r u m
... intersecting plates ... (like nested FOR loops)
news, SearchMonkey answers social ranking OMG personals
engineering
machine learning
engineering
machine learning
engineering
machine learning
engineering
machine learning
x1 x2 x3 x4
Θ
xi
Θ
p(X, θ) = p(θ) Y
i
p(xi|θ)
x0 1 x1 0.2 0.1 1 0.8 0.9 x0 0.4 1 0.6 x1 1 x2 0.8 0.5 1 0.2 0.5 x2 1 x3 1 1 1
p(x; θ) = p(x0; θ)
n−1
Y
i=1
p(xi+1|xi; θ)
x0 x1 x2 x3
p(x1) = X
x0
p(x1|x0)p(x0) ⇐ ⇒ π1 = Π0→1π0 p(x2) = X
x1
p(x2|x1)p(x1) ⇐ ⇒ π2 = Π1→2π1 = Π1→2Π0→1π0
Transition Matrices Unraveling the chain
p(x; θ) = p(x0; θ)
n−1
Y
i=1
p(xi+1|xi; θ)
x0 x1 x2 x3
p(xi|x1) = X
xj:1<j<i i−1
Y
l=2
p(xl+1|xl) · p(x2|x1) | {z }
=:l2(x2)
= X
xj:2<j<i i−1
Y
l=3
p(xl+1|xl) · X
x2
p(x3|x2)l2(x2) | {z }
=:l3(x3)
= X
xj:3<j<i i−1
Y
l=4
p(xl+1|xl) · X
x3
p(x4|x3)l3(x3) | {z }
=:l4(x4)
x0 1 x1 0.2 0.1 1 0.8 0.9 x0 0.4 1 0.6 x1 1 x2 0.8 0.5 1 0.2 0.5 x2 1 x3 1 1 1
p(x; θ) = p(x0; θ)
n−1
Y
i=1
p(xi+1|xi; θ)
x0 x1 x2 x3
Transition Matrices Unraveling the chain
x0 = [0.4; 0.6]; Pi1 = [0.2 0.1; 0.8 0.9]; Pi2 = [0.8 0.5; 0.2 0.5]; Pi3 = [0 1; 1 0]; x3 = Pi3 * Pi2 * Pi1 * x0 = [0.45800; 0.54200]
p(x; θ) = p(x0; θ)
n−1
Y
i=1
p(xi+1|xi; θ)
x0 x1 x2 x3
p(x1|xn) ∝ X
xj:1<j<n n−1
Y
l=1
p(xl+1|xl) · 1 |{z}
=:rn(xn)
= X
xj:1<j<n−1 n−2
Y
l=1
p(xl+1|xl) · X
xn
p(xn|xn−1)rn(xn) | {z }
=:rn−1(xn−1)
= X
xj:1<j<n−2 n−3
Y
l=1
p(xl+1|xl) · X
xn−1
p(xn−1|xn−2)rn−1(xn−1) | {z }
=:rn−2(xn−2)
normalize in the end
p(x0=t)=p(x0=b) = 0.5
Tazza d’oro p(x5=t)=1
0.9 0.2 0.1 0.8
current
> Pi = [0.9, 0.2; 0.1 0.8] Pi = 0.90000 0.20000 0.10000 0.80000 > l1 = [0.5; 0.5]; > l3 = Pi * Pi * l1 l3 = 0.58500 0.41500 > r5 = [1; 0]; > r3 = Pi' * Pi' * r5 r3 = 0.83000 0.34000 > (l3 .* r3) / sum(l3 .* r3) ans = 0.77483 0.22517
0.9 0.2 0.1 0.8
current
x0 x1 x2 x3 x4 x5
mi+1→i(xi) = X
xi+1
mi+2→i+1(xi+1)f(xi, xi+1) mi−1→i(xi) = X
xi−1
mi−2→i−1(xi−1)f(xi−1, xi) li = Πili1 ri = Π>
i ri+1
x0 x1 x2 x3
p(X) = p(x0) Y
i
p(xi+1|xi)
x0 x1 x2 x3
p(X) = p(x0, x1) Y
i
p(xi+1|xi, xi−1)
x0 x1 x2 x3
p(X) = p(x0) Y
i
p(xi+1|xi)
x0 x1 x2 x3
p(X) = p(x0, x1) Y
i
p(xi+1|xi, xi−1)
x0 x1 x2 x3 x4 x5
x6
x7 x8
x0 x1 x2 x3 x4 x5
x6
x7 x8
l1(x1) = X
x0
p(x0)p(x1|x0) r7(x7) = X
x8
p(x8|x7) l2(x2) = X
x1
l1(x1)p(x2|x1) r6(x6) = X
x7
r7(x7)p(x7|x6) r2(x2) = X
x6
r6(x6)p(x6|x2) l3(x3) = X
x2
l2(x2)p(x3|x2)r2(x2) . . .
x0 x1 x2 x3 x4 x5
x6
x7 x8
l1(x1) = X
x0
p(x0)p(x1|x0) r7(x7) = X
x8
p(x8|x7) l2(x2) = X
x1
l1(x1)p(x2|x1) r6(x6) = X
x7
r7(x7)p(x7|x6) r2(x2) = X
x6
r6(x6)p(x6|x2) l3(x3) = X
x2
l2(x2)p(x3|x2)r2(x2) . . .
x0 x1 x2 x3 x4 x5
x6
x7 x8
l1(x1) = X
x0
p(x0)p(x1|x0) r7(x7) = X
x8
p(x8|x7) l2(x2) = X
x1
l1(x1)p(x2|x1) r6(x6) = X
x7
r7(x7)p(x7|x6) r2(x2) = X
x6
r6(x6)p(x6|x2) l3(x3) = X
x2
l2(x2)p(x3|x2)r2(x2) . . .
x0 x1 x2 x3 x4 x5
x6
x7 x8
l1(x1) = X
x0
p(x0)p(x1|x0) r7(x7) = X
x8
p(x8|x7) l2(x2) = X
x1
l1(x1)p(x2|x1) r6(x6) = X
x7
r7(x7)p(x7|x6) r2(x2) = X
x6
r6(x6)p(x6|x2) l3(x3) = X
x2
l2(x2)p(x3|x2)r2(x2) . . .
x0 x1 x2 x3 x4 x5
x6
x7 x8
l1(x1) = X
x0
p(x0)p(x1|x0) r7(x7) = X
x8
p(x8|x7) l2(x2) = X
x1
l1(x1)p(x2|x1) r6(x6) = X
x7
r7(x7)p(x7|x6) r2(x2) = X
x6
r6(x6)p(x6|x2) l3(x3) = X
x2
l2(x2)p(x3|x2)r2(x2) . . .
x0 x1 x2 x3 x4 x5
x6
x7 x8
l1(x1) = X
x0
p(x0)p(x1|x0) r7(x7) = X
x8
p(x8|x7) l2(x2) = X
x1
l1(x1)p(x2|x1) r6(x6) = X
x7
r7(x7)p(x7|x6) r2(x2) = X
x6
r6(x6)p(x6|x2) l3(x3) = X
x2
l2(x2)p(x3|x2)r2(x2) . . .
x0 x1 x2 x3 x4 x5
x6
x7 x8
l1(x1) = X
x0
p(x0)p(x1|x0) r7(x7) = X
x8
p(x8|x7) l2(x2) = X
x1
l1(x1)p(x2|x1) r6(x6) = X
x7
r7(x7)p(x7|x6) r2(x2) = X
x6
r6(x6)p(x6|x2) l3(x3) = X
x2
l2(x2)p(x3|x2)r2(x2) . . .
(only matters for parametrization)
2 3 4 1
in i n
m2→3(x3) = X
x2
m1→2(x2)m4→2(x2)f(x2, x3)
x0 x1 x2 x3 x4 x5 x6 x7 x8
m2→3(x3) = X
x2
m1→2(x2)m6→2(x2)f(x2, x3) m2→6(x6) = X
x2
m1→2(x2)m3→2(x2)f(x2, x6) m2→1(x1) = X
x2
m3→2(x2)m6→2(x2)f(x1, x2)
x0 x1 x2 x3 x4 x5 x6 x7 x8
m2→3(x3) = X
x2
m1→2(x2)m6→2(x2)f(x2, x3) m2→6(x6) = X
x2
m1→2(x2)m3→2(x2)f(x2, x6) m2→1(x1) = X
x2
m3→2(x2)m6→2(x2)f(x1, x2)
x0 x1 x2 x3 x4 x5 x6 x7 x8
m2→3(x3) = X
x2
m1→2(x2)m6→2(x2)f(x2, x3) m2→6(x6) = X
x2
m1→2(x2)m3→2(x2)f(x2, x6) m2→1(x1) = X
x2
m3→2(x2)m6→2(x2)f(x1, x2)
x0 x1 x2 x3 x4 x5 x6 x7 x8
m2→3(x3) = X
x2
m1→2(x2)m6→2(x2)f(x2, x3) m2→6(x6) = X
x2
m1→2(x2)m3→2(x2)f(x2, x6) m2→1(x1) = X
x2
m3→2(x2)m6→2(x2)f(x1, x2)
m2→3(x3) = X
x2
m1→2(x2)m6→2(x2)f(x2, x3) m2→6(x6) = X
x2
m1→2(x2)m3→2(x2)f(x2, x6) m2→1(x1) = X
x2
m3→2(x2)m6→2(x2)f(x1, x2)
x0 x1 x2 x3 x4 x5 x6 x7 x8
m2→3(x3) = X
x2
m1→2(x2)m6→2(x2)f(x2, x3) m2→6(x6) = X
x2
m1→2(x2)m3→2(x2)f(x2, x6) m2→1(x1) = X
x2
m3→2(x2)m6→2(x2)f(x1, x2)
x0 x1 x2 x3 x4 x5 x6 x7 x8
m2→3(x3) = X
x2
m1→2(x2)m6→2(x2)f(x2, x3) m2→6(x6) = X
x2
m1→2(x2)m3→2(x2)f(x2, x6) m2→1(x1) = X
x2
m3→2(x2)m6→2(x2)f(x1, x2)
x0 x1 x2 x3 x4 x5 x6 x7 x8
m2→3(x3) = X
x2
m1→2(x2)m6→2(x2)f(x2, x3) m2→6(x6) = X
x2
m1→2(x2)m3→2(x2)f(x2, x6) m2→1(x1) = X
x2
m3→2(x2)m6→2(x2)f(x1, x2)
x0 x1 x2 x3 x4 x5 x6 x7 x8
m2→3(x3) = X
x2
m1→2(x2)m6→2(x2)f(x2, x3) m2→6(x6) = X
x2
m1→2(x2)m3→2(x2)f(x2, x6) m2→1(x1) = X
x2
m3→2(x2)m6→2(x2)f(x1, x2)
x0 x1 x2 x3 x4 x5 x6 x7 x8
m2→3(x3) = X
x2
m1→2(x2)m6→2(x2)f(x2, x3) m2→6(x6) = X
x2
m1→2(x2)m3→2(x2)f(x2, x6) m2→1(x1) = X
x2
m3→2(x2)m6→2(x2)f(x1, x2)
x0 x1 x2 x3 x4 x5 x6 x7 x8
m2→3(x3) = X
x2
m1→2(x2)m6→2(x2)f(x2, x3) m2→6(x6) = X
x2
m1→2(x2)m3→2(x2)f(x2, x6) m2→1(x1) = X
x2
m3→2(x2)m6→2(x2)f(x1, x2)
x0 x1 x2 x3 x4 x5 x6 x7 x8
m2→3(x3) = X
x2
m1→2(x2)m6→2(x2)f(x2, x3) m2→6(x6) = X
x2
m1→2(x2)m3→2(x2)f(x2, x6) m2→1(x1) = X
x2
m3→2(x2)m6→2(x2)f(x1, x2)
x0 x1 x2 x3 x4 x5 x6 x7 x8
integrate signals from all sources.
2 3 4 1
in i n
x1 x2 x3 x4 y4 xm y3 y2 y1
ym i=1..m xi xi+1 yi x1 x2 x3 x4 y4 xm y3 y2 y1
ym i=1..m xi xi+1 yi
x1 x2 x3 x4 y4 xm y3 y2 y1
ym i=1..m xi xi+1 yi
p(x, y) = p(y1) "m−1 Y
i=1
p(yi+1|yi)p(xi|yi) # p(xm|ym)
x1 x2 x3 x4 y4 xm y3 y2 y1
ym i=1..m xi xi+1 yi
summing over all other variables xj
p(x, y) = p(y1) "m−1 Y
i=1
p(yi+1|yi)p(xi|yi) # p(xm|ym)
x1 x2 x3 x4 y4 xm y3 y2 y1
ym i=1..m xi xi+1 yi
summing over all other variables xj
x1 x2 x3 x4 y4 xm y3 y2 y1
ym i=1..m xi xi+1 yi
p(x|y) = p(x, y) P
x0 p(x0, y) and p(xi|y) ∝
X
xj:j<i
X
xj:j>i
p(x, y)
x1 x2 x3 x4 y4 xm y3 y2 y1
ym i=1..m xi xi+1 yi
l1(x1) = p(x1)p(y1|x1) lj+1(xj+1) = X
xj
lj(xj)p(xj+1|xj)p(yj|xj) rn(xn) = 1 rj−1(xj−1) = X
xj
rj(xj)p(yj|xj)p(xj|xj−1) p(xi|rest) ∝ li(xi)ri(xi)p(yi|xi)
Same algorithm for finding most likely values (+,*) (max,+)
x1 x2 x3 x4 y4 xm y3 y2 y1
ym i=1..m xi xi+1 yi
l1(x1) = log p(x1) + log p(y1|x1) lj+1(xj+1) = max
xj lj(xj) + log p(xj+1|xj) + log p(yj|xj)
rn(xn) = 1 rj−1(xj−1) = max
xj rj(xj) + log p(yj|xj) + log p(xj|xj−1)
ˆ xi = argmax li(xi) + ri(xi) + log p(yi|xi)
directly? Log-likelihood is nonconvex!
x1 x2 x3 x4 y4 xm y3 y2 y1
ym i=1..m xi xi+1 yi
p(x, y) = p(x1) "m−1 Y
i=1
p(xi+1|xi)p(yi|xi) # p(ym|xm)
probability computation is infeasible
log p(y; θ) ≥ Z dq(x) log p(x, y; θ) − Z dq(x) log q(x)
x1 x2 x3 x4 y4 xm y3 y2 y1
ym i=1..m xi xi+1 yi x1 x2 x3 x4 y4 xm y3 y2 y1
ym i=1..m xi xi+1 yi
q(x) = q(x1)
m
Y
i=2
q(xi|xi−1)
Dynamic programming yields chain
q(x) = p(x|y)
q(x1) q(xi+1|xi) q(xi) q(xi)
log p(y; θ) ≥ Z dq(x) log p(x, y; θ) − Z dq(x) log q(x) p(x, y) = p(x1) "m−1 Y
i=1
p(xi+1|xi)p(yi|xi) # p(ym|xm)
Since we have set p(x1) = q(x1)
Same as clustering e.g. for Gaussians
Ex∼q [log p(x, y; θ)] =Ex1∼q [log p(x1; θ)] +
m
X
i=1
Exi∼q [log p(yi|xi; θ)] +
m−1
X
i=1
E(xi,xi+1)∼q [log p(xi+1|xi; θ)]
Eq(x1) [log p(x1)] µx = 1 nx
m
X
i=1
qi(x)yi Σx = 1 nx
m
X
i=1
qi(x)yiy>
i − µxµ> x
Ex∼q [log p(x, y; θ)] =Ex1∼q [log p(x1; θ)] +
m
X
i=1
Exi∼q [log p(yi|xi; θ)] +
m−1
X
i=1
E(xi,xi+1)∼q [log p(xi+1|xi; θ)]
effective sample
m−1
X
i=1
q(a, b) log p(a|b) hence p(a|b) = Pm−1
i=1 q(a, b)
Pm−1
i=1 q(b)
transition smoother aggregate mass
p(a|b) = na|b + Pm−1
i=1 q(a, b)
nb + Pm−1
i=1 q(b)
y1 x1 y2 xi y3 xi ym xm Θ
μk, Σk μ1, Σ1 ...
Junction Tree
Loopy Belief Propagation
Simpler distribution without loops
Draw from one variable at a time
1 2 3 4 5 6 7 8 9 10 0.6 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 parameter1 density
Mode Mode
Mean
p(θ|X) ∝ p(X|θ)p(θ)
maximizing it
x ∼ p(x|x0) and then x0 ∼ p(x0|x)
0.45 0.05 0.05 0.45
0.45 0.05 0.05 0.45
(b,g) - draw p(.,g)
0.45 0.05 0.05 0.45
(b,g) - draw p(.,g) (g,g) - draw p(g,.)
0.45 0.05 0.05 0.45
(b,g) - draw p(.,g) (g,g) - draw p(g,.) (g,g) - draw p(.,g)
0.45 0.05 0.05 0.45
(b,g) - draw p(.,g) (g,g) - draw p(g,.) (g,g) - draw p(.,g) (b,g) - draw p(b,.)
0.45 0.05 0.05 0.45
(b,g) - draw p(.,g) (g,g) - draw p(g,.) (g,g) - draw p(.,g) (b,g) - draw p(b,.) (b,b) ...
random initialization
sample cluster labels
resample cluster model
resample cluster labels
resample cluster model
resample cluster labels
resample cluster model e.g. Mahout Dirichlet Process Clustering
we know that chicken and egg are correlated either chicken or egg
p(c, e) ∝ exp ψ(c, e)
encode the correlation via the clique potential between c and e we know that chicken and egg are correlated either chicken or egg
we know that chicken and egg are correlated either chicken or egg
we know that chicken and egg are correlated either chicken or egg
p(c, e) = exp ψ(c, e) P
c0,e0 exp ψ(c0, e0)
= exp [ψ(c, e) − g(ψ)] where g(ψ) = log X
c,e
exp ψ(c, e)
MySQL Apache Website
p(w|m, a)p(m)p(a)
Website
m 6? ? a|w
MySQL Apache Website
p(w|m, a)p(m)p(a)
Website
m 6? ? a|w
MySQL Apache Website
Site affects MySQL Site affects Apache
p(m, w, a) ∝ φ(m, w)φ(w, a)
Website
m ⊥ ⊥ a|w
MySQL Apache Website
p(w|m, a)p(m)p(a)
Website
m 6? ? a|w
MySQL Apache Website
Site affects MySQL Site affects Apache
p(m, w, a) ∝ φ(m, w)φ(w, a)
Website
m ⊥ ⊥ a|w
easier “debugging” easier “modeling”
Key Concept Observing nodes makes remainder conditionally independent
Key Concept Observing nodes makes remainder conditionally independent
Key Concept Observing nodes makes remainder conditionally independent
Key Concept Observing nodes makes remainder conditionally independent
Key Concept Observing nodes makes remainder conditionally independent
Key Concept Observing nodes makes remainder conditionally independent
Key Concept Observing nodes makes remainder conditionally independent
Key Concept Observing nodes makes remainder conditionally independent
p(x) = Y
c
ψc(xc)
If density has full support then it decomposes into products of clique potentials
dependencies
tricky (Bayes Ball algorithm)
(correlation only)
easy to read off (graph connectivity)
p(x; θ) = exp (hφ(x), θi g(θ)) where g(θ) = log X
x0
exp (hφ(x0), θi) ∂θg(θ) = E [φ(x)] ∂2
θg(θ) = Var [φ(x)]
Unconditional model
p(y|θ, x) = e⇥φ(x,y),θ⇤g(θ|x) g(θ|x) = log X
y
e⇥φ(x,y),θ⇤ ∂θg(θ|x) = P
y φ(x, y)e⇥φ(x,y),θ⇤
P
y e⇥φ(x,y),θ⇤
= X
y
φ(x, y)e⇥φ(x,y),θ⇤g(θ|x) p(x|θ) = ehφ(x),θig(θ) g(θ) = log X
x
ehφ(x),θi ∂θg(θ) = P
x φ(x)ehφ(x),θi
P
x ehφ(x),θi
= X
x
φ(x)ehφ(x),θig(θ)
Conditional model
log p(y|x; θ) = hφ(x, y), θi g(θ|x)
log p(θ|X, Y ) = X
i
log(yi|xi; θ) + log p(θ) + const. = *X
i
φ(xi, yi), θ +
i
g(θ|xi) 1 2σ2 kθk2 + const.
X
i
φ(xi, yi) = X
i
Ey|xi [φ(xi, y)] + 1 σ2 θ
prior maxent model
expensive
x y
φ(x, y) = yφ(x) where y ∈ {±1} g(θ|x) = log h e1·hφ(x),θi + e1·hφ(x),θii = log 2 cosh hφ(x), θi
minimize
θ
1 2σ2 kθk2 + X
i
log 2 cosh hφ(xi), θi yi hφ(xi, θi
p(y|x, θ) = eyhφ(x),θi ehφ(x),θi + ehφ(x),θi = 1 1 + e2yhφ(x),θi
x y
φ(x, y) = yφ(x) where y ∈ {±1} g(θ|x) = log h e1·hφ(x),θi + e1·hφ(x),θii = log 2 cosh hφ(x), θi
minimize
θ
1 2σ2 kθk2 + X
i
log 2 cosh hφ(xi), θi yi hφ(xi, θi
p(y|x, θ) = eyhφ(x),θi ehφ(x),θi + ehφ(x),θi = 1 1 + e2yhφ(x),θi
GP Classification
p(x) = Y
c
ψc(xc)
Theorem: Clique decomposition holds in sufficient statistics
φ(x) = (. . . , φc(xc), . . .) and hφ(x), θi = X
c
hφc(xc), θci
Corollary: we only need expectations on cliques
Ex[φ(x)] = (. . . , Exc [φc(xc)] , . . .)
y x
φ(x) = (y1φx(x1), . . . , ynφx(xn), φy(y1, y2), . . . , φy(yn1, yn)) hφ(x), θi = X
i
hφx(xi, yi), θxi + X
i
hφy(yi, yi+1), θyi g(θ|x) = X
y
Y
i
fi(yi, yi+1) where fi(yi, yi+1) = ehφx(xi,yi),θxi+hφy(yi,yi+1),θyi
dynamic programming
via message passing ...
x
p(x) = Y
i
ψi(xi, xi+1)
x
p(x) = Y
i
ψi(xi, xi+1)
x
p(x) = Y
i
ψi(xi, xi+1)
x y
p(x, y) = Y
i
ψx
i (xi, xi+1)ψxy i (xi, yi)
x
p(x) = Y
i
ψi(xi, xi+1)
p(x|y) ∝ Y
i
ψx
i (xi, xi+1)ψxy i (xi, yi)
| {z }
=:fi(xi,xi+1)
x y
p(x, y) = Y
i
ψx
i (xi, xi+1)ψxy i (xi, yi)
p(x|y) ∝ Y
i
ψx
i (xi, xi+1)ψxy i (xi, yi)
| {z }
=:fi(xi,xi+1)
x y
Dynamic Programming
l1(x1) = 1 and li+1(xi+1) = X
xi
li(xi)fi(xi, xi+1) rn(xn) = 1 and ri(xi) = X
xi+1
ri+1(xi+1)fi(xi, xi+1)
p(x|y) ∝ Y
i
ψx
i (xi, xi+1)ψxy i (xi, yi)
| {z }
=:fi(xi,xi+1)
x y
x y y y y y y y
Document Labels
p(y|x) = Y
i
ψ(yi, yparent(i), x)
x
y
real image
p(x|y) = Y
ij
ψright(xij, xi+1,j)ψup(xij, xi,j+1)ψxy(xij, yij)
x
y
real image
p(x|y) = Y
ij
ψright(xij, xi+1,j)ψup(xij, xi,j+1)ψxy(xij, yij)
long range interactions
Li&Huttenlocher, ECCV’08
classification CRF SMM
phrase segmentation, activity recognition, motion data analysis Shi, Smola, Altun, Vishwanathan, Li, 2007-2009
web page information extraction, segmentation, annotation Bo, Zhu, Nie, Wen, Hon, 2005-2007