Graphical Models 10-715 Fall 2015 Alexander Smola alex@smola.org - - PowerPoint PPT Presentation

graphical models
SMART_READER_LITE
LIVE PREVIEW

Graphical Models 10-715 Fall 2015 Alexander Smola alex@smola.org - - PowerPoint PPT Presentation

Graphical Models 10-715 Fall 2015 Alexander Smola alex@smola.org Office hours - after class in my office Marianas Labs Directed Graphical Models Brain & Brawn p (brain) = 0 . 1 p (sports) = 0 . 2 smart strong 0 1 0 0.1 0.8 1


slide-1
SLIDE 1

Marianas Labs

Graphical Models

10-715 Fall 2015

Alexander Smola alex@smola.org
 Office hours - after class in my office

slide-2
SLIDE 2

Directed Graphical Models

slide-3
SLIDE 3

Brain & Brawn

smart strong

1 0.1 0.8 1 0.8 0.9

p(g, s, b) = p(g|s, b)p(s)p(b) p(brain) = 0.1 p(sports) = 0.2

slide-4
SLIDE 4

Brain & Brawn

1 0.1 0.8 1 0.8 0.9

p(g, s, b) = p(g|s, b)p(s)p(b)

? 1 0.72 0.08 1 0.18 0.02

p(s, b) = p(s)p(b) p(brain) = 0.1 p(sports) = 0.2

slide-5
SLIDE 5

Brain & Brawn

1 0.1 0.8 1 0.8 0.9

p(g, s, b) = p(g|s, b)p(s)p(b)

g=1 1 0.072 0.064 1 0.144 0.018

p(s, b|g) = p(s)p(b)p(g|s, b) P

s0,b0 p(s0)p(b0)p(g|s0, b0)

element-wise multiply

p(brain) = 0.1 p(sports) = 0.2

slide-6
SLIDE 6

Brain & Brawn

1 0.1 0.8 1 0.8 0.9

p(g, s, b) = p(g|s, b)p(s)p(b)

|g=1 1 0.242 0.215 1 0.483 0.06

p(s, b|g) = p(s)p(b)p(g|s, b) P

s0,b0 p(s0)p(b0)p(g|s0, b0)

renormalize to 1

p(brain) = 0.1 p(sports) = 0.2

slide-7
SLIDE 7

Brain & Brawn

p(g, s, b) = p(g|s, b)p(s)p(b)

|g=1 1 0.242 0.215 1 0.483 0.06

p(s, b|g) = p(s)p(b)p(g|s, b) P

s0,b0 p(s0)p(b0)p(g|s0, b0)

p(brain) = 0.1 p(sports) = 0.2 p(brain|graduate) = 0.275 p(sports|graduate) = 0.544 p(brain|graduate, sports) = 0.111 p(brain|graduate, nosports) = 0.471 p(sports|graduate, brain) = 0.220 p(sports|graduate, nobrain) = 0.333

slide-8
SLIDE 8

Brain & Brawn

smart strong

p(g, s, b) = p(g)p(s|g)p(b|g) p(s, b) = X

g

p(s|g)p(b|g)p(g) p(s, b|g) = p(s|g)p(b|g)

slide-9
SLIDE 9

... some Web 2.0 service

MySQL Apache Website

slide-10
SLIDE 10

... some Web 2.0 service

  • Joint distribution (assume a and m are independent)


MySQL Apache Website

slide-11
SLIDE 11

... some Web 2.0 service

  • Joint distribution (assume a and m are independent)

  • Explaining away



 
 a and m are dependent conditioned on w

MySQL Apache Website

p(m, a, w) = p(w|m, a)p(m)p(a) p(m, a|w) = p(w|m, a)p(m)p(a) P

m0,a0 p(w|m0, a0)p(m0)p(a0)

slide-12
SLIDE 12

... some Web 2.0 service

MySQL Apache Website

slide-13
SLIDE 13

... some Web 2.0 service

is working

MySQL is working Apache is working

MySQL Apache Website

slide-14
SLIDE 14

... some Web 2.0 service

is working

MySQL is working Apache is working

is broken

At least one of the two services is broken (not independent)

MySQL Apache Website

slide-15
SLIDE 15

Directed graphical model

  • Easier estimation
  • 15 parameters for full joint distribution
  • 1+1+4+1 for factorizing distribution
  • Causal relations
  • Inference for unobserved variables

m a w m a w m a w u

user action

slide-16
SLIDE 16

No loops allowed

slide-17
SLIDE 17

No loops allowed

slide-18
SLIDE 18

No loops allowed

slide-19
SLIDE 19

No loops allowed

slide-20
SLIDE 20

No loops allowed

p(c|e)p(e|c)

slide-21
SLIDE 21

No loops allowed

p(c|e)p(e) or p(e|c)p(c)

p(c|e)p(e|c)

slide-22
SLIDE 22

No loops allowed

p(c|e)p(e) or p(e|c)p(c)

p(c|e)p(e|c)

slide-23
SLIDE 23

Directed Graphical Model

  • Probability


distribution

  • Iterate over


children|parents

1 2 3 5 4 7 6 9 8

p(x) =p(x1)p(x2|x1)p(x3|x2) p(x4|x3, x7)p(x5|x2, x3, x6) p(x6|x9)p(x7|x6)p(x8|x5)p(x9)

slide-24
SLIDE 24

Directed Graphical Model

  • Joint probability distribution


  • Parameter estimation
  • If x is fully observed the likelihood breaks up


  • Estimation is trivial. All terms decompose

p(x) = Y

i

p(xi|xparents(i)) log p(x|θ) = X

i

log p(xi|xparents(i), θ) minimize

θi

− log p(xi|xparents(i), θi) − log p(θi)

slide-25
SLIDE 25

Approximate Inference


… don’t worry, there’s math why this works …

  • Joint probability distribution


  • EM Parameter estimation
  • If x is partly observed we need to approximate
  • Intuition - use best guess of missing variables
  • Estimation is possible (M step)

p(x) = Y

i

p(xi|xparents(i)) q(xmissing) = p(xmissing|xobserved) minimize

θi

Exmissing∼q ⇥ − log p(xi|xparents(i), θi) ⇤ − log p(θi)

slide-26
SLIDE 26

Summary

  • Directed graphical models


  • Explaining away


Independent variables become dependent conditioned on a joint child.

  • Observing yields independence


Observed parent makes children independent

  • No loops in graph allowed

p(x) = Y

i

p(xi|xparents(i))

slide-27
SLIDE 27

Dependence

slide-28
SLIDE 28

1 Chain

  • Joint distribution
  • Conditioning on b



 
 
 


  • Conditional independence

c a b c a b

p(a, b, c) = p(a)p(b|a)p(c|b) p(a, c|b) = p(a)p(b|a)p(c|b) P

a0,c0 p(a0)p(b|a0)p(c0|b)

= p(a)p(b|a) P

a0 p(a0)p(b|a0)

p(c|b) P

c0 p(c0|b)

a ⊥ c|b

slide-29
SLIDE 29

2 Common Cause

  • Joint distribution
  • a and c are dependent


  • Conditioning on b creates 


independence

c a b c a b

a ⊥ c|b p(a, b, c) = p(a|b)p(b)p(c|b) p(a, c) = X

b

p(a|b)p(b)p(c|b) p(a, c|b) = p(a|b)p(c|b)

slide-30
SLIDE 30

3 Explaining Away

  • Joint distribution
  • a and c are independent
  • Conditioning on b creates 


dependence

c a b c a b

p(a, b, c) = p(a)p(b|a, c)p(c) p(a, c|b) = p(a)p(b|a, c)p(c) P

a0,c0 p(a0)p(b|a0, c0)p(c0)

slide-31
SLIDE 31

d-Separation

  • Given general directed acyclic graph (DAG)
  • Determine whether sets A, B of random variables

are conditionally independent given C

  • Simple algorithm - reachability
  • Start in in vertex of A
  • Check whether any vertex in B can be reached
  • If separated, we have conditional independence
slide-32
SLIDE 32

Transition rules

X Y Z X Y Z

(a) (b) (a) (b)

X Y X Y (a)

X Y Z X Y Z

(b)

(a) (b)

X Y X Y (a)

X Y Z

(b)

X Y Z

Courtesy of Sam Roweis

slide-33
SLIDE 33

Transition rules

x2 ⊥ x3|{x1, x6} ?

1

X

2

X

3

X X 4 X 5 X6

ball can travel

  • pposite arrows
slide-34
SLIDE 34

Transition rules

x2 ⊥ x3|{x1, x6} ?

1

X

2

X

3

X X 4 X 5 X6

ball can travel

  • pposite arrows
slide-35
SLIDE 35

Transition rules

x2 ⊥ x3|{x1, x6} ?

1

X

2

X

3

X X 4 X 5 X6

ball can travel

  • pposite arrows
slide-36
SLIDE 36

Transition rules

x2 ⊥ x3|{x1, x6} ?

1

X

2

X

3

X X 4 X 5 X6

ball can travel

  • pposite arrows
slide-37
SLIDE 37

Summary

  • Dependent random variables
  • Observing can make things dependent or

independent

  • Conditional independence simplifies model
  • Bayes ball to check properties
  • Chains (observing stops dependence)
  • Common causes (observing stops dependence)
  • Common children (observing creates dependence)
slide-38
SLIDE 38

Structures

slide-39
SLIDE 39

Plates: FOR loops for statisticians

  • Repeated dependency structure
  • Modeling iid observations



 
 
 


  • Supervised learning

x1 x2 x3 x4

Θ

xi

Θ

xi

Θ

yi

p(X, θ) = p(θ) Y

i

p(xi|θ)

w

p(X, Y, θ, w) =p(θ)p(w) Y

i

p(xi|θ)p(yi|xi, w)

slide-40
SLIDE 40

Plates: FOR loops for statisticians

  • Repeated dependency structure
  • Modeling iid observations



 
 
 


  • Supervised learning

x1 x2 x3 x4

Θ

xi

Θ

xi

Θ

yi

p(X, θ) = p(θ) Y

i

p(xi|θ)

w

p(X, Y, θ, w) =p(θ)p(w) Y

i

p(xi|θ)p(yi|xi, w)

slide-41
SLIDE 41

Chains

Markov Chain

past past

present

future future

slide-42
SLIDE 42

Chains

Markov Chain

past past

present

future future

Plate

slide-43
SLIDE 43

Chains

Markov Chain

past past

present

future future

Plate Hidden Markov Chain

  • bserved

user action user’s mindset

slide-44
SLIDE 44

Chains

Markov Chain

past past

present

future future

Plate Hidden Markov Chain

  • bserved

user action user’s mindset

user model for traversal through search results

slide-45
SLIDE 45

Chains

Markov Chain

past past

present

future future

Plate Hidden Markov Chain

  • bserved

user action user’s mindset

user model for traversal through search results

slide-46
SLIDE 46

Chains

Markov Chain Hidden Markov Chain

  • bserved

user action user’s mindset

user model for traversal through search results

p(x, y; θ) = p(x0; θ)

n−1

Y

i=1

p(xi+1|xi; θ)

n

Y

i=1

p(yi|xi) p(x; θ) = p(x0; θ)

n−1

Y

i=1

p(xi+1|xi; θ)

Plate

slide-47
SLIDE 47

Factor Graphs

Latent Factors Observed Effects

slide-48
SLIDE 48

Factor Graphs

  • Observed effects


Click behavior, queries, watched news, emails

Latent Factors Observed Effects

slide-49
SLIDE 49

Factor Graphs

  • Observed effects


Click behavior, queries, watched news, emails

  • Latent factors


User profile, news content, hot keywords, social connectivity graph, events

Latent Factors Observed Effects

slide-50
SLIDE 50

Factor Graphs

  • Observed effects


Click behavior, queries, watched news, emails

  • Latent factors


User profile, news content, hot keywords, social connectivity graph, events

  • Multiple layers


Restricted Boltzmann Machine

Latent Factors Observed Effects

slide-51
SLIDE 51

Example - PCA/ICA

Latent Factors Observed Effects

x ∼ N d X

i=1

yivi, σ21 ! and p(y) =

d

Y

i=1

p(yi)

slide-52
SLIDE 52

Example - PCA/ICA

  • Observed effects


Click behavior, queries, watched news, emails

Latent Factors Observed Effects

x ∼ N d X

i=1

yivi, σ21 ! and p(y) =

d

Y

i=1

p(yi)

slide-53
SLIDE 53

Example - PCA/ICA

  • Observed effects


Click behavior, queries, watched news, emails

Latent Factors Observed Effects

x ∼ N d X

i=1

yivi, σ21 ! and p(y) =

d

Y

i=1

p(yi)

slide-54
SLIDE 54

Example - PCA/ICA

  • Observed effects


Click behavior, queries, watched news, emails

Latent Factors Observed Effects

x ∼ N d X

i=1

yivi, σ21 ! and p(y) =

d

Y

i=1

p(yi)

slide-55
SLIDE 55

Example - PCA/ICA

  • Observed effects


Click behavior, queries, watched news, emails

  • p(y) is Gaussian for PCA. General for ICA

Latent Factors Observed Effects

x ∼ N d X

i=1

yivi, σ21 ! and p(y) =

d

Y

i=1

p(yi)

slide-56
SLIDE 56

Cocktail party problem

slide-57
SLIDE 57

Recommender Systems

r u m

slide-58
SLIDE 58

Recommender Systems

  • Users u
  • Movies m
  • Ratings r (but only for a subset of users)

r u m

slide-59
SLIDE 59

Recommender Systems

  • Users u
  • Movies m
  • Ratings r (but only for a subset of users)

r u m

... intersecting plates ... (like nested FOR loops)

slide-60
SLIDE 60

Recommender Systems

  • Users u
  • Movies m
  • Ratings r (but only for a subset of users)

r u m

... intersecting plates ... (like nested FOR loops)

news, SearchMonkey answers social ranking OMG personals

slide-61
SLIDE 61

Challenges

engineering

machine learning

slide-62
SLIDE 62

Challenges

  • How to design models
  • Common (engineering) sense
  • Computational tractability

engineering

machine learning

slide-63
SLIDE 63

Challenges

  • How to design models
  • Common (engineering) sense
  • Computational tractability
  • Dependency analysis

engineering

machine learning

slide-64
SLIDE 64

Challenges

  • How to design models
  • Common (engineering) sense
  • Computational tractability
  • Dependency analysis
  • Inference
  • Easy for fully observed situations
  • Many algorithms if not fully observed
  • Dynamic programming / message passing

engineering

machine learning

slide-65
SLIDE 65

Summary

  • Repeated structure - encode with plate
  • Chains, bipartite graphs, etc (more later)
  • Plates can intersect
  • Not all variables are observed

x1 x2 x3 x4

Θ

xi

Θ

p(X, θ) = p(θ) Y

i

p(xi|θ)

slide-66
SLIDE 66
slide-67
SLIDE 67

x0 1 x1 0.2 0.1 1 0.8 0.9 x0 0.4 1 0.6 x1 1 x2 0.8 0.5 1 0.2 0.5 x2 1 x3 1 1 1

Markov Chains

p(x; θ) = p(x0; θ)

n−1

Y

i=1

p(xi+1|xi; θ)

x0 x1 x2 x3

p(x1) = X

x0

p(x1|x0)p(x0) ⇐ ⇒ π1 = Π0→1π0 p(x2) = X

x1

p(x2|x1)p(x1) ⇐ ⇒ π2 = Π1→2π1 = Π1→2Π0→1π0

Transition Matrices Unraveling the chain

slide-68
SLIDE 68

Markov Chains

p(x; θ) = p(x0; θ)

n−1

Y

i=1

p(xi+1|xi; θ)

x0 x1 x2 x3

  • From the start - sum sequentially

p(xi|x1) = X

xj:1<j<i i−1

Y

l=2

p(xl+1|xl) · p(x2|x1) | {z }

=:l2(x2)

= X

xj:2<j<i i−1

Y

l=3

p(xl+1|xl) · X

x2

p(x3|x2)l2(x2) | {z }

=:l3(x3)

= X

xj:3<j<i i−1

Y

l=4

p(xl+1|xl) · X

x3

p(x4|x3)l3(x3) | {z }

=:l4(x4)

slide-69
SLIDE 69

x0 1 x1 0.2 0.1 1 0.8 0.9 x0 0.4 1 0.6 x1 1 x2 0.8 0.5 1 0.2 0.5 x2 1 x3 1 1 1

Markov Chains

p(x; θ) = p(x0; θ)

n−1

Y

i=1

p(xi+1|xi; θ)

x0 x1 x2 x3

Transition Matrices Unraveling the chain

x0 = [0.4; 0.6];
 Pi1 = [0.2 0.1; 0.8 0.9];
 Pi2 = [0.8 0.5; 0.2 0.5];
 Pi3 = [0 1; 1 0];
 x3 = Pi3 * Pi2 * Pi1 * x0 = [0.45800; 0.54200]

  • nly need matrix-vector
slide-70
SLIDE 70

Markov Chains

p(x; θ) = p(x0; θ)

n−1

Y

i=1

p(xi+1|xi; θ)

x0 x1 x2 x3

  • From the end - sum sequentially

p(x1|xn) ∝ X

xj:1<j<n n−1

Y

l=1

p(xl+1|xl) · 1 |{z}

=:rn(xn)

= X

xj:1<j<n−1 n−2

Y

l=1

p(xl+1|xl) · X

xn

p(xn|xn−1)rn(xn) | {z }

=:rn−1(xn−1)

= X

xj:1<j<n−2 n−3

Y

l=1

p(xl+1|xl) · X

xn−1

p(xn−1|xn−2)rn−1(xn−1) | {z }

=:rn−2(xn−2)

normalize in the end

slide-71
SLIDE 71

Example - inferring lunch

  • Initial probability


p(x0=t)=p(x0=b) = 0.5

  • Stationary transition matrix
  • On fifth day observed at

Tazza d’oro p(x5=t)=1

  • Distribution on day 3
  • Left messages to 3
  • Right messages to 3
  • Renormalize

0.9 0.2 0.1 0.8

current

slide-72
SLIDE 72

Example - inferring lunch

> Pi = [0.9, 0.2; 0.1 0.8] Pi = 0.90000 0.20000 0.10000 0.80000 > l1 = [0.5; 0.5]; > l3 = Pi * Pi * l1 l3 = 0.58500 0.41500 > r5 = [1; 0]; > r3 = Pi' * Pi' * r5 r3 = 0.83000 0.34000 > (l3 .* r3) / sum(l3 .* r3) ans = 0.77483 0.22517

0.9 0.2 0.1 0.8

current

slide-73
SLIDE 73

Message Passing

  • Send forward messages starting from left node


  • Send backward messages starting from right node

x0 x1 x2 x3 x4 x5

mi+1→i(xi) = X

xi+1

mi+2→i+1(xi+1)f(xi, xi+1) mi−1→i(xi) = X

xi−1

mi−2→i−1(xi−1)f(xi−1, xi) li = Πili1 ri = Π>

i ri+1

slide-74
SLIDE 74

Higher Order Markov Chains

  • First order chain
  • Second order

x0 x1 x2 x3

p(X) = p(x0) Y

i

p(xi+1|xi)

x0 x1 x2 x3

p(X) = p(x0, x1) Y

i

p(xi+1|xi, xi−1)

slide-75
SLIDE 75

Higher Order Markov Chains

  • First order chain
  • Second order

x0 x1 x2 x3

p(X) = p(x0) Y

i

p(xi+1|xi)

x0 x1 x2 x3

p(X) = p(x0, x1) Y

i

p(xi+1|xi, xi−1)

slide-76
SLIDE 76

Trees

  • Forward/Backward messages as normal for chain
  • When we have more edges for a vertex use ...

x0 x1 x2 x3 x4 x5

x6

x7 x8

slide-77
SLIDE 77

Trees

x0 x1 x2 x3 x4 x5

x6

x7 x8

l1(x1) = X

x0

p(x0)p(x1|x0) r7(x7) = X

x8

p(x8|x7) l2(x2) = X

x1

l1(x1)p(x2|x1) r6(x6) = X

x7

r7(x7)p(x7|x6) r2(x2) = X

x6

r6(x6)p(x6|x2) l3(x3) = X

x2

l2(x2)p(x3|x2)r2(x2) . . .

slide-78
SLIDE 78

Trees

x0 x1 x2 x3 x4 x5

x6

x7 x8

l1(x1) = X

x0

p(x0)p(x1|x0) r7(x7) = X

x8

p(x8|x7) l2(x2) = X

x1

l1(x1)p(x2|x1) r6(x6) = X

x7

r7(x7)p(x7|x6) r2(x2) = X

x6

r6(x6)p(x6|x2) l3(x3) = X

x2

l2(x2)p(x3|x2)r2(x2) . . .

slide-79
SLIDE 79

Trees

x0 x1 x2 x3 x4 x5

x6

x7 x8

l1(x1) = X

x0

p(x0)p(x1|x0) r7(x7) = X

x8

p(x8|x7) l2(x2) = X

x1

l1(x1)p(x2|x1) r6(x6) = X

x7

r7(x7)p(x7|x6) r2(x2) = X

x6

r6(x6)p(x6|x2) l3(x3) = X

x2

l2(x2)p(x3|x2)r2(x2) . . .

slide-80
SLIDE 80

Trees

x0 x1 x2 x3 x4 x5

x6

x7 x8

l1(x1) = X

x0

p(x0)p(x1|x0) r7(x7) = X

x8

p(x8|x7) l2(x2) = X

x1

l1(x1)p(x2|x1) r6(x6) = X

x7

r7(x7)p(x7|x6) r2(x2) = X

x6

r6(x6)p(x6|x2) l3(x3) = X

x2

l2(x2)p(x3|x2)r2(x2) . . .

slide-81
SLIDE 81

Trees

x0 x1 x2 x3 x4 x5

x6

x7 x8

l1(x1) = X

x0

p(x0)p(x1|x0) r7(x7) = X

x8

p(x8|x7) l2(x2) = X

x1

l1(x1)p(x2|x1) r6(x6) = X

x7

r7(x7)p(x7|x6) r2(x2) = X

x6

r6(x6)p(x6|x2) l3(x3) = X

x2

l2(x2)p(x3|x2)r2(x2) . . .

slide-82
SLIDE 82

Trees

x0 x1 x2 x3 x4 x5

x6

x7 x8

l1(x1) = X

x0

p(x0)p(x1|x0) r7(x7) = X

x8

p(x8|x7) l2(x2) = X

x1

l1(x1)p(x2|x1) r6(x6) = X

x7

r7(x7)p(x7|x6) r2(x2) = X

x6

r6(x6)p(x6|x2) l3(x3) = X

x2

l2(x2)p(x3|x2)r2(x2) . . .

slide-83
SLIDE 83

Trees

x0 x1 x2 x3 x4 x5

x6

x7 x8

l1(x1) = X

x0

p(x0)p(x1|x0) r7(x7) = X

x8

p(x8|x7) l2(x2) = X

x1

l1(x1)p(x2|x1) r6(x6) = X

x7

r7(x7)p(x7|x6) r2(x2) = X

x6

r6(x6)p(x6|x2) l3(x3) = X

x2

l2(x2)p(x3|x2)r2(x2) . . .

slide-84
SLIDE 84

Junction Template

  • Order of computation
  • Dependence does not matter


(only matters for parametrization)

2 3 4 1

in i n

  • ut

m2→3(x3) = X

x2

m1→2(x2)m4→2(x2)f(x2, x3)

slide-85
SLIDE 85

Trees

  • Forward/Backward messages as normal for chain
  • When we have more edges for a vertex use ...

x0 x1 x2 x3 x4 x5 x6 x7 x8

m2→3(x3) = X

x2

m1→2(x2)m6→2(x2)f(x2, x3) m2→6(x6) = X

x2

m1→2(x2)m3→2(x2)f(x2, x6) m2→1(x1) = X

x2

m3→2(x2)m6→2(x2)f(x1, x2)

slide-86
SLIDE 86

Trees

  • Forward/Backward messages as normal for chain
  • When we have more edges for a vertex use ...

x0 x1 x2 x3 x4 x5 x6 x7 x8

m2→3(x3) = X

x2

m1→2(x2)m6→2(x2)f(x2, x3) m2→6(x6) = X

x2

m1→2(x2)m3→2(x2)f(x2, x6) m2→1(x1) = X

x2

m3→2(x2)m6→2(x2)f(x1, x2)

slide-87
SLIDE 87

Trees

  • Forward/Backward messages as normal for chain
  • When we have more edges for a vertex use ...

x0 x1 x2 x3 x4 x5 x6 x7 x8

m2→3(x3) = X

x2

m1→2(x2)m6→2(x2)f(x2, x3) m2→6(x6) = X

x2

m1→2(x2)m3→2(x2)f(x2, x6) m2→1(x1) = X

x2

m3→2(x2)m6→2(x2)f(x1, x2)

slide-88
SLIDE 88

Trees

  • Forward/Backward messages as normal for chain
  • When we have more edges for a vertex use ...

x0 x1 x2 x3 x4 x5 x6 x7 x8

m2→3(x3) = X

x2

m1→2(x2)m6→2(x2)f(x2, x3) m2→6(x6) = X

x2

m1→2(x2)m3→2(x2)f(x2, x6) m2→1(x1) = X

x2

m3→2(x2)m6→2(x2)f(x1, x2)

slide-89
SLIDE 89

Trees

m2→3(x3) = X

x2

m1→2(x2)m6→2(x2)f(x2, x3) m2→6(x6) = X

x2

m1→2(x2)m3→2(x2)f(x2, x6) m2→1(x1) = X

x2

m3→2(x2)m6→2(x2)f(x1, x2)

  • Forward/Backward messages as normal for chain
  • When we have more edges for a vertex use ...

x0 x1 x2 x3 x4 x5 x6 x7 x8

slide-90
SLIDE 90

Trees

m2→3(x3) = X

x2

m1→2(x2)m6→2(x2)f(x2, x3) m2→6(x6) = X

x2

m1→2(x2)m3→2(x2)f(x2, x6) m2→1(x1) = X

x2

m3→2(x2)m6→2(x2)f(x1, x2)

  • Forward/Backward messages as normal for chain
  • When we have more edges for a vertex use ...

x0 x1 x2 x3 x4 x5 x6 x7 x8

slide-91
SLIDE 91

Trees

m2→3(x3) = X

x2

m1→2(x2)m6→2(x2)f(x2, x3) m2→6(x6) = X

x2

m1→2(x2)m3→2(x2)f(x2, x6) m2→1(x1) = X

x2

m3→2(x2)m6→2(x2)f(x1, x2)

  • Forward/Backward messages as normal for chain
  • When we have more edges for a vertex use ...

x0 x1 x2 x3 x4 x5 x6 x7 x8

slide-92
SLIDE 92

Trees

m2→3(x3) = X

x2

m1→2(x2)m6→2(x2)f(x2, x3) m2→6(x6) = X

x2

m1→2(x2)m3→2(x2)f(x2, x6) m2→1(x1) = X

x2

m3→2(x2)m6→2(x2)f(x1, x2)

  • Forward/Backward messages as normal for chain
  • When we have more edges for a vertex use ...

x0 x1 x2 x3 x4 x5 x6 x7 x8

slide-93
SLIDE 93

Trees

  • Forward/Backward messages as normal for chain
  • When we have more edges for a vertex use ...

m2→3(x3) = X

x2

m1→2(x2)m6→2(x2)f(x2, x3) m2→6(x6) = X

x2

m1→2(x2)m3→2(x2)f(x2, x6) m2→1(x1) = X

x2

m3→2(x2)m6→2(x2)f(x1, x2)

x0 x1 x2 x3 x4 x5 x6 x7 x8

slide-94
SLIDE 94

Trees

  • Forward/Backward messages as normal for chain
  • When we have more edges for a vertex use ...

m2→3(x3) = X

x2

m1→2(x2)m6→2(x2)f(x2, x3) m2→6(x6) = X

x2

m1→2(x2)m3→2(x2)f(x2, x6) m2→1(x1) = X

x2

m3→2(x2)m6→2(x2)f(x1, x2)

x0 x1 x2 x3 x4 x5 x6 x7 x8

slide-95
SLIDE 95

Trees

  • Forward/Backward messages as normal for chain
  • When we have more edges for a vertex use ...

m2→3(x3) = X

x2

m1→2(x2)m6→2(x2)f(x2, x3) m2→6(x6) = X

x2

m1→2(x2)m3→2(x2)f(x2, x6) m2→1(x1) = X

x2

m3→2(x2)m6→2(x2)f(x1, x2)

x0 x1 x2 x3 x4 x5 x6 x7 x8

slide-96
SLIDE 96

Trees

  • Forward/Backward messages as normal for chain
  • When we have more edges for a vertex use ...

m2→3(x3) = X

x2

m1→2(x2)m6→2(x2)f(x2, x3) m2→6(x6) = X

x2

m1→2(x2)m3→2(x2)f(x2, x6) m2→1(x1) = X

x2

m3→2(x2)m6→2(x2)f(x1, x2)

x0 x1 x2 x3 x4 x5 x6 x7 x8

slide-97
SLIDE 97

Summary

  • Markov chains
  • Present only depends on recent past
  • Higher order - longer history.
  • Dynamic programming
  • Exponential if brute force.
  • Linear in chain if we iterate.
  • For junctions treat like chains but


integrate signals from all sources.

  • Exponential in the history size.

2 3 4 1

in i n

  • ut
slide-98
SLIDE 98

Hidden Markov Models

slide-99
SLIDE 99

Clustering and 
 Hidden Markov Models

  • Clustering - no dependence between observations
  • Hidden Markov Model - dependence between states

x1 x2 x3 x4 y4 xm y3 y2 y1

...

ym i=1..m xi xi+1 yi x1 x2 x3 x4 y4 xm y3 y2 y1

...

ym i=1..m xi xi+1 yi

slide-100
SLIDE 100

Applications

  • Speech recognition (sound|text)
  • Optical character recognition (writing|text)
  • Gene finding (DNA sequence|genes)
  • Activity recognition (accelerometer|activity)

x1 x2 x3 x4 y4 xm y3 y2 y1

...

ym i=1..m xi xi+1 yi

slide-101
SLIDE 101

Inference Tasks

  • Infer latent variables p(x|y), extend sequence
  • Estimate distributions p(yi|xi) and p(xi+1|xi)

p(x, y) = p(y1) "m−1 Y

i=1

p(yi+1|yi)p(xi|yi) # p(xm|ym)

x1 x2 x3 x4 y4 xm y3 y2 y1

...

ym i=1..m xi xi+1 yi

slide-102
SLIDE 102

Recall Dynamic Programming

  • Observe y, maybe also part of x
  • Likely value for any xi in the chain means

summing over all other variables xj

p(x, y) = p(y1) "m−1 Y

i=1

p(yi+1|yi)p(xi|yi) # p(xm|ym)

x1 x2 x3 x4 y4 xm y3 y2 y1

...

ym i=1..m xi xi+1 yi

slide-103
SLIDE 103

Dynamic Programming

  • Observe y, maybe also part of x
  • Likely value for any xi in the chain means

summing over all other variables xj

x1 x2 x3 x4 y4 xm y3 y2 y1

...

ym i=1..m xi xi+1 yi

p(x|y) = p(x, y) P

x0 p(x0, y) and p(xi|y) ∝

X

xj:j<i

X

xj:j>i

p(x, y)

slide-104
SLIDE 104

Dynamic Programming

x1 x2 x3 x4 y4 xm y3 y2 y1

...

ym i=1..m xi xi+1 yi

l1(x1) = p(x1)p(y1|x1) lj+1(xj+1) = X

xj

lj(xj)p(xj+1|xj)p(yj|xj) rn(xn) = 1 rj−1(xj−1) = X

xj

rj(xj)p(yj|xj)p(xj|xj−1) p(xi|rest) ∝ li(xi)ri(xi)p(yi|xi)

Same algorithm for finding most likely values (+,*) (max,+)

slide-105
SLIDE 105

Dynamic Programming

x1 x2 x3 x4 y4 xm y3 y2 y1

...

ym i=1..m xi xi+1 yi

l1(x1) = log p(x1) + log p(y1|x1) lj+1(xj+1) = max

xj lj(xj) + log p(xj+1|xj) + log p(yj|xj)

rn(xn) = 1 rj−1(xj−1) = max

xj rj(xj) + log p(yj|xj) + log p(xj|xj−1)

ˆ xi = argmax li(xi) + ri(xi) + log p(yi|xi)

slide-106
SLIDE 106

Inference

  • What if we want to estimate the probabilities

directly? Log-likelihood is nonconvex!

x1 x2 x3 x4 y4 xm y3 y2 y1

...

ym i=1..m xi xi+1 yi

p(x, y) = p(x1) "m−1 Y

i=1

p(xi+1|xi)p(yi|xi) # p(ym|xm)

slide-107
SLIDE 107

Variational Approximation

  • Lower bound on log-likelihood


  • Inequality holds for any q (equality for p(x|y)=q(x))
  • Find q within subset Q to tighten inequality
  • Find parameters to maximize for fixed q
  • Inference for graphical models where joint

probability computation is infeasible


log p(y; θ) ≥ Z dq(x) log p(x, y; θ) − Z dq(x) log q(x)

slide-108
SLIDE 108

Variational Approximation

x1 x2 x3 x4 y4 xm y3 y2 y1

...

ym i=1..m xi xi+1 yi x1 x2 x3 x4 y4 xm y3 y2 y1

...

ym i=1..m xi xi+1 yi

  • Variational approximation via


  • Compute p(x|y) via dynamic programming

q(x) = q(x1)

m

Y

i=2

q(xi|xi−1)

slide-109
SLIDE 109

Variational Principle in Action

  • Initialize parameters somehow
  • Set


Dynamic programming yields chain

  • Maximizing the log-likelihod w.r.t. q

q(x) = p(x|y)

q(x1) q(xi+1|xi) q(xi) q(xi)

log p(y; θ) ≥ Z dq(x) log p(x, y; θ) − Z dq(x) log q(x) p(x, y) = p(x1) "m−1 Y

i=1

p(xi+1|xi)p(yi|xi) # p(ym|xm)

slide-110
SLIDE 110

Parameter Estimation

  • p(x1)


Since we have set p(x1) = q(x1)

  • p(yi|xi)


Same as clustering
 e.g. for Gaussians

Ex∼q [log p(x, y; θ)] =Ex1∼q [log p(x1; θ)] +

m

X

i=1

Exi∼q [log p(yi|xi; θ)] +

m−1

X

i=1

E(xi,xi+1)∼q [log p(xi+1|xi; θ)]

Eq(x1) [log p(x1)] µx = 1 nx

m

X

i=1

qi(x)yi Σx = 1 nx

m

X

i=1

qi(x)yiy>

i − µxµ> x

slide-111
SLIDE 111

Parameter Estimation

  • Maximum Likelihood estimate

Ex∼q [log p(x, y; θ)] =Ex1∼q [log p(x1; θ)] +

m

X

i=1

Exi∼q [log p(yi|xi; θ)] +

m−1

X

i=1

E(xi,xi+1)∼q [log p(xi+1|xi; θ)]

effective sample

m−1

X

i=1

q(a, b) log p(a|b) hence p(a|b) = Pm−1

i=1 q(a, b)

Pm−1

i=1 q(b)

slide-112
SLIDE 112

Smoothed Estimates

  • Laplace prior on latent state distribution
  • Uniform distribution over states
  • Alternatively assume that state remains



 
 
 
 
 


  • Same trick for means and variances

transition smoother aggregate mass

p(a|b) = na|b + Pm−1

i=1 q(a, b)

nb + Pm−1

i=1 q(b)

slide-113
SLIDE 113

y1 x1 y2 xi y3 xi ym xm Θ

...

μk, Σk μ1, Σ1 ...

Advanced Inference

slide-114
SLIDE 114

The Problem

  • Message passing leads to loops (not arrows)
  • Combine Variables


Junction Tree

  • Ignore it


Loopy Belief Propagation

  • Variational Approximation


Simpler distribution
 without loops

  • Gibbs Sampling


Draw from one variable at a time

slide-115
SLIDE 115

1 2 3 4 5 6 7 8 9 10 0.6 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 parameter1 density

Mode Mode

Mean

Is maximization (always) good?

p(θ|X) ∝ p(X|θ)p(θ)

slide-116
SLIDE 116

Sampling

  • Key idea
  • Want accurate distribution of the posterior
  • Sample from posterior distribution rather than

maximizing it

  • Problem - direct sampling is usually intractable
  • Solutions
  • Markov Chain Monte Carlo (complicated)
  • Gibbs Sampling (somewhat simpler)



 


x ∼ p(x|x0) and then x0 ∼ p(x0|x)

slide-117
SLIDE 117

Gibbs sampling

  • Gibbs sampling:
  • In most cases direct sampling not possible
  • Draw one set of variables at a time

0.45 0.05 0.05 0.45

slide-118
SLIDE 118

Gibbs sampling

  • Gibbs sampling:
  • In most cases direct sampling not possible
  • Draw one set of variables at a time

0.45 0.05 0.05 0.45

(b,g) - draw p(.,g)

slide-119
SLIDE 119

Gibbs sampling

  • Gibbs sampling:
  • In most cases direct sampling not possible
  • Draw one set of variables at a time

0.45 0.05 0.05 0.45

(b,g) - draw p(.,g) (g,g) - draw p(g,.)

slide-120
SLIDE 120

Gibbs sampling

  • Gibbs sampling:
  • In most cases direct sampling not possible
  • Draw one set of variables at a time

0.45 0.05 0.05 0.45

(b,g) - draw p(.,g) (g,g) - draw p(g,.) (g,g) - draw p(.,g)

slide-121
SLIDE 121

Gibbs sampling

  • Gibbs sampling:
  • In most cases direct sampling not possible
  • Draw one set of variables at a time

0.45 0.05 0.05 0.45

(b,g) - draw p(.,g) (g,g) - draw p(g,.) (g,g) - draw p(.,g) (b,g) - draw p(b,.)

slide-122
SLIDE 122

Gibbs sampling

  • Gibbs sampling:
  • In most cases direct sampling not possible
  • Draw one set of variables at a time

0.45 0.05 0.05 0.45

(b,g) - draw p(.,g) (g,g) - draw p(g,.) (g,g) - draw p(.,g) (b,g) - draw p(b,.) (b,b) ...

slide-123
SLIDE 123

Gibbs sampling for clustering

slide-124
SLIDE 124

Gibbs sampling for clustering

random initialization

slide-125
SLIDE 125

Gibbs sampling for clustering

sample cluster labels

slide-126
SLIDE 126

Gibbs sampling for clustering

resample cluster model

slide-127
SLIDE 127

Gibbs sampling for clustering

resample cluster labels

slide-128
SLIDE 128

Gibbs sampling for clustering

resample cluster model

slide-129
SLIDE 129

Gibbs sampling for clustering

resample cluster labels

slide-130
SLIDE 130

Gibbs sampling for clustering

resample cluster model e.g. Mahout Dirichlet Process Clustering

slide-131
SLIDE 131

Blunting the arrows
 Undirected Graphical Models

slide-132
SLIDE 132

Chicken and Egg

slide-133
SLIDE 133

Chicken and Egg

p(c|e)p(e|c)

slide-134
SLIDE 134

Chicken and Egg

p(c|e)p(e|c)

slide-135
SLIDE 135

Chicken and Egg

we know that chicken and egg are correlated either chicken or egg

slide-136
SLIDE 136

Chicken and Egg

p(c, e) ∝ exp ψ(c, e)

encode the correlation via the clique potential between c and e we know that chicken and egg are correlated either chicken or egg

slide-137
SLIDE 137

Chicken and Egg

we know that chicken and egg are correlated either chicken or egg

slide-138
SLIDE 138

Chicken and Egg

we know that chicken and egg are correlated either chicken or egg

p(c, e) = exp ψ(c, e) P

c0,e0 exp ψ(c0, e0)

= exp [ψ(c, e) − g(ψ)] where g(ψ) = log X

c,e

exp ψ(c, e)

slide-139
SLIDE 139

... some web service

MySQL Apache Website

p(w|m, a)p(m)p(a)

Website

m 6? ? a|w

slide-140
SLIDE 140

... some web service

MySQL Apache Website

p(w|m, a)p(m)p(a)

Website

m 6? ? a|w

MySQL Apache Website

Site affects MySQL Site affects Apache

p(m, w, a) ∝ φ(m, w)φ(w, a)

Website

m ⊥ ⊥ a|w

slide-141
SLIDE 141

... some web service

MySQL Apache Website

p(w|m, a)p(m)p(a)

Website

m 6? ? a|w

MySQL Apache Website

Site affects MySQL Site affects Apache

p(m, w, a) ∝ φ(m, w)φ(w, a)

Website

m ⊥ ⊥ a|w

easier “debugging” easier “modeling”

slide-142
SLIDE 142

Undirected Graphical Models

Key Concept Observing nodes makes remainder conditionally independent

slide-143
SLIDE 143

Undirected Graphical Models

Key Concept Observing nodes makes remainder conditionally independent

slide-144
SLIDE 144

Undirected Graphical Models

Key Concept Observing nodes makes remainder conditionally independent

slide-145
SLIDE 145

Undirected Graphical Models

Key Concept Observing nodes makes remainder conditionally independent

slide-146
SLIDE 146

Undirected Graphical Models

Key Concept Observing nodes makes remainder conditionally independent

slide-147
SLIDE 147

Undirected Graphical Models

Key Concept Observing nodes makes remainder conditionally independent

slide-148
SLIDE 148

Undirected Graphical Models

Key Concept Observing nodes makes remainder conditionally independent

slide-149
SLIDE 149

Undirected Graphical Models

Key Concept Observing nodes makes remainder conditionally independent

slide-150
SLIDE 150

Cliques

slide-151
SLIDE 151

Cliques

maximal fully connected subgraph

slide-152
SLIDE 152

Cliques

maximal fully connected subgraph

slide-153
SLIDE 153

Hammersley Clifford Theorem

p(x) = Y

c

ψc(xc)

If density has full support then it decomposes into products of clique potentials

slide-154
SLIDE 154

Directed vs. Undirected

  • Causal description
  • Normalization automatic
  • Intuitive
  • Requires knowledge of

dependencies

  • Conditional independence

tricky (Bayes Ball algorithm)

  • Noncausal description

(correlation only)

  • Intuitive
  • Easy modeling
  • Normalization difficult
  • Conditional independence

easy to read off (graph connectivity)

slide-155
SLIDE 155

Exponential Families
 and Graphical Models

vs.

slide-156
SLIDE 156

Exponential Family Recap

  • Density function



 


  • Log partition function generates cumulants



 


  • g is convex (second derivative is p.s.d.)

p(x; θ) = exp (hφ(x), θi g(θ)) where g(θ) = log X

x0

exp (hφ(x0), θi) ∂θg(θ) = E [φ(x)] ∂2

θg(θ) = Var [φ(x)]

slide-157
SLIDE 157

Log Partition Function

Unconditional model

p(y|θ, x) = e⇥φ(x,y),θ⇤g(θ|x) g(θ|x) = log X

y

e⇥φ(x,y),θ⇤ ∂θg(θ|x) = P

y φ(x, y)e⇥φ(x,y),θ⇤

P

y e⇥φ(x,y),θ⇤

= X

y

φ(x, y)e⇥φ(x,y),θ⇤g(θ|x) p(x|θ) = ehφ(x),θig(θ) g(θ) = log X

x

ehφ(x),θi ∂θg(θ) = P

x φ(x)ehφ(x),θi

P

x ehφ(x),θi

= X

x

φ(x)ehφ(x),θig(θ)

Conditional model

slide-158
SLIDE 158

Estimation

  • Conditional log-likelihood
  • Log-posterior (Gaussian Prior)



 
 


  • First order optimality conditions

log p(y|x; θ) = hφ(x, y), θi g(θ|x)

log p(θ|X, Y ) = X

i

log(yi|xi; θ) + log p(θ) + const. = *X

i

φ(xi, yi), θ +

  • X

i

g(θ|xi) 1 2σ2 kθk2 + const.

X

i

φ(xi, yi) = X

i

Ey|xi [φ(xi, y)] + 1 σ2 θ

prior maxent model

expensive

slide-159
SLIDE 159

Logistic Regression

  • Label space
  • Log-partition function
  • Convex minimization problem


  • Prediction

x y

φ(x, y) = yφ(x) where y ∈ {±1} g(θ|x) = log h e1·hφ(x),θi + e1·hφ(x),θii = log 2 cosh hφ(x), θi

minimize

θ

1 2σ2 kθk2 + X

i

log 2 cosh hφ(xi), θi yi hφ(xi, θi

p(y|x, θ) = eyhφ(x),θi ehφ(x),θi + ehφ(x),θi = 1 1 + e2yhφ(x),θi

slide-160
SLIDE 160

Logistic Regression

  • Label space
  • Log-partition function
  • Convex minimization problem


  • Prediction

x y

φ(x, y) = yφ(x) where y ∈ {±1} g(θ|x) = log h e1·hφ(x),θi + e1·hφ(x),θii = log 2 cosh hφ(x), θi

minimize

θ

1 2σ2 kθk2 + X

i

log 2 cosh hφ(xi), θi yi hφ(xi, θi

p(y|x, θ) = eyhφ(x),θi ehφ(x),θi + ehφ(x),θi = 1 1 + e2yhφ(x),θi

GP Classification

slide-161
SLIDE 161

Exponential Clique Decomposition

p(x) = Y

c

ψc(xc)

Theorem: Clique decomposition holds in sufficient statistics

φ(x) = (. . . , φc(xc), . . .) and hφ(x), θi = X

c

hφc(xc), θci

Corollary: we only need expectations on cliques

Ex[φ(x)] = (. . . , Exc [φc(xc)] , . . .)

slide-162
SLIDE 162

Conditional Random Fields

y x

φ(x) = (y1φx(x1), . . . , ynφx(xn), φy(y1, y2), . . . , φy(yn1, yn)) hφ(x), θi = X

i

hφx(xi, yi), θxi + X

i

hφy(yi, yi+1), θyi g(θ|x) = X

y

Y

i

fi(yi, yi+1) where fi(yi, yi+1) = ehφx(xi,yi),θxi+hφy(yi,yi+1),θyi

dynamic programming

slide-163
SLIDE 163

Conditional Random Fields

  • Compute distribution over marginal and adjacent labels
  • Take conditional expectations
  • Take update step (batch or online)
  • More general techniques for computing normalization

via message passing ...

slide-164
SLIDE 164

Examples

slide-165
SLIDE 165

x

p(x) = Y

i

ψi(xi, xi+1)

Chains

slide-166
SLIDE 166

x

p(x) = Y

i

ψi(xi, xi+1)

Chains

slide-167
SLIDE 167

x

p(x) = Y

i

ψi(xi, xi+1)

Chains

x y

p(x, y) = Y

i

ψx

i (xi, xi+1)ψxy i (xi, yi)

slide-168
SLIDE 168

x

p(x) = Y

i

ψi(xi, xi+1)

Chains

p(x|y) ∝ Y

i

ψx

i (xi, xi+1)ψxy i (xi, yi)

| {z }

=:fi(xi,xi+1)

x y

p(x, y) = Y

i

ψx

i (xi, xi+1)ψxy i (xi, yi)

slide-169
SLIDE 169

Chains

p(x|y) ∝ Y

i

ψx

i (xi, xi+1)ψxy i (xi, yi)

| {z }

=:fi(xi,xi+1)

x y

Dynamic Programming

l1(x1) = 1 and li+1(xi+1) = X

xi

li(xi)fi(xi, xi+1) rn(xn) = 1 and ri(xi) = X

xi+1

ri+1(xi+1)fi(xi, xi+1)

slide-170
SLIDE 170

Named Entity Tagging

p(x|y) ∝ Y

i

ψx

i (xi, xi+1)ψxy i (xi, yi)

| {z }

=:fi(xi,xi+1)

x y

slide-171
SLIDE 171

Trees + Ontologies

  • Ontology classification (e.g. YDir, DMOZ)

x y y y y y y y

Document Labels

p(y|x) = Y

i

ψ(yi, yparent(i), x)

slide-172
SLIDE 172

Spin Glasses + Images

x

y

  • bserved pixels

real image

p(x|y) = Y

ij

ψright(xij, xi+1,j)ψup(xij, xi,j+1)ψxy(xij, yij)

slide-173
SLIDE 173

Spin Glasses + Images

x

y

  • bserved pixels

real image

p(x|y) = Y

ij

ψright(xij, xi+1,j)ψup(xij, xi,j+1)ψxy(xij, yij)

long range interactions

slide-174
SLIDE 174

Image Denoising

Li&Huttenlocher, ECCV’08

slide-175
SLIDE 175

Semi-Markov Models

  • Flexible length of an episode
  • Segmentation between episodes

classification CRF SMM

phrase segmentation, activity recognition, motion data analysis Shi, Smola, Altun, Vishwanathan, Li, 2007-2009

slide-176
SLIDE 176

2D CRF for Webpages

web page information extraction, segmentation, annotation Bo, Zhu, Nie, Wen, Hon, 2005-2007