Probabilistic Graphical Models Lecture 11 CRFs, Exponential - - PowerPoint PPT Presentation

probabilistic graphical models
SMART_READER_LITE
LIVE PREVIEW

Probabilistic Graphical Models Lecture 11 CRFs, Exponential - - PowerPoint PPT Presentation

Probabilistic Graphical Models Lecture 11 CRFs, Exponential Family CS/CNS/EE 155 Andreas Krause Announcements Homework 2 due today Project milestones due next Monday (Nov 9) About half the work should be done 4 pages of writeup, NIPS


slide-1
SLIDE 1

Probabilistic Graphical Models

Lecture 11 – CRFs, Exponential Family

CS/CNS/EE 155 Andreas Krause

slide-2
SLIDE 2

2

Announcements

Homework 2 due today Project milestones due next Monday (Nov 9)

About half the work should be done 4 pages of writeup, NIPS format http://nips.cc/PaperInformation/StyleFiles

slide-3
SLIDE 3

3

So far

Markov Network Representation

Local/Global Markov assumptions; Separation Soundness and completeness of separation

Markov Network Inference

Variable elimination and Junction Tree inference work exactly as in Bayes Nets

How about Learning Markov Nets?

slide-4
SLIDE 4

4

MLE for Markov Nets

Log likelihood of the data

slide-5
SLIDE 5

5

Log-likelihood doesn’t decompose

Log likelihood l(D | θ) is concave function! Log Partition function log Z(θ) doesn’t decompose

slide-6
SLIDE 6

6

Computing the derivative

Derivative Computing P(ci | ) requires inference! Can optimize using conjugate gradient etc.

C

D

I

G S L J H

slide-7
SLIDE 7

7

Alternative approach: Iterative Proportional Fitting (IPF)

At optimum, it must hold that Solve fixed point equation Must recompute parameters every iteration

slide-8
SLIDE 8

8

Parameter learning for log-linear models

Feature functions (Ci) defined over cliques Log linear model over undirected graph G

Feature functions 1(C1),…,k(Ck) Domains Ci can overlap

Joint distribution How do we get weights wi?

slide-9
SLIDE 9

9

Optimizing parameters

Gradient of log-likelihood Thus, w is MLE

slide-10
SLIDE 10

10

Regularization of parameters

Put prior on parameters w

slide-11
SLIDE 11

11

Summary: Parameter learning in MN

MLE in BN is easy (score decomposes) MLE in MN requires inference (score doesn’t decompose) Can optimize using gradient ascent or IPF

slide-12
SLIDE 12

12

Generative vs. discriminative models

Often, want to predict Y for inputs X

Bayes optimal classifier: Predict according to P(Y | X)

Generative model

Model P(Y), P(X|Y) Use Bayes’ rule to compute P(Y | X)

Discriminative model

Model P(Y | X) directly! Don’t model distribution P(X) over inputs X Cannot “generate” sample inputs Example: Logistic regression

slide-13
SLIDE 13

13

Example: Logistic Regression

slide-14
SLIDE 14

14

Log-linear conditional random field

Define log-linear model over outputs Y

No assumptions about inputs X

Feature functions (Ci,x) defined over cliques and inputs Joint distribution

slide-15
SLIDE 15

15

Example: CRFs in NLP

Classify into Person, Location or Other

  • Mrs. Greene spoke today in New York. Green chairs the finance committee

X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 Y1 Y2 Y3 Y4 Y5 Y6 Y7 Y8 Y9 Y10 Y11 Y12

slide-16
SLIDE 16

16

Example: CRFs in vision

slide-17
SLIDE 17

17

Parameter learning for log-linear CRF

Conditional log-likelihood of data Can maximize using conjugate gradient

slide-18
SLIDE 18

18

Gradient of conditional log-likelihood

Partial derivative Requires one inference per Can optimize using conjugate gradient

slide-19
SLIDE 19

19

Exponential Family Distributions

Distributions for log-linear models More generally: Exponential family distributions

h(x): Base measure w: natural parameters (x): Sufficient statistics A(w): log-partition function

Hereby x can be continuous (defined over any set)

slide-20
SLIDE 20

20

Examples

  • Exp. Family:

Gaussian distribution Other examples: Multinomial, Poisson, Exponential, Gamma, Weibull, chi-square, Dirichlet, Geometric, …

h(x): Base measure w: natural parameters (x): Sufficient statistics A(w): log-partition function

slide-21
SLIDE 21

21

Moments and gradients

Correspondence between moments and log-partition function (just like in log-linear models) Can compute moments from derivatives, and derivatives from moments! MLE moment matching

slide-22
SLIDE 22

22

Recall: Conjugate priors

Consider parametric families of prior distributions:

P() = f(; ) is called “hyperparameters” of prior

A prior P() = f(; ) is called conjugate for a likelihood function P(D | ) if P( | D) = f(; ’)

Posterior has same parametric form Hyperparameters are updated based on data D

Obvious questions (answered later):

How to choose hyperparameters?? Why limit ourselves to conjugate priors??

slide-23
SLIDE 23

23

Conjugate priors in Exponential Family

Any exponential family likelihood has a conjugate prior

slide-24
SLIDE 24

24

Maximum Entropy interpretation

Theorem: Exponential family distributions maximize the entropy over all distributions satisfying

slide-25
SLIDE 25

25

Summary exponential family

Distributions of the form Most common distributions are exponential family

Multinomial, Gaussian Poisson, Exponential, Gamma, Weibull, chi- square, Dirichlet, Geometric, … Log-linear Markov Networks

All exponential family distributions have conjugate prior in EF Moments of sufficient stats = derivatives of log-partition function Maximum Entropy distributions (“most uncertain” distributions with specified expected sufficient statistics)

slide-26
SLIDE 26

26

Exponential family graphical models

So far, only defined graphical models over discrete variables. Can define GMs over continuous distributions! For exponential family distributions: Can do much of what we discussed (VE, JT, parameter learning, etc.) for such exponential family models Important example: Gaussian Networks

slide-27
SLIDE 27

27

Gaussian distribution

σ = Standard deviation µ = mean

slide-28
SLIDE 28

28

Bivariate Gaussian distribution

  • 2
  • 1.5
  • 1
  • 0.5

0.5 1 1.5 2

  • 2
  • 1

1 2 0.1 0.2 0.3 0.4

  • 2
  • 1.5
  • 1
  • 0.5

0.5 1 1.5 2

  • 2
  • 1

1 2 0.05 0.1 0.15 0.2

slide-29
SLIDE 29

29

Multivariate Gaussian distribution

Joint distribution over n random variables P(X1,…Xn) σjk = E[ (Xj – µj) (Xk - µk) ] Xj and Xk independent σjk=0

slide-30
SLIDE 30

30

Marginalization

Suppose (X1,…,Xn) ~ N( µ, Σ) What is P(X1)?? More generally: Let A={i1,…,ik} {1,…,N} Write XA = (Xi1,…,Xik) XA ~ N( µA, ΣAA)

slide-31
SLIDE 31

31

Conditioning

Suppose (X1,…,Xn) ~ N( µ, Σ) Decompose as (XA,XB) What is P(XA | XB)?? P(XA = xA| XB = xB) = N(xA; µA|B, ΣA|B) where Computable using linear algebra!

slide-32
SLIDE 32

32

Conditioning

  • 2
  • 1.5
  • 1
  • 0.5

0.5 1 1.5 2

  • 2
  • 1

1 2 0.1 0.2 0.3 0.4

X1=0.75 P(X2 | X1=0.75)

slide-33
SLIDE 33

33

Conditional linear Gaussians

slide-34
SLIDE 34

34

Canonical representation of Gaussians

slide-35
SLIDE 35

35

Canonical Representation

Multivariate Gaussians in exponential family! Standard vs canonical form:

= -1 = -1

In standard form: Marginalization is easy Will see: In canonical form, multiplication/conditioning is easy!

slide-36
SLIDE 36

36

Gaussian Networks

Zeros in precision matrix indicate missing edges in log-linear model!

slide-37
SLIDE 37

37

Inference in Gaussian Networks

Can compute marginal distributions in O(n3)! For large numbers n of variables, still intractable If Gaussian Network has low treewidth, can use variable elimination / JT inference! Need to be able to multiply and marginalize factors!

slide-38
SLIDE 38

38

Multiplying factors in Gaussians

slide-39
SLIDE 39

39

Marginalizing in canonical form

Recall conversion formulas

= -1 = -1

Marginal distribution

slide-40
SLIDE 40

40

Variable elimination

In Gaussian Markov Networks, Variable elimination = Gaussian elimination (fast for low bandwidth = low treewidth matrices)

slide-41
SLIDE 41

41

Tasks

Read Koller & Friedman Chapters 4.6.1, 8.1-8.3