[PPT] - Learning Bayesian network : Given structure and completely observed PowerPoint Presentation

SLIDE 1

Learning Bayesian network :

Given structure and completely observed data

Probabilistic Graphical Models Sharif University of Technology Spring 2017 Soleymani

SLIDE 2

Learning problem

2

 Target: true distribution 𝑄∗ that maybe correspond to ℳ∗

= 𝒧∗, 𝜾∗

 Hypothesis space: specified probabilistic graphical models  Data: set of instances sampled from 𝑄∗  Learning goal: selecting a model

ℳ to construct the best approximation to 𝑁∗ according to a performance metric

SLIDE 3

Learning tasks on graphical models

3

 Parameter learning / structure learning  Completely observable / partially observable data  Directed model / undirected model

SLIDE 4

Parameter learning in directed models Complete data

4

 We assume that the structure of the model is known

 consider learning parameters for a BN with a given structure

 Goal: estimate CPDs from a dataset 𝒠 = {𝒚 1 , . . . , 𝒚(𝑂)} of 𝑂

independent, identically distributed (i.i.d.) training samples.

 Each training sample 𝒚 𝑜 = 𝑦1

(𝑜), … , 𝑦𝑀 (𝑜)

is a vector that every element 𝑦𝑗

(𝑜) is known (no missing values, no hidden variables)

SLIDE 5

Density estimation review

5

 We use density estimation to solve this learning problem  Density estimation: Estimating the probability density function

𝑄(𝒚), given a set of data points 𝒚 𝑗

𝑗=1 𝑂

drawn from it.

 Parametric methods: Assume that 𝑄(𝒚) in terms of a specific

functional form which has a number of adjustable parameters

 MLE and Bayesian estimate

 MLE: Need to determine 𝜾∗ given {𝒚 1 , … , 𝒚(𝑂)}

 MLE overfitting problem

 Bayesian estimate: Probability distribution 𝑄(𝜾) over spectrum of

hypotheses

 Needs prior distribution on parameters

SLIDE 6

Density estimation: Graphical model

6

 i.i.d assumption

𝜾

𝑌(1) 𝑌(2) 𝑌(𝑂)

… 𝜾 𝑌(𝑜) 𝑜 = 1, . . . , 𝑂 𝜾 𝜾

𝑌(1) 𝑌(2) 𝑌(𝑂)

… 𝜾 𝑌(𝑗) 𝑗 = 1, . . . , 𝑂 𝜾 𝜷 𝜷 𝜷 hyperparametrs

SLIDE 7

Maximum Likelihood Estimation (MLE)

7

 Likelihood is the conditional probability of observations 𝒠

= 𝒚(1), 𝒚(2), … , 𝒚(𝑂) given the value of parameters 𝜾

 Assuming i.i.d. (independent, identically distributed) samples

𝑄 𝒠 𝜾 = 𝑄 𝒚(1), … , 𝒚(𝑂) 𝜾 =

𝑜=1 𝑂

𝑄(𝒚(𝑜)|𝜾)

 Maximum Likelihood estimation

𝜾𝑁𝑀 = argmax

𝜾

𝑄 𝒠 𝜾 = argmax

𝜾 𝑜=1 𝑂

𝑄(𝒚(𝑜)|𝜾)

likelihood of 𝜾 w.r.t. the samples

𝜾𝑁𝑀 = argmax

𝜾 𝑗=1 𝑂

ln 𝑞 𝒚(𝑗) 𝜾

MLE has closed form solution for many parametric distributions

SLIDE 8

MLE: Bernoulli distribution

 Given: 𝒠 = 𝑦(1), 𝑦(2), … , 𝑦(𝑂) , 𝑛 heads (1), 𝑂 − 𝑛 tails (0):

𝑞 𝑦 𝜄 = 𝜄𝑦 1 − 𝜄 1−𝑦 𝑞 𝒠 𝜄 =

𝑜=1 𝑂

𝑞(𝑦 𝑜 |𝜄) =

𝑜=1 𝑂

𝜄𝑦 𝑜 1 − 𝜄 1−𝑦 𝑜 ln 𝑞 𝒠 𝜄 =

𝑜=1 𝑂

ln 𝑞(𝑦 𝑜 |𝜄) =

𝑜=1 𝑂

{𝑦 𝑜 ln 𝜄 + (1 − 𝑦 𝑜 ) ln 1 − 𝜄 } 𝜖 ln 𝑞 𝒠 𝜄 𝜖𝜄 = 0 ⇒ 𝜄𝑁𝑀 = 𝑗=1

𝑂

𝑦(𝑜) 𝑂 = 𝑛 𝑂

𝑞 𝑦 = 1 𝜄 = 𝜄

SLIDE 9

MLE: Multinomial distribution

9

 Multinomial distribution (on variable with 𝐿 state):

𝑄 𝒚 𝜾 =

𝑙=1 𝐿

𝜄𝑙

𝑦𝑙

Parameter space: 𝜾 = 𝜄1, … , 𝜄𝐿 𝜄𝑗 ∈ 0,1

𝑙=1 𝐿

𝜄𝑙 = 1 Variable: 1-of-K coding 𝒚 = 𝑦1, … , 𝑦𝐿 𝑦𝑙 ∈ {0,1}

𝑙=1 𝐿

𝑦𝑙 = 1 𝑄 𝑦𝑙 = 1 = 𝜄𝑙 𝜄1 𝜄2 𝜄3 𝜄1 + 𝜄2 + 𝜄3 = 1 where 𝜄𝑗 ∈ 0,1 that is a simplex showing the set of valid parameters

SLIDE 10

MLE: Multinomial distribution

10

𝒠 = 𝒚(1), 𝒚(2), … , 𝒚(𝑂)

𝑄 𝒠 𝜾 =

𝑜=1 𝑂

𝑄(𝒚 𝑜 |𝜾) =

𝑜=1 𝑂 𝑙=1 𝐿

𝜄𝑙

𝑦𝑙

(𝑜)

=

𝑙=1 𝐿

𝜄𝑙

𝑜=1

𝑂

𝑦𝑙

(𝑜)

ℒ 𝜾, 𝜇 = ln 𝑞 𝒠 𝜾 + 𝜇(1 −

𝑙=1 𝐿

𝜄𝑙) 𝜄𝑙 = 𝑜=1

𝑂

𝑦𝑙

(𝑜)

𝑂 = 𝑛𝑙 𝑂

𝑛𝑙 =

𝑜=1 𝑂

𝑦𝑙

(𝑜) 𝑙=1 𝐿

𝑛𝑙 = 𝑂

SLIDE 11

MLE: Gaussian with unknown 𝜈

12

ln 𝑄(𝑦 𝑜 |𝜈) = − 1 2 ln 2𝜌𝜏 − 1 2𝜏2 𝑦 𝑜 − 𝜈

2

𝜖 ln 𝑄 𝒠 𝜈 𝜖𝜈 = 0 ⇒ 𝜖 𝜖𝜈

𝑜=1 𝑂

𝑚𝑜 𝑞 𝑦(𝑜) 𝜈 = 0 ⇒

𝑜=1 𝑂

1 𝜏2 𝑦(𝑜) − 𝜈 = 0 ⇒ 𝜈𝑁𝑀 = 1 𝑂

𝑜=1 𝑂

𝑦 𝑜

SLIDE 12

Bayesian approach

13

 Parameters 𝜾 as random variables with a priori distribution

 utilizes the available prior information about the unknown parameter  As opposed to ML estimation, it does not seek a specific point estimate

f the unknown parameter vector 𝜾

 Samples 𝒠 convert the prior densities 𝑄 𝜾

into a posterior density 𝑄 𝜾|𝒠

 Keep track of beliefs about 𝜾’s values and uses these beliefs for reaching

conclusions

SLIDE 13

Maximum A Posteriori (MAP) estimation

14

 MAP estimation

𝜾𝑁𝐵𝑄 = argmax

𝜾

𝑞 𝜾 𝒠

 Since 𝑞 𝜾|𝒠 ∝ 𝑞 𝒠|𝜾 𝑞(𝜾)

𝜾𝑁𝐵𝑄 = argmax

𝜾

𝑞 𝒠 𝜾 𝑞(𝜾)

 Example of prior distribution:

𝑞 𝜄 = 𝒪(𝜄0, 𝜏2)

SLIDE 14

Bayesian approach: Predictive distribution

15

 Given a set of samples 𝒠 = 𝒚 𝑗

𝑗=1 𝑂 , a prior distribution on

the parameters 𝑄(𝜾), and the form of the distribution 𝑄 𝒚 𝜾

 We find 𝑄 𝜾|𝒠 and use it to specify

𝑄 𝒚 = 𝑄(𝒚|𝒠) on new data as an estimate of 𝑄(𝒚):

𝑄 𝒚 𝒠 = 𝑄 𝒚, 𝜾|𝒠 𝑒𝜾 = 𝑄 𝒚 𝒠, 𝜾 𝑄 𝜾|𝒠 𝑒𝜾 = 𝑄 𝒚 𝜾 𝑄 𝜾|𝒠 𝑒𝜾

 Analytical solutions exist for very special forms of the involved

functions

Predictive distribution

SLIDE 15

Conjugate Priors

16

 We consider a form of prior distribution that has a simple

interpretation as well as some useful analytical properties

 Choosing a prior such that the posterior distribution that

is proportional to 𝑞(𝒠|𝜾)𝑞(𝜾) will have the same functional form as the prior. ∀𝜷, 𝒠 ∃𝜷′ 𝑄(𝜾|𝜷′) ∝ 𝑄 𝒠 𝜾 𝑄(𝜾|𝜷)

Having the same functional form

SLIDE 16

Prior for Bernoulli Likelihood

 Beta distribution over 𝜄 ∈ [0,1]:

Beta 𝜄 𝛽1, 𝛽0 ∝ 𝜄𝛽1−1 1 − 𝜄 𝛽0−1 Beta 𝜄 𝛽1, 𝛽0 = Γ(𝛽0 + 𝛽1) Γ(𝛽0)Γ(𝛽1) 𝜄𝛽1−1 1 − 𝜄 𝛽0−1

 Beta distribution is the conjugate prior of Bernoulli:

𝑄 𝑦 𝜄 = 𝜄𝑦 1 − 𝜄 1−𝑦

𝐹 𝜄 = 𝛽1 𝛽0 + 𝛽1 𝜄 = 𝛽1 − 1 𝛽0 − 1 + 𝛽1 − 1 most probable 𝜄

17

SLIDE 17

Beta distribution

18

SLIDE 18

Benoulli likelihood: posterior

Given: 𝒠 = 𝑦(1), 𝑦(2), … , 𝑦(𝑂) , 𝑛 heads (1), 𝑂 − 𝑛 tails (0)

𝑞 𝜄 𝒠 ∝ 𝑞 𝒠 𝜄 𝑞(𝜄) =

𝑗=1 𝑂

𝜄𝑦 𝑗 1 − 𝜄

1−𝑦 𝑗

Beta 𝜄 𝛽1, 𝛽0 ∝ 𝜄𝑛+𝛽1−1 1 − 𝜄 𝑂−𝑛+𝛽0−1 ⇒ 𝑞 𝜄 𝒠 ∝ 𝐶𝑓𝑢𝑏 𝜄 𝛽1

′, 𝛽0 ′

𝛽1

′ = 𝛽1 + 𝑛

𝛽0

′ = 𝛽0 + 𝑂 − 𝑛 ∝ 𝜄𝛽1−1 1 − 𝜄 𝛽0−1

19

𝑛 =

𝑗=1 𝑂

𝑦(𝑗)

SLIDE 19

Example

20

Bernoulli 𝛽0 = 𝛽1 = 2 𝒠 = 1,1,1 ⇒ 𝑂 = 3, 𝑛 = 3 𝜄𝑁𝐵𝑄 = argmax

𝜄

𝑄 𝜄 𝒠 = 𝛽1

′ − 1

𝛽1

′ − 1 + 𝛽0 ′ − 1 = 4

5 Posterior Beta:𝛽1

′ = 5, 𝛽0 ′ = 2

Prior Beta: 𝛽0 = 𝛽1 = 2 𝜄 𝜄 𝑞 𝑦 = 1 𝜄 𝜄 Given: 𝒠 = 𝑦(1), 𝑦(2), … , 𝑦(𝑂) : 𝑛 heads (1), 𝑂 − 𝑛 tails (0) 𝑞 𝑦 𝜄 = 𝜄𝑦 1 − 𝜄 1−𝑦

SLIDE 20

Benoulli: Predictive distribution

21

 Training samples: 𝒠 = 𝑦(1), … , 𝑦(𝑂)

𝑄 𝜄 = 𝐶𝑓𝑢𝑏 𝜄 𝛽1, 𝛽0 ∝ 𝜄𝛽1−1 1 − 𝜄 𝛽0−1 𝑄 𝜄|𝒠 = 𝐶𝑓𝑢𝑏 𝜄 𝛽1 + 𝑛, 𝛽0 + 𝑂 − 𝑛 ∝ 𝜄𝛽1+𝑛−1 1 − 𝜄 𝛽0+ 𝑂−𝑛 −1 𝑄 𝑦|𝒠 = 𝑄 𝑦|𝜄 𝑄 𝜄|𝒠 𝑒𝜄 = 𝐹𝑄 𝜄|𝒠 𝑄(𝑦|𝜄) ⇒ 𝑄 𝑦 = 1|𝒠 = 𝐹𝑄 𝜄|𝒠 𝜄 = 𝛽1 + 𝑛 𝛽0 + 𝛽1 + 𝑂

SLIDE 21

Dirichlet distribution

22

𝑄 𝜾 𝜷 ∝

𝑙=1 𝐿

𝜄𝑙

𝛽𝑙−1

= Γ(𝛽) Γ 𝛽1 … Γ(𝛽𝐿)

𝑙=1 𝐿

𝜄𝑙

𝛽𝑙−1

𝛽 =

𝑙=1 𝐿

𝛽𝑙

𝐹 𝜄𝑙 = 𝛽𝑙 𝛽 𝜄𝑙 = 𝛽𝑙 − 1 𝛽 − 𝐿

Input space: 𝜾 = 𝜄1, … , 𝜄𝐿 𝑈 𝜄𝑙 ∈ 0,1

𝑙=1 𝐿

𝜄𝑙 = 1

SLIDE 22

Dirichlet distribution: Examples

23

𝜷 = [0.1,0.1,0.1] 𝜷 = [1,1,1] 𝜷 = [10,10,10] Dirichlet parameters determine both the prior beliefs and their strength. The larger values of 𝛽 correspond to more confidence on the prior belief (i.e., more imaginary samples)

SLIDE 23

Dirichlet distribution: Example

24

𝜷 = [20,2,2] 𝜷 = [2,2,2]

SLIDE 24

Multinomial distribution: Prior

25

 Dirichlet

distribution is the conjugate prior

f

Multinomial 𝑄 𝜾 𝒠, 𝜷 ∝ 𝑄 𝒠 𝜾 𝑄 𝜾 𝜷 ∝

𝑙=1 𝐿

𝜄𝑙

𝑛𝑙+𝛽𝑙−1

𝑄 𝜾 𝒠, 𝜷 = 𝐸𝑗𝑠 𝜾 𝜷 + 𝒏

𝒏 = 𝑛1, … , 𝑛𝐿 𝑈

𝑄 𝜾 𝜷 𝜾~𝐸𝑗𝑠(𝛽1, … , 𝛽𝐿) 𝑄 𝜾 𝒠, 𝜷 𝜾~𝐸𝑗𝑠(𝛽1 + 𝑛1, … , 𝛽𝐿 + 𝑛𝐿)

sufficient statistics of data

SLIDE 25

Multinomial: Predictive distribution

26

 Training samples: 𝒠 = 𝒚(1), … , 𝒚(𝑂)

𝑄 𝜾 = 𝐸𝑗𝑠 𝜾 𝛽1, … , 𝛽𝐿 𝑄 𝜾|𝒠 = 𝐸𝑗𝑠 𝜾 𝛽1 + 𝑛1, … , 𝛽𝐿 + 𝑛𝐿 𝑄 𝒚|𝒠 = 𝑄 𝒚|𝜾 𝑄 𝜾|𝒠 𝑒𝜾 = 𝐹𝑄 𝜾|𝒠 𝑄(𝒚|𝜾) ⇒ 𝑄 𝑦𝑙 = 1|𝒠 = 𝐹𝑄 𝜾|𝒠 𝜄𝑙 = 𝛽𝑙 + 𝑛𝑙 𝛽 + 𝑂

Larger 𝛽 more confidence in our prior 𝛽𝑙 + 𝑛𝑙 𝛽 + 𝑂 = 𝛽 𝛽 + 𝑂 × 𝛽𝑙 𝛽 + 𝑂 𝛽 + 𝑂 × 𝑛𝑙 𝑂

SLIDE 26

Multinomial: Predictive distribution

27

𝑄 𝒚|𝒠 = 𝑁𝑣𝑚𝑢𝑗 𝛽1 + 𝑛1 𝛽 + 𝑂 , … , 𝛽𝐿 + 𝑛𝐿 𝛽 + 𝑂

 Bayesian prediction combines sufficient statistics from

imaginary Dirichlet samples and real data samples

SLIDE 27

Example: MLE vs. Bayesian

28

 Effect of different priors on smoothing parameter estimates 𝜄~𝐶𝑓𝑢𝑏(1,1) 𝜄~𝐶𝑓𝑢𝑏(10,10) [Koller & Friedman Book]

SLIDE 28

Bayesian Estimation Gaussian distribution with unknown 𝜈 (known 𝜏)

29

 𝑄 𝑦|𝜈 ~𝑂(𝜈, 𝜏2)  𝑄(𝜈)~𝑂(𝜈0, 𝜏0

2)

𝑄 𝜈 𝒠 ∝ 𝑄 𝒠 𝜈 𝑄 𝜈 =

𝑜=1 𝑂

𝑄(𝑦 𝑜 𝜈 𝑄(𝜈) 𝑄 𝜈 𝒠 ∝ exp − 1 2 𝑂 𝜏2 + 1 𝜏0

2 𝜈2 − 2 𝑜=1 𝑂

𝑦 𝑜 𝜏2 + 𝜈0 𝜏0

2

𝜈

SLIDE 29

Bayesian Estimation Gaussian distribution with unknown 𝜈 (known 𝜏)

30

⇒ 𝑄 𝜈 𝒠 ~𝑂(𝜈𝑂, 𝜏𝑂

2) 𝜈𝑂 =

𝑂𝜏0

2

𝑂𝜏0

2 + 𝜏2

𝑜=1

𝑂

𝑦 𝑜 𝑂

+ 𝜏2 𝑂𝜏0

2 + 𝜏2 𝜈0

1 𝜏𝑂

2 = 1

𝜏0

2 + 𝑂

𝜏2

𝑄 𝑦 𝒠 = 𝑄 𝑦 𝜈 𝑄 𝜈|𝒠 𝑒𝜈 ∝ exp − 1 2 𝑦 − 𝜈𝑂 2 𝜏2 + 𝜏𝑂

2

⇒ 𝑄 𝑦 𝒠 ~𝑂(𝜈𝑂, 𝜏2 + 𝜏𝑂

2)

SLIDE 30

Conjugate prior for Gaussian distribution

31

 Known 𝜈, unknown 𝜏2

 The conjugate prior for 𝜇 = 1/𝜏2 is a Gamma with shape 𝑏0 and rate

(inverse scale) 𝑐0

 The conjugate prior for 𝜏2 is Inverse-Gamma

 Unknown 𝜈 and unknown 𝜏2

 The conjugate prior 𝑄 𝜈, 𝜇 is Normal-Gamma

 Multivariate case:

 The conjugate prior 𝑄 𝝂, 𝚳 is Normal-Wishart

𝑄 𝜈, 𝜇 = 𝒪 𝜈 𝜈0, 𝛾𝜇 −1 𝐻𝑏𝑛 (𝜇|𝑏0, 𝑐0) 𝐻𝑏𝑛 𝜇 𝑏, 𝑐 = 1 Γ(𝑏) 𝑐𝑏𝑦𝑏−1 exp −𝑐𝜇 𝐽𝑜𝑤𝐻𝑏𝑛 𝜏2 𝑏, 𝑐 = 1 Γ(𝑏) 𝑐𝑏𝑦−𝑏−1exp(− 𝑐 𝜏2)

SLIDE 31

Bayesian estimation: Summary

32

 Bayesian approach treats parameters as random variables

 Learning is then a special case of inference

 Asymptotically equivalent to MLE  For some parametric distributions, has closed form (that

is obtained based on prior parameters and sufficient statistics from data)

SLIDE 32

Learning in Bayesian networks

33

 Learning

 MLE

 Likelihood decomposes on nodes conditional distributions

 Bayesian

 We can make some assumptions (global & local independencies) to

simplify the learning

SLIDE 33

Likelihood decomposition: Example

34

Directed factorization causes likelihood to locally decompose:

𝑄 𝒀 𝜾 = 𝑄 𝑌1 𝜾1 𝑄 𝑌2 𝑌1, 𝜾2 𝑄 𝑌3 𝑌1, 𝜾3 𝑄 𝑌4 𝑌2, 𝑌3, 𝜾4

𝑌1 𝑌3 𝑌2 𝑌4

SLIDE 34

Decomposition of the likelihood

35

 If we assume the parameters for each CPD are disjoint (i.e., disjoint

parameter sets 𝜾𝑌𝑗|𝑄𝑏𝑌𝑗 for different variables), and the data is complete, then the log-likelihood function decomposes on nodes:

𝑀 𝜾; 𝒠 = 𝑄 𝒠 𝜾 =

𝑜=1 𝑂 𝑗∈𝒲

𝑄 𝑌𝑗

(𝑜) 𝑄𝑏𝑌𝑗

(𝑜), 𝜾𝑗

=

𝑗∈𝒲 𝑜=1 𝑂

𝑄 𝑌𝑗

(𝑜) 𝑄𝑏𝑌𝑗 (𝑜), 𝜾𝑗

=

𝑗∈𝒲

𝑀𝑗(𝜾𝑗; 𝒠) 𝑀𝑗 𝜾𝑗; 𝒠 =

𝑜=1 𝑂

𝑄 𝑌𝑗

(𝑜) 𝑄𝑏𝑌𝑗 (𝑜), 𝜾𝑗

𝜾𝑗 ≡ 𝜾𝑌𝑗|𝑄𝑏𝑌𝑗 𝜾𝑗 = argmax

𝜾𝒋

𝑀𝑗 𝜾𝑗; 𝒠 𝜾 = [ 𝜾1, … , 𝜾 𝓦 ] 𝜾 = argmax

𝜾

𝑀 𝜾; 𝒠

SLIDE 35

Local decomposition of the likelihood: Table-CPDs

36

The choice of parameters given different value assignment to the parents are independent of each other Consider only data points that agree with the parent assignment:

𝑀𝑗 𝜾𝑌𝑗|𝑄𝑏𝑌𝑗; 𝒠 =

𝑜=1 𝑂

𝜄𝑌𝑗

(𝑜)|𝑄𝑏𝑌𝑗 (𝑜)

=

𝒗∈𝑊𝑏𝑚(𝑄𝑏𝑌𝑗) 𝑦∈𝑊𝑏𝑚(𝑌𝑗)

𝜄𝑦|𝒗

𝑜=1

𝑂

𝐽 𝑌𝑗

(𝑜)=𝑦,𝑄𝑏𝑌𝑗 (𝑜)=𝒗

SLIDE 36

Local decomposition of the likelihood: Table-CPDs

37

𝑛𝑗 𝑙, 𝒗 =

𝑜=1 𝑂

𝐽 𝑌𝑗

(𝑜) = 𝑙, 𝑄𝑏𝑌𝑗 (𝑜) = 𝒗

∀𝒗

𝑙

𝜄𝑙|𝒗

𝑗

= 1 𝜄𝑙|𝒗

𝑗

= 𝑛𝑗 𝑙, 𝒗 𝑙 𝑛𝑗 𝑙, 𝒗 = 𝑜=1

𝑂

𝐽 𝑌𝑗

(𝑜) = 𝑙, 𝑄𝑏𝑌𝑗 (𝑜) = 𝒗

𝑜=1

𝑂

𝐽 𝑄𝑏𝑌𝑗

(𝑜) = 𝒗

𝑌𝑗 = 1 … 𝑌𝑗 = 𝐿 𝑄𝑏𝑌𝑗 = 𝒗1 𝜄1|𝒗𝟐

𝑗

… 𝜄𝐿|𝒗𝟐

𝑗

… 𝑄𝑏𝑌𝑗 = 𝒗𝑀 𝜄1|𝒗𝑴

𝑗

… 𝜄𝐿|𝒗𝑴

𝑗

𝜾𝑗 ≡ 𝜾𝑌𝑗|𝑄𝑏𝑌𝑗 Each row is multinomial distribution 𝜾𝑌𝑗|𝒗1 𝜾𝑌𝑗|𝒗𝑀

SLIDE 37

Overfitting

38

 For large

𝑎∈𝑄𝑏𝑌𝑗

|𝑊𝑏𝑚(𝑎)|, most “buckets” will have very few

instances

 Number of parameters will increases exponentially with 𝑄𝑏𝑌𝑗  Poor estimation  With limited data, overfitting occurs

 we often get better generalization with simpler structures

 zero count problem or the sparse data problem

 e.g., consider a naïve bayes classifier and an email 𝑒 containing a word

that does not occur in any training doc for the class 𝑑, then 𝑄 𝑒 𝑑 = 0

SLIDE 38

ML Example: Naïve Bayes classifier

39

𝑍(𝑜) 𝑌𝑗

(𝑜)

𝑗 = 1, … , 𝐸 𝜾𝑗 𝑜 = 1, … , 𝑂 𝑍 𝑌1 𝑌2 𝑌𝐸 … 𝜾1 𝜾𝐸 𝜾2 𝝆 𝑍 𝑌𝑗 𝑗 = 1, . . , 𝐸

𝜾𝑗 = 𝜾𝑌𝑗|1, … , 𝜾𝑌𝑗|𝐷

𝜾𝑗 = 𝜾𝑌𝑗|𝑍 = 𝜾𝑌𝑗|1, … , 𝜾𝑌𝑗|𝐷 𝑍 ∈ {1, … , 𝐷} Training phase Test phase 𝝆 𝝆

SLIDE 39

ML Example: Naïve Bayes classifier

40

𝑀 𝜾; 𝒠 =

𝑜=1 𝑂

𝑄(𝑍 𝑜 |𝝆)

𝑗=1 𝐸

𝑄(𝑌𝑗

𝑜 |𝑍 𝑜 , 𝜾𝑌𝑗|𝑍)

𝜌𝑑 = 𝑜=1

𝑂

𝐽(𝑍 𝑜 = 𝑑) 𝑂

 Discrete inputs or features (𝑌𝑗 ∈ {1, … , 𝐿}):



𝜾𝑙|𝑑

𝑗

=

𝑜=1

𝑂

𝐽(𝑌𝑗

𝑜 =𝑙,𝑍 𝑜 =𝑑)

𝑜=1

𝑂

𝐽(𝑍 𝑜 =𝑑) 𝑞 𝑌𝑗 = 𝑙|𝑍 = 𝑑

SLIDE 40

Recall: Maximum conditional likelihood

41

 Example: discriminative classifier

 Needs

to learn the conditional distribution (not joint distribution)

 Given 𝒠 =

𝒚 𝑜 , 𝑧 𝑜

𝑜=1 𝑂

and a parametric conditional distribution for 𝑄(𝑧|𝒚; 𝜾), we find: 𝜾 = argmax

𝜾 𝑜=1 𝑂

𝑄(𝑧 𝑜 |𝒚 𝑜 ; 𝜾)

 We will also see maximum conditional likelihood for

more general CPDs than tabular ones

SLIDE 41

Maximum conditional likelihood Linear regression

42

𝒚 𝑧 𝒙

𝑄 𝑧 𝒚, 𝒙 = 𝑂 𝑧|𝒙𝑈𝜚 𝒚 , 𝜏2 Linear: 𝜚 𝒚 = [1, 𝑦1, … , 𝑦𝐸] 𝑀𝑍|𝒀 𝒙; 𝒠 =

𝑜=1 𝑂

𝑄 𝑍(𝑜)|𝒀 𝑜 , 𝒙 =

𝑜=1 𝑂

𝑂 𝑍 𝑜 𝒙𝑈𝜚 𝒀 𝑜 , 𝜏2 𝒙 = argmax

𝜾 𝑜=1 𝑂

𝑂 𝑍 𝑜 𝒙𝑈𝜚 𝒀 𝑜 , 𝜏2 = 𝚾𝑈𝚾 −1𝚾𝑈𝒁

𝒠 = 𝒚 𝑜 , 𝑧 𝑜

𝑜=1 𝑂

SLIDE 42

Maximum conditional likelihood Logistic regression

43

𝒚 𝑧 𝒙

𝑄 𝑧 = 1 𝒀 = 𝜏 𝒙𝑈𝜚 𝒚 Linear LR: 𝜚 𝒚 = [1, 𝑦1, … , 𝑦𝐸] 𝑄 𝑧 𝒚, 𝒙 = 𝐶𝑓𝑠(𝑧|𝜏 𝒙𝑈𝜚 𝒚 )

𝑍 ∈ {0,1} 𝜏 𝑨 = 1 1 + exp −𝑨

𝑀𝑍|𝒀 𝒙, 𝒠 =

𝑜=1 𝑂

𝑄 𝑧(𝑜)|𝒚 𝑜 , 𝒙 =

𝑜=1 𝑂

𝐶𝑓𝑠 𝑧 𝑜 𝜏 𝒙𝑈𝜚 𝒚 𝑜 =

𝑜=1 𝑂

𝜏 𝒙𝑈𝜚 𝒚 𝑜

𝑧(𝑜)

1 − 𝜏 𝒙𝑈𝜚 𝒚 𝑜

1−𝑧(𝑜)

𝒠 = 𝒚 𝑜 , 𝑧 𝑜

𝑜=1 𝑂

SLIDE 43

MLE for Bayesian networks: Summary

44

 For a BN with disjoint sets of parameters in CPDs, likelihood

decomposes as product of local likelihood functions on nodes

 Thus, we optimize likelihoods on different nodes separately

 For table CPDs, local likelihood further decomposes as

product of likelihood for multinomials, one for each parent combination

 Sparse data problem of MLE

SLIDE 44

Bayesian learning for BNs Global parameter independence

45

 Global parameter independence assumption:

𝑄 𝜾 =

𝑗∈𝒲

𝑄(𝜾𝑗)

𝑀𝑗 𝜾𝑗; 𝒠 =

𝑜=1 𝑂

𝑄 𝑌𝑗

(𝑜) 𝑄𝑏𝑌𝑗

(𝑜), 𝜾𝑗

𝑄 𝒠 𝜾 =

𝑗∈𝒲

𝑀𝑗(𝜾𝑗; 𝒠) 𝑄 𝜾|𝒠 = 1 𝑄 𝒠 𝑄 𝒠 𝜾 𝑄 𝜾 = 1 𝑄 𝒠

𝑗∈𝒲

𝑀𝑗 𝜾𝑗; 𝒠 𝑄 𝜾𝑗 =

𝑗∈𝒲

𝑄 𝜾𝑗|𝒠

𝑌1

(𝑜)

𝑌3

(𝑜)

𝑌2

(𝑜)

𝜾1 𝜾3 𝜾2 𝑂 ⇒ Posteriors of 𝜾 are independent given complete data and independent priors 𝜾𝑗 ≡ 𝜾𝑌𝑗|𝑄𝑏𝑌𝑗

SLIDE 45

Bayesian learning for BNs Global parameter independence

46

 How can we find directly from the structure of the

Bayesian network when

𝑄 𝜾

satisfies global independence ?

SLIDE 46

Bayesian learning for BNs Global parameter independence

47

𝑄 𝒚 𝒠 = 𝑄 𝒚 𝜾, 𝒠 𝑄 𝜾 𝒠 𝑒𝜾 𝑄 𝒚 𝒠 = 𝑄 𝒚 𝜾 𝑄 𝜾 𝒠 𝑒𝜾 =

𝑗∈𝒲

𝑄 𝑦𝑗|𝑄𝑏𝑌𝑗, 𝜾𝑗

𝑗∈𝒲

𝑄 𝜾𝑗|𝒠 𝑒𝜾 =

𝑗∈𝒲

𝑄 𝑦𝑗|𝑄𝑏𝑌𝑗, 𝜾𝑗 𝑄 𝜾𝑗|𝒠 𝑒𝜾𝑗

 Solve the prediction problem for each CPD independently and

then combine the results.

Instances are independent given the parameters (for complete data) 𝜾𝑗 ≡ 𝜾𝑌𝑗|𝑄𝑏𝑌𝑗

Predictive distribution

𝑔 𝑦 𝑕 𝑧 𝑒𝑦𝑒𝑧 = 𝑔 𝑦 𝑒𝑦 𝑕 𝑧 𝑒𝑧

SLIDE 47

Local decomposition for table CPDs

48

 Prior

𝑄 𝜾𝑌|𝑄𝑏𝑌 satisfies local parameter independence assumption if: 𝑄 𝜾𝑌|𝑄𝑏𝑌 =

𝑄𝑏𝑌=𝒗

𝑄 𝜾𝑌|𝒗

 The posterior of 𝑄 𝜾𝑌|𝒗|𝒠 and 𝑄 𝜾𝑌|𝒗′|𝒠 are independent

despite v−structure on 𝑌, because 𝑄𝑏𝑌 acts like a multiplexer

𝑍(𝑜) 𝑌(𝑜) 𝜾𝑌|1 𝜾𝑌|𝐷 … 𝑍 ∈ {1, … , 𝐷} 𝑜 = 1, … , 𝑂 The groups of parameters in different rows of a CPD are independent

SLIDE 48

Global and local parameter independence

49

 Let the data be complete and CPDs be tabular. If the

prior 𝑄(𝜾) satisfies global and local parameter independence, then: 𝑄 𝜾 𝒠 =

𝑗∈𝒲 𝑄𝑏𝑌𝑗

𝑄(𝜾𝑌𝑗|𝑄𝑏𝑌𝑗|𝒠) 𝑄 𝑌𝑗 = 𝑙 𝒗, 𝒠 =

𝛽𝑙|𝒗

𝑗

+ 𝑛𝑗(𝑙, 𝒗) 𝑙′=1

𝐿

𝛽𝑙′|𝒗

𝑗

+ 𝑛𝑗(𝑙′, 𝒗)

𝜾𝑌𝑗|𝒗~𝐸𝑗𝑠(𝛽1|𝒗

𝑗

, … , 𝛽𝐿|𝒗

𝑗

) 𝜾𝑌𝑗|𝒗|𝒠 ~𝐸𝑗𝑠(𝛽1|𝒗

𝑗

+ 𝑛𝑗(1, 𝒗), … , 𝛽𝐿|𝒗

𝑗

+ 𝑛𝑗(𝐿, 𝒗)) 𝑄𝑏𝑌𝑗 = 𝒗 𝐿 = 𝑊𝑏𝑚(𝑌𝑗)

SLIDE 49

Priors for Bayesian learning

50

 Priors 𝛽𝑌𝑗|𝑄𝑏𝑌𝑗 must be defined:

𝛽𝑦|𝒗|𝑦 ∈ 𝑊𝑏𝑚 𝑌𝑗 , 𝒗 ∈ 𝑊𝑏𝑚(𝑄𝑏𝑌𝑗)

 𝐿2 prior: a fixed prior for all hyper-parameters 𝛽𝑦|𝒗 = 1  We can store an 𝛽 and a prior distribution 𝑄′ on the

variables in the network: 𝛽𝑌𝑗|𝑄𝑏𝑌𝑗 = 𝛽𝑄′ 𝑌𝑗, 𝑄𝑏𝑌𝑗

 BDe prior: Define 𝑄′ as a set of independent marginals

ver 𝑌𝑗’s

SLIDE 50

Naïve Bayes Example: Bayesian learning

51

𝑍(𝑜) 𝑌𝑗

(𝑜)

𝑗 = 1, … , 𝐸 𝜾𝒀𝑗|𝑑 𝑜 = 1, … , 𝑂 𝜾𝑗 = 𝜾𝑌𝑗|𝑍 = 𝜾𝑌𝑗|1, … , 𝜾𝑌𝑗|𝐷 𝑍 ∈ {1, … , 𝐷} Training phase 𝜌 𝜷𝒀𝑗|𝑑 𝜷 𝑑 = 1, … , 𝐷

SLIDE 51

Naïve Bayes Example: Bayesian learning

52

 𝑍~𝑁𝑣𝑚𝑢 𝝆 (i.e., 𝑄 𝑍 = 𝑑 𝝆 = 𝜌𝑑)

 𝑄 𝝆 𝛽1, … , 𝛽𝐷 = 𝐸𝑗𝑠(𝝆|𝛽1, … , 𝛽𝐷)  𝑄 𝝆 𝒠, 𝛽1, … , 𝛽𝐷 = 𝐸𝑗𝑠 𝝆 𝛽1 + 𝑛1, … , 𝛽𝐷 + 𝑛𝑑

𝑄 𝑍 = 𝑑 𝒠 = 𝛽𝑑 + 𝑛𝑑 𝑑′=1

𝐷

𝛽𝑑′ + 𝑛𝑑′

 Discrete inputs or features (𝑌𝑗 ∈ {1, … , 𝐿}):  𝑄 𝑌𝑗 = 𝑙 𝑍 = 𝑑, 𝒠 =

𝛽𝑙|𝒅

𝑗 +𝑛𝑗(𝑙,𝑑)

𝑙′=1

𝐿

𝛽𝑙′|𝒅

𝑗

+𝑛𝑗(𝑙′,𝑑)

𝑛𝑗 𝑙, 𝑑 = 𝐽(𝑌𝑗

𝑜 = 𝑙, 𝑍 𝑜 = 𝑑)

𝑛𝑑 =

𝑜=1 𝑂

𝐽(𝑍 𝑜 = 𝑑)

SLIDE 52

Global & local independence

53

 For nodes with no parents, parameters define a single

distribution

 Bayesian or ML learning can be used as in the simple density estimation

n a variable

 More generally, for tabular CPDs there are multiple categorical

distributions per node, one for every combination of parent variables

 Learning objective decomposes into multiple terms, one for subset of

training data with each parent configuration

 Apply independent Bayesian or ML learning to each

SLIDE 53

Shared parameters

54

 Sharing CPDs or sharing parameters within a single CPD:

𝑄 𝑌𝑗 𝑄𝑏𝑌𝑗, 𝜾 = 𝑄 𝑌

𝑘 𝑄𝑏𝑌𝑘, 𝜾

𝑄 𝑌𝑗 𝒗1, 𝜄𝑗

𝑡 = 𝑄 𝑌 𝑘 𝒗2, 𝜄𝑗 𝑡

 MLE: For networks with shared CPDs, sufficient statistics

accumulate over all uses of CPD

 Aggregate sufficient statistics: collect all the instances generated

from the same conditional distribution and combine sufficient statistics from them

SLIDE 54

Markov chain

55

𝑄 𝑌1, … , 𝑌𝑈 𝜾 = 𝑄 𝑌1 𝝆

𝑢=2 𝑈

𝑄(𝑌𝑢|𝑌𝑢−1, 𝑩𝑢−1,𝑢)

 Shared Parameters:

𝑄 𝑌1, … , 𝑌𝑈 𝜾 = 𝑄 𝑌1 𝝆

𝑢=2 𝑈

𝑄(𝑌𝑢|𝑌𝑢−1, 𝑩)

𝑌1 𝑌2 𝑌𝑈 𝑌𝑈−1 …

SLIDE 55

Markov chain

56

 Initial state probability:

𝜌𝑗 = 𝑄 𝑌1 = 𝑗 , 1 ≤ 𝑗 ≤ 𝐿

 State transition probability:

𝐵𝑘𝑗 = 𝜄

𝑘|𝑗 = 𝑄 𝑌𝑢+1 = 𝑗 𝑌𝑢 = 𝑘 ,

1 ≤ 𝑗, 𝑘 ≤ 𝐿

𝑗=1 𝐿

𝐵𝑘𝑗 = 1

𝑌1 𝑌2 𝑌𝑈 𝑌𝑈−1 … 𝝆 𝑩 = 𝜾𝑇′|𝑇

SLIDE 56

Markov chain: Parameter learning by MLE

57

𝑄 𝒠|𝜾 =

𝑜=1 𝑂

𝑄 𝑌1

(𝑜) 𝝆 𝑢=2 𝑈

𝑄(𝑌𝑢

(𝑜)|𝑌𝑢−1 (𝑜), 𝑩)

𝐵𝑘𝑗 = 𝑜=1

𝑂

𝑢=2

𝑈

𝐽 𝑌𝑢−1

(𝑜) = 𝑘, 𝑌𝑢 (𝑜) = 𝑗

𝑜=1

𝑂

𝑢=2

𝑈

𝐽 𝑌𝑢−1

(𝑜) = 𝑘

𝜌𝑗 = 𝑜=1

𝑂

𝐽 𝑌1

(𝑜) = 𝑗

𝑂

SLIDE 57

Markov chain: Bayesian learning

58

 Dirichlet prior 𝜷 on 𝑩

𝑄 𝑌𝑢+1 = 𝑗 𝑌𝑢 = 𝑘, 𝒠, 𝜷𝑘,. = 𝑜=1

𝑂

𝑢=2

𝑈

𝐽 𝑌𝑢−1

(𝑜) = 𝑘, 𝑌𝑢 (𝑜) = 𝑗 + 𝛽𝑘,𝑗

𝑜=1

𝑂

𝑢=2

𝑈

𝐽 𝑌𝑢−1

(𝑜) = 𝑘 + 𝑗′=1 𝐿

𝛽𝑘,𝑗′

𝐵𝑘𝑗 = 𝑄 𝑌𝑢+1 = 𝑗 𝑌𝑢 = 𝑘 𝐵𝑘,.~𝐸𝑗𝑠(𝛽𝑘,1, … , 𝛽𝑘,𝐿)

Assign a Dirichlet prior to each row of the transition matrix 𝐵

SLIDE 58

Hyper-parameters in Bayesian approach

59

 We already consider either parameter independencies or

parameter sharing

 Hierarchical priors gives us a flexible language to introduce

dependencies in the priors over parameters.

 useful when we have small amount of examples relevant to each

parameter but we believe that some parameters are reasonably similar.

 In such situations, hierarchical priors “spread” the effect of the

bservations between parameters with shared hyper-parameters.

 However, when we use hyper-parameters we transformed our

learning problem into one that includes a hidden variable.

SLIDE 59

Hyper-parameters: Example

60

[Koller & Friedman Book] The effect of the prior will be to shift both 𝜄𝑍|𝑦0 and 𝜄𝑍|𝑦1 to be closer to each other

SLIDE 60

Hyper-parameters: Example

61

[Koller & Friedman Book] If we believe that two variables Y and Z depend on X in a similar (but not identical) way

SLIDE 61

References

62