Learning Bayesian network : Given structure and completely observed - - PowerPoint PPT Presentation

β–Ά
learning bayesian network
SMART_READER_LITE
LIVE PREVIEW

Learning Bayesian network : Given structure and completely observed - - PowerPoint PPT Presentation

Learning Bayesian network : Given structure and completely observed data Probabilistic Graphical Models Sharif University of Technology Spring 2017 Soleymani Learning problem Target: true distribution that maybe correspond to


slide-1
SLIDE 1

Learning Bayesian network :

Given structure and completely observed data

Probabilistic Graphical Models Sharif University of Technology Spring 2017 Soleymani

slide-2
SLIDE 2

Learning problem

2

 Target: true distribution π‘„βˆ— that maybe correspond to β„³βˆ—

= π’§βˆ—, πœΎβˆ—

 Hypothesis space: specified probabilistic graphical models  Data: set of instances sampled from π‘„βˆ—  Learning goal: selecting a model

β„³ to construct the best approximation to π‘βˆ— according to a performance metric

slide-3
SLIDE 3

Learning tasks on graphical models

3

 Parameter learning / structure learning  Completely observable / partially observable data  Directed model / undirected model

slide-4
SLIDE 4

Parameter learning in directed models Complete data

4

 We assume that the structure of the model is known

 consider learning parameters for a BN with a given structure

 Goal: estimate CPDs from a dataset 𝒠 = {π’š 1 , . . . , π’š(𝑂)} of 𝑂

independent, identically distributed (i.i.d.) training samples.

 Each training sample π’š π‘œ = 𝑦1

(π‘œ), … , 𝑦𝑀 (π‘œ)

is a vector that every element 𝑦𝑗

(π‘œ) is known (no missing values, no hidden variables)

slide-5
SLIDE 5

Density estimation review

5

 We use density estimation to solve this learning problem  Density estimation: Estimating the probability density function

𝑄(π’š), given a set of data points π’š 𝑗

𝑗=1 𝑂

drawn from it.

 Parametric methods: Assume that 𝑄(π’š) in terms of a specific

functional form which has a number of adjustable parameters

 MLE and Bayesian estimate

 MLE: Need to determine πœΎβˆ— given {π’š 1 , … , π’š(𝑂)}

 MLE overfitting problem

 Bayesian estimate: Probability distribution 𝑄(𝜾) over spectrum of

hypotheses

 Needs prior distribution on parameters

slide-6
SLIDE 6

Density estimation: Graphical model

6

 i.i.d assumption

𝜾

π‘Œ(1) π‘Œ(2) π‘Œ(𝑂)

… 𝜾 π‘Œ(π‘œ) π‘œ = 1, . . . , 𝑂 𝜾 𝜾

π‘Œ(1) π‘Œ(2) π‘Œ(𝑂)

… 𝜾 π‘Œ(𝑗) 𝑗 = 1, . . . , 𝑂 𝜾 𝜷 𝜷 𝜷 hyperparametrs

slide-7
SLIDE 7

Maximum Likelihood Estimation (MLE)

7

 Likelihood is the conditional probability of observations 𝒠

= π’š(1), π’š(2), … , π’š(𝑂) given the value of parameters 𝜾

 Assuming i.i.d. (independent, identically distributed) samples

𝑄 𝒠 𝜾 = 𝑄 π’š(1), … , π’š(𝑂) 𝜾 =

π‘œ=1 𝑂

𝑄(π’š(π‘œ)|𝜾)

 Maximum Likelihood estimation

πœΎπ‘π‘€ = argmax

𝜾

𝑄 𝒠 𝜾 = argmax

𝜾 π‘œ=1 𝑂

𝑄(π’š(π‘œ)|𝜾)

likelihood of 𝜾 w.r.t. the samples

πœΎπ‘π‘€ = argmax

𝜾 𝑗=1 𝑂

ln π‘ž π’š(𝑗) 𝜾

MLE has closed form solution for many parametric distributions

slide-8
SLIDE 8

MLE: Bernoulli distribution

 Given: 𝒠 = 𝑦(1), 𝑦(2), … , 𝑦(𝑂) , 𝑛 heads (1), 𝑂 βˆ’ 𝑛 tails (0):

π‘ž 𝑦 πœ„ = πœ„π‘¦ 1 βˆ’ πœ„ 1βˆ’π‘¦ π‘ž 𝒠 πœ„ =

π‘œ=1 𝑂

π‘ž(𝑦 π‘œ |πœ„) =

π‘œ=1 𝑂

πœ„π‘¦ π‘œ 1 βˆ’ πœ„ 1βˆ’π‘¦ π‘œ ln π‘ž 𝒠 πœ„ =

π‘œ=1 𝑂

ln π‘ž(𝑦 π‘œ |πœ„) =

π‘œ=1 𝑂

{𝑦 π‘œ ln πœ„ + (1 βˆ’ 𝑦 π‘œ ) ln 1 βˆ’ πœ„ } πœ– ln π‘ž 𝒠 πœ„ πœ–πœ„ = 0 β‡’ πœ„π‘π‘€ = 𝑗=1

𝑂

𝑦(π‘œ) 𝑂 = 𝑛 𝑂

π‘ž 𝑦 = 1 πœ„ = πœ„

slide-9
SLIDE 9

MLE: Multinomial distribution

9

 Multinomial distribution (on variable with 𝐿 state):

𝑄 π’š 𝜾 =

𝑙=1 𝐿

πœ„π‘™

𝑦𝑙

Parameter space: 𝜾 = πœ„1, … , πœ„πΏ πœ„π‘— ∈ 0,1

𝑙=1 𝐿

πœ„π‘™ = 1 Variable: 1-of-K coding π’š = 𝑦1, … , 𝑦𝐿 𝑦𝑙 ∈ {0,1}

𝑙=1 𝐿

𝑦𝑙 = 1 𝑄 𝑦𝑙 = 1 = πœ„π‘™ πœ„1 πœ„2 πœ„3 πœ„1 + πœ„2 + πœ„3 = 1 where πœ„π‘— ∈ 0,1 that is a simplex showing the set of valid parameters

slide-10
SLIDE 10

MLE: Multinomial distribution

10

𝒠 = π’š(1), π’š(2), … , π’š(𝑂)

𝑄 𝒠 𝜾 =

π‘œ=1 𝑂

𝑄(π’š π‘œ |𝜾) =

π‘œ=1 𝑂 𝑙=1 𝐿

πœ„π‘™

𝑦𝑙

(π‘œ)

=

𝑙=1 𝐿

πœ„π‘™

π‘œ=1

𝑂

𝑦𝑙

(π‘œ)

β„’ 𝜾, πœ‡ = ln π‘ž 𝒠 𝜾 + πœ‡(1 βˆ’

𝑙=1 𝐿

πœ„π‘™) πœ„π‘™ = π‘œ=1

𝑂

𝑦𝑙

(π‘œ)

𝑂 = 𝑛𝑙 𝑂

𝑛𝑙 =

π‘œ=1 𝑂

𝑦𝑙

(π‘œ) 𝑙=1 𝐿

𝑛𝑙 = 𝑂

slide-11
SLIDE 11

MLE: Gaussian with unknown 𝜈

12

ln 𝑄(𝑦 π‘œ |𝜈) = βˆ’ 1 2 ln 2𝜌𝜏 βˆ’ 1 2𝜏2 𝑦 π‘œ βˆ’ 𝜈

2

πœ– ln 𝑄 𝒠 𝜈 πœ–πœˆ = 0 β‡’ πœ– πœ–πœˆ

π‘œ=1 𝑂

π‘šπ‘œ π‘ž 𝑦(π‘œ) 𝜈 = 0 β‡’

π‘œ=1 𝑂

1 𝜏2 𝑦(π‘œ) βˆ’ 𝜈 = 0 β‡’ πœˆπ‘π‘€ = 1 𝑂

π‘œ=1 𝑂

𝑦 π‘œ

slide-12
SLIDE 12

Bayesian approach

13

 Parameters 𝜾 as random variables with a priori distribution

 utilizes the available prior information about the unknown parameter  As opposed to ML estimation, it does not seek a specific point estimate

  • f the unknown parameter vector 𝜾

 Samples 𝒠 convert the prior densities 𝑄 𝜾

into a posterior density 𝑄 𝜾|𝒠

 Keep track of beliefs about πœΎβ€™s values and uses these beliefs for reaching

conclusions

slide-13
SLIDE 13

Maximum A Posteriori (MAP) estimation

14

 MAP estimation

πœΎπ‘π΅π‘„ = argmax

𝜾

π‘ž 𝜾 𝒠

 Since π‘ž 𝜾|𝒠 ∝ π‘ž 𝒠|𝜾 π‘ž(𝜾)

πœΎπ‘π΅π‘„ = argmax

𝜾

π‘ž 𝒠 𝜾 π‘ž(𝜾)

 Example of prior distribution:

π‘ž πœ„ = π’ͺ(πœ„0, 𝜏2)

slide-14
SLIDE 14

Bayesian approach: Predictive distribution

15

 Given a set of samples 𝒠 = π’š 𝑗

𝑗=1 𝑂 , a prior distribution on

the parameters 𝑄(𝜾), and the form of the distribution 𝑄 π’š 𝜾

 We find 𝑄 𝜾|𝒠 and use it to specify

𝑄 π’š = 𝑄(π’š|𝒠) on new data as an estimate of 𝑄(π’š):

𝑄 π’š 𝒠 = 𝑄 π’š, 𝜾|𝒠 π‘’πœΎ = 𝑄 π’š 𝒠, 𝜾 𝑄 𝜾|𝒠 π‘’πœΎ = 𝑄 π’š 𝜾 𝑄 𝜾|𝒠 π‘’πœΎ

 Analytical solutions exist for very special forms of the involved

functions

Predictive distribution

slide-15
SLIDE 15

Conjugate Priors

16

 We consider a form of prior distribution that has a simple

interpretation as well as some useful analytical properties

 Choosing a prior such that the posterior distribution that

is proportional to π‘ž(𝒠|𝜾)π‘ž(𝜾) will have the same functional form as the prior. βˆ€πœ·, 𝒠 βˆƒπœ·β€² 𝑄(𝜾|πœ·β€²) ∝ 𝑄 𝒠 𝜾 𝑄(𝜾|𝜷)

Having the same functional form

slide-16
SLIDE 16

Prior for Bernoulli Likelihood

 Beta distribution over πœ„ ∈ [0,1]:

Beta πœ„ 𝛽1, 𝛽0 ∝ πœ„π›½1βˆ’1 1 βˆ’ πœ„ 𝛽0βˆ’1 Beta πœ„ 𝛽1, 𝛽0 = Ξ“(𝛽0 + 𝛽1) Ξ“(𝛽0)Ξ“(𝛽1) πœ„π›½1βˆ’1 1 βˆ’ πœ„ 𝛽0βˆ’1

 Beta distribution is the conjugate prior of Bernoulli:

𝑄 𝑦 πœ„ = πœ„π‘¦ 1 βˆ’ πœ„ 1βˆ’π‘¦

𝐹 πœ„ = 𝛽1 𝛽0 + 𝛽1 πœ„ = 𝛽1 βˆ’ 1 𝛽0 βˆ’ 1 + 𝛽1 βˆ’ 1 most probable πœ„

17

slide-17
SLIDE 17

Beta distribution

18

slide-18
SLIDE 18

Benoulli likelihood: posterior

Given: 𝒠 = 𝑦(1), 𝑦(2), … , 𝑦(𝑂) , 𝑛 heads (1), 𝑂 βˆ’ 𝑛 tails (0)

π‘ž πœ„ 𝒠 ∝ π‘ž 𝒠 πœ„ π‘ž(πœ„) =

𝑗=1 𝑂

πœ„π‘¦ 𝑗 1 βˆ’ πœ„

1βˆ’π‘¦ 𝑗

Beta πœ„ 𝛽1, 𝛽0 ∝ πœ„π‘›+𝛽1βˆ’1 1 βˆ’ πœ„ π‘‚βˆ’π‘›+𝛽0βˆ’1 β‡’ π‘ž πœ„ 𝒠 ∝ 𝐢𝑓𝑒𝑏 πœ„ 𝛽1

β€², 𝛽0 β€²

𝛽1

β€² = 𝛽1 + 𝑛

𝛽0

β€² = 𝛽0 + 𝑂 βˆ’ 𝑛 ∝ πœ„π›½1βˆ’1 1 βˆ’ πœ„ 𝛽0βˆ’1

19

𝑛 =

𝑗=1 𝑂

𝑦(𝑗)

slide-19
SLIDE 19

Example

20

Bernoulli 𝛽0 = 𝛽1 = 2 𝒠 = 1,1,1 β‡’ 𝑂 = 3, 𝑛 = 3 πœ„π‘π΅π‘„ = argmax

πœ„

𝑄 πœ„ 𝒠 = 𝛽1

β€² βˆ’ 1

𝛽1

β€² βˆ’ 1 + 𝛽0 β€² βˆ’ 1 = 4

5 Posterior Beta:𝛽1

β€² = 5, 𝛽0 β€² = 2

Prior Beta: 𝛽0 = 𝛽1 = 2 πœ„ πœ„ π‘ž 𝑦 = 1 πœ„ πœ„ Given: 𝒠 = 𝑦(1), 𝑦(2), … , 𝑦(𝑂) : 𝑛 heads (1), 𝑂 βˆ’ 𝑛 tails (0) π‘ž 𝑦 πœ„ = πœ„π‘¦ 1 βˆ’ πœ„ 1βˆ’π‘¦

slide-20
SLIDE 20

Benoulli: Predictive distribution

21

 Training samples: 𝒠 = 𝑦(1), … , 𝑦(𝑂)

𝑄 πœ„ = 𝐢𝑓𝑒𝑏 πœ„ 𝛽1, 𝛽0 ∝ πœ„π›½1βˆ’1 1 βˆ’ πœ„ 𝛽0βˆ’1 𝑄 πœ„|𝒠 = 𝐢𝑓𝑒𝑏 πœ„ 𝛽1 + 𝑛, 𝛽0 + 𝑂 βˆ’ 𝑛 ∝ πœ„π›½1+π‘›βˆ’1 1 βˆ’ πœ„ 𝛽0+ π‘‚βˆ’π‘› βˆ’1 𝑄 𝑦|𝒠 = 𝑄 𝑦|πœ„ 𝑄 πœ„|𝒠 π‘’πœ„ = 𝐹𝑄 πœ„|𝒠 𝑄(𝑦|πœ„) β‡’ 𝑄 𝑦 = 1|𝒠 = 𝐹𝑄 πœ„|𝒠 πœ„ = 𝛽1 + 𝑛 𝛽0 + 𝛽1 + 𝑂

slide-21
SLIDE 21

Dirichlet distribution

22

𝑄 𝜾 𝜷 ∝

𝑙=1 𝐿

πœ„π‘™

π›½π‘™βˆ’1

= Ξ“(𝛽) Ξ“ 𝛽1 … Ξ“(𝛽𝐿)

𝑙=1 𝐿

πœ„π‘™

π›½π‘™βˆ’1

𝛽 =

𝑙=1 𝐿

𝛽𝑙

𝐹 πœ„π‘™ = 𝛽𝑙 𝛽 πœ„π‘™ = 𝛽𝑙 βˆ’ 1 𝛽 βˆ’ 𝐿

Input space: 𝜾 = πœ„1, … , πœ„πΏ π‘ˆ πœ„π‘™ ∈ 0,1

𝑙=1 𝐿

πœ„π‘™ = 1

slide-22
SLIDE 22

Dirichlet distribution: Examples

23

𝜷 = [0.1,0.1,0.1] 𝜷 = [1,1,1] 𝜷 = [10,10,10] Dirichlet parameters determine both the prior beliefs and their strength. The larger values of 𝛽 correspond to more confidence on the prior belief (i.e., more imaginary samples)

slide-23
SLIDE 23

Dirichlet distribution: Example

24

𝜷 = [20,2,2] 𝜷 = [2,2,2]

slide-24
SLIDE 24

Multinomial distribution: Prior

25

 Dirichlet

distribution is the conjugate prior

  • f

Multinomial 𝑄 𝜾 𝒠, 𝜷 ∝ 𝑄 𝒠 𝜾 𝑄 𝜾 𝜷 ∝

𝑙=1 𝐿

πœ„π‘™

𝑛𝑙+π›½π‘™βˆ’1

𝑄 𝜾 𝒠, 𝜷 = 𝐸𝑗𝑠 𝜾 𝜷 + 𝒏

𝒏 = 𝑛1, … , 𝑛𝐿 π‘ˆ

𝑄 𝜾 𝜷 𝜾~𝐸𝑗𝑠(𝛽1, … , 𝛽𝐿) 𝑄 𝜾 𝒠, 𝜷 𝜾~𝐸𝑗𝑠(𝛽1 + 𝑛1, … , 𝛽𝐿 + 𝑛𝐿)

sufficient statistics of data

slide-25
SLIDE 25

Multinomial: Predictive distribution

26

 Training samples: 𝒠 = π’š(1), … , π’š(𝑂)

𝑄 𝜾 = 𝐸𝑗𝑠 𝜾 𝛽1, … , 𝛽𝐿 𝑄 𝜾|𝒠 = 𝐸𝑗𝑠 𝜾 𝛽1 + 𝑛1, … , 𝛽𝐿 + 𝑛𝐿 𝑄 π’š|𝒠 = 𝑄 π’š|𝜾 𝑄 𝜾|𝒠 π‘’πœΎ = 𝐹𝑄 𝜾|𝒠 𝑄(π’š|𝜾) β‡’ 𝑄 𝑦𝑙 = 1|𝒠 = 𝐹𝑄 𝜾|𝒠 πœ„π‘™ = 𝛽𝑙 + 𝑛𝑙 𝛽 + 𝑂

Larger 𝛽 more confidence in our prior 𝛽𝑙 + 𝑛𝑙 𝛽 + 𝑂 = 𝛽 𝛽 + 𝑂 Γ— 𝛽𝑙 𝛽 + 𝑂 𝛽 + 𝑂 Γ— 𝑛𝑙 𝑂

slide-26
SLIDE 26

Multinomial: Predictive distribution

27

𝑄 π’š|𝒠 = π‘π‘£π‘šπ‘’π‘— 𝛽1 + 𝑛1 𝛽 + 𝑂 , … , 𝛽𝐿 + 𝑛𝐿 𝛽 + 𝑂

 Bayesian prediction combines sufficient statistics from

imaginary Dirichlet samples and real data samples

slide-27
SLIDE 27

Example: MLE vs. Bayesian

28

 Effect of different priors on smoothing parameter estimates πœ„~𝐢𝑓𝑒𝑏(1,1) πœ„~𝐢𝑓𝑒𝑏(10,10) [Koller & Friedman Book]

slide-28
SLIDE 28

Bayesian Estimation Gaussian distribution with unknown 𝜈 (known 𝜏)

29

 𝑄 𝑦|𝜈 ~𝑂(𝜈, 𝜏2)  𝑄(𝜈)~𝑂(𝜈0, 𝜏0

2)

𝑄 𝜈 𝒠 ∝ 𝑄 𝒠 𝜈 𝑄 𝜈 =

π‘œ=1 𝑂

𝑄(𝑦 π‘œ 𝜈 𝑄(𝜈) 𝑄 𝜈 𝒠 ∝ exp βˆ’ 1 2 𝑂 𝜏2 + 1 𝜏0

2 𝜈2 βˆ’ 2 π‘œ=1 𝑂

𝑦 π‘œ 𝜏2 + 𝜈0 𝜏0

2

𝜈

slide-29
SLIDE 29

Bayesian Estimation Gaussian distribution with unknown 𝜈 (known 𝜏)

30

β‡’ 𝑄 𝜈 𝒠 ~𝑂(πœˆπ‘‚, πœπ‘‚

2) πœˆπ‘‚ =

π‘‚πœ0

2

π‘‚πœ0

2 + 𝜏2

π‘œ=1

𝑂

𝑦 π‘œ 𝑂

+ 𝜏2 π‘‚πœ0

2 + 𝜏2 𝜈0

1 πœπ‘‚

2 = 1

𝜏0

2 + 𝑂

𝜏2

𝑄 𝑦 𝒠 = 𝑄 𝑦 𝜈 𝑄 𝜈|𝒠 π‘’πœˆ ∝ exp βˆ’ 1 2 𝑦 βˆ’ πœˆπ‘‚ 2 𝜏2 + πœπ‘‚

2

β‡’ 𝑄 𝑦 𝒠 ~𝑂(πœˆπ‘‚, 𝜏2 + πœπ‘‚

2)

slide-30
SLIDE 30

Conjugate prior for Gaussian distribution

31

 Known 𝜈, unknown 𝜏2

 The conjugate prior for πœ‡ = 1/𝜏2 is a Gamma with shape 𝑏0 and rate

(inverse scale) 𝑐0

 The conjugate prior for 𝜏2 is Inverse-Gamma

 Unknown 𝜈 and unknown 𝜏2

 The conjugate prior 𝑄 𝜈, πœ‡ is Normal-Gamma

 Multivariate case:

 The conjugate prior 𝑄 𝝂, 𝚳 is Normal-Wishart

𝑄 𝜈, πœ‡ = π’ͺ 𝜈 𝜈0, π›Ύπœ‡ βˆ’1 𝐻𝑏𝑛 (πœ‡|𝑏0, 𝑐0) 𝐻𝑏𝑛 πœ‡ 𝑏, 𝑐 = 1 Ξ“(𝑏) π‘π‘π‘¦π‘βˆ’1 exp βˆ’π‘πœ‡ π½π‘œπ‘€π»π‘π‘› 𝜏2 𝑏, 𝑐 = 1 Ξ“(𝑏) π‘π‘π‘¦βˆ’π‘βˆ’1exp(βˆ’ 𝑐 𝜏2)

slide-31
SLIDE 31

Bayesian estimation: Summary

32

 Bayesian approach treats parameters as random variables

 Learning is then a special case of inference

 Asymptotically equivalent to MLE  For some parametric distributions, has closed form (that

is obtained based on prior parameters and sufficient statistics from data)

slide-32
SLIDE 32

Learning in Bayesian networks

33

 Learning

 MLE

 Likelihood decomposes on nodes conditional distributions

 Bayesian

 We can make some assumptions (global & local independencies) to

simplify the learning

slide-33
SLIDE 33

Likelihood decomposition: Example

34

Directed factorization causes likelihood to locally decompose:

𝑄 𝒀 𝜾 = 𝑄 π‘Œ1 𝜾1 𝑄 π‘Œ2 π‘Œ1, 𝜾2 𝑄 π‘Œ3 π‘Œ1, 𝜾3 𝑄 π‘Œ4 π‘Œ2, π‘Œ3, 𝜾4

π‘Œ1 π‘Œ3 π‘Œ2 π‘Œ4

slide-34
SLIDE 34

Decomposition of the likelihood

35

 If we assume the parameters for each CPD are disjoint (i.e., disjoint

parameter sets πœΎπ‘Œπ‘—|π‘„π‘π‘Œπ‘— for different variables), and the data is complete, then the log-likelihood function decomposes on nodes:

𝑀 𝜾; 𝒠 = 𝑄 𝒠 𝜾 =

π‘œ=1 𝑂 π‘—βˆˆπ’²

𝑄 π‘Œπ‘—

(π‘œ) π‘„π‘π‘Œπ‘—

(π‘œ), πœΎπ‘—

=

π‘—βˆˆπ’² π‘œ=1 𝑂

𝑄 π‘Œπ‘—

(π‘œ) π‘„π‘π‘Œπ‘— (π‘œ), πœΎπ‘—

=

π‘—βˆˆπ’²

𝑀𝑗(πœΎπ‘—; 𝒠) 𝑀𝑗 πœΎπ‘—; 𝒠 =

π‘œ=1 𝑂

𝑄 π‘Œπ‘—

(π‘œ) π‘„π‘π‘Œπ‘— (π‘œ), πœΎπ‘—

πœΎπ‘— ≑ πœΎπ‘Œπ‘—|π‘„π‘π‘Œπ‘— πœΎπ‘— = argmax

πœΎπ’‹

𝑀𝑗 πœΎπ‘—; 𝒠 𝜾 = [ 𝜾1, … , 𝜾 𝓦 ] 𝜾 = argmax

𝜾

𝑀 𝜾; 𝒠

slide-35
SLIDE 35

Local decomposition of the likelihood: Table-CPDs

36

The choice of parameters given different value assignment to the parents are independent of each other Consider only data points that agree with the parent assignment:

𝑀𝑗 πœΎπ‘Œπ‘—|π‘„π‘π‘Œπ‘—; 𝒠 =

π‘œ=1 𝑂

πœ„π‘Œπ‘—

(π‘œ)|π‘„π‘π‘Œπ‘— (π‘œ)

=

π’—βˆˆπ‘Šπ‘π‘š(π‘„π‘π‘Œπ‘—) π‘¦βˆˆπ‘Šπ‘π‘š(π‘Œπ‘—)

πœ„π‘¦|𝒗

π‘œ=1

𝑂

𝐽 π‘Œπ‘—

(π‘œ)=𝑦,π‘„π‘π‘Œπ‘— (π‘œ)=𝒗

slide-36
SLIDE 36

Local decomposition of the likelihood: Table-CPDs

37

𝑛𝑗 𝑙, 𝒗 =

π‘œ=1 𝑂

𝐽 π‘Œπ‘—

(π‘œ) = 𝑙, π‘„π‘π‘Œπ‘— (π‘œ) = 𝒗

βˆ€π’—

𝑙

πœ„π‘™|𝒗

𝑗

= 1 πœ„π‘™|𝒗

𝑗

= 𝑛𝑗 𝑙, 𝒗 𝑙 𝑛𝑗 𝑙, 𝒗 = π‘œ=1

𝑂

𝐽 π‘Œπ‘—

(π‘œ) = 𝑙, π‘„π‘π‘Œπ‘— (π‘œ) = 𝒗

π‘œ=1

𝑂

𝐽 π‘„π‘π‘Œπ‘—

(π‘œ) = 𝒗

π‘Œπ‘— = 1 … π‘Œπ‘— = 𝐿 π‘„π‘π‘Œπ‘— = 𝒗1 πœ„1|π’—πŸ

𝑗

… πœ„πΏ|π’—πŸ

𝑗

… π‘„π‘π‘Œπ‘— = 𝒗𝑀 πœ„1|𝒗𝑴

𝑗

… πœ„πΏ|𝒗𝑴

𝑗

πœΎπ‘— ≑ πœΎπ‘Œπ‘—|π‘„π‘π‘Œπ‘— Each row is multinomial distribution πœΎπ‘Œπ‘—|𝒗1 πœΎπ‘Œπ‘—|𝒗𝑀

slide-37
SLIDE 37

Overfitting

38

 For large

π‘Žβˆˆπ‘„π‘π‘Œπ‘—

|π‘Šπ‘π‘š(π‘Ž)|, most β€œbuckets” will have very few

instances

 Number of parameters will increases exponentially with π‘„π‘π‘Œπ‘—  Poor estimation  With limited data, overfitting occurs

 we often get better generalization with simpler structures

 zero count problem or the sparse data problem

 e.g., consider a naΓ―ve bayes classifier and an email 𝑒 containing a word

that does not occur in any training doc for the class 𝑑, then 𝑄 𝑒 𝑑 = 0

slide-38
SLIDE 38

ML Example: NaΓ―ve Bayes classifier

39

𝑍(π‘œ) π‘Œπ‘—

(π‘œ)

𝑗 = 1, … , 𝐸 πœΎπ‘— π‘œ = 1, … , 𝑂 𝑍 π‘Œ1 π‘Œ2 π‘ŒπΈ … 𝜾1 𝜾𝐸 𝜾2 𝝆 𝑍 π‘Œπ‘— 𝑗 = 1, . . , 𝐸

πœΎπ‘— = πœΎπ‘Œπ‘—|1, … , πœΎπ‘Œπ‘—|𝐷

πœΎπ‘— = πœΎπ‘Œπ‘—|𝑍 = πœΎπ‘Œπ‘—|1, … , πœΎπ‘Œπ‘—|𝐷 𝑍 ∈ {1, … , 𝐷} Training phase Test phase 𝝆 𝝆

slide-39
SLIDE 39

ML Example: NaΓ―ve Bayes classifier

40

𝑀 𝜾; 𝒠 =

π‘œ=1 𝑂

𝑄(𝑍 π‘œ |𝝆)

𝑗=1 𝐸

𝑄(π‘Œπ‘—

π‘œ |𝑍 π‘œ , πœΎπ‘Œπ‘—|𝑍)

πœŒπ‘‘ = π‘œ=1

𝑂

𝐽(𝑍 π‘œ = 𝑑) 𝑂

 Discrete inputs or features (π‘Œπ‘— ∈ {1, … , 𝐿}):



πœΎπ‘™|𝑑

𝑗

=

π‘œ=1

𝑂

𝐽(π‘Œπ‘—

π‘œ =𝑙,𝑍 π‘œ =𝑑)

π‘œ=1

𝑂

𝐽(𝑍 π‘œ =𝑑) π‘ž π‘Œπ‘— = 𝑙|𝑍 = 𝑑

slide-40
SLIDE 40

Recall: Maximum conditional likelihood

41

 Example: discriminative classifier

 Needs

to learn the conditional distribution (not joint distribution)

 Given 𝒠 =

π’š π‘œ , 𝑧 π‘œ

π‘œ=1 𝑂

and a parametric conditional distribution for 𝑄(𝑧|π’š; 𝜾), we find: 𝜾 = argmax

𝜾 π‘œ=1 𝑂

𝑄(𝑧 π‘œ |π’š π‘œ ; 𝜾)

 We will also see maximum conditional likelihood for

more general CPDs than tabular ones

slide-41
SLIDE 41

Maximum conditional likelihood Linear regression

42

π’š 𝑧 𝒙

𝑄 𝑧 π’š, 𝒙 = 𝑂 𝑧|π’™π‘ˆπœš π’š , 𝜏2 Linear: 𝜚 π’š = [1, 𝑦1, … , 𝑦𝐸] 𝑀𝑍|𝒀 𝒙; 𝒠 =

π‘œ=1 𝑂

𝑄 𝑍(π‘œ)|𝒀 π‘œ , 𝒙 =

π‘œ=1 𝑂

𝑂 𝑍 π‘œ π’™π‘ˆπœš 𝒀 π‘œ , 𝜏2 𝒙 = argmax

𝜾 π‘œ=1 𝑂

𝑂 𝑍 π‘œ π’™π‘ˆπœš 𝒀 π‘œ , 𝜏2 = πšΎπ‘ˆπšΎ βˆ’1πšΎπ‘ˆπ’

𝒠 = π’š π‘œ , 𝑧 π‘œ

π‘œ=1 𝑂

slide-42
SLIDE 42

Maximum conditional likelihood Logistic regression

43

π’š 𝑧 𝒙

𝑄 𝑧 = 1 𝒀 = 𝜏 π’™π‘ˆπœš π’š Linear LR: 𝜚 π’š = [1, 𝑦1, … , 𝑦𝐸] 𝑄 𝑧 π’š, 𝒙 = 𝐢𝑓𝑠(𝑧|𝜏 π’™π‘ˆπœš π’š )

𝑍 ∈ {0,1} 𝜏 𝑨 = 1 1 + exp βˆ’π‘¨

𝑀𝑍|𝒀 𝒙, 𝒠 =

π‘œ=1 𝑂

𝑄 𝑧(π‘œ)|π’š π‘œ , 𝒙 =

π‘œ=1 𝑂

𝐢𝑓𝑠 𝑧 π‘œ 𝜏 π’™π‘ˆπœš π’š π‘œ =

π‘œ=1 𝑂

𝜏 π’™π‘ˆπœš π’š π‘œ

𝑧(π‘œ)

1 βˆ’ 𝜏 π’™π‘ˆπœš π’š π‘œ

1βˆ’π‘§(π‘œ)

𝒠 = π’š π‘œ , 𝑧 π‘œ

π‘œ=1 𝑂

slide-43
SLIDE 43

MLE for Bayesian networks: Summary

44

 For a BN with disjoint sets of parameters in CPDs, likelihood

decomposes as product of local likelihood functions on nodes

 Thus, we optimize likelihoods on different nodes separately

 For table CPDs, local likelihood further decomposes as

product of likelihood for multinomials, one for each parent combination

 Sparse data problem of MLE

slide-44
SLIDE 44

Bayesian learning for BNs Global parameter independence

45

 Global parameter independence assumption:

𝑄 𝜾 =

π‘—βˆˆπ’²

𝑄(πœΎπ‘—)

𝑀𝑗 πœΎπ‘—; 𝒠 =

π‘œ=1 𝑂

𝑄 π‘Œπ‘—

(π‘œ) π‘„π‘π‘Œπ‘—

(π‘œ), πœΎπ‘—

𝑄 𝒠 𝜾 =

π‘—βˆˆπ’²

𝑀𝑗(πœΎπ‘—; 𝒠) 𝑄 𝜾|𝒠 = 1 𝑄 𝒠 𝑄 𝒠 𝜾 𝑄 𝜾 = 1 𝑄 𝒠

π‘—βˆˆπ’²

𝑀𝑗 πœΎπ‘—; 𝒠 𝑄 πœΎπ‘— =

π‘—βˆˆπ’²

𝑄 πœΎπ‘—|𝒠

π‘Œ1

(π‘œ)

π‘Œ3

(π‘œ)

π‘Œ2

(π‘œ)

𝜾1 𝜾3 𝜾2 𝑂 β‡’ Posteriors of 𝜾 are independent given complete data and independent priors πœΎπ‘— ≑ πœΎπ‘Œπ‘—|π‘„π‘π‘Œπ‘—

slide-45
SLIDE 45

Bayesian learning for BNs Global parameter independence

46

 How can we find directly from the structure of the

Bayesian network when

𝑄 𝜾

satisfies global independence ?

slide-46
SLIDE 46

Bayesian learning for BNs Global parameter independence

47

𝑄 π’š 𝒠 = 𝑄 π’š 𝜾, 𝒠 𝑄 𝜾 𝒠 π‘’πœΎ 𝑄 π’š 𝒠 = 𝑄 π’š 𝜾 𝑄 𝜾 𝒠 π‘’πœΎ =

π‘—βˆˆπ’²

𝑄 𝑦𝑗|π‘„π‘π‘Œπ‘—, πœΎπ‘—

π‘—βˆˆπ’²

𝑄 πœΎπ‘—|𝒠 π‘’πœΎ =

π‘—βˆˆπ’²

𝑄 𝑦𝑗|π‘„π‘π‘Œπ‘—, πœΎπ‘— 𝑄 πœΎπ‘—|𝒠 π‘’πœΎπ‘—

 Solve the prediction problem for each CPD independently and

then combine the results.

Instances are independent given the parameters (for complete data) πœΎπ‘— ≑ πœΎπ‘Œπ‘—|π‘„π‘π‘Œπ‘—

Predictive distribution

𝑔 𝑦 𝑕 𝑧 𝑒𝑦𝑒𝑧 = 𝑔 𝑦 𝑒𝑦 𝑕 𝑧 𝑒𝑧

slide-47
SLIDE 47

Local decomposition for table CPDs

48

 Prior

𝑄 πœΎπ‘Œ|π‘„π‘π‘Œ satisfies local parameter independence assumption if: 𝑄 πœΎπ‘Œ|π‘„π‘π‘Œ =

π‘„π‘π‘Œ=𝒗

𝑄 πœΎπ‘Œ|𝒗

 The posterior of 𝑄 πœΎπ‘Œ|𝒗|𝒠 and 𝑄 πœΎπ‘Œ|𝒗′|𝒠 are independent

despite vβˆ’structure on π‘Œ, because π‘„π‘π‘Œ acts like a multiplexer

𝑍(π‘œ) π‘Œ(π‘œ) πœΎπ‘Œ|1 πœΎπ‘Œ|𝐷 … 𝑍 ∈ {1, … , 𝐷} π‘œ = 1, … , 𝑂 The groups of parameters in different rows of a CPD are independent

slide-48
SLIDE 48

Global and local parameter independence

49

 Let the data be complete and CPDs be tabular. If the

prior 𝑄(𝜾) satisfies global and local parameter independence, then: 𝑄 𝜾 𝒠 =

π‘—βˆˆπ’² π‘„π‘π‘Œπ‘—

𝑄(πœΎπ‘Œπ‘—|π‘„π‘π‘Œπ‘—|𝒠) 𝑄 π‘Œπ‘— = 𝑙 𝒗, 𝒠 =

𝛽𝑙|𝒗

𝑗

+ 𝑛𝑗(𝑙, 𝒗) 𝑙′=1

𝐿

𝛽𝑙′|𝒗

𝑗

+ 𝑛𝑗(𝑙′, 𝒗)

πœΎπ‘Œπ‘—|𝒗~𝐸𝑗𝑠(𝛽1|𝒗

𝑗

, … , 𝛽𝐿|𝒗

𝑗

) πœΎπ‘Œπ‘—|𝒗|𝒠 ~𝐸𝑗𝑠(𝛽1|𝒗

𝑗

+ 𝑛𝑗(1, 𝒗), … , 𝛽𝐿|𝒗

𝑗

+ 𝑛𝑗(𝐿, 𝒗)) π‘„π‘π‘Œπ‘— = 𝒗 𝐿 = π‘Šπ‘π‘š(π‘Œπ‘—)

slide-49
SLIDE 49

Priors for Bayesian learning

50

 Priors π›½π‘Œπ‘—|π‘„π‘π‘Œπ‘— must be defined:

𝛽𝑦|𝒗|𝑦 ∈ π‘Šπ‘π‘š π‘Œπ‘— , 𝒗 ∈ π‘Šπ‘π‘š(π‘„π‘π‘Œπ‘—)

 𝐿2 prior: a fixed prior for all hyper-parameters 𝛽𝑦|𝒗 = 1  We can store an 𝛽 and a prior distribution 𝑄′ on the

variables in the network: π›½π‘Œπ‘—|π‘„π‘π‘Œπ‘— = 𝛽𝑄′ π‘Œπ‘—, π‘„π‘π‘Œπ‘—

 BDe prior: Define 𝑄′ as a set of independent marginals

  • ver π‘Œπ‘—β€™s
slide-50
SLIDE 50

NaΓ―ve Bayes Example: Bayesian learning

51

𝑍(π‘œ) π‘Œπ‘—

(π‘œ)

𝑗 = 1, … , 𝐸 πœΎπ’€π‘—|𝑑 π‘œ = 1, … , 𝑂 πœΎπ‘— = πœΎπ‘Œπ‘—|𝑍 = πœΎπ‘Œπ‘—|1, … , πœΎπ‘Œπ‘—|𝐷 𝑍 ∈ {1, … , 𝐷} Training phase 𝜌 πœ·π’€π‘—|𝑑 𝜷 𝑑 = 1, … , 𝐷

slide-51
SLIDE 51

NaΓ―ve Bayes Example: Bayesian learning

52

 𝑍~π‘π‘£π‘šπ‘’ 𝝆 (i.e., 𝑄 𝑍 = 𝑑 𝝆 = πœŒπ‘‘)

 𝑄 𝝆 𝛽1, … , 𝛽𝐷 = 𝐸𝑗𝑠(𝝆|𝛽1, … , 𝛽𝐷)  𝑄 𝝆 𝒠, 𝛽1, … , 𝛽𝐷 = 𝐸𝑗𝑠 𝝆 𝛽1 + 𝑛1, … , 𝛽𝐷 + 𝑛𝑑

𝑄 𝑍 = 𝑑 𝒠 = 𝛽𝑑 + 𝑛𝑑 𝑑′=1

𝐷

𝛽𝑑′ + 𝑛𝑑′

 Discrete inputs or features (π‘Œπ‘— ∈ {1, … , 𝐿}):  𝑄 π‘Œπ‘— = 𝑙 𝑍 = 𝑑, 𝒠 =

𝛽𝑙|𝒅

𝑗 +𝑛𝑗(𝑙,𝑑)

𝑙′=1

𝐿

𝛽𝑙′|𝒅

𝑗

+𝑛𝑗(𝑙′,𝑑)

𝑛𝑗 𝑙, 𝑑 = 𝐽(π‘Œπ‘—

π‘œ = 𝑙, 𝑍 π‘œ = 𝑑)

𝑛𝑑 =

π‘œ=1 𝑂

𝐽(𝑍 π‘œ = 𝑑)

slide-52
SLIDE 52

Global & local independence

53

 For nodes with no parents, parameters define a single

distribution

 Bayesian or ML learning can be used as in the simple density estimation

  • n a variable

 More generally, for tabular CPDs there are multiple categorical

distributions per node, one for every combination of parent variables

 Learning objective decomposes into multiple terms, one for subset of

training data with each parent configuration

 Apply independent Bayesian or ML learning to each

slide-53
SLIDE 53

Shared parameters

54

 Sharing CPDs or sharing parameters within a single CPD:

𝑄 π‘Œπ‘— π‘„π‘π‘Œπ‘—, 𝜾 = 𝑄 π‘Œ

π‘˜ π‘„π‘π‘Œπ‘˜, 𝜾

𝑄 π‘Œπ‘— 𝒗1, πœ„π‘—

𝑑 = 𝑄 π‘Œ π‘˜ 𝒗2, πœ„π‘— 𝑑

 MLE: For networks with shared CPDs, sufficient statistics

accumulate over all uses of CPD

 Aggregate sufficient statistics: collect all the instances generated

from the same conditional distribution and combine sufficient statistics from them

slide-54
SLIDE 54

Markov chain

55

𝑄 π‘Œ1, … , π‘Œπ‘ˆ 𝜾 = 𝑄 π‘Œ1 𝝆

𝑒=2 π‘ˆ

𝑄(π‘Œπ‘’|π‘Œπ‘’βˆ’1, π‘©π‘’βˆ’1,𝑒)

 Shared Parameters:

𝑄 π‘Œ1, … , π‘Œπ‘ˆ 𝜾 = 𝑄 π‘Œ1 𝝆

𝑒=2 π‘ˆ

𝑄(π‘Œπ‘’|π‘Œπ‘’βˆ’1, 𝑩)

π‘Œ1 π‘Œ2 π‘Œπ‘ˆ π‘Œπ‘ˆβˆ’1 …

slide-55
SLIDE 55

Markov chain

56

 Initial state probability:

πœŒπ‘— = 𝑄 π‘Œ1 = 𝑗 , 1 ≀ 𝑗 ≀ 𝐿

 State transition probability:

π΅π‘˜π‘— = πœ„

π‘˜|𝑗 = 𝑄 π‘Œπ‘’+1 = 𝑗 π‘Œπ‘’ = π‘˜ ,

1 ≀ 𝑗, π‘˜ ≀ 𝐿

𝑗=1 𝐿

π΅π‘˜π‘— = 1

π‘Œ1 π‘Œ2 π‘Œπ‘ˆ π‘Œπ‘ˆβˆ’1 … 𝝆 𝑩 = πœΎπ‘‡β€²|𝑇

slide-56
SLIDE 56

Markov chain: Parameter learning by MLE

57

𝑄 𝒠|𝜾 =

π‘œ=1 𝑂

𝑄 π‘Œ1

(π‘œ) 𝝆 𝑒=2 π‘ˆ

𝑄(π‘Œπ‘’

(π‘œ)|π‘Œπ‘’βˆ’1 (π‘œ), 𝑩)

π΅π‘˜π‘— = π‘œ=1

𝑂

𝑒=2

π‘ˆ

𝐽 π‘Œπ‘’βˆ’1

(π‘œ) = π‘˜, π‘Œπ‘’ (π‘œ) = 𝑗

π‘œ=1

𝑂

𝑒=2

π‘ˆ

𝐽 π‘Œπ‘’βˆ’1

(π‘œ) = π‘˜

πœŒπ‘— = π‘œ=1

𝑂

𝐽 π‘Œ1

(π‘œ) = 𝑗

𝑂

slide-57
SLIDE 57

Markov chain: Bayesian learning

58

 Dirichlet prior 𝜷 on 𝑩

𝑄 π‘Œπ‘’+1 = 𝑗 π‘Œπ‘’ = π‘˜, 𝒠, πœ·π‘˜,. = π‘œ=1

𝑂

𝑒=2

π‘ˆ

𝐽 π‘Œπ‘’βˆ’1

(π‘œ) = π‘˜, π‘Œπ‘’ (π‘œ) = 𝑗 + π›½π‘˜,𝑗

π‘œ=1

𝑂

𝑒=2

π‘ˆ

𝐽 π‘Œπ‘’βˆ’1

(π‘œ) = π‘˜ + 𝑗′=1 𝐿

π›½π‘˜,𝑗′

π΅π‘˜π‘— = 𝑄 π‘Œπ‘’+1 = 𝑗 π‘Œπ‘’ = π‘˜ π΅π‘˜,.~𝐸𝑗𝑠(π›½π‘˜,1, … , π›½π‘˜,𝐿)

Assign a Dirichlet prior to each row of the transition matrix 𝐡

slide-58
SLIDE 58

Hyper-parameters in Bayesian approach

59

 We already consider either parameter independencies or

parameter sharing

 Hierarchical priors gives us a flexible language to introduce

dependencies in the priors over parameters.

 useful when we have small amount of examples relevant to each

parameter but we believe that some parameters are reasonably similar.

 In such situations, hierarchical priors β€œspread” the effect of the

  • bservations between parameters with shared hyper-parameters.

 However, when we use hyper-parameters we transformed our

learning problem into one that includes a hidden variable.

slide-59
SLIDE 59

Hyper-parameters: Example

60

[Koller & Friedman Book] The effect of the prior will be to shift both πœ„π‘|𝑦0 and πœ„π‘|𝑦1 to be closer to each other

slide-60
SLIDE 60

Hyper-parameters: Example

61

[Koller & Friedman Book] If we believe that two variables Y and Z depend on X in a similar (but not identical) way

slide-61
SLIDE 61

References

62

 Koller and Friedman, Chapter 17.