15-388/688 - Practical Data Science: Maximum likelihood estimation, - - PowerPoint PPT Presentation

▶

Sep 16, 2022 119 likes •353 views

15-388/688 - Practical Data Science: Maximum likelihood estimation, nave Bayes J. Zico Kolter Carnegie Mellon University Spring 2018 1 Outline Maximum likelihood estimation Naive Bayes Machine learning and maximum likelihood 2 Outline

SLIDE 1

15-388/688 - Practical Data Science: Maximum likelihood estimation, naïve Bayes

J. Zico Kolter

Carnegie Mellon University Spring 2018

SLIDE 2

Outline

Maximum likelihood estimation Naive Bayes Machine learning and maximum likelihood

SLIDE 3

Outline

Maximum likelihood estimation Naive Bayes Machine learning and maximum likelihood

SLIDE 4

Estimating the parameters of distributions

We’re moving now from probability to statistics The basic question: given some data 𝑦 1 , … , 𝑦 푚 , how do I find a distribution that captures this data “well”? In general (if we can pick from the space of all distributions), this is a hard question, but if we pick from a particular parameterized family of distributions 𝑞 𝑌; 𝜄 , the question is (at least a little bit) easier Question becomes: how do I find parameters 𝜄 of this distribution that fit the data?

SLIDE 5

Maximum likelihood estimation

Given a distribution 𝑞 𝑌; 𝜄 , and a collection of observed (independent) data points 𝑦 1 , … , 𝑦 푚 , the probability of observing this data is simply 𝑞 𝑦 1 , … , 𝑦 푚 ; 𝜄 = ∏

푖=1 푚

𝑞 𝑦 푖 ; 𝜄 Basic idea of maximum likelihood estimation (MLE): find the parameters that maximize the probability of the observed data maximize

휃

∏

푖=1 푚

𝑞 𝑦 푖 ; 𝜄 ≡ maximize

휃

ℓ 𝜄 = ∑

푖=1 푚

log 𝑞 𝑦 푖 ; 𝜄 where ℓ 𝜄 is called the log likelihood of the data Seems “obvious”, but there are many other ways of fitting parameters

SLIDE 6

Parameter estimation for Bernoulli

Simple example: Bernoulli distribution 𝑞 𝑌 = 1; 𝜚 = 𝜚, 𝑞 𝑌 = 0; 𝜚 = 1 − 𝜚 Given observed data 𝑦 1 , … , 𝑦 푚 , the “obvious” answer is: ̂ 𝜚 = #1’s # Total = ∑푖=1

푚 𝑦 푖

𝑛 But why is this the case? Maybe there are other estimates that are just as good, i.e.? 𝜚 = ∑푖=1

푚 𝑦 푖 + 1

𝑛 + 2

SLIDE 7

MLE for Bernoulli

Maximum likelihood solution for Bernoulli given by maximize

휙

∏

푖=1 푚

𝑞 𝑦 푖 ; 𝜚 = maximize

휙

∏

푖=1 푚

𝜚푥 푖 1 − 𝜚 1−푥 푖 Taking the negative log of the optimization objective (just to be consistent with our usual notation of optimization as minimization) maximize

휙

ℓ 𝜚 = ∑

푖=1 푚

𝑦 푖 log 𝜚 + 1 − 𝑦 푖 log 1 − 𝜚 Derivative with respect to 𝜚 is given by 𝑒 𝑒𝜚 ℓ 𝜚 = ∑

푖=1 푚

𝑦 푖 𝜚 − 1 − 𝑦 푖 1 − 𝜚 = ∑푖=1

푚 𝑦 푖

𝜚 − ∑푖=1

푚 (1 − 𝑦 푖 )

1 − 𝜚

SLIDE 8

MLE for Bernoulli, continued

Setting derivative to zero gives: ∑푖=1

푚 𝑦 푖

𝜚 − ∑푖=1

푚 (1 − 𝑦 푖 )

1 − 𝜚 ≡ 𝑏 𝜚 − 𝑐 1 − 𝜚 = 0 ⟹ 1 − 𝜚 𝑏 = 𝜚𝑐 ⟹ 𝜚 = 𝑏 𝑏 + 𝑐 = ∑푖=1

푚 𝑦 푖

𝑛 So, we have shown that the “natural” estimate of 𝜚 actually corresponds to the maximum likelihood estimate

SLIDE 9

Poll: Bernoulli maximum likelihood

Suppose we observe binary data 𝑦 1 , … , 𝑦 푚 with 𝑦 푖 ∈ {0,1} with some 𝑦 푖 = 0 and some 𝑦 푗 = 1, and we compute the Bernoulli MLE 𝜚 = ∑푖=1

푚 𝑦 푖

𝑛 Which of following statements is necessarily true? (may be more than one) 1. For any 𝜚′ ≠ 𝜚, 𝑞 𝑦 푖 ; 𝜚′ ≤ 𝑞 𝑦 푖 ; 𝜚 for all 𝑗 = 1, … , 𝑜 2. For any 𝜚′ ≠ 𝜚, ∏푖=1

푚 𝑞 𝑦 푖 ; 𝜚′

≤ ∏푖=1

푚 𝑞 𝑦 푖 ; 𝜚

3. We always have 𝑞 𝑦 푖 ; 𝜚′ ≥ 𝑞 𝑦 푖 ; 𝜚 for at least one 𝑗

SLIDE 10

MLE for Gaussian, briefly

For Gaussian distribution 𝑞 𝑦; 𝜈, 𝜏2 = 2𝜌𝜏2 −1/2 exp − 1/2 𝑦 − 𝜈 2/𝜏2 Log likelihood given by: ℓ 𝜈, 𝜏2 = −𝑛 1 2 log 2𝜌𝜏2 − 1 2 ∑

푖=1 푚

𝑦 푖 − 𝜈 2 𝜏2 Derivatives (see if you can derive these fully):

𝑒 𝑒𝜈 ℓ 𝜈, 𝜏2 = − 1 2 ∑

푖=1 푚 𝑦 푖 − 𝜈

𝜏2 = 0 ⟹ 𝜈 = 1 𝑛 ∑

푖=1 푚

𝑦 푖 𝑒 𝑒𝜏2 ℓ 𝜈, 𝜏2 = − 𝑛 2𝜏2 + 1 2 ∑

푖=1 푚

𝑦 푖 − 𝜈 2 𝜏2 2 = 0 ⟹ 𝜏2 = 1 𝑛 ∑

푖=1 푚

𝑦 푖 − 𝜈 2

SLIDE 11

Outline

Maximum likelihood estimation Naive Bayes Machine learning and maximum likelihood

SLIDE 12

Naive Bayes modeling

Naive Bayes is a machine learning algorithm that rests relies heavily on probabilistic modeling But, it is also interpretable according to the three ingredients of a machine learning algorithm (hypothesis function, loss, optimization), more on this later Basic idea is that we model input and output as random variables 𝑌 = 𝑌1, 𝑌2, … , 𝑌푛 (several Bernoulli, categorical, or Gaussian random variables), and 𝑍 (one Bernoulli or categorical random variable), goal is to find 𝑞(𝑍 |𝑌)

SLIDE 13

Naive Bayes assumptions

We’re going to find 𝑞 𝑍 𝑌 via Bayes’ rule 𝑞 𝑍 𝑌 = 𝑞 𝑌 𝑍 𝑞 𝑍 𝑞 𝑌 = 𝑞 𝑌 𝑍 𝑞 𝑍 ∑푦 𝑞(𝑌|𝑧) 𝑞 𝑧 The denominator is just the sum over all values of 𝑍 of the distribution specified by the numeration, so we’re just going to focus on the 𝑞 𝑌 𝑍 𝑞 𝑍 term Modeling full distribution 𝑞(𝑌|𝑍 ) for high-dimensional 𝑌 is not practical, so we’re going to make the naive Bayes assumption, that the elements 𝑌푖 are conditionally independent given 𝑍 𝑞 𝑌 𝑍 = ∏

푖=1 푛

𝑞 𝑌푖 𝑍

SLIDE 14

Modeling individual distributions

We’re going to explicitly model the distribution of each 𝑞 𝑌푖 𝑍 as well as 𝑞(𝑍 ) We do this by specifying a distribution for 𝑞(𝑍 ) and a separate distribution and for each 𝑞(𝑌푖|𝑍 = 𝑧) So assuming, for instance, that 𝑍푖 and 𝑌푖 are binary (Bernoulli random variables), then we would represent the distributions 𝑞 𝑍 ; 𝜚0 , 𝑞 𝑌푖 𝑍 = 0; 𝜚푖

0),

𝑞 𝑌푖 𝑍 = 1; 𝜚푖

We then estimate the parameters of these distributions using MLE, i.e. 𝜚0 = ∑푗=1

푚

𝑧 푗 𝑛 , 𝜚푖

푦 =

∑푗=1

푚

𝑦푖

푗 ⋅ 1{𝑧 푗 = 𝑧}

∑푗=1

푚

1{𝑧 푗 = 𝑧}

SLIDE 15

Making predictions

Given some new data point 𝑦, we can now compute the probability of each class 𝑞 𝑍 = 𝑧 𝑦 ∝ 𝑞 𝑍 = 𝑧 ∏

푖=1 푚

𝑞 𝑦푖 𝑍 = 𝑧 = 𝜚0 ∏

푖=1 푚

(𝜚푖

푦)푥푖 1 − 𝜚1 푦 1−푥푖

After you have computed the right hand side, just normalize (divide by the sum

ver all 𝑧) to get the desired probability

Alternatively, if you just want to know the most likely 𝑍 , just compute each right hand side and take the maximum

SLIDE 16

Example

𝒁 𝒀ퟏ 𝒀ퟐ 1 1 1 1 1 1 1 1 1 1 1 ? 1 𝑞 𝑍 = 1 = 𝜚0 = 𝑞 𝑌1 = 1 𝑍 = 0 = 𝜚1

0 =

𝑞 𝑌1 = 1 𝑍 = 1 = 𝜚1

1 =

𝑞 𝑌2 = 1 𝑍 = 0 = 𝜚2

0 =

𝑞 𝑌2 = 1 𝑍 = 0 = 𝜚2

1 =

𝑞 𝑍 𝑌1 = 1, 𝑌2 = 0 =

SLIDE 17

Potential issues

Problem #1: when computing probability, the product p 𝑧 ∏푖=1

푛

𝑞(𝑦푖|𝑧) quickly goes to zero to numerical precision Solution: compute log of the probabilities instead log 𝑞(𝑧) + ∑

푖=1 푛

log 𝑞 𝑦푖 𝑧 Problem #2: If we have never seen either 𝑌푖 = 1 or 𝑌푖 = 0 for a given 𝑧, then the corresponding probabilities computed by MLE will be zero Solution: Laplace smoothing, “hallucinate” one 𝑌푖 = 0/1 for each class 𝜚푖

푦 =

∑푗=1

푚

𝑦푖

푗 ⋅ 1{𝑧 푗 = 𝑧} + 1

∑푗=1

푚

1{𝑧 푗 = 𝑧} + 2

SLIDE 18

Other distributions

Though naive Bayes is often presented as “just” counting, the value of the maximum likelihood interpretation is that it’s clear how to model 𝑞(𝑌푖|𝑍 ) for non- categorical random variables Example: if 𝑦푖 is real-valued, we can model 𝑞(𝑌푖|𝑍 = 𝑧) as a Gaussian 𝑞 𝑦푖 𝑧; 𝜈푦, 𝜏푦

2 = 𝒪(𝑦푖; 𝜈푦, 𝜏푦 2)

with maximum likelihood estimates 𝜈푦 = ∑푗=1

푚 𝑦푖 푗 ⋅ 1{𝑧 푗 = 𝑧}

∑푗=1

푚 1{𝑧 푗 = 𝑧}

, 𝜏푦

∑푗=1

푚 (𝑦푖 푗 −𝜈푦)^2 ⋅ 1{𝑧 푗 = 𝑧}

∑푗=1

푚 1{𝑧 푗 = 𝑧}

All probability computations are exactly the same as before (it doesn’t matter that some of the terms are probability densities)

SLIDE 19

Outline

Maximum likelihood estimation Naive Bayes Machine learning and maximum likelihood

SLIDE 20

Machine learning via maximum likelihood

Many machine learning algorithms (specifically the loss function component) can be interpreted probabilistically, as maximum likelihood estimation Recall logistic regression: minimize

휃

∑

푖=1 푚

ℓlogistic(ℎ휃(𝑦 푖 ) , 𝑧 푖 ) ℓlogistic ℎ휃 𝑦 , 𝑧 = log(1 + exp −𝑧 ⋅ ℎ휃 𝑦

SLIDE 21

Logistic probability model

Consider the model (where 𝑍 is binary taking on −1, +1 values) 𝑞 𝑧 𝑦; 𝜄 = logistic 𝑧 ⋅ ℎ휃 𝑦 = 1 1 + exp(−𝑧 ⋅ ℎ휃 𝑦 ) Under this model, the maximum likelihood estimate is maximize

휃

∑

푖=1 푚

log 𝑞 𝑧 푖 𝑦 푖 ; 𝜄) ≡ minimize

휃

∑

푖=1 푚

ℓlogistic(ℎ휃(𝑦 푖 ) , 𝑧 푖 )

SLIDE 22

Least squares

In linear regression, assume 𝑧 = 𝜄푇 𝑦 + 𝜗, 𝜗 ∼ 𝒪 0, 𝜏2 ⟺ 𝑞 𝑧 𝑦; 𝜄 = 𝒪 𝜄푇 𝑦, 𝜏2 Then the maximum likelihood estimate is given by maximize

휃

∑

푖=1 푚

log 𝑞 𝑧 푖 𝑦 푖 ; 𝜄) ≡ minimize

휃

∑

푖=1 푚

𝑧 푖 − 𝜄푇 𝑦 푖

i.e., the least-squares loss function can be viewed as MLE under Gaussian errors Other approaches possible too: absolute loss function can be viewed as MLE under Laplace errors