15-388/688 - Practical Data Science: Maximum likelihood estimation, - - PowerPoint PPT Presentation

β–Ά
15 388 688 practical data science maximum likelihood
SMART_READER_LITE
LIVE PREVIEW

15-388/688 - Practical Data Science: Maximum likelihood estimation, - - PowerPoint PPT Presentation

15-388/688 - Practical Data Science: Maximum likelihood estimation, nave Bayes J. Zico Kolter Carnegie Mellon University Spring 2018 1 Outline Maximum likelihood estimation Naive Bayes Machine learning and maximum likelihood 2 Outline


slide-1
SLIDE 1

15-388/688 - Practical Data Science: Maximum likelihood estimation, naΓ―ve Bayes

  • J. Zico Kolter

Carnegie Mellon University Spring 2018

1

slide-2
SLIDE 2

Outline

Maximum likelihood estimation Naive Bayes Machine learning and maximum likelihood

2

slide-3
SLIDE 3

Outline

Maximum likelihood estimation Naive Bayes Machine learning and maximum likelihood

3

slide-4
SLIDE 4

Estimating the parameters of distributions

We’re moving now from probability to statistics The basic question: given some data 𝑦 1 , … , 𝑦 ν‘š , how do I find a distribution that captures this data β€œwell”? In general (if we can pick from the space of all distributions), this is a hard question, but if we pick from a particular parameterized family of distributions π‘ž π‘Œ; πœ„ , the question is (at least a little bit) easier Question becomes: how do I find parameters πœ„ of this distribution that fit the data?

4

slide-5
SLIDE 5

Maximum likelihood estimation

Given a distribution π‘ž π‘Œ; πœ„ , and a collection of observed (independent) data points 𝑦 1 , … , 𝑦 ν‘š , the probability of observing this data is simply π‘ž 𝑦 1 , … , 𝑦 ν‘š ; πœ„ = ∏

ν‘–=1 ν‘š

π‘ž 𝑦 ν‘– ; πœ„ Basic idea of maximum likelihood estimation (MLE): find the parameters that maximize the probability of the observed data maximize

νœƒ

∏

ν‘–=1 ν‘š

π‘ž 𝑦 ν‘– ; πœ„ ≑ maximize

νœƒ

β„“ πœ„ = βˆ‘

ν‘–=1 ν‘š

log π‘ž 𝑦 ν‘– ; πœ„ where β„“ πœ„ is called the log likelihood of the data Seems β€œobvious”, but there are many other ways of fitting parameters

5

slide-6
SLIDE 6

Parameter estimation for Bernoulli

Simple example: Bernoulli distribution π‘ž π‘Œ = 1; 𝜚 = 𝜚, π‘ž π‘Œ = 0; 𝜚 = 1 βˆ’ 𝜚 Given observed data 𝑦 1 , … , 𝑦 ν‘š , the β€œobvious” answer is: Μ‚ 𝜚 = #1’s # Total = βˆ‘ν‘–=1

ν‘š 𝑦 ν‘–

𝑛 But why is this the case? Maybe there are other estimates that are just as good, i.e.? 𝜚 = βˆ‘ν‘–=1

ν‘š 𝑦 ν‘– + 1

𝑛 + 2

6

slide-7
SLIDE 7

MLE for Bernoulli

Maximum likelihood solution for Bernoulli given by maximize

νœ™

∏

ν‘–=1 ν‘š

π‘ž 𝑦 ν‘– ; 𝜚 = maximize

νœ™

∏

ν‘–=1 ν‘š

𝜚ν‘₯ ν‘– 1 βˆ’ 𝜚 1βˆ’ν‘₯ ν‘– Taking the negative log of the optimization objective (just to be consistent with our usual notation of optimization as minimization) maximize

νœ™

β„“ 𝜚 = βˆ‘

ν‘–=1 ν‘š

𝑦 ν‘– log 𝜚 + 1 βˆ’ 𝑦 ν‘– log 1 βˆ’ 𝜚 Derivative with respect to 𝜚 is given by 𝑒 π‘’πœš β„“ 𝜚 = βˆ‘

ν‘–=1 ν‘š

𝑦 ν‘– 𝜚 βˆ’ 1 βˆ’ 𝑦 ν‘– 1 βˆ’ 𝜚 = βˆ‘ν‘–=1

ν‘š 𝑦 ν‘–

𝜚 βˆ’ βˆ‘ν‘–=1

ν‘š (1 βˆ’ 𝑦 ν‘– )

1 βˆ’ 𝜚

7

slide-8
SLIDE 8

MLE for Bernoulli, continued

Setting derivative to zero gives: βˆ‘ν‘–=1

ν‘š 𝑦 ν‘–

𝜚 βˆ’ βˆ‘ν‘–=1

ν‘š (1 βˆ’ 𝑦 ν‘– )

1 βˆ’ 𝜚 ≑ 𝑏 𝜚 βˆ’ 𝑐 1 βˆ’ 𝜚 = 0 ⟹ 1 βˆ’ 𝜚 𝑏 = πœšπ‘ ⟹ 𝜚 = 𝑏 𝑏 + 𝑐 = βˆ‘ν‘–=1

ν‘š 𝑦 ν‘–

𝑛 So, we have shown that the β€œnatural” estimate of 𝜚 actually corresponds to the maximum likelihood estimate

8

slide-9
SLIDE 9

Poll: Bernoulli maximum likelihood

Suppose we observe binary data 𝑦 1 , … , 𝑦 ν‘š with 𝑦 ν‘– ∈ {0,1} with some 𝑦 ν‘– = 0 and some 𝑦 ν‘— = 1, and we compute the Bernoulli MLE 𝜚 = βˆ‘ν‘–=1

ν‘š 𝑦 ν‘–

𝑛 Which of following statements is necessarily true? (may be more than one) 1. For any πœšβ€² β‰  𝜚, π‘ž 𝑦 ν‘– ; πœšβ€² ≀ π‘ž 𝑦 ν‘– ; 𝜚 for all 𝑗 = 1, … , π‘œ 2. For any πœšβ€² β‰  𝜚, βˆν‘–=1

ν‘š π‘ž 𝑦 ν‘– ; πœšβ€²

≀ βˆν‘–=1

ν‘š π‘ž 𝑦 ν‘– ; 𝜚

3. We always have π‘ž 𝑦 ν‘– ; πœšβ€² β‰₯ π‘ž 𝑦 ν‘– ; 𝜚 for at least one 𝑗

9

slide-10
SLIDE 10

MLE for Gaussian, briefly

For Gaussian distribution π‘ž 𝑦; 𝜈, 𝜏2 = 2𝜌𝜏2 βˆ’1/2 exp βˆ’ 1/2 𝑦 βˆ’ 𝜈 2/𝜏2 Log likelihood given by: β„“ 𝜈, 𝜏2 = βˆ’π‘› 1 2 log 2𝜌𝜏2 βˆ’ 1 2 βˆ‘

ν‘–=1 ν‘š

𝑦 ν‘– βˆ’ 𝜈 2 𝜏2 Derivatives (see if you can derive these fully):

𝑒 π‘’πœˆ β„“ 𝜈, 𝜏2 = βˆ’ 1 2 βˆ‘

ν‘–=1 ν‘š 𝑦 ν‘– βˆ’ 𝜈

𝜏2 = 0 ⟹ 𝜈 = 1 𝑛 βˆ‘

ν‘–=1 ν‘š

𝑦 ν‘– 𝑒 π‘’πœ2 β„“ 𝜈, 𝜏2 = βˆ’ 𝑛 2𝜏2 + 1 2 βˆ‘

ν‘–=1 ν‘š

𝑦 ν‘– βˆ’ 𝜈 2 𝜏2 2 = 0 ⟹ 𝜏2 = 1 𝑛 βˆ‘

ν‘–=1 ν‘š

𝑦 ν‘– βˆ’ 𝜈 2

10

slide-11
SLIDE 11

Outline

Maximum likelihood estimation Naive Bayes Machine learning and maximum likelihood

11

slide-12
SLIDE 12

Naive Bayes modeling

Naive Bayes is a machine learning algorithm that rests relies heavily on probabilistic modeling But, it is also interpretable according to the three ingredients of a machine learning algorithm (hypothesis function, loss, optimization), more on this later Basic idea is that we model input and output as random variables π‘Œ = π‘Œ1, π‘Œ2, … , π‘Œν‘› (several Bernoulli, categorical, or Gaussian random variables), and 𝑍 (one Bernoulli or categorical random variable), goal is to find π‘ž(𝑍 |π‘Œ)

12

slide-13
SLIDE 13

Naive Bayes assumptions

We’re going to find π‘ž 𝑍 π‘Œ via Bayes’ rule π‘ž 𝑍 π‘Œ = π‘ž π‘Œ 𝑍 π‘ž 𝑍 π‘ž π‘Œ = π‘ž π‘Œ 𝑍 π‘ž 𝑍 βˆ‘ν‘¦ π‘ž(π‘Œ|𝑧) π‘ž 𝑧 The denominator is just the sum over all values of 𝑍 of the distribution specified by the numeration, so we’re just going to focus on the π‘ž π‘Œ 𝑍 π‘ž 𝑍 term Modeling full distribution π‘ž(π‘Œ|𝑍 ) for high-dimensional π‘Œ is not practical, so we’re going to make the naive Bayes assumption, that the elements π‘Œν‘– are conditionally independent given 𝑍 π‘ž π‘Œ 𝑍 = ∏

ν‘–=1 ν‘›

π‘ž π‘Œν‘– 𝑍

13

slide-14
SLIDE 14

Modeling individual distributions

We’re going to explicitly model the distribution of each π‘ž π‘Œν‘– 𝑍 as well as π‘ž(𝑍 ) We do this by specifying a distribution for π‘ž(𝑍 ) and a separate distribution and for each π‘ž(π‘Œν‘–|𝑍 = 𝑧) So assuming, for instance, that 𝑍푖 and π‘Œν‘– are binary (Bernoulli random variables), then we would represent the distributions π‘ž 𝑍 ; 𝜚0 , π‘ž π‘Œν‘– 𝑍 = 0; πœšν‘–

0),

π‘ž π‘Œν‘– 𝑍 = 1; πœšν‘–

1

We then estimate the parameters of these distributions using MLE, i.e. 𝜚0 = βˆ‘ν‘—=1

ν‘š

𝑧 ν‘— 𝑛 , πœšν‘–

푦 =

βˆ‘ν‘—=1

ν‘š

𝑦푖

ν‘— β‹… 1{𝑧 ν‘— = 𝑧}

βˆ‘ν‘—=1

ν‘š

1{𝑧 ν‘— = 𝑧}

14

slide-15
SLIDE 15

Making predictions

Given some new data point 𝑦, we can now compute the probability of each class π‘ž 𝑍 = 𝑧 𝑦 ∝ π‘ž 𝑍 = 𝑧 ∏

ν‘–=1 ν‘š

π‘ž 𝑦푖 𝑍 = 𝑧 = 𝜚0 ∏

ν‘–=1 ν‘š

(πœšν‘–

푦)ν‘₯ν‘– 1 βˆ’ 𝜚1 푦 1βˆ’ν‘₯ν‘–

After you have computed the right hand side, just normalize (divide by the sum

  • ver all 𝑧) to get the desired probability

Alternatively, if you just want to know the most likely 𝑍 , just compute each right hand side and take the maximum

15

slide-16
SLIDE 16

Example

16

𝒁 π’€νŸ π’€νŸ 1 1 1 1 1 1 1 1 1 1 1 ? 1 π‘ž 𝑍 = 1 = 𝜚0 = π‘ž π‘Œ1 = 1 𝑍 = 0 = 𝜚1

0 =

π‘ž π‘Œ1 = 1 𝑍 = 1 = 𝜚1

1 =

π‘ž π‘Œ2 = 1 𝑍 = 0 = 𝜚2

0 =

π‘ž π‘Œ2 = 1 𝑍 = 0 = 𝜚2

1 =

π‘ž 𝑍 π‘Œ1 = 1, π‘Œ2 = 0 =

slide-17
SLIDE 17

Potential issues

Problem #1: when computing probability, the product p 𝑧 βˆν‘–=1

ν‘›

π‘ž(𝑦푖|𝑧) quickly goes to zero to numerical precision Solution: compute log of the probabilities instead log π‘ž(𝑧) + βˆ‘

ν‘–=1 ν‘›

log π‘ž 𝑦푖 𝑧 Problem #2: If we have never seen either π‘Œν‘– = 1 or π‘Œν‘– = 0 for a given 𝑧, then the corresponding probabilities computed by MLE will be zero Solution: Laplace smoothing, β€œhallucinate” one π‘Œν‘– = 0/1 for each class πœšν‘–

푦 =

βˆ‘ν‘—=1

ν‘š

𝑦푖

ν‘— β‹… 1{𝑧 ν‘— = 𝑧} + 1

βˆ‘ν‘—=1

ν‘š

1{𝑧 ν‘— = 𝑧} + 2

17

slide-18
SLIDE 18

Other distributions

Though naive Bayes is often presented as β€œjust” counting, the value of the maximum likelihood interpretation is that it’s clear how to model π‘ž(π‘Œν‘–|𝑍 ) for non- categorical random variables Example: if 𝑦푖 is real-valued, we can model π‘ž(π‘Œν‘–|𝑍 = 𝑧) as a Gaussian π‘ž 𝑦푖 𝑧; πœˆν‘¦, πœν‘¦

2 = π’ͺ(𝑦푖; πœˆν‘¦, πœν‘¦ 2)

with maximum likelihood estimates πœˆν‘¦ = βˆ‘ν‘—=1

ν‘š 𝑦푖 ν‘— β‹… 1{𝑧 ν‘— = 𝑧}

βˆ‘ν‘—=1

ν‘š 1{𝑧 ν‘— = 𝑧}

, πœν‘¦

2=

βˆ‘ν‘—=1

ν‘š (𝑦푖 ν‘— βˆ’πœˆν‘¦)^2 β‹… 1{𝑧 ν‘— = 𝑧}

βˆ‘ν‘—=1

ν‘š 1{𝑧 ν‘— = 𝑧}

All probability computations are exactly the same as before (it doesn’t matter that some of the terms are probability densities)

18

slide-19
SLIDE 19

Outline

Maximum likelihood estimation Naive Bayes Machine learning and maximum likelihood

19

slide-20
SLIDE 20

Machine learning via maximum likelihood

Many machine learning algorithms (specifically the loss function component) can be interpreted probabilistically, as maximum likelihood estimation Recall logistic regression: minimize

νœƒ

βˆ‘

ν‘–=1 ν‘š

β„“logistic(β„Žνœƒ(𝑦 ν‘– ) , 𝑧 ν‘– ) β„“logistic β„Žνœƒ 𝑦 , 𝑧 = log(1 + exp βˆ’π‘§ β‹… β„Žνœƒ 𝑦

20

slide-21
SLIDE 21

Logistic probability model

Consider the model (where 𝑍 is binary taking on βˆ’1, +1 values) π‘ž 𝑧 𝑦; πœ„ = logistic 𝑧 β‹… β„Žνœƒ 𝑦 = 1 1 + exp(βˆ’π‘§ β‹… β„Žνœƒ 𝑦 ) Under this model, the maximum likelihood estimate is maximize

νœƒ

βˆ‘

ν‘–=1 ν‘š

log π‘ž 𝑧 ν‘– 𝑦 ν‘– ; πœ„) ≑ minimize

νœƒ

βˆ‘

ν‘–=1 ν‘š

β„“logistic(β„Žνœƒ(𝑦 ν‘– ) , 𝑧 ν‘– )

21

slide-22
SLIDE 22

Least squares

In linear regression, assume 𝑧 = πœ„ν‘‡ 𝑦 + πœ—, πœ— ∼ π’ͺ 0, 𝜏2 ⟺ π‘ž 𝑧 𝑦; πœ„ = π’ͺ πœ„ν‘‡ 𝑦, 𝜏2 Then the maximum likelihood estimate is given by maximize

νœƒ

βˆ‘

ν‘–=1 ν‘š

log π‘ž 𝑧 ν‘– 𝑦 ν‘– ; πœ„) ≑ minimize

νœƒ

βˆ‘

ν‘–=1 ν‘š

𝑧 ν‘– βˆ’ πœ„ν‘‡ 𝑦 ν‘–

2

i.e., the least-squares loss function can be viewed as MLE under Gaussian errors Other approaches possible too: absolute loss function can be viewed as MLE under Laplace errors

22