Logistic Regression Required reading: Mitchell draft chapter (see - - PowerPoint PPT Presentation

▶

Nov 16, 2022 128 likes •387 views

Logistic Regression Required reading: Mitchell draft chapter (see course website) Recommended reading: Bishop, Chapter 3.1.3, 3.1.4 Ng and Jordan paper (see course website) Machine Learning 10-701 Tom M. Mitchell Center for

SLIDE 1

Logistic Regression

Machine Learning 10-701 Tom M. Mitchell Center for Automated Learning and Discovery Carnegie Mellon University September 29, 2005

Required reading:

Mitchell draft chapter (see course website)

Naïve Bayes: What you should know

Designing classifiers based on Bayes rule
Conditional independence

– What it is – Why it’s important

Naïve Bayes assumption and its consequences

– Which (and how many) parameters must be estimated under different generative models (different forms for P(X|Y) )

How to train Naïve Bayes classifiers

– MLE and MAP estimates – with discrete and/or continuous inputs

SLIDE 3

Generative vs. Discriminative Classifiers

Wish to learn f: X Y, or P(Y|X)

Generative classifiers (e.g., Naïve Bayes):

Assume some functional form for P(X|Y), P(Y)
This is the ‘generative’ model
Estimate parameters of P(X|Y), P(Y) directly from training data
Use Bayes rule to calculate P(Y|X= xi)

Discriminative classifiers:

Assume some functional form for P(Y|X)
This is the ‘discriminative’ model
Estimate parameters of P(Y|X) directly from training data

SLIDE 4

Consider learning f: X Y, where
X is a vector of real-valued features, < X1 … Xn >
Y is boolean
We could use a Gaussian Naïve Bayes classifier
assume all Xi are conditionally independent given Y
model P(Xi | Y = yk) as Gaussian N(μik,σ)
model P(Y) as Bernoulli (π)
What does that imply about the form of P(Y|X)?

SLIDE 5

Consider learning f: X Y, where
X is a vector of real-valued features, < X1 … Xn >
Y is boolean
assume all Xi are conditionally independent given Y
model P(Xi | Y = yk) as Gaussian N(μik,σi)
model P(Y) as Bernoulli (π)
What does that imply about the form of P(Y|X)?

SLIDE 6

Very convenient!

implies implies implies

linear classification rule!

SLIDE 7

Derive form for P(Y|X) for continuous Xi

SLIDE 8

Very convenient!

implies implies implies

linear classification rule!

SLIDE 9

Logistic function

SLIDE 10

Logistic regression more generally

Logistic regression in more general case,

where Y ∈ {Y1 ... YR} : learn R-1 sets of weights for k<R for k=R

SLIDE 11

Training Logistic Regression: MCLE

Choose parameters W=<w0, ... wn> to

maximize conditional likelihood of training data

Training data D =
Data likelihood =
Data conditional likelihood =

where

SLIDE 12

Expressing Conditional Log Likelihood

SLIDE 13

Maximizing Conditional Log Likelihood

Good news: l(W) is concave function of W Bad news: no closed-form solution to maximize l(W)

SLIDE 14

SLIDE 15

Maximize Conditional Log Likelihood: Gradient Ascent

Gradient ascent algorithm: iterate until change < ε For all i, repeat

SLIDE 16

That’s all M(C)LE. How about MAP?

One common approach is to define priors on W

– Normal distribution, zero mean, identity covariance

Helps avoid very large weights and overfitting
MAP estimate

SLIDE 17

MLE vs MAP

Maximum conditional likelihood estimate
Maximum a posteriori estimate

SLIDE 18

Naïve Bayes vs. Logistic Regression

Generative and Discriminative classifiers
Asymptotic comparison (# training examples infinity)
when model correct
when model incorrect
Non-asymptotic analysis
convergence rate of parameter estimates
convergence rate of expected error
Experimental results

[Ng & Jordan, 2002]

SLIDE 19

Naïve Bayes vs Logistic Regression

Consider Y and Xi boolean, X=<X1 ... Xn> Number of parameters:

NB: 2n +1
LR: n+1

Estimation method:

NB parameter estimates are uncoupled
LR parameter estimates are coupled

SLIDE 20

What is the difference asymptotically?

Notation: let denote error of hypothesis learned via algorithm A, from m examples

If assumed naïve Bayes model correct, then
If assumed model incorrect

Note assumed discriminative model can be correct even when generative model incorrect, but not vice versa

SLIDE 21

Rate of covergence: logistic regression

Let hDis,m be logistic regression trained on m examples in n

dimensions. Then with high probability:

Implication: if we want for some constant , it suffices to pick Convergences to its classifier, in order of n examples (result follows from Vapnik’s structural risk bound, plus fact that VCDim of n dimensional linear separators is n )

SLIDE 22

Rate of covergence: naïve Bayes

Consider first how quickly parameter estimates converge toward their asymptotic values. Then we’ll ask how this influences rate of convergence toward asymptotic classification error.

SLIDE 23

Rate of covergence: naïve Bayes parameters

SLIDE 24

Some experiments from UCI data sets

SLIDE 25

What you should know:

Logistic regression

– Functional form follows from Naïve Bayes assumptions – But training procedure picks parameters without the conditional independence assumption – MLE training: pick W to maximize P(Y | X, W) – MAP training: pick W to maximize P(W | X,Y)

‘regularization’
Gradient ascent/descent

– General approach when closed-form solutions unavailable

Generative vs. Discriminative classifiers

– Bias vs. variance tradeoff