Logistic Regression Required reading: Mitchell draft chapter (see - - PowerPoint PPT Presentation

logistic regression
SMART_READER_LITE
LIVE PREVIEW

Logistic Regression Required reading: Mitchell draft chapter (see - - PowerPoint PPT Presentation

Logistic Regression Required reading: Mitchell draft chapter (see course website) Recommended reading: Bishop, Chapter 3.1.3, 3.1.4 Ng and Jordan paper (see course website) Machine Learning 10-701 Tom M. Mitchell Center for


slide-1
SLIDE 1

Logistic Regression

Machine Learning 10-701 Tom M. Mitchell Center for Automated Learning and Discovery Carnegie Mellon University September 29, 2005

Required reading:

  • Mitchell draft chapter (see course website)

Recommended reading:

  • Bishop, Chapter 3.1.3, 3.1.4
  • Ng and Jordan paper (see course website)
slide-2
SLIDE 2

Naïve Bayes: What you should know

  • Designing classifiers based on Bayes rule
  • Conditional independence

– What it is – Why it’s important

  • Naïve Bayes assumption and its consequences

– Which (and how many) parameters must be estimated under different generative models (different forms for P(X|Y) )

  • How to train Naïve Bayes classifiers

– MLE and MAP estimates – with discrete and/or continuous inputs

slide-3
SLIDE 3

Generative vs. Discriminative Classifiers

Wish to learn f: X Y, or P(Y|X)

Generative classifiers (e.g., Naïve Bayes):

  • Assume some functional form for P(X|Y), P(Y)
  • This is the ‘generative’ model
  • Estimate parameters of P(X|Y), P(Y) directly from training data
  • Use Bayes rule to calculate P(Y|X= xi)

Discriminative classifiers:

  • Assume some functional form for P(Y|X)
  • This is the ‘discriminative’ model
  • Estimate parameters of P(Y|X) directly from training data
slide-4
SLIDE 4
  • Consider learning f: X Y, where
  • X is a vector of real-valued features, < X1 … Xn >
  • Y is boolean
  • We could use a Gaussian Naïve Bayes classifier
  • assume all Xi are conditionally independent given Y
  • model P(Xi | Y = yk) as Gaussian N(μik,σ)
  • model P(Y) as Bernoulli (π)
  • What does that imply about the form of P(Y|X)?
slide-5
SLIDE 5
  • Consider learning f: X Y, where
  • X is a vector of real-valued features, < X1 … Xn >
  • Y is boolean
  • assume all Xi are conditionally independent given Y
  • model P(Xi | Y = yk) as Gaussian N(μik,σi)
  • model P(Y) as Bernoulli (π)
  • What does that imply about the form of P(Y|X)?
slide-6
SLIDE 6

Very convenient!

implies implies implies

linear classification rule!

slide-7
SLIDE 7

Derive form for P(Y|X) for continuous Xi

slide-8
SLIDE 8

Very convenient!

implies implies implies

linear classification rule!

slide-9
SLIDE 9

Logistic function

slide-10
SLIDE 10

Logistic regression more generally

  • Logistic regression in more general case,

where Y ∈ {Y1 ... YR} : learn R-1 sets of weights for k<R for k=R

slide-11
SLIDE 11

Training Logistic Regression: MCLE

  • Choose parameters W=<w0, ... wn> to

maximize conditional likelihood of training data

  • Training data D =
  • Data likelihood =
  • Data conditional likelihood =

where

slide-12
SLIDE 12

Expressing Conditional Log Likelihood

slide-13
SLIDE 13

Maximizing Conditional Log Likelihood

Good news: l(W) is concave function of W Bad news: no closed-form solution to maximize l(W)

slide-14
SLIDE 14
slide-15
SLIDE 15

Maximize Conditional Log Likelihood: Gradient Ascent

Gradient ascent algorithm: iterate until change < ε For all i, repeat

slide-16
SLIDE 16

That’s all M(C)LE. How about MAP?

  • One common approach is to define priors on W

– Normal distribution, zero mean, identity covariance

  • Helps avoid very large weights and overfitting
  • MAP estimate
slide-17
SLIDE 17

MLE vs MAP

  • Maximum conditional likelihood estimate
  • Maximum a posteriori estimate
slide-18
SLIDE 18

Naïve Bayes vs. Logistic Regression

  • Generative and Discriminative classifiers
  • Asymptotic comparison (# training examples infinity)
  • when model correct
  • when model incorrect
  • Non-asymptotic analysis
  • convergence rate of parameter estimates
  • convergence rate of expected error
  • Experimental results

[Ng & Jordan, 2002]

slide-19
SLIDE 19

Naïve Bayes vs Logistic Regression

Consider Y and Xi boolean, X=<X1 ... Xn> Number of parameters:

  • NB: 2n +1
  • LR: n+1

Estimation method:

  • NB parameter estimates are uncoupled
  • LR parameter estimates are coupled
slide-20
SLIDE 20

What is the difference asymptotically?

Notation: let denote error of hypothesis learned via algorithm A, from m examples

  • If assumed naïve Bayes model correct, then
  • If assumed model incorrect

Note assumed discriminative model can be correct even when generative model incorrect, but not vice versa

slide-21
SLIDE 21

Rate of covergence: logistic regression

Let hDis,m be logistic regression trained on m examples in n

  • dimensions. Then with high probability:

Implication: if we want for some constant , it suffices to pick Convergences to its classifier, in order of n examples (result follows from Vapnik’s structural risk bound, plus fact that VCDim of n dimensional linear separators is n )

slide-22
SLIDE 22

Rate of covergence: naïve Bayes

Consider first how quickly parameter estimates converge toward their asymptotic values. Then we’ll ask how this influences rate of convergence toward asymptotic classification error.

slide-23
SLIDE 23

Rate of covergence: naïve Bayes parameters

slide-24
SLIDE 24

Some experiments from UCI data sets

slide-25
SLIDE 25

What you should know:

  • Logistic regression

– Functional form follows from Naïve Bayes assumptions – But training procedure picks parameters without the conditional independence assumption – MLE training: pick W to maximize P(Y | X, W) – MAP training: pick W to maximize P(W | X,Y)

  • ‘regularization’
  • Gradient ascent/descent

– General approach when closed-form solutions unavailable

  • Generative vs. Discriminative classifiers

– Bias vs. variance tradeoff