SLIDE 1 Logistic Regression
Machine Learning 10-701 Tom M. Mitchell Center for Automated Learning and Discovery Carnegie Mellon University September 29, 2005
Required reading:
- Mitchell draft chapter (see course website)
Recommended reading:
- Bishop, Chapter 3.1.3, 3.1.4
- Ng and Jordan paper (see course website)
SLIDE 2 Naïve Bayes: What you should know
- Designing classifiers based on Bayes rule
- Conditional independence
– What it is – Why it’s important
- Naïve Bayes assumption and its consequences
– Which (and how many) parameters must be estimated under different generative models (different forms for P(X|Y) )
- How to train Naïve Bayes classifiers
– MLE and MAP estimates – with discrete and/or continuous inputs
SLIDE 3 Generative vs. Discriminative Classifiers
Wish to learn f: X Y, or P(Y|X)
Generative classifiers (e.g., Naïve Bayes):
- Assume some functional form for P(X|Y), P(Y)
- This is the ‘generative’ model
- Estimate parameters of P(X|Y), P(Y) directly from training data
- Use Bayes rule to calculate P(Y|X= xi)
Discriminative classifiers:
- Assume some functional form for P(Y|X)
- This is the ‘discriminative’ model
- Estimate parameters of P(Y|X) directly from training data
SLIDE 4
- Consider learning f: X Y, where
- X is a vector of real-valued features, < X1 … Xn >
- Y is boolean
- We could use a Gaussian Naïve Bayes classifier
- assume all Xi are conditionally independent given Y
- model P(Xi | Y = yk) as Gaussian N(μik,σ)
- model P(Y) as Bernoulli (π)
- What does that imply about the form of P(Y|X)?
SLIDE 5
- Consider learning f: X Y, where
- X is a vector of real-valued features, < X1 … Xn >
- Y is boolean
- assume all Xi are conditionally independent given Y
- model P(Xi | Y = yk) as Gaussian N(μik,σi)
- model P(Y) as Bernoulli (π)
- What does that imply about the form of P(Y|X)?
SLIDE 6
Very convenient!
implies implies implies
linear classification rule!
SLIDE 7
Derive form for P(Y|X) for continuous Xi
SLIDE 8
Very convenient!
implies implies implies
linear classification rule!
SLIDE 9
Logistic function
SLIDE 10 Logistic regression more generally
- Logistic regression in more general case,
where Y ∈ {Y1 ... YR} : learn R-1 sets of weights for k<R for k=R
SLIDE 11 Training Logistic Regression: MCLE
- Choose parameters W=<w0, ... wn> to
maximize conditional likelihood of training data
- Training data D =
- Data likelihood =
- Data conditional likelihood =
where
SLIDE 12
Expressing Conditional Log Likelihood
SLIDE 13
Maximizing Conditional Log Likelihood
Good news: l(W) is concave function of W Bad news: no closed-form solution to maximize l(W)
SLIDE 14
SLIDE 15
Maximize Conditional Log Likelihood: Gradient Ascent
Gradient ascent algorithm: iterate until change < ε For all i, repeat
SLIDE 16 That’s all M(C)LE. How about MAP?
- One common approach is to define priors on W
– Normal distribution, zero mean, identity covariance
- Helps avoid very large weights and overfitting
- MAP estimate
SLIDE 17 MLE vs MAP
- Maximum conditional likelihood estimate
- Maximum a posteriori estimate
SLIDE 18 Naïve Bayes vs. Logistic Regression
- Generative and Discriminative classifiers
- Asymptotic comparison (# training examples infinity)
- when model correct
- when model incorrect
- Non-asymptotic analysis
- convergence rate of parameter estimates
- convergence rate of expected error
- Experimental results
[Ng & Jordan, 2002]
SLIDE 19 Naïve Bayes vs Logistic Regression
Consider Y and Xi boolean, X=<X1 ... Xn> Number of parameters:
Estimation method:
- NB parameter estimates are uncoupled
- LR parameter estimates are coupled
SLIDE 20 What is the difference asymptotically?
Notation: let denote error of hypothesis learned via algorithm A, from m examples
- If assumed naïve Bayes model correct, then
- If assumed model incorrect
Note assumed discriminative model can be correct even when generative model incorrect, but not vice versa
SLIDE 21 Rate of covergence: logistic regression
Let hDis,m be logistic regression trained on m examples in n
- dimensions. Then with high probability:
Implication: if we want for some constant , it suffices to pick Convergences to its classifier, in order of n examples (result follows from Vapnik’s structural risk bound, plus fact that VCDim of n dimensional linear separators is n )
SLIDE 22
Rate of covergence: naïve Bayes
Consider first how quickly parameter estimates converge toward their asymptotic values. Then we’ll ask how this influences rate of convergence toward asymptotic classification error.
SLIDE 23
Rate of covergence: naïve Bayes parameters
SLIDE 24
Some experiments from UCI data sets
SLIDE 25 What you should know:
– Functional form follows from Naïve Bayes assumptions – But training procedure picks parameters without the conditional independence assumption – MLE training: pick W to maximize P(Y | X, W) – MAP training: pick W to maximize P(W | X,Y)
- ‘regularization’
- Gradient ascent/descent
– General approach when closed-form solutions unavailable
- Generative vs. Discriminative classifiers
– Bias vs. variance tradeoff