[PPT] - Linear Classifiers R Greiner Cmput 466/551 1 Outline Framework PowerPoint Presentation

SLIDE 1

1

Linear Classifiers

R Greiner Cmput 466/551

HTF: Ch4 B: Ch4

SLIDE 2

2

Outline

Framework Exact

Minimize Mistakes (Perceptron Training) Matrix inversion (LMS)

Logistic Regression

Max Likelihood Estimation (MLE) of P( y | x ) Gradient descent (MSE; MLE) Newton-Raphson

Linear Discriminant Analysis

Max Likelihood Estimation (MLE) of P( y, x ) Direct Computation Fisher’s Linear Discriminant

SLIDE 3

3

Diagnosing Butterfly-itis

SLIDE 4

4

Classifier: Decision Boundaries

Classifier: partitions input space X into

“decision regions”

Linear threshold unit has a

linear decision boundary

Defn: Set of points that can be separated by

linear decision boundary is “linearly separable"

#antennae

# w i n g s

SLIDE 5

5

Linear Separators

Draw “separating line” If #antennae ≤ 2, then butterfly-itis

So ? is Not butterfly-itis.

SLIDE 6

6

Can be “angled”…

If 2.3 × × × × #Wings – 7.5 × × × × #antennae + 1.2 > 0

then butterfly-itis

2.3 ×

× × × #w – 7.5 × × × × #a + 1.2 = 0

SLIDE 7

7

Linear Separators, in General

Given data (many features)

… … … …

No

Pale 50 10 : : : : Yes Clear 80 22

No

Pale 95 35 diseaseX? Color Press Temp. … … … …

No

1.9 50 10 : : : : Yes

2

80 22

No

3 95 35 Class Fn F2 F1

find “weights” {w1, w2, …, wn, w0}

such that means ×

× × × × × × ×

SLIDE 8

8

Linear Separator

Σ ×

!

"#

$% &# '( )( ' !

SLIDE 9

9

Linear Separator

Σ ×

!

"#
'(

)( ' ! *+,-

Performance

Given {wi}, and values for instance, compute response

Learning

Given labeled data, find “correct” {wi}

Linear Threshold Unit … “Perceptron”

SLIDE 10

10

Geometric View

Consider 3 training examples: Want classifier that looks like. . . ( [1.0, 1.0]; 1 ) ( [0.5; 3.0]; 1 ) ( [2.0; 2.0]; 0 )

SLIDE 11

11

Linear Equation is Hyperplane

Equation w·x =i wi·xi is plane

y(x) =

1 if w·x > 0 0 otherwise

SLIDE 12

12

Linear Threshold Unit: “Perceptron”

Squashing function:

sgn: ℜ→ {-1, +1 } sgn(r) =

(“Heaviside”)

Actually w · x > b but. . .

Create extra input x0 fixed at 1 Corresponding w0 corresponds to -b

1 if r > 0 0 otherwise

SLIDE 13

13

Remarkable learning algorithm: [Rosenblatt 1960] If function f can be represented by perceptron, then ∃learning alg guaranteed to quickly converge to f!

enormous popularity, early / mid 60's

But some simple fns cannot be represented

… killed the field temporarily!

Can represent Linearly-Separated surface

. . . any hyper-plane between two half-spaces…

Learning Perceptrons

SLIDE 14

14

Perceptron Learning

Hypothesis space is. . .

Fixed Size:

∃ O(2n^2) distinct perceptrons over n boolean features

Deterministic Continuous Parameters

Learning algorithm:

Various: Local search, Direct computation, . . . Eager Online / Batch

SLIDE 15

15

Task

Input: labeled data

Transformed to

Output: w ∈ℜr+1

Goal: Want w s.t. ∀ ∀ ∀ ∀i sgn( w · [1, x(i) ]) = y(i)

. . . minimize mistakes wrt data . . .

SLIDE 16

16

Error Function

Given data { [x(i), y(i) ] }i=1..m, optimize...

1. Classification error

Perceptron Training; Matrix Inversion

2. Mean-squared error (LMS)

Matrix Inversion; Gradient Descent

3. (Log) Conditional Probability (LR)

MSE Gradient Descent; LCL Gradient Descent

4. (Log) Joint Probability (LDA; FDA)

Direct Computation ] ) ( [ 1 ) (

) ( 1 ) ( i w m i i Class

x

y

I m w err ≠ =

=

2 ) ( 1 ) (

] ) ( [ 2 1 1 ) (

i w m i i MSE

x

y

m w err − =

=

) | ( log 1 ) (

) ( 1 ) ( i m i i w

x y P m w LCL

=

= ) , ( log 1 ) (

) ( 1 ) ( i m i i w

x y P m w LL

=

=

SLIDE 17

17

#1: Optimal Classification Error

For each labeled instance [x, y]

Err = y – ow(x) y = f(x) is target value

w(x) = sgn(w · x) is perceptron output

Idea: Move weights in appropriate direction,

to push Err → 0

If Err > 0 (error on POSITIVE example)

need to increase sgn(w · x)

need to increase w · x

Input j contributes wj · xj to w · x

if xj > 0, increasing wj will increase w · x if xj < 0, decreasing wj will increase w · x

wj ←wj + xj

If Err < 0 (error on NEGATIVE example)

wj ←wj – xj

SLIDE 18

18 18

Local Search via Gradient Descent

SLIDE 19

19

#1a: Mistake Bound Perceptron Alg

OK #3 [1 -1 2] +x #1 [1 -1 2] OK #2 [1 -1 2] OK #1 [1 -1 2] OK #1 [1 0 2]

x

#2 [1 0 2] OK #3 [0 -1 2] OK #1 [1 0 1]

x

#2 [1 0 1] +x #3 [0 -1 1] +x #3 [0 -1 0]

x

#2 [1 0 0] +x #1 [0 0 0] Action Instance Weights

Initialize w = 0 Do until bored Predict “+” iff w · x > 0 else “–" Mistake on y = +1: w ←w + x Mistake on w ←w – x

SLIDE 20

20

Mistake Bound Theorem

Theorem: [Rosenblatt 1960] If data is consistent w/some linear threshold w, then number of mistakes is ≤ (1/∆)2 , where

∆ measures “wiggle room” available:

If |x| = 1, then ∆ is max, over all consistent planes,

f minimum distance of example to that plane

w is ⊥ to separator, as w · x = 0 at boundary So |w · x| is projection of x onto plane,

PERPENDICULAR to boundary line … ie, is distance from x to that line (once normalized)

S e e S V M …

SLIDE 21

21

Proof of Convergence

Let w* be unit vector rep'ning target plane

∆ = minx { w* · x } Let w be hypothesis plane

Consider: On each mistake, add x to w

w = Σ{x | x · w < 0 } x

x wrong wrt w iff w · x < 0

SLIDE 22

22

Proof (con't)

If w is mistake…

∆ = minx { w* ·x }

w = Σ{x | x · w < 0 } x

SLIDE 23

23

#1b: Perceptron Training Rule

For each labeled instance [x, y]

Err( [x, y] ) = y – ow(x) ∈ { -1, 0, +1 }

If Err( [x, y] ) = 0 Correct! … Do nothing!

∆w = 0 ≡ Err( [x, y] ) · x

If Err( [x, y] ) = +1

Mistake on positive! Increment by +x ∆w = +x ≡ Err( [x, y] ) · x

If Err( [x, y] ) = -1

Mistake on negative! Increment by -x ∆w = -x ≡ Err( [x, y] ) · x

In all cases... ∆w(i) = Err( [x(i), y(i) ] ) · x(i) = [y(i) – ow(x(i))] · x(i)

Batch Mode: do ALL updates at once!

∆wj = i ∆wj(i)

= i x(i)j ( y(i) – ow(x(i)) )

wj += η ∆w j

SLIDE 24

24

x(i)

j

x(i)

feature j

∆wj

0. Fix w

∆w := 0

1. For each row i, compute
a. E(i) := y(i) – ow(x(i))
b. ∆w += E(i) x(i)

[ … ∆wj += E(i) x(i)

j … ]

2. Increment w += η ∆w

E(i)

∆w

0. New w

SLIDE 25

25

Correctness

Rule is intuitive: Climbs in correct direction. . . Thrm: Converges to correct answer, if . . .

training data is linearly separable sufficiently small η

Proof: Weight space has EXACTLY 1 minimum! (no non-global minima)

with enough examples, finds correct function!

Explains early popularity If η too large, may overshoot

If η too small, takes too long

So often η = η(k) … which decays with # of iterations, k

SLIDE 26

26

#1c: Matrix Version?

Task: Given { xi, yi }i

yi ∈ { –1, +1 } is label

Find { wi } s.t.

Linear Equalities y = X w Solution: w = X-1 y

SLIDE 27

27

Issues

1. Why restrict to only yi ∈ { –1, +1 } ?

If from discrete set yi ∈ { 0, 1, …, m } :

General (non-binary) classification

If ARBITRARY yi ∈ ℜ: Regression

2. What if NO w works?

...X is singular; overconstrained ... Could try to minimize residual i [ y(i) ≠ w · x(i) ] || y – X w ||1 = i | y(i) – w · x(i) | || y – X w ||2 = i ( y(i) – w · x(i) )2

NP-Hard!

Easy!

SLIDE 28

28

L2 error vs 0/1-Loss

“0/1 Loss function” not smooth,

differentiable

MSE error is smooth, differentiable…

and is overbound...

SLIDE 29

29

Gradient Descent for Perceptron?

Why not Gradient Descent

for THRESHOLDed perceptron?

Needs gradient (derivative), not Gradient Descent is General approach.

Requires + continuously parameterized hypothesis + error must be differentiatable wrt parameters

But. . .

– can be slow (many iterations) – may only find LOCAL opt

SLIDE 30

30

Linear Separators – Facts

GOOD NEWS:

If data is linearly separated, Then FAST ALGORITHM finds correct {wi} !

But…

SLIDE 31

31

Linear Separators – Facts

GOOD NEWS:

If data is linearly separated, Then FAST ALGORITHM finds correct {wi} !

Some “data sets” are

NOT linearly separatable!

But…

Stay tuned!

SLIDE 32

32

#1. LMS version of Classifier

View as Regression

Find “best” linear mapping w from X to Y

w* = argmin ErrLMS(X, Y)(w) ErrLMS(X, Y)(w) = i ( y(i) – w · x(i) )2

Threshold: if wTx > 0.5,

return 1; else 0

See Chapter 3…

SLIDE 33

33

General Idea

Use a discriminant function δk(x) for each class k

Eg, δk(x) = P( G=k | X)

Classification rule:

Return k = argmaxj δj(x)

If each δj(x) is linear,

decision boundaries are piecewise hyperplanes

SLIDE 34

34

Linear Classification using Linear Regression

2D Input space: X = (X1, X2)

K-3 classes:

Training sample (N=5): Regression output: Classification rule:

∈

= ] 1 , , [ ] , 1 , [ ] , , 1 [ ) , , (

3 2 1

Y Y Y Y

=
=

53 52 51 43 42 41 33 32 31 23 22 21 13 12 11 52 51 42 41 32 31 22 21 12 11

, 1 1 1 1 1 y y y y y y y y y y y y y y y x x x x x x x x x x Y X

) ( ) )( 1 ( )) , (( ˆ

3 2 1 1 2 1 2 1

β β β

T T T T T

x x x x x x x Y = =

−

Y X X X

3 2 1 2 1 3 2 2 1 2 1 2 1 2 1 2 1 1

) 1 ( )) (( ˆ ) 1 ( )) (( ˆ ) 1 ( )) (( ˆ β β β x x x x Y x x x x Y x x x x Y = = =

)) (( ˆ max arg )) (( ˆ

2 1 2 1

x x Y x x G

k k

=

SLIDE 35

35

Use Linear Regression for Classification?

But … regression minimizes

sum of squared errors on target function … which gives strong influence to outliers

Great separation Bad separation

SLIDE 36

36

#3: Logistic Regression

x

e x

−

+ = 1 1 ) ( σ

Want to compute Pw(y=1| x)

... based on parameters w

But …

w·x has range [-∞, ∞] probability must be in range ∈ [0; 1]

Need “squashing” function [-∞, ∞] →[0, 1]

SLIDE 37

37

Alternative Derivation…

) exp( 1 1 ) ( ) | ( ) ( ) | ( ) ( ) | ( ) | ( a y P y x P y P y x P y P y x P x y P − + = − − + + + + + = + ) ( ) | ( ) ( ) | ( ln y P y x P y P y x P a − − + + =

SLIDE 38

38

Sigmoid Unit

SLIDE 39

39

Logistic Regression (con’t)

Assume 2 classes:

) (

1 1 ) ( ) | (

w x w

e x w x y P

⋅ −

+ = ⋅ = + σ

) ( ) ( ) (

1 1 1 1 ) | (

w x w x w x w

e e e x y P

⋅ − ⋅ − ⋅ −

+ = + − = −

Log Odds:

w x x y P x y P

w w

⋅ = − + ) | ( ) | ( log

Linear

SLIDE 40

40

How to learn parameters w ?

… depends on goal?

A: Minimize MSE?

i ( y(i) – ow(x(i)) )2

B: Maximize likelihood?

i log Pw(y(i) | x(i))

SLIDE 41

41

MSError Gradient for Sigmoid Unit

Error: j ( y(j) – ow(x(j)) )2 = j E(j)

For single training instance

Input: x(j) = [x(j)1, …, x(j)k] Computed Output: o(j) = σ( i x(j)i · wi ) = σ( z(j) )

where z(j) = i x(j)

i · wi using current { wi }

Correct output: y(j)

Stochastic Error Gradient (Ignore (j) superscript)

z

e z

−

+ = 1 1 ) ( σ

SLIDE 42

42

Derivative of Sigmoid

)] ( 1 [ ) ( ) 1 ( ) 1 ( 1 ) 1 ( ) ( ) 1 ( 1 ) 1 ( ) 1 ( 1 ) 1 ( 1 ) (

2 2 2

a a e e e e e e e e da d e e da d a da d

a a a a a a a a a a

σ σ σ − = + + = + = − + − = + + − = + =

− − − − − − − − − −

SLIDE 43

43

Update wi += ∆wi

where

Updating LR Weights (MSE)

z

e z

−

+ = 1 1 ) ( σ

Note: As already computed o(j) = σ( z(j)) to get answer, trivial to compute σ’( z(j)) = σ( z(j))( 1– σ( z(j)) )

SLIDE 44

44

(LMS)

x(i)

j

x(i)

feature j

∆wj E(i)

0. Fix w

∆w = 0

1. For each row i, compute
a. E(i) = y(i) – ow(x(i))
b. ∆w += E(i) x(i)

[ … ∆wj += E(i) x(i)

j … ]

2. Increment w += η∆w

∆w

0. New w

(o(i) – y(i)) o(i) (1– o(i) )

SLIDE 45

45

B: Or... Learn Conditional Probability

As fitting probability distribution,

better to return probability distribution (≈ w) that is most likely, given training data, S

Bayes Rules As P(S) does not depend on w As P(w) is uniform As log is monotonic

SLIDE 46

46

ML Estimation

P( S | w) ≡ likelihood function

L(w) = log P( S | w)

w* = argmaxw L(w)

is “maximum likelihood estimator” (MLE)

SLIDE 47

47

Computing the Likelihood

As training examples [x(i), y(i)] are iid

drawn independently from same (unknown) prob Pw(x, y)

log P( S | w ) = log Πi Pw(x(i), y(i) )

= i log Pw(x(i), y(i) ) = i log Pw(y(i) | x(i)) + i log Pw( x(i))

Here Pw(x(i)) = 1/n …

not dependent on w, over empirical sample S

w* = argmaxw i log Pw(y(i) | x(i))

SLIDE 48

48

Fit Logistic Regression… by Gradient Ascent

Want w* = argmaxw J(w)

J(w) =i r(y(i), x(i), w) For y ∈ {0, 1}

r(y, x, w) = log Pw( y | x ) = y log( Pw( y=1 | x )) + (1 – y) log(1 – Pw( y=1 | x ))

So climb along…

∂

∂ = ∂ ∂

i j i i j

w y r w J ) ( ) (

) ( ) (

SLIDE 49

49

Gradient Descent …

j j j j j

w p p p p y w p p y w p p y p y p y w w y r ∂ ∂ − − = ∂ ∂ − − × − + ∂ ∂ = − − + ∂ ∂ = ∂ ∂

1 1 1 1 1 1 1 1 1 1

) 1 ( 1 1 ) 1 ( ) 1 log( ) 1 ( ) log( [ ) (

(

) ( )

) ( 1 1 1

) 1 ( )] ( 1 )[ ( ) ( ) | 1 (

i j j j j w j

x p p w x w w x w x w x w w x y P w p ⋅ − = ⋅ ∂ ∂ ⋅ − ⋅ = ⋅ ∂ ∂ = ∂ = ∂ = ∂ ∂ σ σ σ

) ( ) ( ) ( 1 1 1 1 1 ) ( ) ( ) (

)) | 1 ( ( ) 1 ( ) 1 ( ) , , ( ) (

i j i w i i j i i i j i i j

x x y P y x p p p p p y w w x y r w w J ⋅ = − = ⋅ − − − = ∂ ∂ = ∂ ∂

SLIDE 50

50

Gradient Ascent for Logistic Regression (MLE)

y(i)

∆w η ∆w

) ( 1 i

p

) ( 1 i

p

∆wj

SLIDE 51

51

Comments on MLE Algorithm

This is BATCH;

∃ obvious online alg (stochastic gradient ascent)

Can use second-order (Newton-Raphson)

alg for faster convergence

weighted least squares computation;

aka “Iteratively-Reweighted Least Squares” (IRLS)

SLIDE 52

52

Use Logistic Regression for Classification

Return YES iff

) exp( 1 ln )) exp( 1 /( ) exp( )) exp( 1 /( 1 ln ) | ( ) | 1 ( ln 1 ) | ( ) | 1 ( ) | ( ) | 1 ( > ⋅ = ⋅ − > ⋅ − + ⋅ − ⋅ − + > = = > = = = > = x w x w x w x w x w x y P x y P x y P x y P x y P x y P

Logistic Regression learns a LTU!

SLIDE 53

53

Logistic Regression for K > 2 Classes

Note: k-1 different wi weights, … each of dimension |x|

SLIDE 54

54

Learning LR Weights

∆w(i)

j = (o(i) – y(i)) o(i) (1– o(i) )

∆w(i)

j = (y(i) – p(1|x(i) )) x(i) j

1

) exp( 1 ) exp( 1 ) exp( 1 1 = ⋅ − + ⋅ − = ⋅ − + y if x w x w y if x w

SLIDE 55

55

(LMS)

x(i)

j

x(i)

feature j

∆wj E(i)

0. Fix w

∆w = 0

1. For each row i, compute
a. E(i) = y(i) – ow(x(i))
b. ∆w += E(i) x(i)

[ … ∆wj += E(i) x(i)

j … ]

2. Increment w += η∆w

∆w

0. New w

(o(i) – y(i)) o(i) (1– o(i) ) (y(i) – p(1|x(i) ))

(MaxProb)

SLIDE 56

56

Logistic Regression Computation…

(p+1) non-linear equations Solve by Newton-Raphson method:

) exp( 1 ) exp( ) (

1

=

+

− = ∂ ∂

=

N i i T T i

x x x y l β β β β

β β β β β β ∂ ∂ ∂ ∂ − = ) ( )] ) ( Jacobian( [

1

ld
ld
ld

new

l l

=

= = =

+ − − = + − + = = = − + = = = = = =

N i i T i i T i N i i T i i T i N i i i i i N i i i

x y x y x y x y x X G y x X G y x X y G l

1 1 1 1

))) exp( 1 log( ) 1 ( ( ) ) exp( 1 1 log ) 1 ( ( )) | log(Pr( ) 1 ( )) | 1 log(Pr( )} | Pr( {log ) ( β β β β β

SLIDE 57

57

Newton-Raphson Method

A gen’l technique for solving f(x)=0

… even if non-linear

Taylor series:

f( xn+1 ) ≈ f(xn) + (xn+1 – xn) f’( xn ) xn+1 ≈ xn + [ f( xn+1 ) – f(xn) ] / f’( xn )

When xn+1 near root, f( xn+1 ) ≈ 0

)

( ) ( :

1 n n n n

x f x f x x ′ − =

+

Iteration…

SLIDE 58

58

Newton-Raphson in Multi-dimensions

To solve the equations: Taylor series: N-R:

) , , , ( ) , , , ( ) , , , (

2 1 2 1 2 2 1 1

= = =

N N N N

x x x f x x x f x x x f

N

j x x f x f x x f

N k k k j j j

,..., 1 , ) ( ) (

1

= ∆ ∂ ∂ + = ∆ +

=
∂

∂ ∂ ∂ ∂ ∂ ∂ ∂ ∂ ∂ ∂ ∂ ∂ ∂ ∂ ∂ ∂ ∂ −

=
−

+ + + + + +

) , , , ( ) , , , ( ) , , , (

2 1 2 1 2 2 1 1 1 2 1 2 2 2 1 2 1 2 1 1 1 1 1 2 1 1 1 1 2 1 1 n N n n N n N n n n N n n N N N N N N n N n n n N n n

x x x f x x x f x x x f x f x f x f x f x f x f x f x f x f x x x x x x

Jacobian matrix

SLIDE 59

59

Newton-Raphson : Example

Solve

) sin( ) , ( ) cos( ) , (

3 2 2 1 1 2 1 2 2 2 1 2 1 1

= + + = = − = x x x x x f x x x x f

+

+ −

+

−

=
−

+ + 3 2 2 1 1 2 2 1 1 2 2 1 1 2 1 2 1 1 2 1 1

) ( ) ( ) sin( ) cos( ) ( ) ( 3 2 ) cos( ) sin( 2

n n n n n n n n n n n n n n

x x x x x x x x x x x x x x

SLIDE 60

60

Maximum Likelihood Parameter Estimation

Find the unknown parameters

mean & standard deviation of a Gaussian pdf, given N independent samples, {x1,….,xN }

Estimate the parameters that maximize the

likelihood function

) 2 ) ( exp( 2 1 ) , ; (

2 2

σ µ σ π σ µ − − = x x p

) , ( max arg ) ˆ , ˆ (

,

σ µ σ µ

σ µ

L =

∏

=

− − =

N i i

x L

1 2 2

) 2 ) ( exp( 2 1 ) , ( σ µ σ π σ µ

SLIDE 61

61

Logistic Regression Algs for LTUs

Learns Conditional Probability Distribution P( y | x ) Local Search:

Begin with initial weight vector; iteratively modify to maximize objective function log likelihood of the data (ie, seek w s.t. probability distribution Pw( y | x ) is most likely given data.)

Eager: Classifier constructed from training examples,

which can then be discarded.

Online or batch

SLIDE 62

62

Masking of Some Class

3 2 1 3

) 1 ( ˆ β x x Y =

2 2 1 2

) 1 ( ˆ β x x Y =

Linear regression of the indicator matrix can lead to masking LDA can avoid this masking 2D input space and three classes Masking

1 2 1 1

) 1 ( ˆ β x x Y =

Viewing direction

SLIDE 63

63

#4: Linear Discriminant Analysis

LDA learns joint distribution P( y, x )

As P( y, x ) P( y | x );

ptimizing P( y, x ) optimizing P( y | x )

“generative model”

P( y,x ) model of how data is generated Eg, factor

P( y, x ) = P( y ) P( x | y )

P( y ) generates value for y; then P( x | y ) generates value for x given this y

Belief net:

Y X

SLIDE 64

64

Linear Discriminant Analysis, con't

P( y, x ) = P( y ) P( x | y ) P( y ) is a simple discrete distribution

Eg: P( y = 0 ) = 0.31; P( y = 1 ) = 0.69

(31% negative examples; 69% positive examples) Assume P( x | y ) is multivariate normal,

with mean µk and covariance

SLIDE 65

65

Estimating LDA Model

Linear discriminant analysis assumes form

µy is mean for examples belonging to class y;

covariance matrix is shared by all classes !

Can estimate LDA directly:

mk = #training examples in class y = k

Estimate of P( y = k ): pk = mk / m

(Subtract each xi from corresponding before taking outer product)

P( x,y) =

T i y i y i

i i

x x m − − = Σ ) ˆ )( ˆ ( 1 ˆ µ µ

=

=

} : {

1 ˆ

k y i i k

i

x m µ

i

y

µ ˆ

m – k

SLIDE 66

66

Example of Estimation

m=7 examples;

m+ = 3 positive; m- = 4 negative

p+ = 3/7 p- = 4/7

Note: do NOT pre-pend x =1! 4

T T T T T

SLIDE 67

67

Estimation…

… z(7) := …

T T T T T T

SLIDE 68

68

Classifying, Using LDA

How to classify new instance, given estimates

Class for instance x = [5, 14, 6]T ?

T T T T T T T

T T T T T

SLIDE 69

69

LDA learns an LTU

Consider 2-class case with a 0/1 loss function Classify = 1 if

iff

SLIDE 70

70

LDA Learns an LTU (2)

(x–µ1)T -1 (x–µ1) – (x–µ0)T -1 (x–µ0)

= xT -1 (µ0 –µ1) + (µ0 –µ1)T -1 x + µ1T -1 µ1 – µ0T -1 µ0

As -1 is symmetric,

… = 2 xT -1 (µ0 –µ1)+ µ1T -1 µ1 – µ0T -1 µ0

SLIDE 71

71

LDA Learns an LTU (3)

So let… Classify = 1 iff

w · x + c > 0

LTU!!

SLIDE 72

72

LDA: Example

LDA was able to avoid masking here

SLIDE 73

73

View LDA wrt Mahalanobis Distance

Squared Mahalanobis distance between x and µ

µ µ µ

DM2(x, µ µ µ µ) = (x–µ µ µ µ)T -1 (x–µ µ µ µ)

1 ≈ linear distortion

… converts standard Euclidean distance into Mahalanobis distance.

LDA classifies x as 0 if

DM2(x, µ µ µ µ0) < DM2(x, µ µ µ µ1)

log P( x | y = k ) ≈ log πk – ½ DM2(x, µ

µ µ µk)

SLIDE 74

74

Generalizations of LDA

General Gaussian Classifier: QDA

Allow each class k to have its own

k

Classifier ≡ quadratic threshold unit (not LTU)

Naïve Gaussian Classifier

Allow each class k to have its own k but require each k be diagonal. within each class, any pair of features xi and xj are independent

Classifier is still quadratic threshold unit

but with a restricted form

Most “discriminating” Low Dimensional Projection

Fisher’s Linear Discriminant

SLIDE 75

75

QDA and Masking

Better than Linear Regression in terms of handling masking: Usually computationally more expensive than LDA

SLIDE 76

76

Variants of LDA

Covariance matrix n features; k classes

General Gaussian Classifier Naïve Gaussian Classifier

LDA

Name

k n2

— —

k n +

—

n2

—

+ k + +

#param’s Diagonal Same for all classes?

SLIDE 77

77

Versions of ?L?Q?N? DA

LDA Quadratic Naïve SuperSimple

SLIDE 78

78

Summary of Linear Discriminant Analysis

Learns Joint Probability Distr'n P( y, x ) Direct Computation.

MLEstimate of P( y, x ) computed directly from data without search. But need to invert matrix, which is O(n3)

Eager:

Classifier constructed from training examples, which can then be discarded.

Batch: Only a batch algorithm.

An online LDA alg requires online alg for incrementally updating -1 [Easy if -1 is diagonal. . . ]

SLIDE 79

79

Fisher's Linear Discriminant

LDA

Finds K–1 dim hyperplane

(K = number of classes)

Project x and { µk } to that hyperplane Classify x as nearest µk

within hyperplane

Better:

Find hyperplane that maximally separates projection of x's wrt -1 Fisher’s Linear Discriminant

SLIDE 80

80

Fisher Linear Discriminant

Recall any vector w projects ℜn → ℜ Goal: Want w that “separates” classes

Each w · x+ far from each w · x–

Perhaps project onto m+ – m– ? Still overlap… why? µ+ µ-

SLIDE 81

81

Fisher Linear Discriminant

Problem with m+ – m– : Does not consider “scatter” within class Goal: Want w that “separates” classes

Each w · x+ far from each w · x– Positive x+'s: w · x+ close to each other Negative x–'s: w · x– close to each other

“scatter” of +instance; –instance

s+

2 =

i y(i) (w · x(i) – m+)2

s–

2 =

i (1 – y(i) ) (w · x(i) – m-)2

µ+ µ-

SLIDE 82

82

Fisher Linear Discriminant

Recall any vector w projects ℜn → ℜ Goal: Want w that “separates” classes

Positive x+'s: w · x+ close to each other Negative x–'s: w · x– close to each other Each w · x+ far from each w · x–

“scatter” of +instance; –instance

s+

2 =

i y(i) (w · x(i) – m+)2

s–

2 =

i (1 – y(i) ) (w · x(i) – m+)2

µ+ µ-

SLIDE 83

83

FLD, con't

Separate means m– and m+

maximize (m– – m+)2

Minimize each spread s+

2, s– 2

minimize (s+

2 + s– 2)

Objective function: maximize

#1:(µ– – µ+)2 = ( wT m+ – wT m–)2 = wT (m+ – m–)(m+ – m–)T w = wT SB w

SB = (m+ – m– ) (m+ – m–)T

“between-class scatter”

) ( ) ( ) (

2 2 2 − + − +

+ − = s s w J S µ µ

SLIDE 84

84

FLD, III

s+2 =

i y(i) (w · x(i) – m+)2

=

i wT y(i) (x(i) – m+) (x(i) – m+)T w

= wT S+ w

Sw = S+ + S– so

s+2 + s–2 = wT SW w S+ =

i y(i) (x(i) – m+) (x(i) – m+)T

… “within-class scatter matrix” for +

S– =

i (1 – y(i)) (x(i) – m–) (x(i) – m–)T

… “within-class scatter matrix” for –

) ( ) ( ) (

2 2 2 − + − +

+ − = s s w J S µ µ

SLIDE 85

85

FLD, IV

… w* is eigenvector of SB

1Sw
w

T B T S

S S s s J = + − =

− + − +

) ( ) ( ) (

2 2 2

µ µ

) 2 ( 2 ) , (

w

B

S S L λ λ − = ∂ ∂

Minimizing JS(w) …

w* = argminw wTSBw s.t. wTSww = 1

Lagrange: L(w, λ) = wTSBw + λ (1 - wTSww )

λ

λ 1 ) , (

1

=

=

∂ ∂

− w B S

S L

SLIDE 86

86

FLD, V

Optimal w* is eigenvector of SB

1Sw

When P( x | yi ) ~ N(µi; )

∃ LINEAR DISCRIMINANT: w = -1(µ+ – µ–) FLD is optimal classifier, if classes normally distributed

Can use even if not Gaussian:

After projecting d-dim to 1, just use any classification method

w

T B T S

S S s s J = + − =

− + − +

) ( ) ( ) (

2 2 2

µ µ

SLIDE 87

87

Fisher’s LD vs LDA

Fisher’s LD = LDA when…

Prior probabilities are same Each class conditional density is

multivariate Gaussian

… with common covariance matrix

Fisher’s LD…

does not assume Gaussian densities can be used to reduce dimensions even

when multiple classes scenario

SLIDE 88

88

Comparing LMS, Logistic Regression, LDA, FLD

Which is best: LMS, LR, LDA, FLD ? Ongoing debate within machine learning

community about relative merits of

direct classifiers [ LMS ] conditional models P( y | x )

[ LR ]

generative models P( y, x ) [ LDA, FLD ]

Stay tuned...

SLIDE 89

89

Issues in Debate

Statistical efficiency

If generative model P( y, x ) is correct, then … usually gives better accuracy, particularly if training sample is small

Computational efficiency

Generative models typically easiest to compute (LDA/FLD computed directly, without iteration)

Robustness to changing loss functions

LMS must re-train the classifier when the loss function changes. … no retraining for generative and conditional models

Robustness to model assumptions.

Generative model usually performs poorly when the assumptions are violated. Eg, LDA works poorly if P( x | y ) is non-Gaussian. Logistic Regression is more robust, … LMS is even more robust

Robustness to missing values and noise.

In many applications, some of the features xij may be missing or corrupted for some of the training examples. Generative models typically provide better ways of handling this than non-generative models.

SLIDE 90

90

Other Algorithms for learning LTUs

Naive Bayes [Discuss later]

For K = 2 classes, produces LTU

Winnow [?Discuss later?]

Can handle large numbers of “irrelevant" features

(features whose weights should be zero)

SLIDE 91

91

Learning Theory

Assume data is truly linearly separable. . .

Sample Complexity: Given ε, δ ∈ (0, 1),

want LTU has error rate (on new examples)

less than ε with probability > 1 – δ .

Suffices to learn from (be consistent with) labeled training examples.

Computational Complexity:

There is a polynomial time algorithm for finding a consistent LTU

(reduction from linear programming)

Agnostic case… different…