[PPT] - CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick PowerPoint Presentation

SLIDE 1

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

Aaron Bobick School of Interactive Computing

CS 7616 Pattern Recognition Linear, Linear, Linear…

SLIDE 2

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

Administrivia

First problem set will be out tonight (Thurs 1/23). Due in more

than one week, Sunday Feb 2 (touchdown…), 11:55pm.

General description: for a trio of data sets (one common, one from the

sets we provide, one from those sets or your own), use parametric density estimation for normal densities to find best result. Use both MLE methods and Bayes.

But next one may be out before this one is due.

SLIDE 3

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

Today brought to you by…

Some materials borrowed from Jie Lu, Joy, Lucian @ CMU,

Geoff Hinton (U Toronto), and Reza Shadmehr (Hopkins)

SLIDE 4

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

Outline for “today”

We have seen linear discriminants arise in the case of normal
distributions. (When?)
Now we’ll approach from another way:
Linear regression – really least squares
“Hat” operator
From regression to classification: Indicator Matrix
Logistic regression – which is not regression but classification
Reduced rank linear discriminants - Fischer Linear Discriminant Analysis

SLIDE 5

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

Jumping ahead…

Last time regression and some discussion of discriminants from

normal distributions.

This time logistic regression and Fisher LDA

SLIDE 6

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

First regression

Let 𝑌 = 𝑌1, 𝑌2, … 𝑌𝑞

𝑈be a random vector. Unfortunately,

𝒚𝑗 is the ith vector. Let 𝑧𝑗 be a real value associated with 𝒚𝑗.

Let us assume we want want to build a predictor of y based

upon a linear model.

Choose 𝛾such that the residual is smallest:

SLIDE 7

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

Linear regression

Easy to do with vector notation:

Let 𝒀 be a matrix (N x (p+1)) where each row is (1, 𝑦𝑗) (why p+1?). Let y be a N long column vector of outputs. Then:

Want to minimize this. How? Differentiate:

SLIDE 8

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

Setting derivative to zero:
Solving:
Predicting
Could now predict the original y’s:
The matrix called H for “hat”:

Continuing…

ˆ

T

y x β =

SLIDE 9

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

Two views of regression

SLIDE 10

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

Linear Methods for Classification

What are they?

Methods that give linear decision boundaries between classes Linear decision boundaries {x: β0

+ β1 T x = 0}

How to define decision boundaries?

Two classes of methods

Model discriminant functions δk(x) for each class as linear
Model the boundaries between classes as linear

SLIDE 11

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

Two Classes of Linear Methods

Model discriminant functions δk(x) for each class as linear;

choose the k for which δk(x) is largest.

Different models/methods:
Linear regression fit to the class indicator variables
Linear discriminant analysis (LDA)
Logistic regression (LOGREG)
Model the boundaries between classes as linear (will be

discussed later in class)

Perceptron
Support vector classifier (SVM)

SLIDE 12

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

Linear Regression Fit to the Class Indicator Variables

Linear model for kth indicator response variable
Decision boundary is set of points
Linear discriminant function for class k

x x f

T k k k ∧ ∧ ∧

+ = β β 0 ) (

} ) ( ) ( : { )} ( ) ( : { = − + − = =

∧ ∧ ∧ ∧ ∧ ∧

x x x f x f x

T l k l k l k

β β β β

) ( ) ( x f x

k k ∧

= δ

SLIDE 13

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

Linear Regression Fit to the Class Indicator Variables

Let Y be a vector where the kthelement 𝑍

𝑙 is a 1 if the class

f the corresponding input is K, zero otherwise. This vector Y

is an indicator vector

For a set of N training points we can stack the Y’s into an NxK

matrix such that each row is the Y for a single input. In this case each column is a different indicator function to be

learned. A different regression problem.

This image cannot currently be displayed.

SLIDE 14

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

Linear Regression Fit to the Class Indicator Variables

Best linear fit: for a single column we know how to solve this:
So for the stacked Y :

ˆ

T

y x β =

SLIDE 15

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

Linear Regression Fit to the Class Indicator Variables

So given columns of weights B (just columns of 𝛾)
Compute the discriminant functions as a row vector :
And choose class k for whichever𝑔

̂

𝑙 𝑦 is largest

SLIDE 16

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

Linear Regression Fit to the Class Indicator Variables

So why is this a good idea? Or is it?
This is actually a sum of squares approach: define the class

indicator as a target value of 1 or 0. Goal is to fit each class target function as well as possible.

How well does it work?
Pretty well when K=2 (number of classes)
But…

SLIDE 17

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

Linear Regression Fit to the Class Indicator Variables

Problem

–When K≥3, classes can be masked by others

–Because the rigid nature of the regression model:

SLIDE 18

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

Linear Regression Fit to the Class Indicator Variables

Quadratic Polynomials

SLIDE 19

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

Linear Discriminant Analysis (Common Convariance Matrix Σ)

Model class-conditional density of X in class k as multivariate

Gaussian

Class posterior
Decision boundary is set of points

) ( ) ( 2 1 2 / 1 2 /

1

| | ) 2 ( 1 ) (

k T k

x x p k

e x f

µ µ

π

− ∑ − −

−

∑ =

= = =

K l l l k k

x f x f x X k G

1

) ( ) ( ) | Pr( π π

} ) | Pr( ) | Pr( log : { )} | Pr( ) | Pr( : { = = = = = = = = = = = x X l G x X k G x x X l G x X k G x

} ) ( ) ( ) ( 2 1 log : {

1 1

= − ∑ + − ∑ + − =

− − l k T l k T l k l k

x x µ µ µ µ µ µ π π

SLIDE 20

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

Linear Discriminant Analysis (Common Σ) con’t

Linear discriminant function for class k
Classify to the class with the largest value for its δk(x)
Parameters estimation
Objective function
Estimated parameters

N Nk

k

/ =

∧

π

k k g i k

N x

i

/

∑

= ∧

= µ ) /( ) )( (

1

K N x x

T k i k K k k g i

i

− − − = ∑

∧ ∧ = = ∧ ∑ ∑

µ µ

k k T k k T k

x x π µ µ µ δ log 2 1 ) (

1 1

+ ∑ − ∑ =

− −

) ( max arg ) ( x x G

k g k

δ

∈ ∧

=

) ( Pr ) | ( Pr log max arg ) , ( Pr log max arg

1 1 i i i N i i i N i

y y x y x

β β β β β

β

∑ ∑

= = ∧

= =

SLIDE 21

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

More on being linear…

SLIDE 22

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

The planar decision surface in data-space for the simple linear discriminant function:

0 ≥

+ w

Tx

w

SLIDE 23

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

Gaussian Linear Discriminant Analysis with Common Convariance Matrix (GDA)

Model class-conditional density of X in class k as multivariate

Gaussian

Class posterior
Decision boundary is set of points

) ( ) ( 2 1 2 / 1 2 /

1

| | ) 2 ( 1 ) (

k T k

x x p k

e x f

µ µ

π

− ∑ − −

−

∑ =

1

( ) Pr( | ) ( )

k k K l l l

f x C k X x f x π π

=

= = = ∑

Pr( | ) { : Pr( | ) Pr( | )} { : log 0} Pr( | ) C k X x x C k X x C l X x x C l X x = = = = = = = => = = =

1 1

Pr( ) 1 { : log ( ) ( ) ( ) 0} Pr 2 ) (

T T k k l k l k l l

C x x C µ µ µ µ µ µ

− −

= − + ∑ − + ∑ − =

SLIDE 24

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

Gaussian Linear Discriminant Analysis with Common Convariance Matrix (GDA)

Linear discriminant function for class k
Classify to the class with the largest value for its δk(x)
Parameters estimation (where 𝑧𝑗is class of 𝒚𝑗)
Objective function
MLE Estimated parameters



) / (

k k

Pr N N C =

/

i

C k k i k

x N µ

∧ =

= ∑

) /( ) )( (

1

K N x x

T k i k K k k g i

i

− − − = ∑

∧ ∧ = = ∧ ∑ ∑

µ µ

1 1

1 ( ) log(Pr( 2 ))

T T k k k k k

x x C δ µ µ µ

− −

= ∑ − ∑ +

) ( Pr ) | ( Pr log max arg ) , ( Pr log max arg

1 1 i i i N i i i N i

y y x y x

β β β β β

β

∑ ∑

= = ∧

= =

SLIDE 25

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

To compute the posterior, we modeled the right side of the

equation below by assuming that they were Gaussians and computed their parameters (or used a kernel estimate of the density).

In logistic regression, we want to directly model the posterior

as a function of the variable x.

In practice, when there are k classes to classify, we model:

( )

( ) ( )

( )

( ) ( ) (

)

1

ˆ ˆ ˆ ˆ ˆ | ˆ ˆ ˆ

L

p C P C p C P C P C p P p

λ

λ λ

=

= =

∑

x x x x x

( )

ˆ P C g = x x ( ) ( )

( )

1

1| P g P k = x x x

Logistic Regression

SLIDE 26

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

In this example we assume that the two distributions for the classes have equal

variance. Suppose we

want to classify a person as male or female based

n height.

Height is normally distributed in the population of men and in the population of women, with different means, and similar variances. Let y be an indicator variable for being a

female. Then the conditional distribution of x (the height becomes):

( )

( ) ( )

2 2 2 2

1 1 1 1 | 1 exp | exp 2 2 2 2

f m

p x y x p x y x µ µ σ σ σ π σ π     = = − − = = − −        

( ) ( ) ( )

| 0 and | 1 and 1 p x y p x y P y q = = = =

What we have:

( )

| 1 p x y =

( )

| p x y = x

Classification by maximizing the posterior distribution

( )

1| P y x = What we want:

SLIDE 27

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

( ) ( ) ( ) ( ) ( ) ( ) ( )

( ) ( )

( ) ( ) ( ) ( )

( )

2 2 2 2 2 2 2 2 2 2 2 2 2

1 | 1 1| 1 | 1 | 1 exp 2 1 1 exp 1 exp 2 2 1 1 1 exp 2 1 1 exp 2 1 1 1 1 exp log 2 1 1 exp log

f f m m f m f

P y p x y P y x P y p x y P y p x y q x q x q x q x q x q x x q µ σ µ µ σ σ µ σ µ σ µ µ σ = = = = = = + = =   − −     =     − − + − − −         =   − − −     +   − −     =     − + − − − −         = +

( ) ( )

2 2 2 2

1 1 2

m f m f

q x q µ µ µ µ σ σ   −   −   − − +        

Posterior probability for classification when we have two classes:

q = Pr(𝐷1)

SLIDE 28

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

Computing the probability that the subject is female, given that we observed height x. ( )

( ) ( )

2 2 2 2

1 1| 1 1 1 exp log 2

m f m f

P y x q x q µ µ µ µ σ σ = =   −   −   + − − +        

( )

176 166 12 1 0.5

m f

cm cm cm p y µ µ σ = = = = =

( )

1| P y x =

120 140 160 180 200 220 0.2 0.4 0.6 0.8 1

x

Posterior:

( )

| 1 p x y =

( )

| p x y =

a logistic function

In the denominator, x appears linearly inside the exponential So if we assume that the class membership densities p(x/y) are normal with equal variance, then the posterior probability will be a logistic function.

Computing the posterior

SLIDE 29

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

( ) ( ) ( ) ( ) ( ) ( ) ( )

( ) ( )

( ) ( ) ( ) ( )

( )

2 2 2 2 2 2 2 2 2 2 2 2 2

1 | 1 1| 1 | 1 | 1 exp 2 1 1 exp 1 exp 2 2 1 1 1 exp 2 1 1 exp 2 1 1 1 1 exp log 2 1 1 exp log

f f m m f m f

P y p x y P y x P y p x y P y p x y q x q x q x q x q x q x x q µ σ µ µ σ σ µ σ µ σ µ µ σ = = = = = = + = =   − −     =     − − + − − −         =   − − −     +   − −     =     − + − − − −         = +

( ) ( )

( )

2 2 2 2

1 1 exp 1 1 2

i T m f m f

a q x q µ µ µ µ σ σ =   − + −   −   − − +         a x

Posterior probability for classification when we have two classes:

q = Pr(𝐷1)

SLIDE 30

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

4
2

2 4 6

2

2 4 6

T

a − + = a x

1

x

2

x

Class 0

( )

( ) ( )

( )

( ) ( ) ( ) ( ) ( )

( )

( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )

1 1 1 exp exp 1 1 1 exp 1 exp 1 1 if log 1 1 log log exp

i i i T i T i i i i T T i i i i i i i i T i i i T

P y a a P y a a P y y P y P y a P y a = = + − − = = − = + − + − = = > = = = = − + = − x a x a x x a x a x x x x a x x a x

Logistic regression classification

Assumption of equal variance

among density of classes implies a linear decision boundary

SLIDE 31

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

( ) ( )

{ }

( )

( ) ( ) ( ) (

) ( )

( )

( ) ( )

( )

( ) ( )

( ) ( ) ( ) ( )

(1) (1) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) 1 ( ) ( ) ( ) ( ) 1 (1) ( ) ( ) ( ) ( ) 1 ( ) ( ) ( ) ( ) 1

, , , , 0,1 1 1 1 exp 1 1 , , , 1 log 1 log 1

i i i i

N N n i i i i T i i i y y i i i i n y y n i i i i n i i i i i

D y y y P y q P y q p y q q p y y q q l D y q y q

− − = =

= ∈ = = = + − = = − = − = − = + − −

∏ ∑

x x x w x x x w x w  

Assumption of equal variance among the clusters

The goal is to find parameters w that maximize the log-likelihood of training.

( ) ( ) 1 1 ( ) 2 2

1

i i i

w x w w x         = =               x w

Logistic regression: problem statement

SLIDE 32

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

( ) ( ) ( )

( )

2

1 ;0 1 1 exp 1 exp 1 1 1 log 1 log 1 1 1 1 1

T T T T T

q q q q q q d q dq q q q q dq q q d = < < + − − = −     − = − − = −           = − − =   − −   = − w x w x w x w x w x

Some useful properties of the logistic function

SLIDE 33

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

( )

( ) ( ) ( )

( )

( ) ( )

( )

( ) ( )

( )

( ) ( ) ( ) ( ) 1 ( ) ( ) ( ) ( ) ( ) 1 ( ) ( ) ( ) ( ) ( ) ( ) 1 ( ) ( ) ( ) ( ) ( ) ( ) 1 ( ) ( ) 1

log 1 log 1 1 1 1 1 1 1 1

n i i i i i i T i i i n i i i T i i i n i i i i i i i i n i i i i i i n i i i i

l D y q y q d y dl y dq d q q d d y y q q q q y q q q q q y q

= = = = =

= + − −   −   = −   −     −   = − −   −     −   = −   −   = −

∑ ∑ ∑ ∑

w w x w w w x x x x

( )

( 1) ( ) ( ) ( ) i i i i i

y q η

+ =

+ −

∑

w w x

( ) i

dl dq

Online algorithm for logistic regression: gradient ascent (new data or iterate)

SLIDE 34

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

( )

( ) ( )

( )

( ) ( ) ( )

( ) ( ) 2 ( ) ( ) 1 ( ) ( ) 1 (1) (1) (2) (2) ( ) ( ) 2

1 1 1 1

i T i i i i n T i T i T i n i i T i i i n n T T

d d y q dl dq d d dq d d q q q q q q Q q q dl X QX d d

= =

− = = − −   −     − ≡       −     = −

∑ ∑

w x x w w w w x x x w w       

( )

( ) ( ) 1 (1) (1) (1) ( ) ( ) ( ) n i i i i T n n n T T

dl y q d q y X q y dl X d

=

= −             ≡ ≡ ≡                   = −

∑

x w x q y x y q w   

Batch algorithm: Iteratively Re-weighted Least Squares

First derivative
Second derivative

SLIDE 35

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

( )

( ) ( ) ( ) 2 1 ( 1) ( )

1 1 1 exp

i i i i T T T T t t T T

P y q dl X d dl X QX d d X QX X

− +

= = = + − = − = − = + − x w x y q w w w w w y q

IRLS

0.2 0.4 0.6 0.8 5 6 7 8 9 10 11

q

( )

1 1 q q −

uncertain certain certain

Sensitivity to error

Iteratively Re-weighted Least Squares

SLIDE 36

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )

2 1 2 1 1 2 2 1 2 2 2 1 2 2 2 2 2 2 2 1 2 1 1

1 | 1 1| 1 | 1 | 1 1 exp 2 2 1 1 1 1 exp 1 exp 2 2 2 2 1 1 1 1 exp 2 2 1 1 1 exp 2 2 1 1 1 exp log

f m

P y p x y P y x P y p x y P y p x y q x x q x x q x x q x x q x x q q σ σ π σ σ σ π σ π σ σ π σ σ π = = = = = = + = =   − −     =     − − + − − −           =   − − −     +   − −     =   − +    

( ) ( )

( )

2 2 1 2 1 2 2 2 2 1 2 1 2

1 1 log 2 2 1 1 exp x x x x w w x w x σ σ σ σ     + − − + −         = + + +

Modeling the posterior with unequal variance

SLIDE 37

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

( ) ( )

{ }

( )

( ) ( ) ( )

( )

(1) (1) ( ) ( ) ( ) ( ) ( ) ( ) (1) 1 2 1 2 ( ) 2 1 2 2 1 ( 1) ( )

, , , , 0,1 1 1 1 exp 1

N N n i i i i T T T N t t T T

D y y y P y q x x X x x x x X QX X

− +

= ∈ = = = + −               = ≡                   = + − x x x w g x g x g x g x w w y q  

8
6
4
2
1

1 2 3 4

By using non-linear bases, we can deal with clusters having unequal variance.

Estimated posterior probability

1

x

2

x

Logistic regression with basis functions

SLIDE 38

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1| 1 | 1 | | 1 1 exp 2 1 1 exp 2 1 1 exp log 2 2 1 1 exp log 2 2 exp

T T k k k T T k k k T T T T k k k k k

P y P y p y P y k P y k p y k q q q q q q a

− − − − − − − −

= = = = = = =   − − ∑ −     ∑ =   − − ∑ −     ∑   = − − ∑ − + − ∑ −       = − ∑ + ∑ + ∑ − ∑     = + x x x x x μ x μ x μ x μ x μ x μ x μ x μ μ μ μ μ μ x μ x

( )

( ) ( )

1 1 1

1| log |

T k T k k

P y a P y k = = + = w x x w x x

Logistic function for multiple classes with equal variance

Rather than modeling the posterior directly, let us pick the posterior for one class as our reference and then model the ratio of the posterior for all other classes with respect to that class. Suppose we have k classes:

SLIDE 39

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )

( )

1 1 1 1 1 1 1 1 1 1 1 1

1| log | | exp | | 1 exp | | 1 | 1 exp 1 1 | 1 exp exp | 1 exp

T k k i k i k i i k i i k i i i k j j

P y a m P y k P y i m P y k P y i m P y k P y k P y k m P y k m m P y i m

= − = − = − = − =

= = + ≡ = = = = = = = + = =   = + =     = = + = = +

∑ ∑ ∑ ∑ ∑

x w x x x x x x x x x x A “soft-max” function

Logistic function for multiple classes with equal variance: soft-max

SLIDE 40

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

Classification of multiple classes with equal variance

160 180 200 220 0.0025 0.005 0.0075 0.01 0.0125 0.015

( ) ( )

| 1 1 p x y P y = =

( ) ( )

| 2 2 p x y P y = =

( ) ( ) ( )

3 1

|

i

p x p x y i P y i

=

= = =

∑

( ) ( )

| 3 3 p x y P y = =

160 180 200 220 0.005 0.01 0.015 0.02 0.025 160 180 200 220 0.2 0.4 0.6 0.8 1

Posterior probabilities

160 180 200 220

15
10
5

5 10 15

( ) ( )

1| log 3| P y P y = = x x

( ) ( )

2 | log 3| P y P y = = x x

SLIDE 41

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

SLIDE 42

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

Fisher’s linear discriminant

A simple linear discriminant function is a projection of the data

down to 1-D.

So choose the projection that gives the best separation of the classes.

What do we mean by “best separation”?

An obvious direction to choose is the direction of the line

joining the class means.

But if the main direction of variance in each class is not orthogonal to this

line, this will not give good separation (see the next figure).

Fisher’s method chooses the direction that maximizes the ratio
f between class variance to within class variance.
This is the direction in which the projected points contain the most

information about class membership (under Gaussian assumptions)

SLIDE 43

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

A picture showing the advantage of Fisher’s linear discriminant.

When projected onto the line joining the class means, the classes are not well separated. Fisher chooses a direction that makes the projected classes much tighter, even though their projected means are less far apart.

SLIDE 44

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

(Fisher) Discriminant Analysis

Discriminant analysis seeks directions that are efficient for discrimination.
Consider the problem of projecting data from d dimensions onto a line with

the hope that we can optimize the orientation of the line to minimize error.

Consider a set of N d-dimensional samples 𝑦1, 𝑦2, … 𝑦𝑂 with 𝑜1 the subset

D1 labeled ω1 and 𝑜2 in the subset D2 labeled ω2.

Define a linear combination of the components of x: 𝑧 = w𝑈x which

yields corresponding set of N samples 𝑧1, 𝑧2, … 𝑧𝑂 divided into Y1 and Y2.

Our challenge is to find w that “maximizes separation”.
This can be done by considering the ratio of the between-class scatter to

the within-class scatter.

SLIDE 45

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

Separation of the Means and Scatter

Define a sample mean for class i:
The sample mean for the projected points are:
The sample mean for the projected points is just the projection
f the mean (which is expected since this is a linear

transformation).

It follows that the distance between the projected means is:

∑ =

∈ i D i i

n x x m 1

1 1

i i

T t i i y Y D i i

m y n n

∈ ∈

= = = ∑ ∑ 

x w x

w m

1 2 1 2 T T

m m − = −   w m w m

SLIDE 46

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

Separation of the Means and Scatter

Define a scatter for the projected samples:
An estimate of the variance of the pooled data is:

and is called the within-class scatter .

2 2

( )

i

i y i Y y

s m

∈

∑ = −  

2 2 1 2

1 ( ) s s n +  

SLIDE 47

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

Fisher Linear Discriminant and Scatter

The Fisher linear discriminant maximizes the criteria:
Define a scatter matrix for class i :
The total scatter is:
We can write the scatter for the projected samples as:
So the sum of the scatters can be written as:

2 2 2 1 2 2 1

~ ~ ~ ~ ) ( s s m m w J + − =

( )( )

i

T i i i D ∈

= ∑

x

S x - m x - m

2 1

S S S + =

W

( ) ( )( )

2 2

i i

T T T T i i D t i i i D

s

∈ ∈

= − ∑ = − − = ∑ 

x x

w x w m w x m x m w w S w

w S w

W t

s s = +

2 2 2 1

~ ~

SLIDE 48

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

Separation of the Projected Means

The separation of the projected means obeys:
where the between class scatter, SB, is given by:
𝑇𝑋 is the within-class scatter and is proportional to the covariance of

the pooled data.

𝑇𝐶 , the between-class scatter, is symmetric and positive definite, but

because it is the outer product of two vectors, its rank is at most one.

This implies that for any w, 𝑇𝐶𝑥 is in the direction of m1-m2.
The criterion function, J(w), can be written as:

( ) ( )( )

2 1 2 1 1 2 1 2 2 2 T T T T B T

m m − = − = − − =   w m w m w m m m m w w S w

( )( )

1 2 1 2 T

B =

− − S m m m m

( )

w S w w S w w

W t B t

J =

SLIDE 49

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

Linear Discriminant Analysis

This ratio is well-known as the generalized Rayleigh quotient and has the

well-known property that the vector, w, that maximizes J(w), must satisfy:

Linear Discriminant Analysis

B W

λ = S w S w

SLIDE 50

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

Proof of Fisher

Show that J(w) is max when 𝑇𝐶𝑥 = 𝜇𝑇𝑋𝑥

( )

t B t W

f J g = = w S w w w S w

B W

λ = S w S w

SLIDE 51

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

Linear Discriminant Analysis

This ratio is well-known as the generalized Rayleigh quotient and has

the well-known property that the vector, w, that maximizes J(), must satisfy:

The solution is: this is Fisher’s linear

discriminant.

This solution maps the d-dimensional problem to a one-dimensional

problem (in this case).

From earlier, when the conditional densities, 𝑞(𝑦|ω𝑗), are

multivariate Gaussian with equal covariances, the optimal decision boundary is given by: where , and 𝑥0 is related to the prior probabilities.

Linear Discriminant Analysis

B W

λ = S w S w

) (

2 1 1

m m S w − =

W

T

w + = w x

( )

2 1 1

μ

μ

w

−

= ∑

Recall: 𝑇𝐶𝑥 is in direction of 𝑛1 − 𝑛2

SLIDE 52

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

Suppose K>2?

SLIDE 53

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

Multiple Discriminant Analysis

For the c-class problem in a d-dimensional space, the natural

generalization involves c-1 discriminant functions.

The within-class scatter is defined as:
Define a total mean vector, m:
and a total scatter matrix, 𝑇𝑈, by:
The total scatter is related to the within-class scatter
We have c-1 discriminant functions of the form:

( )( )

1

i

c W i i i D T

c i i S

= ∈

=

= ∑ ∑

= ∑

x

S x - m x - m

i c i i

n n m m ∑ =

=1

1

( )( )

∑ =

x

m

x

m

x

S

t T

B W T

S S S + =

∑ − − =

= c i t i i i B

n

1

) )( ( m m m m S

[ ]

1 2 1

T T i

i , ,...,c - = = = y W x w x

SLIDE 54

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

Multiple Discriminant Analysis (Cont.)

The criterion function is:

Where 𝑇𝑋 is as before (pooled covariance) but 𝑇𝐶 is now covariance

f K centers, rank K-1
The solution to maximizing J(W) is once again found via an

eigenvalue decomposition:

Because SB is the sum of c rank one or less matrices, and because
nly c-1 of these are independent, SB is of rank c-1 or less. (See

Hastie chapter 4)

T B T W

J = W S W (W) W S W

( )

B i W B i W i

λ λ − = − = S S S S w

∑ − − =

= c i t i i i B

n

1

) )( ( m m m m S

SLIDE 55

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

Spreading out the centers

SLIDE 56

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

Multi-Fisher

When well behaved

Multi-class Fisher (FLDA?) can work well.

Maybe you’ll try it???

SLIDE 57

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

Some left over time?

SLIDE 58

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

Discriminant functions for N>2 classes

One possibility is to use N two-way discriminant functions.
Each function discriminates one class from the rest.
Another possibility is to use N(N-1)/2 two-way discriminant

functions

Each function discriminates between two particular classes.
Both these methods have problems

SLIDE 59

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

Problems with multi-class discriminant functions

A simple solution

Use N discriminant

functions 𝑔

1(𝑦), 𝑔 2(𝑦), … , 𝑔 𝑙(𝑦), …

and pick the max.

This is guaranteed to give

consistent and convex decision regions if 𝑔

𝑙(𝑦) is linear.

( ) ( )

( ) ( ) ( ) ( ) ( ) (1 ) (1 )

k A j A k B j B k A B j A B

f f and f f implies for positive that f f α α α α α > > + − > + − x x x x x x x x

SLIDE 61

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

More time?

SLIDE 62

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

A way of thinking about the role of the inverse covariance matrix

If the Gaussian is spherical we

don’t need to worry about the covariance matrix.

So we could start by transforming

the data space to make the Gaussian spherical

This is called “whitening” the

data.

It pre-multiplies by the matrix

square root of the inverse covariance matrix.

In the transformed space, the

weight vector is just the difference between the transformed means.

aff T aff aff aff T

for gives and as for value same the gives x w x Σ x μ Σ μ Σ w x w μ μ Σ w

2 1 2 1 2 1

1 1 1

: ) (

− − −

= − = − =

−

SLIDE 63

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

Two ways to train a set of class-specific generative models

Generative approach Train each

model separately to fit the input vectors of that class.

Different models can be trained
n different cores.
It is easy to add a new class

without retraining all the other classes

These are significant advantages

when the models are harder to train than the simple linear models considered here.

Discriminative approach Train

all of the parameters of both models to maximize the probability of getting the labels right.

SLIDE 64

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

An example where the two types of training behave very differently

decision boundary What happens to the decision boundary if we add a new red point here? new Gaussian

For generative fitting, the red mean moves rightwards but the decision boundary moves leftwards! If you really believe its Gaussian data this is sensible.