[PPT] - Lecture 12: Midterm Exam Review Dr. Chengjiang Long Computer PowerPoint Presentation

SLIDE 1

Lecture 12: Midterm Exam Review

Dr. Chengjiang Long

Computer Vision Researcher at Kitware Inc. Adjunct Professor at RPI. Email: longc3@rpi.edu

SLIDE 2

C. Long

Lecture 12 March 6, 2018 2

Pattern recognition design cycle

SLIDE 3

C. Long

Lecture 12 March 6, 2018 3

Pattern recognition design cycle

Collecting training and testing data.
How can we know when we have

adequately large and representative set

f samples?

SLIDE 4

C. Long

Lecture 12 March 6, 2018 4

Training/Test Split

Randomly split dataset into two parts:
Training data
Test data
Use training data to optimize parameters
Evaluate error using test data

SLIDE 5

C. Long

Lecture 12 March 6, 2018 5

Training/Test Split

How many points in each set?
Very hard question
Too few points in training set, learned classifier is bad
Too few points in test set, classifier evaluation is

insufficient

Cross-validation
Leave-one-out cross-validation

SLIDE 6

C. Long

Lecture 12 March 6, 2018 6

Cross-Validation

In practice
Available data => training and validation
Train on the training data
Test on the validation data
k-fold cross validation:
Data randomly separated into k groups
Each time k
1 groups used for training and one as testing

SLIDE 7

C. Long

Lecture 12 March 6, 2018 7

Cross Validation and Test Accuracy

Using CV on training + validation
Classify test data with the best parameters from CV

SLIDE 8

C. Long

Lecture 12 March 6, 2018 8

Pattern recognition design cycle

Domain dependence and prior information.
Computational cost and feasibility.
Discriminative features, i.e., similar values for similar patterns,

and different values for different patterns.

Invariant features with respect to translation, rotation and scale.
Robust features with respect to occlusion, distortion,
deformation, and variations in environment.

SLIDE 9

C. Long

Lecture 12 March 6, 2018 9

PCA: Visualization

Data points are represented in a rotated orthogonal coordinate system: the origin is the mean of the data points and the axes are provided by the eigenvectors.

SLIDE 10

C. Long

Lecture 12 March 6, 2018 10

Computation of PCA

In practice we compute PCA via SVD (singular value

decomposition)

Form the centered data matrix:

[ ]

) ( ) (

1 ,

m x m x

=

N N p

X  

Compute its SVD:

T p N p p p p

V D U X ) (

, , ,

=

U and V are orthogonal matrices, D is a diagonal matrix

SLIDE 11

C. Long

Lecture 12 March 6, 2018 11

Computation of PCA…

Sometimes we are given only a few high dimensional data

points, i.e., p ≥ N

In such cases compute the SVD of XT:

T N p N N N N T

U D V X ) (

, , ,

=

So we get:

T N N N N N p

V D U X ) (

, , ,

=

Then, proceed as before, choose only d < N significant eigenvalues for data representation:

) ( ) ( ~

, ,

m x m x

+

=

i T d p d p i

U U

Usually we used the features with reduced dimensions to fit the classification models.

SLIDE 12

C. Long

Lecture 12 March 6, 2018 12

Fisher Linear Discriminant

We need to normalize by both scatter of class 1 and

scatter of class 2

The Fisher linear discriminant is the projection on a

line in the direction v which maximizes

SLIDE 13

C. Long

Lecture 12 March 6, 2018 13

Fisher Linear Discriminant

Thus our objective function can be written:
Maximize J(v) by taking the derivative w.r.t. v and setting it to 0

SLIDE 14

C. Long

Lecture 12 March 6, 2018 14

Fisher Linear Discriminant

SLIDE 15

C. Long

Lecture 12 March 6, 2018 15

Fisher Linear Discriminant

If SW has full rank (the inverse exists), we can convert this

to a standard eigenvalue problem

But SBx for any vector x, points in the same direction as

μ1-μ2

Based on this, we can solve the eigenvalue problem

directly

But SBx for any vector x, points in the same direction as

μ1-μ2

SLIDE 16

C. Long

Lecture 12 March 6, 2018 16

Example

SLIDE 17

C. Long

Lecture 12 March 6, 2018 17

Pattern recognition design cycle

How can we know how close we are to the true model underlying the patterns?

Domain dependence and prior information.
Definition of design criteria.
Parametric vs. non-parametric models.
Handling of missing features.
Computational complexity.
Types of models: templates, decision-theoretic or

statistical,syntactic or structural, neural, and hybrid.

SLIDE 18

C. Long

Lecture 12 March 6, 2018 18

The Classifiers We Have Learned So Far

Bayesian classifiers MLE classifier MAP classifier Naive Bayes classifier Nonparametric classifiers KNN classifier Linear classifiers LDF (Perceptron rule & Minimu Square Error rule & Ho-Kashyap Procedure SVM classifier Nonlinear classifiers Linear classifiers Kernel Tricks Feature Mapping Φ

SLIDE 19

C. Long

Lecture 12 March 6, 2018 19

Decision Rule

Using Bayes’ rule:

where

( / ) ( ) ( / ) ( )

j j j

p x P likelihood prior P x p x evidence w w w ´ = =

2 1

( ) ( / ) ( )

j j j

p x p x P w w

=

= å

Decideω1 if P(ω1 /x) > P(ω2 /x); otherwise decide ω2

r

Decideω1 if p(x/ω1)P(ω1)>p(x/ω2)P(ω2); otherwise decideω2

r

Decideω1 if p(x/ω1)/p(x/ω2) >P(ω2)/P(ω1) ; otherwise decide ω2

SLIDE 20

C. Long

Lecture 12 March 6, 2018 20

Discriminant Functions

A useful way to represent a classifier is through

discriminant functions gi(x), i = 1, . . . , c, where a feature vector x is assigned to class ωi if gi(x) > gj(x) for all j i

max

SLIDE 21

C. Long

Lecture 12 March 6, 2018 21

Discriminants for Bayes Classifier

Is the choice of gi unique?
Replacing gi(x) with f(gi(x)), where f() is

monotonically increasing, does not change the classification results.

( / ) ( ) ( ) ( ) ( ) ( / ) ( ) ( ) ln ( / ) ln ( )

i i i i i i i i i

p P g p g p P g p P w w w w w w = = = + x x x x x x x

gi(x)=P(ωi/x)

we’ll use this discriminant extensively!

SLIDE 22

C. Long

Lecture 12 March 6, 2018 22

Case 1: Statistically Independent Features with Identical Variances

SLIDE 23

C. Long

Lecture 12 March 6, 2018 23

Case II: Identical Covariances

Notes on Decision Boundary
As for Case I, passes through point x0 lying on the line between the two

class means. Again, x0 in the middle if priors identical

Hyperplane defined by boundary generally not orthogonal to the line

between the two means

SLIDE 24

C. Long

Lecture 12 March 6, 2018 24

Case III: arbitrary

Nonlinear decision boundaries

SLIDE 25

C. Long

Lecture 12 March 6, 2018 25

Parameter Parameter

Parameter estimation Maximum likelihood: values of parameters are fixed but unknown Bayesian estimation / Maximum a posteriori (MAP): parameters as random variables having some known a priori distribution

SLIDE 26

C. Long

Lecture 12 March 6, 2018 26

Maximum-Likelihood Estimation

Use set of independent samples to estimate
Our goal is to determine (value of that best

agrees with observed training data)

Note if D is fixed is not a density

SLIDE 27

C. Long

Lecture 12 March 6, 2018 27

Example: Gaussian case

Assume we have c classes and
Use the information provided by the training samples

to estimate each is associated with each category.

Suppose that D contains n samples,

SLIDE 28

C. Long

Lecture 12 March 6, 2018 28

Maximum-Likelihood Estimation

is called the likelihood of w.r.t the set of

samples.

ML estimate of is, by definition the value that

maximizes “It is the value of that best agrees with the actually

bserved training sample”

SLIDE 29

C. Long

Lecture 12 March 6, 2018 29

Optimal Estimation

Let and let be the gradient operator
We define as the log likelihood function
New problem statement:
determine that maximizes the log likelihood

SLIDE 30

C. Long

Lecture 12 March 6, 2018 30

Optimal Estimation

Local or global maximum
Local or global minimum
Saddle point
Boundary of parameter space

SLIDE 31

C. Long

Lecture 12 March 6, 2018 31

Bayesian Estimation (MAP): General Theory

p(x | D) computation can be applied to any situation in

which the unknown density can be parameterized.

The basic assumptions are:

 The form of is assumed known, but the value of is

not known exactly

 Our knowledge about is assumed to be contained in a

known prior density

 The rest of our knowledge is contained in a set D of n

random variables x1, x2, …, xn that follows p(x)

SLIDE 32

C. Long

Lecture 12 March 6, 2018 32

Bayesian Estimation (MAP): General Theory

The basic problem is: “Compute the posterior

density ” then “Derive ”

Using Bayes formula, we have:
And by the independence assumption:

SLIDE 33

C. Long

Lecture 12 March 6, 2018 33

MLE vs. MAP

Maximum Likelihood estimation (MLE)
- Choose value that maximizes the probability of observed

data

Maximum a posteriori (MAP) estimation
- Choose value that is most probable given observed data

and prior belief

When is MAP same as MLE?

SLIDE 34

C. Long

Lecture 12 March 6, 2018 34

Naïve Bayes Classifier (not BE)

Simple classifier that applies Bayes' rule with

strong (naive) independence assumptions

A.k.a. the "independent feature model”
Often performs reasonably well despite simplicity

SLIDE 35

C. Long

Lecture 12 March 6, 2018 35

Naïve Bayes Classifier

NB is known to produce posteriors closer to extremes

(0 or 1) than true posteriors – Why?

NB performs well when only small amounts of training

data are available – Why?

SLIDE 36

C. Long

Lecture 12 March 6, 2018 36

The Classifiers We Have Learned So Far

Bayesian classifiers MLE classifier MAP classifier Naive Bayes classifier Nonparametric classifiers KNN classifier Linear classifiers LDF (Perceptron rule & Minimu Square Error rule & Ho-Kashyap Procedure SVM classifier Nonlinear classifiers Linear classifiers Kernel Tricks Feature Mapping Φ

SLIDE 37

C. Long

Lecture 12 March 6, 2018 37

Density Estimation: Two Approaches

Parzen Windows:

 Shrink an initial region where

and show that

 This is called “the Parzen window estimation

method”

k-Nearest Neighbors



Specify kn as some function of n, such as the volume Vn is grown until it encloses kn neighbors of x. This is called “the kn-nearest neighbor estimation method”

SLIDE 38

C. Long

Lecture 12 March 6, 2018 38

The k–Nearest-Neighbor Rule

Goal: Classify x by assigning it the label most

frequently represented among the k nearest samples

Use a voting scheme

The k-nearest-neighbor query starts at the test point and grows a spherical region until it encloses k training samples, and labels the test point by a majority vote of these samples

SLIDE 39

C. Long

Lecture 12 March 6, 2018 39

kNN: Multi-modal Distributions

Most parametric

distributions would not work for this 2 class classification problem

Nearest neighbors will

do reasonably well, provided we have a lot of samples

SLIDE 40

C. Long

Lecture 12 March 6, 2018 40

The Classifiers We Have Learned So Far

Bayesian classifiers MLE classifier MAP classifier Naive Bayes classifier Nonparametric classifiers KNN classifier Linear classifiers LDF (Perceptron rule & Minimu Square Error rule & Ho-Kashyap Procedure SVM classifier Nonlinear classifiers Linear classifiers Kernel Tricks Feature Mapping Φ

SLIDE 41

C. Long

Lecture 12 March 6, 2018 41

Augmented feature space

Augmented feature/parameter space

1

( )

d d t t i i i i i i

g w w x x w w x

= =

= + = + = =

å å

x w x α y

1 1 2 1 2 1

... ... ... ...

d d d d

w w x x w w x x w w x x é ù é ù é ù é ù ê ú ê ú ê ú ê ú ê ú ê ú ê ú ê ú = Þ = = Þ = ê ú ê ú ê ú ê ú ê ú ê ú ê ú ê ú ë û ë û ë û ë û w α x y

Discriminant:

( )

t

g = x α y

If αtyi ≥ 0 assign yi to ω1 else if αtyi<0 assign yi to ω2

SLIDE 42

C. Long

Lecture 12 March 6, 2018 42

Normalization

Seek a hyperplane that separates patterns from different categories Seek a hyperplane that puts normalized patterns on the same (positive) side

Classification rule: If αtyi>0 assign yi to ω1 else if αtyi<0 assign yi to ω2

SLIDE 43

C. Long

Lecture 12 March 6, 2018 43

Perceptron Batch Rule

The gradient of Jp(α) is:

( )

( ) ( )

t p Y

J

Î

=

å

y α

α α y

( )

p Y

J

Î

Ñ =

å

y α

y

The perceptron update rule is obtained using gradient

descent:

( )

( 1) ( ) ( )

Y

k k k h

Î

+ = +

å

y α

α α y

It is not possible to solve analytically 0.
It is called batch rule because it is based on all misclassified

examples

SLIDE 44

C. Long

Lecture 12 March 6, 2018 44

Perceptron Single Sample Rule

The gradient decent single sample rule for Jp(a) is:

– Note that yM is one sample misclassified by – Must have a consistent way of visiting samples

Geometric Interpretation:

– Note that yM is one sample misclassified by – yM is on the wrong side of decision hyperplane – Adding ηyM to a moves the new decision hyperplane in the right direction with respect to yM

SLIDE 45

C. Long

Lecture 12 March 6, 2018 45

MSE Criterion Function

Minimum squared error approach: find a which

minimizes the length of the error vector e

Thus minimize the minimum squared error criterion

function:

Unlike the perceptron criterion function, we can
ptimize the minimum squared error criterion function

analytically by setting the gradient to 0

SLIDE 46

C. Long

Lecture 12 March 6, 2018 46

Computing the Gradient

SLIDE 47

C. Long

Lecture 12 March 6, 2018 47

Pseudo-Inverse Solution

Setting the gradient to 0:
The matrix is square (it has d +1 rows and

columns) and it is often non-singular

If is non-singular, its inverse exists and we can

solve for a uniquely:

SLIDE 48

C. Long

Lecture 12 March 6, 2018 48

Ho-Kashyap Procedure

As usual, take partial derivatives w.r.t. a and b
Use modified gradient descent procedure to find a

minimum of JHK(a,b)

Alternate the two steps below until convergence:

①

Fix b and minimize JHK(a,b) with respect to a

②

Fix a and minimize JHK(a,b) with respect to b

SLIDE 49

C. Long

Lecture 12 March 6, 2018 49

LDF Summary

Perceptron procedures

– Find a separating hyperplane in the linearly separable case, – Do not converge in the non-separable case – Can force convergence by using a decreasing learning rate, but are not guaranteed a reasonable stopping point

MSE procedures

– Converge in separable and not separable case – May not find separating hyperplane even if classes are linearly separable – Use pseudoinverse if is not singular and not too large – Use gradient descent (Widrow-Hoff procedure) otherwise

Ho-Kashyap procedures

– always converge – find separating hyperplane in the linearly separable case – more costly

SLIDE 50

C. Long

Lecture 12 March 6, 2018 50

50

wT x + b = 0 wTxa + b = 1 wTxb + b = -1 ρ

Sec. 15.1

Linear Support Vector Machine (SVM)

Maximize margin

ρ = ||xa–xb|| = 2/||w||

Prime problem

SLIDE 51

C. Long

Lecture 12 March 6, 2018 51

SVM solution: Lagrange multipliers

We can then swap ’max’ and ’min’:
We can find the optimal w as a function of {αi} by

setting the derivatives to zero:

SLIDE 52

C. Long

Lecture 12 March 6, 2018 52

SVM: Optimal Hyperplane

Use Kuhn-Tucker Theorem (KTT) condition to convert our

problem to:

a ={a1,…, an} are new variables, one for each sample
Optimized by quadratic programming

SLIDE 53

C. Long

Lecture 12 March 6, 2018 53

SVM: Optimal Hyperplane

After finding the optimal a = {a1,…, an}
Final discriminant function:
where S is the set of support vectors

SLIDE 54

C. Long

Lecture 12 March 6, 2018 54

SVM: Non-Separable Case

Data are most likely to be not linearly separable, but

linear classifier may still be appropriate

Can apply SVM in non linearly separable case
Data should be “almost” linearly separable for good

performance

SLIDE 55

C. Long

Lecture 12 March 6, 2018 55

SVM: Non-Separable Case

Use slack variablesξ1,…, ξn (one for each sample)
Change constraints:
ξi is a measure of deviation

from the ideal for xi

–ξi ≥ 1: xi is on the wrong side of the separating hyperplane – 0 < ξi < 1: xi is on the right side of separating hyperplane but within the region of maximum margin – ξi < 0 : is the ideal case for xi

SLIDE 56

C. Long

Lecture 12 March 6, 2018 56

SVM: Non-Separable Case

Unfortunately this minimization problem is NP-hard

due to the discontinuity of I(ξi)

Instead, we minimize

subject to

SLIDE 57

C. Long

Lecture 12 March 6, 2018 57

SVM: Non-Separable Case

Use Kuhn-Tucker Theorem (KTT) conditon to

convert to:

w and w0 is computed using:
Remember that:

SLIDE 58

C. Long

Lecture 12 March 6, 2018 58

What about multi-class SVMs?

Unfortunately, there is no “definitive” multi-class SVM

formulation

In practice, we have to obtain a multi-class SVM by

combining multiple two/ class SVMs

One vs. others
Traning: learn an SVM for each class vs. the others
Testing: apply each SVM to test example and assign to it

the class of the SVM that returns the highest decision value

One vs. one
Training: learn an SVM for each pair of classes
Testing: each learned SVM “votes” for a class to assign to

the test example

SLIDE 59

C. Long

Lecture 12 March 6, 2018 59

The Classifiers We Have Learned So Far

Bayesian classifiers MLE classifier MAP classifier Naive Bayes classifier Nonparametric classifiers KNN classifier Linear classifiers LDF (Perceptron rule & Minimu Square Error rule & Ho-Kashyap Procedure SVM classifier Nonlinear classifiers Linear classifiers Kernel Tricks Feature Mapping Φ

SLIDE 60

C. Long

Lecture 12 March 6, 2018 60

60

General idea: the original feature space can always be

mapped to some higher dimensional feature space where the training set is separable:

Φ: x → φ(x)

Sec. 15.2.3

Non-linear SVMs: Feature spaces

SLIDE 61

C. Long

Lecture 12 March 6, 2018 61

Kernels

SVM optimization:

Maximize

Note this optimization depends on samples xi only

through the dot product

If we lift xi to high dimension using φ(xi), we need to

compute high dimensional product φ(xi) Tφ(xj)

Idea: find kernel function K(xi,xj) s.t.

SLIDE 62

C. Long

Lecture 12 March 6, 2018 62

Kernel Trick

Then we only need to compute K(xi,xj) instead of

φ(xi) Tφ(xj)

“kernel trick”: do not need to perform operations

in high dimensional space explicitly

SLIDE 63

C. Long

Lecture 12 March 6, 2018 63

Choice of Kernel

How to choose kernel function K(xi,xj)?

– K(xi,xj) should correspond to φ(xi) Tφ(xj) in a higher dimensional space – Mercer’s condition tells us which kernel function can be expressed as dot product of two vectors – If K and K’ are kernels aK+bK’ is a kernel

The mappings φ(xi) never have to be computed!!

SLIDE 64

C. Long

Lecture 12 March 6, 2018 64

Pattern recognition design cycle

How can we learn the rule from data?

Supervised learning: a teacher provides a category label or

cost for each pattern in the training set.

Unsupervised learning: the system forms clusters or natural

groupings of the input patterns.

Reinforcement learning: no desired category is given but the

teacher provides feedback to the system such as the decision is right or wrong.

SLIDE 65

C. Long

Lecture 12 March 6, 2018 65

Learning Algorithms

To design a learning algorithm, we face the following

problems:

①

Whether to stop?

②

In what direction to proceed?

③

How long a step to take?

Is the criterion satisfactory?

SLIDE 66

C. Long

Lecture 12 March 6, 2018 66

Criterion Function

To facilitate learning, we usually define a scalar

criterion function.

It usually represents the penalty or cost of a solution.
Our goal is to minimize its value, i.e., Function
ptimization.

SLIDE 67

C. Long

Lecture 12 March 6, 2018 67

Learning Using Iterative Optimization

Minimize an error function J(α) (e.g., classification

error) with respect to α:

Minimize J(α) iteratively:

search direction learning rate

How should we choose pk?

SLIDE 68

C. Long

Lecture 12 March 6, 2018 68

Choosing pk using Gradient Descent

(a = α)

SLIDE 69

C. Long

Lecture 12 March 6, 2018 69

Gradient Decent (cont'd)

2
1

1 2

2
1

1 2

search space

(0) α ( ) k α

J(α)

SLIDE 70

C. Long

Lecture 12 March 6, 2018 70

Gradient Decent (cont'd)

How to choose the learning rate ?

Taylor series approximation

ptimum learning rate

Hessian (2nd derivatives)

(a = α) Setting a=a(k+1) and using

SLIDE 71

C. Long

Lecture 12 March 6, 2018 71

Choosing pk using Newton's Method

1

( ( ))

k

H J k

= -

Ñ p α

requires inverting H

(a = α)

SLIDE 72

C. Long

Lecture 12 March 6, 2018 72

Newton's Method (cont'd)

2
1

1 2

2
1

1 2

If J(α) is quadratic, Newton’s method converges in one step!

J(α)

SLIDE 73

C. Long

Lecture 12 March 6, 2018 73

Gradient decent vs Newton's method

Newton’s method Gradient Decent

SLIDE 74

C. Long

Lecture 12 March 6, 2018 74

Pattern recognition design cycle

How can we estimate the

performance with training samples?

How can we predict the

performance with future data?

Problems of overfitting and

generalization.

SLIDE 75

C. Long

Lecture 12 March 6, 2018 75

Receiver Operating Characteristic (ROC) Curve

Every classifier typically employs some kind of a

threshold.

Changing the threshold will affect the performance
f the classifier.
ROC curves allow us to evaluate the performance
f a classifier using different thresholds.

2 1

( )/ ( )

a

P P q w w =

2 12 22 1 21 11

( )( ) ( )( )

b

P P w l l q w l l

=

SLIDE 76

C. Long

Lecture 12 March 6, 2018 76

ROC Curve

FPR: False Positive Rate (X-axis) TRR: True Postive Rate (Y-axis)

false positive correct acceptance correct rejection false negative

SLIDE 77

C. Long

Lecture 12 March 6, 2018 77

Overfitting

Prediction error: probability of test pattern not in class

with max posterior (true)

Training error: probability of test pattern not in class

with max posterior (estimated)

Classifier optimized w.r.t. training error
Training error: optimistically biased estimate of prediction

error

SLIDE 78

C. Long

Lecture 12 March 6, 2018 78

Overfitting

Overfitting: a learning algorithm overfits the training

data if it outputs a solution w when another solution w’ exists such that:

SLIDE 79

C. Long

Lecture 12 March 6, 2018 79

Example: Fish Classifier

SLIDE 80

C. Long

Lecture 12 March 6, 2018 80

Minimum Training Error

SLIDE 81

C. Long

Lecture 12 March 6, 2018 81

Final Decision Boundary

SLIDE 82

C. Long

Lecture 12 March 6, 2018 82

Typical Behavior

SLIDE 83

C. Long

Lecture 12 March 6, 2018 83

Typical Behavior

SLIDE 84

C. Long

Lecture 12 March 6, 2018 84

Typical Behavior

The aim is to get a classification model to generalize

to classify new inputs appropriately.

If the training data is known to contain noise, we don’t

necessarily want the training data to be classified totally accurately, because that is likely to reduce the generalization ability.

SLIDE 85

C. Long

Lecture 12 March 6, 2018 85