Lecture 12: Midterm Exam Review Dr. Chengjiang Long Computer - - PowerPoint PPT Presentation

lecture 12 midterm exam review
SMART_READER_LITE
LIVE PREVIEW

Lecture 12: Midterm Exam Review Dr. Chengjiang Long Computer - - PowerPoint PPT Presentation

Lecture 12: Midterm Exam Review Dr. Chengjiang Long Computer Vision Researcher at Kitware Inc. Adjunct Professor at RPI. Email: longc3@rpi.edu Pattern recognition design cycle 2 C. Long March 6, 2018 Lecture 12 Pattern recognition design


slide-1
SLIDE 1

Lecture 12: Midterm Exam Review

  • Dr. Chengjiang Long

Computer Vision Researcher at Kitware Inc. Adjunct Professor at RPI. Email: longc3@rpi.edu

slide-2
SLIDE 2
  • C. Long

Lecture 12 March 6, 2018 2

Pattern recognition design cycle

slide-3
SLIDE 3
  • C. Long

Lecture 12 March 6, 2018 3

Pattern recognition design cycle

  • Collecting training and testing data.
  • How can we know when we have

adequately large and representative set

  • f samples?
slide-4
SLIDE 4
  • C. Long

Lecture 12 March 6, 2018 4

Training/Test Split

  • Randomly split dataset into two parts:
  • Training data
  • Test data
  • Use training data to optimize parameters
  • Evaluate error using test data
slide-5
SLIDE 5
  • C. Long

Lecture 12 March 6, 2018 5

Training/Test Split

  • How many points in each set?
  • Very hard question
  • Too few points in training set, learned classifier is bad
  • Too few points in test set, classifier evaluation is

insufficient

  • Cross-validation
  • Leave-one-out cross-validation
slide-6
SLIDE 6
  • C. Long

Lecture 12 March 6, 2018 6

Cross-Validation

  • In practice
  • Available data => training and validation
  • Train on the training data
  • Test on the validation data
  • k-fold cross validation:
  • Data randomly separated into k groups
  • Each time k
  • 1 groups used for training and one as testing
slide-7
SLIDE 7
  • C. Long

Lecture 12 March 6, 2018 7

Cross Validation and Test Accuracy

  • Using CV on training + validation
  • Classify test data with the best parameters from CV
slide-8
SLIDE 8
  • C. Long

Lecture 12 March 6, 2018 8

Pattern recognition design cycle

  • Domain dependence and prior information.
  • Computational cost and feasibility.
  • Discriminative features, i.e., similar values for similar patterns,

and different values for different patterns.

  • Invariant features with respect to translation, rotation and scale.
  • Robust features with respect to occlusion, distortion,
  • deformation, and variations in environment.
slide-9
SLIDE 9
  • C. Long

Lecture 12 March 6, 2018 9

PCA: Visualization

Data points are represented in a rotated orthogonal coordinate system: the origin is the mean of the data points and the axes are provided by the eigenvectors.

slide-10
SLIDE 10
  • C. Long

Lecture 12 March 6, 2018 10

Computation of PCA

  • In practice we compute PCA via SVD (singular value

decomposition)

  • Form the centered data matrix:

[ ]

) ( ) (

1 ,

m x m x

  • =

N N p

X  

  • Compute its SVD:

T p N p p p p

V D U X ) (

, , ,

=

U and V are orthogonal matrices, D is a diagonal matrix

slide-11
SLIDE 11
  • C. Long

Lecture 12 March 6, 2018 11

Computation of PCA…

  • Sometimes we are given only a few high dimensional data

points, i.e., p ≥ N

  • In such cases compute the SVD of XT:

T N p N N N N T

U D V X ) (

, , ,

=

So we get:

T N N N N N p

V D U X ) (

, , ,

=

Then, proceed as before, choose only d < N significant eigenvalues for data representation:

) ( ) ( ~

, ,

m x m x

  • +

=

i T d p d p i

U U

Usually we used the features with reduced dimensions to fit the classification models.

slide-12
SLIDE 12
  • C. Long

Lecture 12 March 6, 2018 12

Fisher Linear Discriminant

  • We need to normalize by both scatter of class 1 and

scatter of class 2

  • The Fisher linear discriminant is the projection on a

line in the direction v which maximizes

slide-13
SLIDE 13
  • C. Long

Lecture 12 March 6, 2018 13

Fisher Linear Discriminant

  • Thus our objective function can be written:
  • Maximize J(v) by taking the derivative w.r.t. v and setting it to 0
slide-14
SLIDE 14
  • C. Long

Lecture 12 March 6, 2018 14

Fisher Linear Discriminant

slide-15
SLIDE 15
  • C. Long

Lecture 12 March 6, 2018 15

Fisher Linear Discriminant

  • If SW has full rank (the inverse exists), we can convert this

to a standard eigenvalue problem

  • But SBx for any vector x, points in the same direction as

μ1-μ2

  • Based on this, we can solve the eigenvalue problem

directly

  • But SBx for any vector x, points in the same direction as

μ1-μ2

slide-16
SLIDE 16
  • C. Long

Lecture 12 March 6, 2018 16

Example

slide-17
SLIDE 17
  • C. Long

Lecture 12 March 6, 2018 17

Pattern recognition design cycle

How can we know how close we are to the true model underlying the patterns?

  • Domain dependence and prior information.
  • Definition of design criteria.
  • Parametric vs. non-parametric models.
  • Handling of missing features.
  • Computational complexity.
  • Types of models: templates, decision-theoretic or

statistical,syntactic or structural, neural, and hybrid.

slide-18
SLIDE 18
  • C. Long

Lecture 12 March 6, 2018 18

The Classifiers We Have Learned So Far

Bayesian classifiers MLE classifier MAP classifier Naive Bayes classifier Nonparametric classifiers KNN classifier Linear classifiers LDF (Perceptron rule & Minimu Square Error rule & Ho-Kashyap Procedure SVM classifier Nonlinear classifiers Linear classifiers Kernel Tricks Feature Mapping Φ

slide-19
SLIDE 19
  • C. Long

Lecture 12 March 6, 2018 19

Decision Rule

  • Using Bayes’ rule:

where

( / ) ( ) ( / ) ( )

j j j

p x P likelihood prior P x p x evidence w w w ´ = =

2 1

( ) ( / ) ( )

j j j

p x p x P w w

=

= å

Decideω1 if P(ω1 /x) > P(ω2 /x); otherwise decide ω2

  • r

Decideω1 if p(x/ω1)P(ω1)>p(x/ω2)P(ω2); otherwise decideω2

  • r

Decideω1 if p(x/ω1)/p(x/ω2) >P(ω2)/P(ω1) ; otherwise decide ω2

slide-20
SLIDE 20
  • C. Long

Lecture 12 March 6, 2018 20

Discriminant Functions

  • A useful way to represent a classifier is through

discriminant functions gi(x), i = 1, . . . , c, where a feature vector x is assigned to class ωi if gi(x) > gj(x) for all j i

max

slide-21
SLIDE 21
  • C. Long

Lecture 12 March 6, 2018 21

Discriminants for Bayes Classifier

  • Is the choice of gi unique?
  • Replacing gi(x) with f(gi(x)), where f() is

monotonically increasing, does not change the classification results.

( / ) ( ) ( ) ( ) ( ) ( / ) ( ) ( ) ln ( / ) ln ( )

i i i i i i i i i

p P g p g p P g p P w w w w w w = = = + x x x x x x x

gi(x)=P(ωi/x)

we’ll use this discriminant extensively!

slide-22
SLIDE 22
  • C. Long

Lecture 12 March 6, 2018 22

Case 1: Statistically Independent Features with Identical Variances

slide-23
SLIDE 23
  • C. Long

Lecture 12 March 6, 2018 23

Case II: Identical Covariances

  • Notes on Decision Boundary
  • As for Case I, passes through point x0 lying on the line between the two

class means. Again, x0 in the middle if priors identical

  • Hyperplane defined by boundary generally not orthogonal to the line

between the two means

slide-24
SLIDE 24
  • C. Long

Lecture 12 March 6, 2018 24

Case III: arbitrary

Nonlinear decision boundaries

slide-25
SLIDE 25
  • C. Long

Lecture 12 March 6, 2018 25

Parameter Parameter

Parameter estimation Maximum likelihood: values of parameters are fixed but unknown Bayesian estimation / Maximum a posteriori (MAP): parameters as random variables having some known a priori distribution

slide-26
SLIDE 26
  • C. Long

Lecture 12 March 6, 2018 26

Maximum-Likelihood Estimation

  • Use set of independent samples to estimate
  • Our goal is to determine (value of that best

agrees with observed training data)

  • Note if D is fixed is not a density
slide-27
SLIDE 27
  • C. Long

Lecture 12 March 6, 2018 27

Example: Gaussian case

  • Assume we have c classes and
  • Use the information provided by the training samples

to estimate each is associated with each category.

  • Suppose that D contains n samples,
slide-28
SLIDE 28
  • C. Long

Lecture 12 March 6, 2018 28

Maximum-Likelihood Estimation

  • is called the likelihood of w.r.t the set of

samples.

  • ML estimate of is, by definition the value that

maximizes “It is the value of that best agrees with the actually

  • bserved training sample”
slide-29
SLIDE 29
  • C. Long

Lecture 12 March 6, 2018 29

Optimal Estimation

  • Let and let be the gradient operator
  • We define as the log likelihood function
  • New problem statement:
  • determine that maximizes the log likelihood
slide-30
SLIDE 30
  • C. Long

Lecture 12 March 6, 2018 30

Optimal Estimation

  • Local or global maximum
  • Local or global minimum
  • Saddle point
  • Boundary of parameter space
slide-31
SLIDE 31
  • C. Long

Lecture 12 March 6, 2018 31

Bayesian Estimation (MAP): General Theory

  • p(x | D) computation can be applied to any situation in

which the unknown density can be parameterized.

  • The basic assumptions are:

 The form of is assumed known, but the value of is

not known exactly

 Our knowledge about is assumed to be contained in a

known prior density

 The rest of our knowledge is contained in a set D of n

random variables x1, x2, …, xn that follows p(x)

slide-32
SLIDE 32
  • C. Long

Lecture 12 March 6, 2018 32

Bayesian Estimation (MAP): General Theory

  • The basic problem is: “Compute the posterior

density ” then “Derive ”

  • Using Bayes formula, we have:
  • And by the independence assumption:
slide-33
SLIDE 33
  • C. Long

Lecture 12 March 6, 2018 33

MLE vs. MAP

  • Maximum Likelihood estimation (MLE)
  • - Choose value that maximizes the probability of observed

data

  • Maximum a posteriori (MAP) estimation
  • - Choose value that is most probable given observed data

and prior belief

When is MAP same as MLE?

slide-34
SLIDE 34
  • C. Long

Lecture 12 March 6, 2018 34

Naïve Bayes Classifier (not BE)

  • Simple classifier that applies Bayes' rule with

strong (naive) independence assumptions

  • A.k.a. the "independent feature model”
  • Often performs reasonably well despite simplicity
slide-35
SLIDE 35
  • C. Long

Lecture 12 March 6, 2018 35

Naïve Bayes Classifier

  • NB is known to produce posteriors closer to extremes

(0 or 1) than true posteriors – Why?

  • NB performs well when only small amounts of training

data are available – Why?

slide-36
SLIDE 36
  • C. Long

Lecture 12 March 6, 2018 36

The Classifiers We Have Learned So Far

Bayesian classifiers MLE classifier MAP classifier Naive Bayes classifier Nonparametric classifiers KNN classifier Linear classifiers LDF (Perceptron rule & Minimu Square Error rule & Ho-Kashyap Procedure SVM classifier Nonlinear classifiers Linear classifiers Kernel Tricks Feature Mapping Φ

slide-37
SLIDE 37
  • C. Long

Lecture 12 March 6, 2018 37

Density Estimation: Two Approaches

  • Parzen Windows:

 Shrink an initial region where

and show that

 This is called “the Parzen window estimation

method”

  • k-Nearest Neighbors

Specify kn as some function of n, such as the volume Vn is grown until it encloses kn neighbors of x. This is called “the kn-nearest neighbor estimation method”

slide-38
SLIDE 38
  • C. Long

Lecture 12 March 6, 2018 38

The k–Nearest-Neighbor Rule

  • Goal: Classify x by assigning it the label most

frequently represented among the k nearest samples

  • Use a voting scheme

The k-nearest-neighbor query starts at the test point and grows a spherical region until it encloses k training samples, and labels the test point by a majority vote of these samples

slide-39
SLIDE 39
  • C. Long

Lecture 12 March 6, 2018 39

kNN: Multi-modal Distributions

  • Most parametric

distributions would not work for this 2 class classification problem

  • Nearest neighbors will

do reasonably well, provided we have a lot of samples

slide-40
SLIDE 40
  • C. Long

Lecture 12 March 6, 2018 40

The Classifiers We Have Learned So Far

Bayesian classifiers MLE classifier MAP classifier Naive Bayes classifier Nonparametric classifiers KNN classifier Linear classifiers LDF (Perceptron rule & Minimu Square Error rule & Ho-Kashyap Procedure SVM classifier Nonlinear classifiers Linear classifiers Kernel Tricks Feature Mapping Φ

slide-41
SLIDE 41
  • C. Long

Lecture 12 March 6, 2018 41

Augmented feature space

  • Augmented feature/parameter space

1

( )

d d t t i i i i i i

g w w x x w w x

= =

= + = + = =

å å

x w x α y

1 1 2 1 2 1

... ... ... ...

d d d d

w w x x w w x x w w x x é ù é ù é ù é ù ê ú ê ú ê ú ê ú ê ú ê ú ê ú ê ú = Þ = = Þ = ê ú ê ú ê ú ê ú ê ú ê ú ê ú ê ú ë û ë û ë û ë û w α x y

Discriminant:

( )

t

g = x α y

If αtyi ≥ 0 assign yi to ω1 else if αtyi<0 assign yi to ω2

slide-42
SLIDE 42
  • C. Long

Lecture 12 March 6, 2018 42

Normalization

Seek a hyperplane that separates patterns from different categories Seek a hyperplane that puts normalized patterns on the same (positive) side

Classification rule: If αtyi>0 assign yi to ω1 else if αtyi<0 assign yi to ω2

slide-43
SLIDE 43
  • C. Long

Lecture 12 March 6, 2018 43

Perceptron Batch Rule

  • The gradient of Jp(α) is:

( )

( ) ( )

t p Y

J

Î

=

  • å

y α

α α y

( )

( )

p Y

J

Î

Ñ =

  • å

y α

y

  • The perceptron update rule is obtained using gradient

descent:

( )

( 1) ( ) ( )

Y

k k k h

Î

+ = +

å

y α

α α y

  • It is not possible to solve analytically 0.
  • It is called batch rule because it is based on all misclassified

examples

slide-44
SLIDE 44
  • C. Long

Lecture 12 March 6, 2018 44

Perceptron Single Sample Rule

  • The gradient decent single sample rule for Jp(a) is:

– Note that yM is one sample misclassified by – Must have a consistent way of visiting samples

  • Geometric Interpretation:

– Note that yM is one sample misclassified by – yM is on the wrong side of decision hyperplane – Adding ηyM to a moves the new decision hyperplane in the right direction with respect to yM

slide-45
SLIDE 45
  • C. Long

Lecture 12 March 6, 2018 45

MSE Criterion Function

  • Minimum squared error approach: find a which

minimizes the length of the error vector e

  • Thus minimize the minimum squared error criterion

function:

  • Unlike the perceptron criterion function, we can
  • ptimize the minimum squared error criterion function

analytically by setting the gradient to 0

slide-46
SLIDE 46
  • C. Long

Lecture 12 March 6, 2018 46

Computing the Gradient

slide-47
SLIDE 47
  • C. Long

Lecture 12 March 6, 2018 47

Pseudo-Inverse Solution

  • Setting the gradient to 0:
  • The matrix is square (it has d +1 rows and

columns) and it is often non-singular

  • If is non-singular, its inverse exists and we can

solve for a uniquely:

slide-48
SLIDE 48
  • C. Long

Lecture 12 March 6, 2018 48

Ho-Kashyap Procedure

  • As usual, take partial derivatives w.r.t. a and b
  • Use modified gradient descent procedure to find a

minimum of JHK(a,b)

  • Alternate the two steps below until convergence:

Fix b and minimize JHK(a,b) with respect to a

Fix a and minimize JHK(a,b) with respect to b

slide-49
SLIDE 49
  • C. Long

Lecture 12 March 6, 2018 49

LDF Summary

  • Perceptron procedures

– Find a separating hyperplane in the linearly separable case, – Do not converge in the non-separable case – Can force convergence by using a decreasing learning rate, but are not guaranteed a reasonable stopping point

  • MSE procedures

– Converge in separable and not separable case – May not find separating hyperplane even if classes are linearly separable – Use pseudoinverse if is not singular and not too large – Use gradient descent (Widrow-Hoff procedure) otherwise

  • Ho-Kashyap procedures

– always converge – find separating hyperplane in the linearly separable case – more costly

slide-50
SLIDE 50
  • C. Long

Lecture 12 March 6, 2018 50

50

wT x + b = 0 wTxa + b = 1 wTxb + b = -1 ρ

  • Sec. 15.1

Linear Support Vector Machine (SVM)

  • Maximize margin

ρ = ||xa–xb|| = 2/||w||

  • Prime problem
slide-51
SLIDE 51
  • C. Long

Lecture 12 March 6, 2018 51

SVM solution: Lagrange multipliers

  • We can then swap ’max’ and ’min’:
  • We can find the optimal w as a function of {αi} by

setting the derivatives to zero:

slide-52
SLIDE 52
  • C. Long

Lecture 12 March 6, 2018 52

SVM: Optimal Hyperplane

  • Use Kuhn-Tucker Theorem (KTT) condition to convert our

problem to:

  • a ={a1,…, an} are new variables, one for each sample
  • Optimized by quadratic programming
slide-53
SLIDE 53
  • C. Long

Lecture 12 March 6, 2018 53

SVM: Optimal Hyperplane

  • After finding the optimal a = {a1,…, an}
  • Final discriminant function:
  • where S is the set of support vectors
slide-54
SLIDE 54
  • C. Long

Lecture 12 March 6, 2018 54

SVM: Non-Separable Case

  • Data are most likely to be not linearly separable, but

linear classifier may still be appropriate

  • Can apply SVM in non linearly separable case
  • Data should be “almost” linearly separable for good

performance

slide-55
SLIDE 55
  • C. Long

Lecture 12 March 6, 2018 55

SVM: Non-Separable Case

  • Use slack variablesξ1,…, ξn (one for each sample)
  • Change constraints:
  • ξi is a measure of deviation

from the ideal for xi

–ξi ≥ 1: xi is on the wrong side of the separating hyperplane – 0 < ξi < 1: xi is on the right side of separating hyperplane but within the region of maximum margin – ξi < 0 : is the ideal case for xi

slide-56
SLIDE 56
  • C. Long

Lecture 12 March 6, 2018 56

SVM: Non-Separable Case

  • Unfortunately this minimization problem is NP-hard

due to the discontinuity of I(ξi)

  • Instead, we minimize

subject to

slide-57
SLIDE 57
  • C. Long

Lecture 12 March 6, 2018 57

SVM: Non-Separable Case

  • Use Kuhn-Tucker Theorem (KTT) conditon to

convert to:

  • w and w0 is computed using:
  • Remember that:
slide-58
SLIDE 58
  • C. Long

Lecture 12 March 6, 2018 58

What about multi-class SVMs?

  • Unfortunately, there is no “definitive” multi-class SVM

formulation

  • In practice, we have to obtain a multi-class SVM by

combining multiple two/ class SVMs

  • One vs. others
  • Traning: learn an SVM for each class vs. the others
  • Testing: apply each SVM to test example and assign to it

the class of the SVM that returns the highest decision value

  • One vs. one
  • Training: learn an SVM for each pair of classes
  • Testing: each learned SVM “votes” for a class to assign to

the test example

slide-59
SLIDE 59
  • C. Long

Lecture 12 March 6, 2018 59

The Classifiers We Have Learned So Far

Bayesian classifiers MLE classifier MAP classifier Naive Bayes classifier Nonparametric classifiers KNN classifier Linear classifiers LDF (Perceptron rule & Minimu Square Error rule & Ho-Kashyap Procedure SVM classifier Nonlinear classifiers Linear classifiers Kernel Tricks Feature Mapping Φ

slide-60
SLIDE 60
  • C. Long

Lecture 12 March 6, 2018 60

60

  • General idea: the original feature space can always be

mapped to some higher dimensional feature space where the training set is separable:

Φ: x → φ(x)

  • Sec. 15.2.3

Non-linear SVMs: Feature spaces

slide-61
SLIDE 61
  • C. Long

Lecture 12 March 6, 2018 61

Kernels

  • SVM optimization:

Maximize

  • Note this optimization depends on samples xi only

through the dot product

  • If we lift xi to high dimension using φ(xi), we need to

compute high dimensional product φ(xi) Tφ(xj)

  • Idea: find kernel function K(xi,xj) s.t.
slide-62
SLIDE 62
  • C. Long

Lecture 12 March 6, 2018 62

Kernel Trick

  • Then we only need to compute K(xi,xj) instead of

φ(xi) Tφ(xj)

  • “kernel trick”: do not need to perform operations

in high dimensional space explicitly

slide-63
SLIDE 63
  • C. Long

Lecture 12 March 6, 2018 63

Choice of Kernel

  • How to choose kernel function K(xi,xj)?

– K(xi,xj) should correspond to φ(xi) Tφ(xj) in a higher dimensional space – Mercer’s condition tells us which kernel function can be expressed as dot product of two vectors – If K and K’ are kernels aK+bK’ is a kernel

  • The mappings φ(xi) never have to be computed!!
slide-64
SLIDE 64
  • C. Long

Lecture 12 March 6, 2018 64

Pattern recognition design cycle

How can we learn the rule from data?

  • Supervised learning: a teacher provides a category label or

cost for each pattern in the training set.

  • Unsupervised learning: the system forms clusters or natural

groupings of the input patterns.

  • Reinforcement learning: no desired category is given but the

teacher provides feedback to the system such as the decision is right or wrong.

slide-65
SLIDE 65
  • C. Long

Lecture 12 March 6, 2018 65

Learning Algorithms

  • To design a learning algorithm, we face the following

problems:

Whether to stop?

In what direction to proceed?

How long a step to take?

Is the criterion satisfactory?

slide-66
SLIDE 66
  • C. Long

Lecture 12 March 6, 2018 66

Criterion Function

  • To facilitate learning, we usually define a scalar

criterion function.

  • It usually represents the penalty or cost of a solution.
  • Our goal is to minimize its value, i.e., Function
  • ptimization.
slide-67
SLIDE 67
  • C. Long

Lecture 12 March 6, 2018 67

Learning Using Iterative Optimization

  • Minimize an error function J(α) (e.g., classification

error) with respect to α:

  • Minimize J(α) iteratively:

search direction learning rate

How should we choose pk?

slide-68
SLIDE 68
  • C. Long

Lecture 12 March 6, 2018 68

Choosing pk using Gradient Descent

(a = α)

slide-69
SLIDE 69
  • C. Long

Lecture 12 March 6, 2018 69

Gradient Decent (cont'd)

  • 2
  • 1

1 2

  • 2
  • 1

1 2

search space

(0) α ( ) k α

J(α)

slide-70
SLIDE 70
  • C. Long

Lecture 12 March 6, 2018 70

Gradient Decent (cont'd)

  • How to choose the learning rate ?

Taylor series approximation

  • ptimum learning rate

Hessian (2nd derivatives)

(a = α) Setting a=a(k+1) and using

slide-71
SLIDE 71
  • C. Long

Lecture 12 March 6, 2018 71

Choosing pk using Newton's Method

1

( ( ))

k

H J k

  • = -

Ñ p α

requires inverting H

(a = α)

slide-72
SLIDE 72
  • C. Long

Lecture 12 March 6, 2018 72

Newton's Method (cont'd)

  • 2
  • 1

1 2

  • 2
  • 1

1 2

If J(α) is quadratic, Newton’s method converges in one step!

J(α)

slide-73
SLIDE 73
  • C. Long

Lecture 12 March 6, 2018 73

Gradient decent vs Newton's method

Newton’s method Gradient Decent

slide-74
SLIDE 74
  • C. Long

Lecture 12 March 6, 2018 74

Pattern recognition design cycle

  • How can we estimate the

performance with training samples?

  • How can we predict the

performance with future data?

  • Problems of overfitting and

generalization.

slide-75
SLIDE 75
  • C. Long

Lecture 12 March 6, 2018 75

Receiver Operating Characteristic (ROC) Curve

  • Every classifier typically employs some kind of a

threshold.

  • Changing the threshold will affect the performance
  • f the classifier.
  • ROC curves allow us to evaluate the performance
  • f a classifier using different thresholds.

2 1

( )/ ( )

a

P P q w w =

2 12 22 1 21 11

( )( ) ( )( )

b

P P w l l q w l l

  • =
slide-76
SLIDE 76
  • C. Long

Lecture 12 March 6, 2018 76

ROC Curve

FPR: False Positive Rate (X-axis) TRR: True Postive Rate (Y-axis)

false positive correct acceptance correct rejection false negative

slide-77
SLIDE 77
  • C. Long

Lecture 12 March 6, 2018 77

Overfitting

  • Prediction error: probability of test pattern not in class

with max posterior (true)

  • Training error: probability of test pattern not in class

with max posterior (estimated)

  • Classifier optimized w.r.t. training error
  • Training error: optimistically biased estimate of prediction

error

slide-78
SLIDE 78
  • C. Long

Lecture 12 March 6, 2018 78

Overfitting

  • Overfitting: a learning algorithm overfits the training

data if it outputs a solution w when another solution w’ exists such that:

slide-79
SLIDE 79
  • C. Long

Lecture 12 March 6, 2018 79

Example: Fish Classifier

slide-80
SLIDE 80
  • C. Long

Lecture 12 March 6, 2018 80

Minimum Training Error

slide-81
SLIDE 81
  • C. Long

Lecture 12 March 6, 2018 81

Final Decision Boundary

slide-82
SLIDE 82
  • C. Long

Lecture 12 March 6, 2018 82

Typical Behavior

slide-83
SLIDE 83
  • C. Long

Lecture 12 March 6, 2018 83

Typical Behavior

slide-84
SLIDE 84
  • C. Long

Lecture 12 March 6, 2018 84

Typical Behavior

  • The aim is to get a classification model to generalize

to classify new inputs appropriately.

  • If the training data is known to contain noise, we don’t

necessarily want the training data to be classified totally accurately, because that is likely to reduce the generalization ability.

slide-85
SLIDE 85
  • C. Long

Lecture 12 March 6, 2018 85