Applied Machine Learning Some important concepts Siamak Ravanbakhsh - - PowerPoint PPT Presentation

applied machine learning
SMART_READER_LITE
LIVE PREVIEW

Applied Machine Learning Some important concepts Siamak Ravanbakhsh - - PowerPoint PPT Presentation

Applied Machine Learning Some important concepts Siamak Ravanbakhsh COMP 551 (fall 2020) Admin Weekly quiz : practice quiz was released yesterday 24hrs to submit your answers correct answers are release afterward no extension possible your


slide-1
SLIDE 1

Applied Machine Learning

Some important concepts

Siamak Ravanbakhsh

COMP 551 (fall 2020)

slide-2
SLIDE 2

Admin

Weekly quiz:

practice quiz was released yesterday 24hrs to submit your answers correct answers are release afterward no extension possible your lowest score across all quizzes is ignored

Mini-project 1:

working on it... instead a mini-project from last year is released to give you an idea

Math tutorial: this Friday at noon

slide-3
SLIDE 3
  • verfitting & generalization

validation and cross-validation curse of dimensionality no free lunch inductive bias of a learning algorithm

Learning objectives

understanding the following concepts

slide-4
SLIDE 4

Model selection

many ML algorithms have hyper-parameters

(e.g., K in K-nearest neighbors, max-depth of decision tree, etc)

how should we select the best hyper-parameter?

performance of KNN regression on California Housing Dataset

example

  • verfitting to the trainig data

bad performance on unseen data

underfitting the model can more closely fit the

trainig data and still get good test error best model

slide-5
SLIDE 5

Model selection

what if unseen data is completely different from training data? no point in learning! unseen data comes from the same distribution. assumption: training data points are samples from an unknown distribution

x , y ∼

(n) (n)

p(x, y)

independent identically distributed (IID)

train unseen

slide-6
SLIDE 6

how to estimate this?

f : ↦ 3

1

Loss, cost and generalization

assume we have a model for example

f : x ↦ y

and we have a loss function that measures the error in our prediction ℓ : y,

→ y ^ R

for example ℓ(y, ) = y ^ (y − ) y ^ 2 ℓ(y, ) = y ^ I(y = ) y ^ we train our models to minimize the cost function:

J = ℓ(y, f(x))

∣D ∣

train

1

∑x,y∈Dtrain

E ℓ(y, f(x))

x,y∼p what we really care about is the generalization error: we can set aside part of the training data and use it to estimate generalization error

slide-7
SLIDE 7

how to estimate this?

Validation set

E ℓ(y, f(x))

x,y∼p what we really care about is the generalization error: we can set aside part of the training data and use it to estimate the generalization error validation unseen (test) training at the very end, we report the error on test set pick a hyper-parameter that gives us the best validation error validation and test error could be different

because they use limited amount of data

slide-8
SLIDE 8

Cross validation

how to get a better estimate of generalization error? increase the size of the validation set?

test training validation

this reduces the training set

L = 5

test

Cross-validation helps us in getting better estimates + uncertainty measure divide the (training + validation) data into L parts use one part for validation and L-1 for training

slide-9
SLIDE 9

Cross validation

use the average validation error and its variance (uncertainty) to pick the best model divide the (training + validation) data into L parts use one part for validation and L-1 for training

validation train train

run 2

test validation train

run 1

validation train train

run 3

validation train train

run 4

validation train

run 5

this is called L-fold cross-validation in leave-one-out cross-validation L=N (only one instance is used for validation) report the test error for the final model

slide-10
SLIDE 10

COMP 551 | Fall 2020

Cross validation

example the plot of the mean and standard deviation in 10 fold cross-validation

test error is plotted only to show its agreement with the validation error; in practice we don't look at the test set for hyper-parameter tunning

a rule of thumb: pick the simplest model within one std of the model with lowest validation error

slide-11
SLIDE 11

Decision tree

decision tree for Iris dataset

decision boundaries

dataset (D=2)

decision boundaries suggest overfitting confirmed using a validation set

training accuracy ~ 85% validation accuracy ~ 70%

decision tree

example

slide-12
SLIDE 12

a decision tree can fit any Boolean function (binary classification with binary features)

image credit: https://www.wikiwand.com/en/Binary_decision_diagram

example: of decision tree representation of a boolean function (D=3) there are such functions, why?

22D

How to solve the problem of overfitting in large decision trees? idea 1. grow a small tree

example cost drops after the second node

problem: substantial reduction in cost may happen after a few steps by stopping early we cannot know this

Decision tree: overfitting

decision tree can perfectly fit our training data

slide-13
SLIDE 13

COMP 551 | Fall 2020

idea 2. grow a large tree and then prune it greedily turn an internal node into a leaf node choice is based on the lowest increase in the cost repeat this until left with the root node pick the best among the above models using a validation set after pruning cross-validation is used to pick the best size example before pruning idea 3. random forests (later!)

Decision tree: overfitting & pruning

slide-14
SLIDE 14

Evaluation metrics

when evaluating a classifier it is useful to look at the confusion matrix

it is a CxC table that shows how many sample of each class are classified as belonging to another class sample images from Cifar-10 dataset

classifier's accuracy is the sum of diagonal divided by the sum-total of the matrix

slide-15
SLIDE 15

Evaluation metrics

type I vs type II error

Accuracy =

P+N TP+TN

F score =

1

2 Precision+Recall

Precision×Recall

Recall =

P TP

Precision = RP

TP

Error rate =

P+N FP+FN

some other evaluation metrics based on the confusion table for binary classification the elements of the confusion matrix are TP, TN, FP, FN

slide-16
SLIDE 16

COMP 551 | Fall 2020

if an ML algorithm produces class score or probability we can trade-off between type I & type II error

Evaluation metrics

1

p(y = 1∣x)

threshold goal: evaluate class scores/probabilities (independent of choice of threshold)

TPR = TP/P (recall, sensitivity) FPR = FP/N (fallout, false alarm)

Receiver Operating Characteristic ROC curve

Area Under the Curve (AUC) is sometimes used as a threshold independent measure of quality of the classifier

slide-17
SLIDE 17

Curse of dimensionality

x ∈ [0, 3]D

learning in high dimensions can be difficult: suppose our data is uniformly distributed in some range, say predict the label by counting labels in the same unit of the grid (similar to KNN) to have at least one example per unit, we need training examples

3D

for D=180 we need more training examples than the number of particles in the universe

slide-18
SLIDE 18

Curse of dimensionality

in high dimensions most points have similar distances! histogram of pairwise distance of 1000 random points as we increase dimension, distances become "similar"!

slide-19
SLIDE 19

Curse of dimensionality

D = 3 (2r)D

DΓ(D/2) 2r π

D D/2

lim =

D→∞ volum( ) volum( )

  • Q. why are most distances similar?
  • A. in high dimensions most of the volume is close to the corners!

a "conceptual" visualization of the same idea

# corners and the mass in the corners grow quickly with D

image: Zaki's book on Data Mining and Analysis

slide-20
SLIDE 20

Real-word vs. randomly generated data

how come ML methods work for image data (D=number of pixels)?

pairwise distance for random data pairwise distance for D pixels of MNIST digits

the statistics do not match that of random high-dimensional data!

in fact KNN works well for image classification

slide-21
SLIDE 21

COMP 551 | Fall 2020

Manifold hypothesis

real-world data is often far from uniformly random manifold hypothesis: real data lies close to the surface of a manifold

data dimension: D = 3 manifold dimension:

= D ^ 2

example example data dimension: D=number of pixels manifold dimension:

= D ^ 2

slide-22
SLIDE 22

No free lunch

consider the binary classification task:

suppose this is our dataset

  • ur learning algorithm can produce one of these as our classifier

: f ^ {0, 1} →

3

{0, 1}

there are binary functions that perfectly fit our dataset (why?)

2 =

4

16

no free lunch

the same algorithm cannot perform well for all possible class of problems (f) each ML algorithm is biased to perform well on some class of problems

slide-23
SLIDE 23

COMP 551 | Fall 2020

Inductive bias

e.g., we are often biased towards simplest explanations of our data why does is make sense for learning algorithms to be biased?

the world is not random there are regularities, and induction is possible (why do you think the sun will rise in the east tomorrow morning?

learning algorithms make implicit assumptions

learning or inductive bias

Occam's razor

between two models (explanations) we should prefer the simpler one

(x) = f ^ x2 both of the following models perfectly fit the data (x) = f ^ x ∧

1

x2

example

this one is simpler

what are some of the inductive biases in using K-NN?

slide-24
SLIDE 24

Summary

curse of dimensionality: exponentially more data needed in higher dimensions the manifold hypothesis to the rescue! what we care about is the generalization of ML algorithms

  • verfitting: good performance on the training set doesn't mean the same for the test set

underfitting: we don't even have a good performance on the training set

estimated using a validation set or better, we could use cross-validation no algorithm can perform well on all problems, or there ain't no such thing as a free lunch learning algorithms make assumptions about the data (inductive biases) strength and correctness of those assumptions about the data affects their performance 5

👖