Linear Regression with Polynomial Features , Cross Validation, and - - PowerPoint PPT Presentation

linear regression with polynomial features cross
SMART_READER_LITE
LIVE PREVIEW

Linear Regression with Polynomial Features , Cross Validation, and - - PowerPoint PPT Presentation

Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2020f/ Linear Regression with Polynomial Features , Cross Validation, and Hyperparameter Selection Many slides attributable to: Prof. Mike Hughes Erik Sudderth


slide-1
SLIDE 1

Linear Regression with Polynomial Features,

Cross Validation, and Hyperparameter Selection

2

Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2020f/

Many slides attributable to: Erik Sudderth (UCI) Finale Doshi-Velez (Harvard) James, Witten, Hastie, Tibshirani (ISL/ESL books)

  • Prof. Mike Hughes
slide-2
SLIDE 2

Objectives for Today (day 04)

  • Regression with transformations of features
  • Especially, polynomial features
  • Ways to estimate generalization error
  • Fixed Validation Set
  • K-fold Cross Validation
  • Hyperparameter Selection

3

Mike Hughes - Tufts COMP 135 - Fall 2020

slide-3
SLIDE 3

What will we learn?

4

Mike Hughes - Tufts COMP 135 - Fall 2020

Supervised Learning Unsupervised Learning Reinforcement Learning

Data, Label Pairs Performance measure Task data x label y

{xn, yn}N

n=1

Training Prediction Evaluation

slide-4
SLIDE 4

5

Mike Hughes - Tufts COMP 135 - Fall 2020

Task: Regression

Supervised Learning Unsupervised Learning Reinforcement Learning

regression

x y y

is a numeric variable e.g. sales in $$

slide-5
SLIDE 5

6

Mike Hughes - Tufts COMP 135 - Fall 2020

Review: Linear Regression

min

w,b N

X

n=1

⇣ yn − ˆ y(xn, w, b) ⌘2

Optimization problem: “Least Squares”

Exact formula for optimal values of w, b exist! Math works in 1D and for many dimensions

[w1 . . . wF b]T = ( ˜ XT ˜ X)−1 ˜ XT y

˜ X =     x11 . . . x1F 1 x21 . . . x2F 1 . . . xN1 . . . xNF 1    

slide-6
SLIDE 6

8

Mike Hughes - Tufts COMP 135 - Fall 2020

Transformations of Features

slide-7
SLIDE 7

Fitting a line isn’t always ideal

9

Mike Hughes - Tufts COMP 135 - Fall 2020

slide-8
SLIDE 8

Can fit linear functions to nonlinear features

10

Mike Hughes - Tufts COMP 135 - Fall 2020

ˆ y(xi) = θ0 + θ1xi + θ2x2

i + θ3x3 i

A nonlinear function of x: Can be written as a linear function of

“Linear regression” means linear in the parameters (weights, biases) Features can be arbitrary transforms of raw data

φ(xi) = [1 xi x2

i x3 i ]

ˆ y(xi) =

4

X

g=1

θgφg(xi) = θT φ(xi)

slide-9
SLIDE 9

What feature transform to use?

  • Anything that works for your data!
  • sin / cos for periodic data
  • polynomials for high-order dependencies
  • interactions between feature dimensions
  • Many other choices possible

11

Mike Hughes - Tufts COMP 135 - Fall 2020

φ(xi) = [1 xi x2

i . . .]

φ(xi) = [1 xi1xi2 xi3xi4 . . .]

slide-10
SLIDE 10

12

Mike Hughes - Tufts COMP 135 - Fall 2020

Linear Regression with Transformed Features

Optimization problem: “Least Squares” Exact solution:

ˆ y(xi) = θT φ(xi) φ(xi) = [1 φ1(xi) φ2(xi) . . . φG−1(xi)] minθ PN

n=1(yn − θT φ(xi))2

θ∗ = (ΦT Φ)−1ΦT y

Φ =      1 φ1(x1) . . . φG−1(x1) 1 φ1(x2) . . . φG−1(x2) . . . ... 1 φ1(xN) . . . φG−1(xN)     

N x G matrix

slide-11
SLIDE 11

0th degree polynomial features

13

Mike Hughes - Tufts COMP 135 - Fall 2020 Credit: Slides from course by Prof. Erik Sudderth (UCI)

  • -- true function
  • training data
  • - predictions from

LR using polynomial features

slide-12
SLIDE 12

14

Mike Hughes - Tufts COMP 135 - Fall 2020 Credit: Slides from course by Prof. Erik Sudderth (UCI)

1st degree polynomial features

  • -- true function
  • training data
  • - predictions from

LR using polynomial features

slide-13
SLIDE 13

15

Mike Hughes - Tufts COMP 135 - Fall 2020 Credit: Slides from course by Prof. Erik Sudderth (UCI)

3rd degree polynomial features

  • -- true function
  • training data
  • - predictions from

LR using polynomial features

slide-14
SLIDE 14

9th degree polynomial features

16

Mike Hughes - Tufts COMP 135 - Fall 2020 Credit: Slides from course by Prof. Erik Sudderth (UCI)

  • -- true function
  • training data
  • - predictions from

LR using polynomial features

slide-15
SLIDE 15

Error vs Degree

17

Mike Hughes - Tufts COMP 135 - Fall 2020

polynomial degree mean squared error

slide-16
SLIDE 16

Error vs Model Complexity

18

Mike Hughes - Tufts COMP 135 - Fall 2020

Overfitting Underfitting

0 degree polynomial high-degree polynomial

slide-17
SLIDE 17

What to do about underfitting?

Increase model complexity (add more features!)

19

Mike Hughes - Tufts COMP 135 - Fall 2020

What to do about overfitting?

Select among several complexity levels the one that generalizes best (today) Control complexity with a penalty in training

  • bjective (next class)
slide-18
SLIDE 18

20

Mike Hughes - Tufts COMP 135 - Fall 2020

Hyperparameter Selection

Selection problem: What polynomial degree to use? “Parameter” (e.g. weight values in linear regression)

a numerical variable controlling quality of fit that we can effectively estimate by minimizing error on training set

“Hyperparameter” (e.g. degree of polynomial features)

a numerical variable controlling model complexity / quality of fit whose value we cannot effectively estimate from the training set

polynomial degree mean squared error

If we picked lowest training error, we’d select a 9-degree polynomial If we picked lowest test error, we’d select a 3 or 4 degree polynomial

slide-19
SLIDE 19

21

Mike Hughes - Tufts COMP 135 - Fall 2020

Goal of regression (supervised ML) is to generalize: sample to population

For any regression task, we might want to:

  • Train a model (estimate parameters)
  • Requires calling `fit` on a training labeled dataset
  • Select hyperparameters (e.g. which degree of polynomial?)
  • Requires evaluating predictions on a validation labeled dataset
  • Report its ability on data it has never seen before (“generalization error” or “test error”)
  • Requires comparing predictions to a test labeled dataset

Should ALWAYS use different labeled datasets to do each of these things!

slide-20
SLIDE 20

Two Ways to Measure Generalization Error

  • Fixed Validation Set
  • Cross-Validation

22

Mike Hughes - Tufts COMP 135 - Fall 2020

slide-21
SLIDE 21

Labeled dataset

23

Mike Hughes - Tufts COMP 135 - Fall 2020

x y

Each row represents one example Assume rows are arranged “uniformly at random” (order doesn’t matter)

N x F N x 1

slide-22
SLIDE 22

Split into train and test

24

Mike Hughes - Tufts COMP 135 - Fall 2020

x y

train test

slide-23
SLIDE 23

Selection via Fixed Validation Set

25

Mike Hughes - Tufts COMP 135 - Fall 2020

Option: Fit on train, select on validation 1) Fit each model to training data 2) Evaluate each model on validation data 3) Select model with lowest validation error 4)Report error on test set

train test validation

x y

slide-24
SLIDE 24

26

Mike Hughes - Tufts COMP 135 - Fall 2020

Option: Fit on train, select on validation 1) Fit each model to training data 2) Evaluate each model on validation data 3) Select model with lowest validation error 4)Report error on test set

train test validation

x y

Concerns

  • What sizes to pick?
  • Will train be too small?
  • Is validation set used

effectively? (only to evaluate predictions?)

Selection via Fixed Validation Set

slide-25
SLIDE 25

For small datasets, randomness in validation split will impact selection

27

Mike Hughes - Tufts COMP 135 - Fall 2020 Credit: ISL Textbook, Chapter 5

10 other random splits Single random split

slide-26
SLIDE 26

3-fold Cross Validation

28

Mike Hughes - Tufts COMP 135 - Fall 2020

train

validation

x y

x y

x y x y

Divide labeled dataset into 3 even-sized parts Fit model 3 independent times. Each time leave one fold as validation and keep remaining as training

fold 1 fold 2 fold 3

Heldout error estimate: average of the validation error across all 3 fits

slide-27
SLIDE 27

K-fold CV: How many folds K?

  • Can do as low as 2 fold
  • Can do as high as N-1 folds (“Leave one out”)
  • Usual rule of thumb: 5-fold or 10-fold CV
  • Computation runtime scales linearly with K
  • Larger K also means each fit uses more train data,

so each fit might take longer too

  • Each fit is independent and parallelizable

29

Mike Hughes - Tufts COMP 135 - Fall 2020

slide-28
SLIDE 28

30

Mike Hughes - Tufts COMP 135 - Fall 2020

9 separate splits Each one with 10 folds Leave-one-out (N-1 folds)

Credit: ISL Textbook, Chapter 5

Estimating Heldout Error with Cross Validation