[PPT] - Lecture 4: Introduction to Regression CS109A Introduction to Data PowerPoint Presentation

SLIDE 1

CS109A Introduction to Data Science

Pavlos Protopapas, Kevin Rader and Chris Tanner

Lecture 4: Introduction to Regression

SLIDE 2

CS109A, PROTOPAPAS, RADER, TANNER

Background

Roadmap:

1

What is Data Science? Data: types, formats, issues, etc, and briefly visualization How to quickly prepare data and scrape the web How to model data and evaluate model fitness. Linear regression, confidence intervals, model selection cross validation, regularization Lecture 1 Lecture 2 Lecture 3 and Lab2 This lecture Next 3 lectures

SLIDE 3

CS109A, PROTOPAPAS, RADER, TANNER

Lecture Outline

Statistical Modeling k-Nearest Neighbors (kNN) Model Fitness How does the model perform predicting? Comparison of Two Models How do we choose from two different models?

2

SLIDE 4

CS109A, PROTOPAPAS, RADER, TANNER

Predicting a Variable

Let's image a scenario where we'd like to predict one variable using another (or a set of other) variables. Examples:

Predicting the amount of view a YouTube video will get next week

based on video length, the date it was posted, previous number of views, etc.

Predicting which movies a Netflix user will rate highly based on their

previous movie ratings, demographic data etc.

3

SLIDE 5

CS109A, PROTOPAPAS, RADER, TANNER

Data

TV radio newspaper sales 230.1 37.8 69.2 22.1 44.5 39.3 45.1 10.4 17.2 45.9 69.3 9.3 151.5 41.3 58.5 18.5 180.8 10.8 58.4 12.9

4

,

The Advertising data set consists of the sales of that product in 200 different markets, along with advertising budgets for the product in each of those markets for three different media: TV, radio, and newspaper. Everything is given in units

f $1000.

Some of the figures in this presentation are taken from "An Introduction to Statistical Learning, with applications in R" (Springer, 2013) with permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani "

SLIDE 6

CS109A, PROTOPAPAS, RADER, TANNER

Response vs. Predictor Variables

There is an asymmetry in many of these problems: The variable we'd like to predict may be more difficult to measure, is more important than the other(s), or may be directly or indirectly influenced by the values of the

ther variable(s).

Thus, we'd like to define two categories of variables:

variables whose value we want to predict
variables whose values we use to make our prediction

5

SLIDE 7

CS109A, PROTOPAPAS, RADER, TANNER

Response vs. Predictor Variables

TV radio newspaper sales 230.1 37.8 69.2 22.1 44.5 39.3 45.1 10.4 17.2 45.9 69.3 9.3 151.5 41.3 58.5 18.5 180.8 10.8 58.4 12.9

6

Y

utcome

response variable dependent variable X predictors features covariates

p predictors n observations

SLIDE 8

CS109A, PROTOPAPAS, RADER, TANNER

Response vs. Predictor Variables

TV radio newspaper sales 230.1 37.8 69.2 22.1 44.5 39.3 45.1 10.4 17.2 45.9 69.3 9.3 151.5 41.3 58.5 18.5 180.8 10.8 58.4 12.9

7

utcome

response variable dependent variable 𝑌 = 𝑌#, … , 𝑌& 𝑌

' = 𝑦#', … , 𝑦)', … , 𝑦*'

predictors features covariates

p predictors n observations

𝑍 = 𝑧#, … , 𝑧*

SLIDE 9

CS109A, PROTOPAPAS, RADER, TANNER

Definition

We are observing 𝑞 + 1 number variables and we are making 𝑜 sets of

bservations. We call:
the variable we'd like to predict the outcome or response variable;

typically, we denote this variable by 𝑍 and the individual measurements 𝑧).

the variables we use in making the predictions the features or

predictor variables; typically, we denote these variables by 𝑌 = 𝑌#, … , 𝑌& and the individual measurements 𝑦),'.

8

Note: 𝑗 indexes the observation (𝑗 = 1, … , 𝑜) and 𝑘 indexes the value of the 𝑘-th predictor variable (j = 1, … , 𝑞).

SLIDE 10

CS109A, PROTOPAPAS, RADER, TANNER

Statistical Model

9

SLIDE 11

CS109A, PROTOPAPAS, RADER, TANNER

True vs. Statistical Model

We will assume that the response variable, 𝑍, relates to the predictors, 𝑌, through some unknown function expressed generally as: 𝑍 = 𝑔 𝑌 + 𝜁 Here, 𝑔 is the unknown function expressing an underlying rule for relating 𝑍 to 𝑌, 𝜁 is the random amount (unrelated to 𝑌) that 𝑍 differs from the rule 𝑔 𝑌 . A statistical model is any algorithm that estimates 𝑔. We denote the estimated function as 𝑔. 9

10

SLIDE 12

CS109A, PROTOPAPAS, RADER, TANNER

Statistical Model

11

x y

SLIDE 13

CS109A, PROTOPAPAS, RADER, TANNER

Statistical Model

12

What is the value of y at this 𝑦?

How do we find 𝑔 A 𝑦 ?

SLIDE 14

CS109A, PROTOPAPAS, RADER, TANNER

Statistical Model

13

How do we find 𝑔 A 𝑦 ?

r this
ne?

SLIDE 15

CS109A, PROTOPAPAS, RADER, TANNER

Statistical Model

14

Simple idea is to take the mean of all y’s, 𝑔 A 𝑦 = #

* ∑ 𝑧) * #

SLIDE 16

CS109A, PROTOPAPAS, RADER, TANNER

Prediction vs. Estimation

For some problems, what's important is obtaining 𝑔 A, our estimate of 𝑔. These are called inference problems. When we use a set of measurements, (𝑦),#, … , 𝑦),&) to predict a value for the response variable, we denote the predicted value by: 𝑧 E) = 𝑔 A(𝑦),#, … , 𝑦),&). For some problems, we don't care about the specific form of 𝑔 A, we just want to make our predictions 𝑧 E’s as close to the observed values 𝑧’s as possible. These are called prediction problems.

15

SLIDE 17

CS109A, PROTOPAPAS, RADER, TANNER

Simple Prediction Model

16

What is 𝑧 EFat some 𝑦F ?

𝑦F

Predict 𝑧 EF = 𝑧& 𝑧 EF Find distances to all other points 𝐸(𝑦F, 𝑦)) Find the nearest neighbor, (𝑦&, 𝑧&) (𝑦&, 𝑧&)

SLIDE 18

CS109A, PROTOPAPAS, RADER, TANNER

Simple Prediction Model

17

Do the same for “all” 𝑦′𝑡

SLIDE 19

CS109A, PROTOPAPAS, RADER, TANNER

Extend the Prediction Model

18

What is 𝑧 EFat some 𝑦F ?

𝑦F

Predict 𝑧F J =

# K ∑ 𝑧FL K )

𝑧 EF Find distances to all other points 𝐸(𝑦F, 𝑦)) Find the k-nearest neighbors, 𝑦FM, … , 𝑦FN

SLIDE 20

CS109A, PROTOPAPAS, RADER, TANNER

Simple Prediction Models

19

SLIDE 21

CS109A, PROTOPAPAS, RADER, TANNER

Simple Prediction Models

20

We can try different k-models on more data

SLIDE 22

CS109A, PROTOPAPAS, RADER, TANNER

k-Nearest Neighbors

21

The k-Nearest Neighbor (kNN) model is an intuitive way to predict a quantitative response variable: to predict a response for a set of observed predictor values, we use the responses of other observations most similar to it Note: this strategy can also be applied in classification to predict a categorical variable. We will encounter kNN again later in the course in the context of classification.

SLIDE 23

CS109A, PROTOPAPAS, RADER, TANNER

ˆ yn = 1 k

k

X

i=1

yni

k-Nearest Neighbors - kNN

22

For a fixed a value of k, the predicted response for the 𝑗-th

bservation is the average of the observed response of the k-

closest observations: where 𝑦#, … , 𝑦K are the k observations most similar to 𝑦) (similar refers to a notion of distance between predictors).

SLIDE 24

CS109A, PROTOPAPAS, RADER, TANNER

ED quiz: Lecture 4 | part 1

23

SLIDE 25

CS109A, PROTOPAPAS, RADER, TANNER

Things to Consider

Model Fitness How does the model perform predicting? Comparison of Two Models How do we choose from two different models? Evaluating Significance of Predictors Does the outcome depend on the predictors? How well do we know 𝒈 P The confidence intervals of our 𝑔 A

24

SLIDE 26

CS109A, PROTOPAPAS, RADER, TANNER

Error Evaluation

25

SLIDE 27

CS109A, PROTOPAPAS, RADER, TANNER

Error Evaluation

26

Start with some data.

SLIDE 28

CS109A, PROTOPAPAS, RADER, TANNER

Error Evaluation

27

Hide some of the data from the model. This is called train-test split. We use the train set to estimate 𝑧 E, and the test set to evaluate the model.

SLIDE 29

CS109A, PROTOPAPAS, RADER, TANNER

Error Evaluation

28

Estimate 𝑧 E for k=1 .

SLIDE 30

CS109A, PROTOPAPAS, RADER, TANNER

Error Evaluation

29

Now, we look at the data we have not used, the test data (red crosses).

SLIDE 31

CS109A, PROTOPAPAS, RADER, TANNER

Error Evaluation

30

Calculate the residuals (𝑧) − 𝑧 E)).

SLIDE 32

CS109A, PROTOPAPAS, RADER, TANNER

Error Evaluation

31

Do the same for k=3.

SLIDE 33

CS109A, PROTOPAPAS, RADER, TANNER

Error Evaluation

32

In order to quantify how well a model performs, we define a loss or error function. A common loss function for quantitative outcomes is the Mean Squared Error (MSE): The quantity 𝑧) − 𝑧 E) is called a residual and measures the error at the i-th prediction.

MSE = 1 n

n

X

i=1

(yi − b yi)2

SLIDE 34

CS109A, PROTOPAPAS, RADER, TANNER

Error Evaluation

33

Caution: The MSE is by no means the only valid (or the best) loss function! Question: What would be an intuitive loss function for predicting categorical

utcomes?

Note: The square Root of the Mean of the Squared Errors (RMSE) is also commonly used.

RMSE = √ MSE = v u u t 1 n

n

X

i=1

(yi − b yi)2

SLIDE 35

CS109A, PROTOPAPAS, RADER, TANNER

Things to Consider

Comparison of Two Models How do we choose from two different models? Model Fitness How does the model perform predicting? Evaluating Significance of Predictors Does the outcome depend on the predictors? How well do we know 𝒈 P The confidence intervals of our 𝑔 A

34

SLIDE 36

CS109A, PROTOPAPAS, RADER, TANNER

Model Comparison

35

SLIDE 37

CS109A, PROTOPAPAS, RADER, TANNER

Model Comparison

36

Do the same for all k’s and compare the RMSEs. k=3 seems to be the best model.

SLIDE 38

CS109A, PROTOPAPAS, RADER, TANNER

Things to Consider

Comparison of Two Models How do we choose from two different models? Model Fitness How does the model perform predicting? Evaluating Significance of Predictors Does the outcome depend on the predictors? How well do we know 𝒈 P The confidence intervals of our 𝑔 A

37

SLIDE 39

CS109A, PROTOPAPAS, RADER, TANNER

Model Fitness

38

SLIDE 40

CS109A, PROTOPAPAS, RADER, TANNER

Model fitness

39

For a subset of the data, calculate the RMSE for k=3. Is RMSE=5.0 good enough?

SLIDE 41

CS109A, PROTOPAPAS, RADER, TANNER

Model fitness

40

What if we measure the Sales in cents instead of dollars? RMSE is now 5004.93. Is that good?

SLIDE 42

CS109A, PROTOPAPAS, RADER, TANNER

Model fitness

41

It is better if we compare it to something. ˆ y = 1 n

n

X

i

yi

We will use the simplest model:

SLIDE 43

CS109A, PROTOPAPAS, RADER, TANNER

R-squared

If our model is as good as the mean value, 𝑧

R, then 𝑆T = 0

If our model is perfect then 𝑆T = 1
𝑆T can be negative if the model is worst than the average. This can

happen when we evaluate the model in the test set.

42

R2 = 1 − P

i(ˆ

yi − yi)2 P

i(¯

y − yi)2

SLIDE 44

CS109A, PROTOPAPAS, RADER, TANNER

Summary

Comparison of Two Models How do we choose from two different models? Model Fitness How does the model perform predicting? Evaluating Significance of Predictors Does the outcome depend on the predictors? How well do we know 𝒈 P The confidence intervals of our 𝑔 A

43

SLIDE 45

CS109A, PROTOPAPAS, RADER, TANNER

Summary

Model Fitness How does the model perform predicting? Comparison of Two Models How do we choose from two different models? Evaluating Significance of Predictors Does the outcome depend on the predictors? How well do we know 𝒈 P The confidence intervals of our 𝑔 A

44

CS109A Introduction to Data Science

Pavlos Protopapas, Kevin Rader and Chris Tanner

Lecture 4: Introduction to Regression

Background

Roadmap:

Lecture Outline

Statistical Modeling k-Nearest Neighbors (kNN) Model Fitness How does the model perform predicting? Comparison of Two Models How do we choose from two different models?

Predicting a Variable

Let's image a scenario where we'd like to predict one variable using another (or a set of other) variables. Examples:

based on video length, the date it was posted, previous number of views, etc.

previous movie ratings, demographic data etc.

Data

The Advertising data set consists of the sales of that product in 200 different markets, along with advertising budgets for the product in each of those markets for three different media: TV, radio, and newspaper. Everything is given in units

Response vs. Predictor Variables

There is an asymmetry in many of these problems: The variable we'd like to predict may be more difficult to measure, is more important than the other(s), or may be directly or indirectly influenced by the values of the

Thus, we'd like to define two categories of variables:

Response vs. Predictor Variables

p predictors n observations

Response vs. Predictor Variables

p predictors n observations

Definition

We are observing 𝑞 + 1 number variables and we are making 𝑜 sets of

typically, we denote this variable by 𝑍 and the individual measurements 𝑧).

predictor variables; typically, we denote these variables by 𝑌 = 𝑌#, … , 𝑌& and the individual measurements 𝑦),'.

Note: 𝑗 indexes the observation (𝑗 = 1, … , 𝑜) and 𝑘 indexes the value of the 𝑘-th predictor variable (j = 1, … , 𝑞).

Statistical Model

True vs. Statistical Model

Statistical Model

x y

Statistical Model

How do we find 𝑔 A 𝑦 ?

Statistical Model

How do we find 𝑔 A 𝑦 ?

Statistical Model

Simple idea is to take the mean of all y’s, 𝑔 A 𝑦 = #

Prediction vs. Estimation

Simple Prediction Model

𝑦F

Simple Prediction Model

Do the same for “all” 𝑦′𝑡

Extend the Prediction Model

𝑦F

Simple Prediction Models

Simple Prediction Models

We can try different k-models on more data

k-Nearest Neighbors

ˆ yn = 1 k

k

X

i=1

yni

k-Nearest Neighbors - kNN

For a fixed a value of k, the predicted response for the 𝑗-th

closest observations: where 𝑦*#, … , 𝑦*K are the k observations most similar to 𝑦) (similar refers to a notion of distance between predictors).

ED quiz: Lecture 4 | part 1

Things to Consider

Model Fitness How does the model perform predicting? Comparison of Two Models How do we choose from two different models? Evaluating Significance of Predictors Does the outcome depend on the predictors? How well do we know 𝒈 P The confidence intervals of our 𝑔 A

Error Evaluation

Error Evaluation

Start with some data.

Error Evaluation

Hide some of the data from the model. This is called train-test split. We use the train set to estimate 𝑧 E, and the test set to evaluate the model.

Error Evaluation

Estimate 𝑧 E for k=1 .

Error Evaluation

Now, we look at the data we have not used, the test data (red crosses).

Error Evaluation

Calculate the residuals (𝑧) − 𝑧 E)).

Error Evaluation

Do the same for k=3.

Error Evaluation

In order to quantify how well a model performs, we define a loss or error function. A common loss function for quantitative outcomes is the Mean Squared Error (MSE): The quantity 𝑧) − 𝑧 E) is called a residual and measures the error at the i-th prediction.

MSE = 1 n

n

X

i=1

(yi − b yi)2

Error Evaluation

Caution: The MSE is by no means the only valid (or the best) loss function! Question: What would be an intuitive loss function for predicting categorical

Note: The square Root of the Mean of the Squared Errors (RMSE) is also commonly used.

closest observations: where 𝑦#, … , 𝑦K are the k observations most similar to 𝑦) (similar refers to a notion of distance between predictors).