[PPT] - U 7: M L R L PowerPoint Presentation, free download

SLIDE 1

U 7: M L R L 1: I  MLR S 101

Nicole Dalzell June 15, 2015

SLIDE 2

Announcements

1

Announcements Recap

2

SLR: Categorical Predictors

3

Many variables in a model

4

Adjusted R2

5

Collinearity and parsimony

Statistics 101 U7 - L1: Multiple Linear Regression Nicole Dalzell

SLIDE 3

Announcements

OH today from 2-3 PM in Old Chem 211A. Problem Set 8 Due tomorrow Lab Due tomorrow by 5 PM on Sakai

Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 2 / 30

SLIDE 4

Announcements Recap

% College graduate vs. % Hispanic in LA

What can you say about the relationship between of % college gradu- ate and % Hispanic in a sample of 100 zip code areas in LA?

Education: College graduate

0.0 0.2 0.4 0.6 0.8 1.0

No data Freeways

Race/Ethnicity: Hispanic

0.0 0.2 0.4 0.6 0.8 1.0

No data Freeways

Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 3 / 30

SLIDE 5

Announcements Recap

% College educated vs. % Hispanic in LA

What can you say about the relationship between of % college gradu- ate and % Hispanic in a sample of 100 zip code areas in LA?

% Hispanic % College graduate 0% 25% 50% 75% 100% 0% 25% 50% 75% 100%

Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 4 / 30

SLIDE 6

Announcements Recap

% College educated vs. % Hispanic in LA - linear model

Participation question Which of the below is the best interpretation of the slope?

Estimate

Std. Error

t value Pr(>|t|) (Intercept) 0.7290 0.0308 23.68 0.0000 %Hispanic

0.7527

0.0501

15.01

0.0000

(a) A 1% increase in Hispanic residents in a zip code area in LA is associated with a 75% decrease in % of college grads. (b) A 1% increase in Hispanic residents in a zip code area in LA is associated with a 0.75% decrease in % of college grads. (c) An additional 1% of Hispanic residents decreases the % of college graduates in a zip code area in LA by 0.75%. (d) In zip code areas with no Hispanic residents, % of college graduates is expected to be 75%.

Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 5 / 30

SLIDE 7

Announcements Recap

% College educated vs. % Hispanic in LA - linear model

Participation question Which of the below is the best interpretation of the slope?

Estimate

Std. Error

t value Pr(>|t|) (Intercept) 0.7290 0.0308 23.68 0.0000 %Hispanic

0.7527

0.0501

15.01

0.0000

(a) A 1% increase in Hispanic residents in a zip code area in LA is associated with a 75% decrease in % of college grads. (b) A 1% increase in Hispanic residents in a zip code area in LA is associated with a 0.75% decrease in % of college grads. (c) An additional 1% of Hispanic residents decreases the % of college graduates in a zip code area in LA by 0.75%. (d) In zip code areas with no Hispanic residents, % of college graduates is expected to be 75%.

Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 5 / 30

SLIDE 8

Announcements Recap

% College educated vs. % Hispanic in LA - linear model

Do these data provide convincing evidence that there is a statistically significant relationship between % Hispanic and % college graduates in zip code areas in LA?

Estimate

Std. Error

t value Pr(>|t|) (Intercept) 0.7290 0.0308 23.68 0.0000 hispanic

0.7527

0.0501

15.01

0.0000

How reliable is this p-value if these zip code areas are not randomly selected?

Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 6 / 30

SLIDE 9

Announcements Recap

% College educated vs. % Hispanic in LA - linear model

Do these data provide convincing evidence that there is a statistically significant relationship between % Hispanic and % college graduates in zip code areas in LA?

Estimate

Std. Error

t value Pr(>|t|) (Intercept) 0.7290 0.0308 23.68 0.0000 hispanic

0.7527

0.0501

15.01

0.0000

Yes, the p-value for % Hispanic is low, indicating that the data provide convincing evidence that the slope parameter is different than 0. How reliable is this p-value if these zip code areas are not randomly selected?

Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 6 / 30

SLIDE 10

Announcements Recap

% College educated vs. % Hispanic in LA - linear model

Do these data provide convincing evidence that there is a statistically significant relationship between % Hispanic and % college graduates in zip code areas in LA?

Estimate

Std. Error

t value Pr(>|t|) (Intercept) 0.7290 0.0308 23.68 0.0000 hispanic

0.7527

0.0501

15.01

0.0000

Yes, the p-value for % Hispanic is low, indicating that the data provide convincing evidence that the slope parameter is different than 0. How reliable is this p-value if these zip code areas are not randomly selected? Not very...

Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 6 / 30

SLIDE 11

Announcements Recap

Recap

Inference for the slope for a SLR model (only one explanatory variable):

Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 7 / 30

SLIDE 12

Announcements Recap

Recap

Inference for the slope for a SLR model (only one explanatory variable):

Hypothesis test: T = b1 − null value SEb1 df = n − 2

Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 7 / 30

SLIDE 13

Announcements Recap

Recap

Inference for the slope for a SLR model (only one explanatory variable):

Hypothesis test: T = b1 − null value SEb1 df = n − 2 Confidence interval: b1 ± t⋆

df=n−2SEb1

Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 7 / 30

SLIDE 14

Announcements Recap

Recap

Inference for the slope for a SLR model (only one explanatory variable):

Hypothesis test: T = b1 − null value SEb1 df = n − 2 Confidence interval: b1 ± t⋆

df=n−2SEb1

The null value is often 0 since we are usually checking for any relationship between the explanatory and the response variable.

Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 7 / 30

SLIDE 15

Announcements Recap

Recap

Inference for the slope for a SLR model (only one explanatory variable):

Hypothesis test: T = b1 − null value SEb1 df = n − 2 Confidence interval: b1 ± t⋆

df=n−2SEb1

The null value is often 0 since we are usually checking for any relationship between the explanatory and the response variable. The regression output gives b1, SEb1, and two-tailed p-value for the t-test for the slope where the null value is 0.

Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 7 / 30

SLIDE 16

Announcements Recap

Recap

Inference for the slope for a SLR model (only one explanatory variable):

Hypothesis test: T = b1 − null value SEb1 df = n − 2 Confidence interval: b1 ± t⋆

df=n−2SEb1

The null value is often 0 since we are usually checking for any relationship between the explanatory and the response variable. The regression output gives b1, SEb1, and two-tailed p-value for the t-test for the slope where the null value is 0. We rarely do inference on the intercept, so we’ll be focusing on the estimates and inference for the slope.

Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 7 / 30

SLIDE 17

Announcements Recap

Caution

Always be aware of the type of data you’re working with: random sample, non-random sample, or population.

Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 8 / 30

SLIDE 18

Announcements Recap

Caution

Always be aware of the type of data you’re working with: random sample, non-random sample, or population. Statistical inference, and the resulting p-values, are meaningless when you already have population data.

Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 8 / 30

SLIDE 19

Announcements Recap

Caution

Always be aware of the type of data you’re working with: random sample, non-random sample, or population. Statistical inference, and the resulting p-values, are meaningless when you already have population data. If you have a sample that is non-random (biased), the results will be unreliable.

Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 8 / 30

SLIDE 20

Announcements Recap

Caution

Always be aware of the type of data you’re working with: random sample, non-random sample, or population. Statistical inference, and the resulting p-values, are meaningless when you already have population data. If you have a sample that is non-random (biased), the results will be unreliable. The ultimate goal is to have independent observations – and you know how to check for those by now.

Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 8 / 30

SLIDE 21

SLR: Categorical Predictors

1

Announcements Recap

2

SLR: Categorical Predictors

3

Many variables in a model

4

Adjusted R2

5

Collinearity and parsimony

Statistics 101 U7 - L1: Multiple Linear Regression Nicole Dalzell

SLIDE 22

SLR: Categorical Predictors

Dinosaur Weight

What relationship do you see between the weight of dinosaurs and the type of dinosaur?

Ornithischian Saurischian 0e+00 4e+04 8e+04

Dinosaur Weight by Type

Weight (kg)

Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 9 / 30

SLIDE 23

SLR: Categorical Predictors

Dinosaur Weight

What relationship do you see between the weight of dinosaurs and the type of dinosaur?

Estimate

Std. Error

t value Pr(>|t|) (Intercept) 2786 4422 0.630 0.5316 dino$TypeSaurischian 13652 5968 2.288 0.0265

Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 10 / 30

SLIDE 24

SLR: Categorical Predictors

Dinosaur Weight

What relationship do you see between the weight of dinosaurs and the type of dinosaur?

Estimate

Std. Error

t value Pr(>|t|) (Intercept) 2786 4422 0.630 0.5316 dino$TypeSaurischian 13652 5968 2.288 0.0265

Weight = 2786 + 13652TypeSaurischian

Type of dinosaur is a categorical variable with two levels: Ornithischian and Saurischian

Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 10 / 30

SLIDE 25

SLR: Categorical Predictors

Dinosaur Weight

What relationship do you see between the weight of dinosaurs and the type of dinosaur?

Estimate

Std. Error

t value Pr(>|t|) (Intercept) 2786 4422 0.630 0.5316 dino$TypeSaurischian 13652 5968 2.288 0.0265

Weight = 2786 + 13652TypeSaurischian

Type of dinosaur is a categorical variable with two levels: Ornithischian and Saurischian For Ornithischian dinosaurs: plug in 0 for TypeSaurischian

Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 10 / 30

SLIDE 26

SLR: Categorical Predictors

Dinosaur Weight

What relationship do you see between the weight of dinosaurs and the type of dinosaur?

Estimate

Std. Error

t value Pr(>|t|) (Intercept) 2786 4422 0.630 0.5316 dino$TypeSaurischian 13652 5968 2.288 0.0265

Weight = 2786 + 13652TypeSaurischian

Type of dinosaur is a categorical variable with two levels: Ornithischian and Saurischian For Ornithischian dinosaurs: plug in 0 for TypeSaurischian Slope b1: The model predicts that Saurischian dinosaurs weigh 13,652 kilograms more than Ornithischian dinosaurs, on average.

Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 10 / 30

SLIDE 27

SLR: Categorical Predictors

Dinosaurs!

Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 11 / 30

SLIDE 28

Many variables in a model

1

Announcements Recap

2

SLR: Categorical Predictors

3

Many variables in a model

4

Adjusted R2

5

Collinearity and parsimony

Statistics 101 U7 - L1: Multiple Linear Regression Nicole Dalzell

SLIDE 29

Many variables in a model

Weights of books

weight (g) volume (cm3) cover 1 800 885 hc 2 950 1016 hc 3 1050 1125 hc 4 350 239 hc 5 750 701 hc 6 600 641 hc 7 1075 1228 hc 8 250 412 pb 9 700 953 pb 10 650 929 pb 11 975 1492 pb 12 350 419 pb 13 950 1010 pb 14 425 595 pb 15 725 1034 pb

w l h

Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 12 / 30

SLIDE 30

Many variables in a model

Weights of hard cover and paperback books

Can you identify a trend in the relationship between volume and weight

f hardcover and paperback books?

200 400 600 800 1000 1200 1400 400 600 800 1000 volume (cm3) weight (g)

hardcover paperback Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 13 / 30

SLIDE 31

Many variables in a model

Weights of hard cover and paperback books

Can you identify a trend in the relationship between volume and weight

f hardcover and paperback books?

Paperbacks generally weigh less than hardcover books.

200 400 600 800 1000 1200 1400 400 600 800 1000 volume (cm3) weight (g)

hardcover paperback Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 13 / 30

SLIDE 32

Many variables in a model

Modeling weights of books using volume and cover type

# load data library(DAAG) data(allbacks) # fit model book_mlr = lm(weight ˜ volume + cover, data = allbacks) summary(book_mlr) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 197.96284 59.19274 3.344 0.005841 ** volume 0.71795 0.06153 11.669 6.6e-08 *** cover:pb

184.04727

40.49420

4.545 0.000672 ***

Residual standard error: 78.2 on 12 degrees of freedom Multiple R-squared: 0.9275, Adjusted R-squared: 0.9154 F-statistic: 76.73 on 2 and 12 DF, p-value: 1.455e-07 Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 14 / 30

SLIDE 33

Many variables in a model

Linear model

Estimate

Std. Error

t value Pr(>|t|) (Intercept) 197.96 59.19 3.34 0.01 volume 0.72 0.06 11.67 0.00 cover:pb

184.05

40.49

4.55

0.00

Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 15 / 30

SLIDE 34

Many variables in a model

Linear model

Estimate

Std. Error

t value Pr(>|t|) (Intercept) 197.96 59.19 3.34 0.01 volume 0.72 0.06 11.67 0.00 cover:pb

184.05

40.49

4.55

0.00

weight = 197.96 + 0.72 volume − 184.05 cover : pb

Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 15 / 30

SLIDE 35

Many variables in a model

Linear model

Estimate

Std. Error

t value Pr(>|t|) (Intercept) 197.96 59.19 3.34 0.01 volume 0.72 0.06 11.67 0.00 cover:pb

184.05

40.49

4.55

0.00

weight = 197.96 + 0.72 volume − 184.05 cover : pb

1

For hardcover books: plug in 0 for cover

weight

= 197.96 + 0.72 volume − 184.05 × 0

Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 15 / 30

SLIDE 36

Many variables in a model

Linear model

Estimate

Std. Error

t value Pr(>|t|) (Intercept) 197.96 59.19 3.34 0.01 volume 0.72 0.06 11.67 0.00 cover:pb

184.05

40.49

4.55

0.00

weight = 197.96 + 0.72 volume − 184.05 cover : pb

1

For hardcover books: plug in 0 for cover

weight

= 197.96 + 0.72 volume − 184.05 × 0 = 197.96 + 0.72 volume

Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 15 / 30

SLIDE 37

Many variables in a model

Linear model

Estimate

Std. Error

t value Pr(>|t|) (Intercept) 197.96 59.19 3.34 0.01 volume 0.72 0.06 11.67 0.00 cover:pb

184.05

40.49

4.55

0.00

weight = 197.96 + 0.72 volume − 184.05 cover : pb

1

For hardcover books: plug in 0 for cover

weight

= 197.96 + 0.72 volume − 184.05 × 0 = 197.96 + 0.72 volume

2

For paperback books: plug in 1 for cover

weight

= 197.96 + 0.72 volume − 184.05 × 1

Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 15 / 30

SLIDE 38

Many variables in a model

Linear model

Estimate

Std. Error

t value Pr(>|t|) (Intercept) 197.96 59.19 3.34 0.01 volume 0.72 0.06 11.67 0.00 cover:pb

184.05

40.49

4.55

0.00

weight = 197.96 + 0.72 volume − 184.05 cover : pb

1

For hardcover books: plug in 0 for cover

weight

= 197.96 + 0.72 volume − 184.05 × 0 = 197.96 + 0.72 volume

2

For paperback books: plug in 1 for cover

weight

= 197.96 + 0.72 volume − 184.05 × 1 = 13.91 + 0.72 volume

Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 15 / 30

SLIDE 39

Many variables in a model

Visualising the linear model

200 400 600 800 1000 1200 1400 400 600 800 1000 volume (cm3) weight (g)

hardcover paperback

Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 16 / 30

SLIDE 40

Many variables in a model

Interpretation of the regression coefficients

Estimate

Std. Error

t value Pr(>|t|) (Intercept) 197.96 59.19 3.34 0.01 volume 0.72 0.06 11.67 0.00 cover:pb

184.05

40.49

4.55

0.00

Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 17 / 30

SLIDE 41

Many variables in a model

Interpretation of the regression coefficients

Estimate

Std. Error

t value Pr(>|t|) (Intercept) 197.96 59.19 3.34 0.01 volume 0.72 0.06 11.67 0.00 cover:pb

184.05

40.49

4.55

0.00

Slope of volume: All else held constant, for each 1 cm3 increase in volume we would expect weight to increase on average by 0.72 grams.

Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 17 / 30

SLIDE 42

Many variables in a model

Interpretation of the regression coefficients

Estimate

Std. Error

t value Pr(>|t|) (Intercept) 197.96 59.19 3.34 0.01 volume 0.72 0.06 11.67 0.00 cover:pb

184.05

40.49

4.55

0.00

Slope of volume: All else held constant, for each 1 cm3 increase in volume we would expect weight to increase on average by 0.72 grams. Slope of cover: All else held constant, the model predicts that paperback books weigh 184 grams lower than hardcover books,

n average.

Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 17 / 30

SLIDE 43

Many variables in a model

Interpretation of the regression coefficients

Estimate

Std. Error

t value Pr(>|t|) (Intercept) 197.96 59.19 3.34 0.01 volume 0.72 0.06 11.67 0.00 cover:pb

184.05

40.49

4.55

0.00

Slope of volume: All else held constant, for each 1 cm3 increase in volume we would expect weight to increase on average by 0.72 grams. Slope of cover: All else held constant, the model predicts that paperback books weigh 184 grams lower than hardcover books,

n average.

Intercept: Hardcover books with no volume are expected on average to weigh 198 grams.

Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 17 / 30

SLIDE 44

Many variables in a model

Interpretation of the regression coefficients

Estimate

Std. Error

t value Pr(>|t|) (Intercept) 197.96 59.19 3.34 0.01 volume 0.72 0.06 11.67 0.00 cover:pb

184.05

40.49

4.55

0.00

Slope of volume: All else held constant, for each 1 cm3 increase in volume we would expect weight to increase on average by 0.72 grams. Slope of cover: All else held constant, the model predicts that paperback books weigh 184 grams lower than hardcover books,

n average.

Intercept: Hardcover books with no volume are expected on average to weigh 198 grams.

Obviously, the intercept does not make sense in context. It only serves to adjust the height of the line.

Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 17 / 30

SLIDE 45

Many variables in a model

Prediction

Participation question Which of the following is the correct calculation for the predicted weight

f a paperback book that is 600 cm3?

Estimate

Std. Error

t value Pr(>|t|) (Intercept) 197.96 59.19 3.34 0.01 volume 0.72 0.06 11.67 0.00 cover:pb

184.05

40.49

4.55

0.00

(a) 197.96 + 0.72 × 600 − 184.05 × 1 (b) 184.05 + 0.72 × 600 − 197.96 × 1 (c) 197.96 + 0.72 × 600 − 184.05 × 0 (d) 197.96 + 0.72 × 1 − 184.05 × 600

Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 18 / 30

SLIDE 46

Many variables in a model

Prediction

Participation question Which of the following is the correct calculation for the predicted weight

f a paperback book that is 600 cm3?

Estimate

Std. Error

t value Pr(>|t|) (Intercept) 197.96 59.19 3.34 0.01 volume 0.72 0.06 11.67 0.00 cover:pb

184.05

40.49

4.55

0.00

(a) 197.96 + 0.72 × 600 − 184.05 × 1 = 445.91 grams (b) 184.05 + 0.72 × 600 − 197.96 × 1 (c) 197.96 + 0.72 × 600 − 184.05 × 0 (d) 197.96 + 0.72 × 1 − 184.05 × 600

Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 18 / 30

SLIDE 47

Many variables in a model

A note on “interaction” variables

weight = 197.96 + 0.72 volume − 184.05 cover : pb

200 400 600 800 1000 1200 1400 400 600 800 1000 volume (cm3) weight (g)

hardcover paperback

This model assumes that hardcover and paperback books have the same slope for the relationship between their volume and weight. If this isn’t reasonable, then we would include an “interaction” variable in the model (beyond the scope of this course).

Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 19 / 30

SLIDE 48

Adjusted R2

1

Announcements Recap

2

SLR: Categorical Predictors

3

Many variables in a model

4

Adjusted R2

5

Collinearity and parsimony

Statistics 101 U7 - L1: Multiple Linear Regression Nicole Dalzell

SLIDE 49

Adjusted R2

Revisit: Modeling poverty

poverty

40 50 60 70 80 90 100

80

85 90

6

8 10 14 18

40

60 80 100

−0.20

metro_res

−0.31

−0.34 white

●
30

50 70 90

80

85 90

−0.75

0.018

0.24

hs_grad

●
●
6

8 10 12 14 16 18

0.53

0.30

30 50 70 90

−0.75

−0.61

8 10 12 14 16 18 8 10 14 18

female_house

Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 20 / 30

SLIDE 50

Adjusted R2

Predicting poverty using % female householder

# load data poverty = read.csv("http://stat.duke.edu/˜mc301/data/poverty.csv") # fit model pov_slr = lm(poverty ˜ female_house, data = poverty) summary(pov_slr)

Linear model: Estimate

Std. Error

t value Pr(>|t|) (Intercept) 3.31 1.90 1.74 0.09 female house 0.69 0.16 4.32 0.00

8

10 12 14 16 18 6 8 10 12 14 16 18 % female householder % in poverty

R = 0.53 R2 = 0.532 = 0.28

Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 21 / 30

SLIDE 51

Adjusted R2

Another look at R2 - from last time

anova(pov_slr)

ANOVA: Df Sum Sq Mean Sq F value Pr(>F) female house 1 132.57 132.57 18.68 0.00 Residuals 49 347.68 7.10 Total 50 480.25

Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 22 / 30

SLIDE 52

Adjusted R2

Another look at R2 - from last time

anova(pov_slr)

ANOVA: Df Sum Sq Mean Sq F value Pr(>F) female house 1 132.57 132.57 18.68 0.00 Residuals 49 347.68 7.10 Total 50 480.25

SS of y: SSTot

=

(y − ¯

y)2 = 480.25 → total variability

Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 22 / 30

SLIDE 53

Adjusted R2

Another look at R2 - from last time

anova(pov_slr)

ANOVA: Df Sum Sq Mean Sq F value Pr(>F) female house 1 132.57 132.57 18.68 0.00 Residuals 49 347.68 7.10 Total 50 480.25

SS of y: SSTot

=

(y − ¯

y)2 = 480.25 → total variability

SS of residuals: SSErr

=

e2

i = 347.68 → unexplained variability

Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 22 / 30

SLIDE 54

Adjusted R2

Another look at R2 - from last time

anova(pov_slr)

ANOVA: Df Sum Sq Mean Sq F value Pr(>F) female house 1 132.57 132.57 18.68 0.00 Residuals 49 347.68 7.10 Total 50 480.25

SS of y: SSTot

=

(y − ¯

y)2 = 480.25 → total variability

SS of residuals: SSErr

=

e2

i = 347.68 → unexplained variability

SS of regression: SSReg

= SSTotal − SSError → explained variability = 480.25 − 347.68 = 132.57

Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 22 / 30

SLIDE 55

Adjusted R2

Another look at R2 - from last time

anova(pov_slr)

ANOVA: Df Sum Sq Mean Sq F value Pr(>F) female house 1 132.57 132.57 18.68 0.00 Residuals 49 347.68 7.10 Total 50 480.25

SS of y: SSTot

=

(y − ¯

y)2 = 480.25 → total variability

SS of residuals: SSErr

=

e2

i = 347.68 → unexplained variability

SS of regression: SSReg

= SSTotal − SSError → explained variability = 480.25 − 347.68 = 132.57 R2 = explained variability

total variability

= 132.57 480.25 = 0.28

Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 22 / 30

SLIDE 56

Adjusted R2

Predicting poverty using % female hh + % white

pov_mlr = lm(poverty ˜ female_house + white, data = poverty) summary(pov_mlr)

Linear model: Estimate

Std. Error

t value Pr(>|t|) (Intercept)

2.58

5.78

0.45

0.66 female house 0.89 0.24 3.67 0.00 white 0.04 0.04 1.08 0.29

anova(pov_mlr)

ANOVA: Df Sum Sq Mean Sq F value Pr(>F) female house 1 132.57 132.57 18.74 0.00 white 1 8.21 8.21 1.16 0.29 Residuals 48 339.47 7.07 Total 50 480.25

Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 23 / 30

SLIDE 57

Adjusted R2

Predicting poverty using % female hh + % white

pov_mlr = lm(poverty ˜ female_house + white, data = poverty) summary(pov_mlr)

Linear model: Estimate

Std. Error

t value Pr(>|t|) (Intercept)

2.58

5.78

0.45

0.66 female house 0.89 0.24 3.67 0.00 white 0.04 0.04 1.08 0.29

anova(pov_mlr)

ANOVA: Df Sum Sq Mean Sq F value Pr(>F) female house 1 132.57 132.57 18.74 0.00 white 1 8.21 8.21 1.16 0.29 Residuals 48 339.47 7.07 Total 50 480.25

R2 = explained variability

total variability

= 132.57 + 8.21 480.25 = 0.29

Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 23 / 30

SLIDE 58

Adjusted R2

R2

adj = 1 −

SSError SSTotal × n − 1 n − k − 1

where n is the number of cases and k is the number of predictors

(explanatory variables) in the model.

Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 24 / 30

SLIDE 59

Adjusted R2

Application exercise: Adjusted R2 Calculate adjusted R2 for the multiple linear regression model predict- ing % living in poverty from % female householders and % white. Re- member n = 51, 50 states + DC.

ANOVA: Df Sum Sq Mean Sq F value Pr(>F) female house 1 132.57 132.57 18.74 0.0001 white 1 8.21 8.21 1.16 0.2868 Residuals 48 339.47 7.07 Total 50 480.25

(a) 0.26 (b) 0.29 (c) 0.32 (d) 0.71

Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 25 / 30

SLIDE 60

Adjusted R2

Application exercise: Adjusted R2 Calculate adjusted R2 for the multiple linear regression model predict- ing % living in poverty from % female householders and % white. Re- member n = 51, 50 states + DC.

ANOVA: Df Sum Sq Mean Sq F value Pr(>F) female house 1 132.57 132.57 18.74 0.0001 white 1 8.21 8.21 1.16 0.2868 Residuals 48 339.47 7.07 Total 50 480.25

(a) 0.26 (b) 0.29 (c) 0.32 (d) 0.71

R2

adj

= 1 − SSError SSTotal × n − 1 n − k − 1

Statistics 101 (Nicole Dalzell)

U7 - L1: Multiple Linear Regression June 15, 2015 25 / 30

SLIDE 61

Adjusted R2

Application exercise: Adjusted R2 Calculate adjusted R2 for the multiple linear regression model predict- ing % living in poverty from % female householders and % white. Re- member n = 51, 50 states + DC.

ANOVA: Df Sum Sq Mean Sq F value Pr(>F) female house 1 132.57 132.57 18.74 0.0001 white 1 8.21 8.21 1.16 0.2868 Residuals 48 339.47 7.07 Total 50 480.25

(a) 0.26 (b) 0.29 (c) 0.32 (d) 0.71

R2

adj

= 1 − SSError SSTotal × n − 1 n − k − 1

=

1 − 339.47 480.25 × 51 − 1 51 − 2 − 1

Statistics 101 (Nicole Dalzell)

U7 - L1: Multiple Linear Regression June 15, 2015 25 / 30

SLIDE 62

Adjusted R2

Application exercise: Adjusted R2 Calculate adjusted R2 for the multiple linear regression model predict- ing % living in poverty from % female householders and % white. Re- member n = 51, 50 states + DC.

ANOVA: Df Sum Sq Mean Sq F value Pr(>F) female house 1 132.57 132.57 18.74 0.0001 white 1 8.21 8.21 1.16 0.2868 Residuals 48 339.47 7.07 Total 50 480.25

(a) 0.26 (b) 0.29 (c) 0.32 (d) 0.71

R2

adj

= 1 − SSError SSTotal × n − 1 n − k − 1

=

1 − 339.47 480.25 × 51 − 1 51 − 2 − 1

=

1 − 339.47 480.25 × 50 48

Statistics 101 (Nicole Dalzell)

U7 - L1: Multiple Linear Regression June 15, 2015 25 / 30

SLIDE 63

Adjusted R2

Application exercise: Adjusted R2 Calculate adjusted R2 for the multiple linear regression model predict- ing % living in poverty from % female householders and % white. Re- member n = 51, 50 states + DC.

ANOVA: Df Sum Sq Mean Sq F value Pr(>F) female house 1 132.57 132.57 18.74 0.0001 white 1 8.21 8.21 1.16 0.2868 Residuals 48 339.47 7.07 Total 50 480.25

(a) 0.26 (b) 0.29 (c) 0.32 (d) 0.71

R2

adj

= 1 − SSError SSTotal × n − 1 n − k − 1

=

1 − 339.47 480.25 × 51 − 1 51 − 2 − 1

=

1 − 339.47 480.25 × 50 48

= 1 − 0.74

Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 25 / 30

SLIDE 64

Adjusted R2

Application exercise: Adjusted R2 Calculate adjusted R2 for the multiple linear regression model predict- ing % living in poverty from % female householders and % white. Re- member n = 51, 50 states + DC.

ANOVA: Df Sum Sq Mean Sq F value Pr(>F) female house 1 132.57 132.57 18.74 0.0001 white 1 8.21 8.21 1.16 0.2868 Residuals 48 339.47 7.07 Total 50 480.25

(a) 0.26 (b) 0.29 (c) 0.32 (d) 0.71

R2

adj

= 1 − SSError SSTotal × n − 1 n − k − 1

=

1 − 339.47 480.25 × 51 − 1 51 − 2 − 1

=

1 − 339.47 480.25 × 50 48

= 1 − 0.74 = 0.26

Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 25 / 30

SLIDE 65

Adjusted R2

R2 vs. adjusted R2

R2

Adjusted R2 Model 1 (poverty vs. female house) 0.28 0.26 Model 2 (poverty vs. female house + white) 0.29 0.26

Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 26 / 30

SLIDE 66

Adjusted R2

R2 vs. adjusted R2

R2

Adjusted R2 Model 1 (poverty vs. female house) 0.28 0.26 Model 2 (poverty vs. female house + white) 0.29 0.26 When any variable is added to the model R2 increases.

Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 26 / 30

SLIDE 67

Adjusted R2

R2 vs. adjusted R2

R2

Adjusted R2 Model 1 (poverty vs. female house) 0.28 0.26 Model 2 (poverty vs. female house + white) 0.29 0.26 When any variable is added to the model R2 increases. But if the added variable doesn’t really provide any new information, or is completely unrelated, adjusted R2 does not increase.

Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 26 / 30

SLIDE 68

Adjusted R2

Adjusted R2 - properties

R2

adj = 1 −

SSError SSTotal × n − 1 n − k − 1

Because k is never negative, R2

adj will always be smaller than R2.

R2

adj applies a penalty for the number of predictors included in the

model. Therefore, we choose models with higher R2

adj over others.

Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 27 / 30

SLIDE 69

Adjusted R2

Participation question True or false: Adjusted R2 tells us the percentage of variability in the response variable explained by the model. (a) True (b) False

Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 28 / 30

SLIDE 70

Adjusted R2

Participation question True or false: Adjusted R2 tells us the percentage of variability in the response variable explained by the model. (a) True (b) False

Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 28 / 30

SLIDE 71

Collinearity and parsimony

1

Announcements Recap

2

SLR: Categorical Predictors

3

Many variables in a model

4

Adjusted R2

5

Collinearity and parsimony

Statistics 101 U7 - L1: Multiple Linear Regression Nicole Dalzell

SLIDE 72

Collinearity and parsimony

We saw that adding the variable white to the model did not increase adjusted R2, i.e. did not add any valuable information to the model. Why?

poverty

40 50 60 70 80 90 100

80

85 90

6

8 10 14 18

40

60 80 100

−0.20

metro_res

−0.31

−0.34 white

●
30

50 70 90

80

85 90

−0.75

0.018

0.24

hs_grad

●
●
6

8 10 12 14 16 18

0.53

0.30

30 50 70 90

−0.75

−0.61

8 10 12 14 16 18 8 10 14 18

female_house Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 29 / 30

SLIDE 73

Collinearity and parsimony

Collinearity between explanatory variables (cont.)

Two predictor variables are said to be collinear when they are correlated, and this collinearity (also called multicollinearity) complicates model estimation.

Remember: Predictors are also called explanatory or independent variables, so they should be independent of each other.

Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 30 / 30

SLIDE 74

Collinearity and parsimony

Collinearity between explanatory variables (cont.)

Two predictor variables are said to be collinear when they are correlated, and this collinearity (also called multicollinearity) complicates model estimation.

Remember: Predictors are also called explanatory or independent variables, so they should be independent of each other.

We don’t like adding predictors that are associated with each

ther to the model, because often times the addition of such

variable brings nothing to the table. Instead, we prefer the simplest best model, i.e. parsimonious model.

Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 30 / 30

SLIDE 75

Collinearity and parsimony

Collinearity between explanatory variables (cont.)

Two predictor variables are said to be collinear when they are correlated, and this collinearity (also called multicollinearity) complicates model estimation.

Remember: Predictors are also called explanatory or independent variables, so they should be independent of each other.

We don’t like adding predictors that are associated with each

ther to the model, because often times the addition of such

variable brings nothing to the table. Instead, we prefer the simplest best model, i.e. parsimonious model. In addition, addition of collinear variables can result in biased estimates of the slope parameters.

Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 30 / 30

SLIDE 76

Collinearity and parsimony

Collinearity between explanatory variables (cont.)

Two predictor variables are said to be collinear when they are correlated, and this collinearity (also called multicollinearity) complicates model estimation.

Remember: Predictors are also called explanatory or independent variables, so they should be independent of each other.

We don’t like adding predictors that are associated with each

ther to the model, because often times the addition of such

variable brings nothing to the table. Instead, we prefer the simplest best model, i.e. parsimonious model. In addition, addition of collinear variables can result in biased estimates of the slope parameters. While it’s impossible to avoid collinearity from arising in

bservational data, experiments are usually designed to control

for correlated predictors.

Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 30 / 30