U 7: M L R L - - PowerPoint PPT Presentation
U 7: M L R L - - PowerPoint PPT Presentation
U 7: M L R L 1: I MLR S 101 Nicole Dalzell June 15, 2015 Announcements
Announcements
1
Announcements Recap
2
SLR: Categorical Predictors
3
Many variables in a model
4
Adjusted R2
5
Collinearity and parsimony
Statistics 101 U7 - L1: Multiple Linear Regression Nicole Dalzell
Announcements
Announcements
OH today from 2-3 PM in Old Chem 211A. Problem Set 8 Due tomorrow Lab Due tomorrow by 5 PM on Sakai
Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 2 / 30
Announcements Recap
% College graduate vs. % Hispanic in LA
What can you say about the relationship between of % college gradu- ate and % Hispanic in a sample of 100 zip code areas in LA?
Education: College graduate
0.0 0.2 0.4 0.6 0.8 1.0
No data Freeways
Race/Ethnicity: Hispanic
0.0 0.2 0.4 0.6 0.8 1.0
No data Freeways
Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 3 / 30
Announcements Recap
% College educated vs. % Hispanic in LA
What can you say about the relationship between of % college gradu- ate and % Hispanic in a sample of 100 zip code areas in LA?
% Hispanic % College graduate 0% 25% 50% 75% 100% 0% 25% 50% 75% 100%
Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 4 / 30
Announcements Recap
% College educated vs. % Hispanic in LA - linear model
Participation question Which of the below is the best interpretation of the slope?
Estimate
- Std. Error
t value Pr(>|t|) (Intercept) 0.7290 0.0308 23.68 0.0000 %Hispanic
- 0.7527
0.0501
- 15.01
0.0000
(a) A 1% increase in Hispanic residents in a zip code area in LA is associated with a 75% decrease in % of college grads. (b) A 1% increase in Hispanic residents in a zip code area in LA is associated with a 0.75% decrease in % of college grads. (c) An additional 1% of Hispanic residents decreases the % of college graduates in a zip code area in LA by 0.75%. (d) In zip code areas with no Hispanic residents, % of college graduates is expected to be 75%.
Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 5 / 30
Announcements Recap
% College educated vs. % Hispanic in LA - linear model
Participation question Which of the below is the best interpretation of the slope?
Estimate
- Std. Error
t value Pr(>|t|) (Intercept) 0.7290 0.0308 23.68 0.0000 %Hispanic
- 0.7527
0.0501
- 15.01
0.0000
(a) A 1% increase in Hispanic residents in a zip code area in LA is associated with a 75% decrease in % of college grads. (b) A 1% increase in Hispanic residents in a zip code area in LA is associated with a 0.75% decrease in % of college grads. (c) An additional 1% of Hispanic residents decreases the % of college graduates in a zip code area in LA by 0.75%. (d) In zip code areas with no Hispanic residents, % of college graduates is expected to be 75%.
Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 5 / 30
Announcements Recap
% College educated vs. % Hispanic in LA - linear model
Do these data provide convincing evidence that there is a statistically significant relationship between % Hispanic and % college graduates in zip code areas in LA?
Estimate
- Std. Error
t value Pr(>|t|) (Intercept) 0.7290 0.0308 23.68 0.0000 hispanic
- 0.7527
0.0501
- 15.01
0.0000
How reliable is this p-value if these zip code areas are not randomly selected?
Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 6 / 30
Announcements Recap
% College educated vs. % Hispanic in LA - linear model
Do these data provide convincing evidence that there is a statistically significant relationship between % Hispanic and % college graduates in zip code areas in LA?
Estimate
- Std. Error
t value Pr(>|t|) (Intercept) 0.7290 0.0308 23.68 0.0000 hispanic
- 0.7527
0.0501
- 15.01
0.0000
Yes, the p-value for % Hispanic is low, indicating that the data provide convincing evidence that the slope parameter is different than 0. How reliable is this p-value if these zip code areas are not randomly selected?
Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 6 / 30
Announcements Recap
% College educated vs. % Hispanic in LA - linear model
Do these data provide convincing evidence that there is a statistically significant relationship between % Hispanic and % college graduates in zip code areas in LA?
Estimate
- Std. Error
t value Pr(>|t|) (Intercept) 0.7290 0.0308 23.68 0.0000 hispanic
- 0.7527
0.0501
- 15.01
0.0000
Yes, the p-value for % Hispanic is low, indicating that the data provide convincing evidence that the slope parameter is different than 0. How reliable is this p-value if these zip code areas are not randomly selected? Not very...
Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 6 / 30
Announcements Recap
Recap
Inference for the slope for a SLR model (only one explanatory variable):
Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 7 / 30
Announcements Recap
Recap
Inference for the slope for a SLR model (only one explanatory variable):
Hypothesis test: T = b1 − null value SEb1 df = n − 2
Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 7 / 30
Announcements Recap
Recap
Inference for the slope for a SLR model (only one explanatory variable):
Hypothesis test: T = b1 − null value SEb1 df = n − 2 Confidence interval: b1 ± t⋆
df=n−2SEb1
Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 7 / 30
Announcements Recap
Recap
Inference for the slope for a SLR model (only one explanatory variable):
Hypothesis test: T = b1 − null value SEb1 df = n − 2 Confidence interval: b1 ± t⋆
df=n−2SEb1
The null value is often 0 since we are usually checking for any relationship between the explanatory and the response variable.
Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 7 / 30
Announcements Recap
Recap
Inference for the slope for a SLR model (only one explanatory variable):
Hypothesis test: T = b1 − null value SEb1 df = n − 2 Confidence interval: b1 ± t⋆
df=n−2SEb1
The null value is often 0 since we are usually checking for any relationship between the explanatory and the response variable. The regression output gives b1, SEb1, and two-tailed p-value for the t-test for the slope where the null value is 0.
Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 7 / 30
Announcements Recap
Recap
Inference for the slope for a SLR model (only one explanatory variable):
Hypothesis test: T = b1 − null value SEb1 df = n − 2 Confidence interval: b1 ± t⋆
df=n−2SEb1
The null value is often 0 since we are usually checking for any relationship between the explanatory and the response variable. The regression output gives b1, SEb1, and two-tailed p-value for the t-test for the slope where the null value is 0. We rarely do inference on the intercept, so we’ll be focusing on the estimates and inference for the slope.
Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 7 / 30
Announcements Recap
Caution
Always be aware of the type of data you’re working with: random sample, non-random sample, or population.
Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 8 / 30
Announcements Recap
Caution
Always be aware of the type of data you’re working with: random sample, non-random sample, or population. Statistical inference, and the resulting p-values, are meaningless when you already have population data.
Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 8 / 30
Announcements Recap
Caution
Always be aware of the type of data you’re working with: random sample, non-random sample, or population. Statistical inference, and the resulting p-values, are meaningless when you already have population data. If you have a sample that is non-random (biased), the results will be unreliable.
Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 8 / 30
Announcements Recap
Caution
Always be aware of the type of data you’re working with: random sample, non-random sample, or population. Statistical inference, and the resulting p-values, are meaningless when you already have population data. If you have a sample that is non-random (biased), the results will be unreliable. The ultimate goal is to have independent observations – and you know how to check for those by now.
Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 8 / 30
SLR: Categorical Predictors
1
Announcements Recap
2
SLR: Categorical Predictors
3
Many variables in a model
4
Adjusted R2
5
Collinearity and parsimony
Statistics 101 U7 - L1: Multiple Linear Regression Nicole Dalzell
SLR: Categorical Predictors
Dinosaur Weight
What relationship do you see between the weight of dinosaurs and the type of dinosaur?
Ornithischian Saurischian 0e+00 4e+04 8e+04
Dinosaur Weight by Type
Weight (kg)
Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 9 / 30
SLR: Categorical Predictors
Dinosaur Weight
What relationship do you see between the weight of dinosaurs and the type of dinosaur?
Estimate
- Std. Error
t value Pr(>|t|) (Intercept) 2786 4422 0.630 0.5316 dino$TypeSaurischian 13652 5968 2.288 0.0265
Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 10 / 30
SLR: Categorical Predictors
Dinosaur Weight
What relationship do you see between the weight of dinosaurs and the type of dinosaur?
Estimate
- Std. Error
t value Pr(>|t|) (Intercept) 2786 4422 0.630 0.5316 dino$TypeSaurischian 13652 5968 2.288 0.0265
Weight = 2786 + 13652TypeSaurischian
Type of dinosaur is a categorical variable with two levels: Ornithischian and Saurischian
Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 10 / 30
SLR: Categorical Predictors
Dinosaur Weight
What relationship do you see between the weight of dinosaurs and the type of dinosaur?
Estimate
- Std. Error
t value Pr(>|t|) (Intercept) 2786 4422 0.630 0.5316 dino$TypeSaurischian 13652 5968 2.288 0.0265
Weight = 2786 + 13652TypeSaurischian
Type of dinosaur is a categorical variable with two levels: Ornithischian and Saurischian For Ornithischian dinosaurs: plug in 0 for TypeSaurischian
Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 10 / 30
SLR: Categorical Predictors
Dinosaur Weight
What relationship do you see between the weight of dinosaurs and the type of dinosaur?
Estimate
- Std. Error
t value Pr(>|t|) (Intercept) 2786 4422 0.630 0.5316 dino$TypeSaurischian 13652 5968 2.288 0.0265
Weight = 2786 + 13652TypeSaurischian
Type of dinosaur is a categorical variable with two levels: Ornithischian and Saurischian For Ornithischian dinosaurs: plug in 0 for TypeSaurischian Slope b1: The model predicts that Saurischian dinosaurs weigh 13,652 kilograms more than Ornithischian dinosaurs, on average.
Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 10 / 30
SLR: Categorical Predictors
Dinosaurs!
Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 11 / 30
Many variables in a model
1
Announcements Recap
2
SLR: Categorical Predictors
3
Many variables in a model
4
Adjusted R2
5
Collinearity and parsimony
Statistics 101 U7 - L1: Multiple Linear Regression Nicole Dalzell
Many variables in a model
Weights of books
weight (g) volume (cm3) cover 1 800 885 hc 2 950 1016 hc 3 1050 1125 hc 4 350 239 hc 5 750 701 hc 6 600 641 hc 7 1075 1228 hc 8 250 412 pb 9 700 953 pb 10 650 929 pb 11 975 1492 pb 12 350 419 pb 13 950 1010 pb 14 425 595 pb 15 725 1034 pb
w l h
Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 12 / 30
Many variables in a model
Weights of hard cover and paperback books
Can you identify a trend in the relationship between volume and weight
- f hardcover and paperback books?
200 400 600 800 1000 1200 1400 400 600 800 1000 volume (cm3) weight (g)
hardcover paperback Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 13 / 30
Many variables in a model
Weights of hard cover and paperback books
Can you identify a trend in the relationship between volume and weight
- f hardcover and paperback books?
Paperbacks generally weigh less than hardcover books.
200 400 600 800 1000 1200 1400 400 600 800 1000 volume (cm3) weight (g)
hardcover paperback Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 13 / 30
Many variables in a model
Modeling weights of books using volume and cover type
# load data library(DAAG) data(allbacks) # fit model book_mlr = lm(weight ˜ volume + cover, data = allbacks) summary(book_mlr) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 197.96284 59.19274 3.344 0.005841 ** volume 0.71795 0.06153 11.669 6.6e-08 *** cover:pb
- 184.04727
40.49420
- 4.545 0.000672 ***
Residual standard error: 78.2 on 12 degrees of freedom Multiple R-squared: 0.9275, Adjusted R-squared: 0.9154 F-statistic: 76.73 on 2 and 12 DF, p-value: 1.455e-07 Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 14 / 30
Many variables in a model
Linear model
Estimate
- Std. Error
t value Pr(>|t|) (Intercept) 197.96 59.19 3.34 0.01 volume 0.72 0.06 11.67 0.00 cover:pb
- 184.05
40.49
- 4.55
0.00
Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 15 / 30
Many variables in a model
Linear model
Estimate
- Std. Error
t value Pr(>|t|) (Intercept) 197.96 59.19 3.34 0.01 volume 0.72 0.06 11.67 0.00 cover:pb
- 184.05
40.49
- 4.55
0.00
- weight = 197.96 + 0.72 volume − 184.05 cover : pb
Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 15 / 30
Many variables in a model
Linear model
Estimate
- Std. Error
t value Pr(>|t|) (Intercept) 197.96 59.19 3.34 0.01 volume 0.72 0.06 11.67 0.00 cover:pb
- 184.05
40.49
- 4.55
0.00
- weight = 197.96 + 0.72 volume − 184.05 cover : pb
1
For hardcover books: plug in 0 for cover
- weight
= 197.96 + 0.72 volume − 184.05 × 0
Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 15 / 30
Many variables in a model
Linear model
Estimate
- Std. Error
t value Pr(>|t|) (Intercept) 197.96 59.19 3.34 0.01 volume 0.72 0.06 11.67 0.00 cover:pb
- 184.05
40.49
- 4.55
0.00
- weight = 197.96 + 0.72 volume − 184.05 cover : pb
1
For hardcover books: plug in 0 for cover
- weight
= 197.96 + 0.72 volume − 184.05 × 0 = 197.96 + 0.72 volume
Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 15 / 30
Many variables in a model
Linear model
Estimate
- Std. Error
t value Pr(>|t|) (Intercept) 197.96 59.19 3.34 0.01 volume 0.72 0.06 11.67 0.00 cover:pb
- 184.05
40.49
- 4.55
0.00
- weight = 197.96 + 0.72 volume − 184.05 cover : pb
1
For hardcover books: plug in 0 for cover
- weight
= 197.96 + 0.72 volume − 184.05 × 0 = 197.96 + 0.72 volume
2
For paperback books: plug in 1 for cover
- weight
= 197.96 + 0.72 volume − 184.05 × 1
Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 15 / 30
Many variables in a model
Linear model
Estimate
- Std. Error
t value Pr(>|t|) (Intercept) 197.96 59.19 3.34 0.01 volume 0.72 0.06 11.67 0.00 cover:pb
- 184.05
40.49
- 4.55
0.00
- weight = 197.96 + 0.72 volume − 184.05 cover : pb
1
For hardcover books: plug in 0 for cover
- weight
= 197.96 + 0.72 volume − 184.05 × 0 = 197.96 + 0.72 volume
2
For paperback books: plug in 1 for cover
- weight
= 197.96 + 0.72 volume − 184.05 × 1 = 13.91 + 0.72 volume
Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 15 / 30
Many variables in a model
Visualising the linear model
200 400 600 800 1000 1200 1400 400 600 800 1000 volume (cm3) weight (g)
hardcover paperback
Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 16 / 30
Many variables in a model
Interpretation of the regression coefficients
Estimate
- Std. Error
t value Pr(>|t|) (Intercept) 197.96 59.19 3.34 0.01 volume 0.72 0.06 11.67 0.00 cover:pb
- 184.05
40.49
- 4.55
0.00
Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 17 / 30
Many variables in a model
Interpretation of the regression coefficients
Estimate
- Std. Error
t value Pr(>|t|) (Intercept) 197.96 59.19 3.34 0.01 volume 0.72 0.06 11.67 0.00 cover:pb
- 184.05
40.49
- 4.55
0.00
Slope of volume: All else held constant, for each 1 cm3 increase in volume we would expect weight to increase on average by 0.72 grams.
Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 17 / 30
Many variables in a model
Interpretation of the regression coefficients
Estimate
- Std. Error
t value Pr(>|t|) (Intercept) 197.96 59.19 3.34 0.01 volume 0.72 0.06 11.67 0.00 cover:pb
- 184.05
40.49
- 4.55
0.00
Slope of volume: All else held constant, for each 1 cm3 increase in volume we would expect weight to increase on average by 0.72 grams. Slope of cover: All else held constant, the model predicts that paperback books weigh 184 grams lower than hardcover books,
- n average.
Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 17 / 30
Many variables in a model
Interpretation of the regression coefficients
Estimate
- Std. Error
t value Pr(>|t|) (Intercept) 197.96 59.19 3.34 0.01 volume 0.72 0.06 11.67 0.00 cover:pb
- 184.05
40.49
- 4.55
0.00
Slope of volume: All else held constant, for each 1 cm3 increase in volume we would expect weight to increase on average by 0.72 grams. Slope of cover: All else held constant, the model predicts that paperback books weigh 184 grams lower than hardcover books,
- n average.
Intercept: Hardcover books with no volume are expected on average to weigh 198 grams.
Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 17 / 30
Many variables in a model
Interpretation of the regression coefficients
Estimate
- Std. Error
t value Pr(>|t|) (Intercept) 197.96 59.19 3.34 0.01 volume 0.72 0.06 11.67 0.00 cover:pb
- 184.05
40.49
- 4.55
0.00
Slope of volume: All else held constant, for each 1 cm3 increase in volume we would expect weight to increase on average by 0.72 grams. Slope of cover: All else held constant, the model predicts that paperback books weigh 184 grams lower than hardcover books,
- n average.
Intercept: Hardcover books with no volume are expected on average to weigh 198 grams.
Obviously, the intercept does not make sense in context. It only serves to adjust the height of the line.
Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 17 / 30
Many variables in a model
Prediction
Participation question Which of the following is the correct calculation for the predicted weight
- f a paperback book that is 600 cm3?
Estimate
- Std. Error
t value Pr(>|t|) (Intercept) 197.96 59.19 3.34 0.01 volume 0.72 0.06 11.67 0.00 cover:pb
- 184.05
40.49
- 4.55
0.00
(a) 197.96 + 0.72 × 600 − 184.05 × 1 (b) 184.05 + 0.72 × 600 − 197.96 × 1 (c) 197.96 + 0.72 × 600 − 184.05 × 0 (d) 197.96 + 0.72 × 1 − 184.05 × 600
Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 18 / 30
Many variables in a model
Prediction
Participation question Which of the following is the correct calculation for the predicted weight
- f a paperback book that is 600 cm3?
Estimate
- Std. Error
t value Pr(>|t|) (Intercept) 197.96 59.19 3.34 0.01 volume 0.72 0.06 11.67 0.00 cover:pb
- 184.05
40.49
- 4.55
0.00
(a) 197.96 + 0.72 × 600 − 184.05 × 1 = 445.91 grams (b) 184.05 + 0.72 × 600 − 197.96 × 1 (c) 197.96 + 0.72 × 600 − 184.05 × 0 (d) 197.96 + 0.72 × 1 − 184.05 × 600
Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 18 / 30
Many variables in a model
A note on “interaction” variables
- weight = 197.96 + 0.72 volume − 184.05 cover : pb
200 400 600 800 1000 1200 1400 400 600 800 1000 volume (cm3) weight (g)
hardcover paperback
This model assumes that hardcover and paperback books have the same slope for the relationship between their volume and weight. If this isn’t reasonable, then we would include an “interaction” variable in the model (beyond the scope of this course).
Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 19 / 30
Adjusted R2
1
Announcements Recap
2
SLR: Categorical Predictors
3
Many variables in a model
4
Adjusted R2
5
Collinearity and parsimony
Statistics 101 U7 - L1: Multiple Linear Regression Nicole Dalzell
Adjusted R2
Revisit: Modeling poverty
poverty
40 50 60 70 80 90 100
- 80
85 90
- 6
8 10 14 18
- 40
60 80 100
−0.20
metro_res
- −0.31
−0.34 white
- ●
- 30
50 70 90
- 80
85 90
−0.75
0.0180.24
hs_grad
- ●
- ●
- 6
8 10 12 14 16 18
0.53
0.30
30 50 70 90
−0.75
−0.61
8 10 12 14 16 18 8 10 14 18
female_house
Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 20 / 30
Adjusted R2
Predicting poverty using % female householder
# load data poverty = read.csv("http://stat.duke.edu/˜mc301/data/poverty.csv") # fit model pov_slr = lm(poverty ˜ female_house, data = poverty) summary(pov_slr)
Linear model: Estimate
- Std. Error
t value Pr(>|t|) (Intercept) 3.31 1.90 1.74 0.09 female house 0.69 0.16 4.32 0.00
- 8
10 12 14 16 18 6 8 10 12 14 16 18 % female householder % in poverty
R = 0.53 R2 = 0.532 = 0.28
Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 21 / 30
Adjusted R2
Another look at R2 - from last time
anova(pov_slr)
ANOVA: Df Sum Sq Mean Sq F value Pr(>F) female house 1 132.57 132.57 18.68 0.00 Residuals 49 347.68 7.10 Total 50 480.25
Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 22 / 30
Adjusted R2
Another look at R2 - from last time
anova(pov_slr)
ANOVA: Df Sum Sq Mean Sq F value Pr(>F) female house 1 132.57 132.57 18.68 0.00 Residuals 49 347.68 7.10 Total 50 480.25
SS of y: SSTot
=
- (y − ¯
y)2 = 480.25 → total variability
Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 22 / 30
Adjusted R2
Another look at R2 - from last time
anova(pov_slr)
ANOVA: Df Sum Sq Mean Sq F value Pr(>F) female house 1 132.57 132.57 18.68 0.00 Residuals 49 347.68 7.10 Total 50 480.25
SS of y: SSTot
=
- (y − ¯
y)2 = 480.25 → total variability
SS of residuals: SSErr
=
- e2
i = 347.68 → unexplained variability
Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 22 / 30
Adjusted R2
Another look at R2 - from last time
anova(pov_slr)
ANOVA: Df Sum Sq Mean Sq F value Pr(>F) female house 1 132.57 132.57 18.68 0.00 Residuals 49 347.68 7.10 Total 50 480.25
SS of y: SSTot
=
- (y − ¯
y)2 = 480.25 → total variability
SS of residuals: SSErr
=
- e2
i = 347.68 → unexplained variability
SS of regression: SSReg
= SSTotal − SSError → explained variability = 480.25 − 347.68 = 132.57
Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 22 / 30
Adjusted R2
Another look at R2 - from last time
anova(pov_slr)
ANOVA: Df Sum Sq Mean Sq F value Pr(>F) female house 1 132.57 132.57 18.68 0.00 Residuals 49 347.68 7.10 Total 50 480.25
SS of y: SSTot
=
- (y − ¯
y)2 = 480.25 → total variability
SS of residuals: SSErr
=
- e2
i = 347.68 → unexplained variability
SS of regression: SSReg
= SSTotal − SSError → explained variability = 480.25 − 347.68 = 132.57 R2 = explained variability
total variability
= 132.57 480.25 = 0.28
Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 22 / 30
Adjusted R2
Predicting poverty using % female hh + % white
pov_mlr = lm(poverty ˜ female_house + white, data = poverty) summary(pov_mlr)
Linear model: Estimate
- Std. Error
t value Pr(>|t|) (Intercept)
- 2.58
5.78
- 0.45
0.66 female house 0.89 0.24 3.67 0.00 white 0.04 0.04 1.08 0.29
anova(pov_mlr)
ANOVA: Df Sum Sq Mean Sq F value Pr(>F) female house 1 132.57 132.57 18.74 0.00 white 1 8.21 8.21 1.16 0.29 Residuals 48 339.47 7.07 Total 50 480.25
Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 23 / 30
Adjusted R2
Predicting poverty using % female hh + % white
pov_mlr = lm(poverty ˜ female_house + white, data = poverty) summary(pov_mlr)
Linear model: Estimate
- Std. Error
t value Pr(>|t|) (Intercept)
- 2.58
5.78
- 0.45
0.66 female house 0.89 0.24 3.67 0.00 white 0.04 0.04 1.08 0.29
anova(pov_mlr)
ANOVA: Df Sum Sq Mean Sq F value Pr(>F) female house 1 132.57 132.57 18.74 0.00 white 1 8.21 8.21 1.16 0.29 Residuals 48 339.47 7.07 Total 50 480.25
R2 = explained variability
total variability
= 132.57 + 8.21 480.25 = 0.29
Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 23 / 30
Adjusted R2
Adjusted R2
Adjusted R2
R2
adj = 1 −
SSError SSTotal × n − 1 n − k − 1
- where n is the number of cases and k is the number of predictors
(explanatory variables) in the model.
Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 24 / 30
Adjusted R2
Application exercise: Adjusted R2 Calculate adjusted R2 for the multiple linear regression model predict- ing % living in poverty from % female householders and % white. Re- member n = 51, 50 states + DC.
ANOVA: Df Sum Sq Mean Sq F value Pr(>F) female house 1 132.57 132.57 18.74 0.0001 white 1 8.21 8.21 1.16 0.2868 Residuals 48 339.47 7.07 Total 50 480.25
(a) 0.26 (b) 0.29 (c) 0.32 (d) 0.71
Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 25 / 30
Adjusted R2
Application exercise: Adjusted R2 Calculate adjusted R2 for the multiple linear regression model predict- ing % living in poverty from % female householders and % white. Re- member n = 51, 50 states + DC.
ANOVA: Df Sum Sq Mean Sq F value Pr(>F) female house 1 132.57 132.57 18.74 0.0001 white 1 8.21 8.21 1.16 0.2868 Residuals 48 339.47 7.07 Total 50 480.25
(a) 0.26 (b) 0.29 (c) 0.32 (d) 0.71
R2
adj
= 1 − SSError SSTotal × n − 1 n − k − 1
- Statistics 101 (Nicole Dalzell)
U7 - L1: Multiple Linear Regression June 15, 2015 25 / 30
Adjusted R2
Application exercise: Adjusted R2 Calculate adjusted R2 for the multiple linear regression model predict- ing % living in poverty from % female householders and % white. Re- member n = 51, 50 states + DC.
ANOVA: Df Sum Sq Mean Sq F value Pr(>F) female house 1 132.57 132.57 18.74 0.0001 white 1 8.21 8.21 1.16 0.2868 Residuals 48 339.47 7.07 Total 50 480.25
(a) 0.26 (b) 0.29 (c) 0.32 (d) 0.71
R2
adj
= 1 − SSError SSTotal × n − 1 n − k − 1
- =
1 − 339.47 480.25 × 51 − 1 51 − 2 − 1
- Statistics 101 (Nicole Dalzell)
U7 - L1: Multiple Linear Regression June 15, 2015 25 / 30
Adjusted R2
Application exercise: Adjusted R2 Calculate adjusted R2 for the multiple linear regression model predict- ing % living in poverty from % female householders and % white. Re- member n = 51, 50 states + DC.
ANOVA: Df Sum Sq Mean Sq F value Pr(>F) female house 1 132.57 132.57 18.74 0.0001 white 1 8.21 8.21 1.16 0.2868 Residuals 48 339.47 7.07 Total 50 480.25
(a) 0.26 (b) 0.29 (c) 0.32 (d) 0.71
R2
adj
= 1 − SSError SSTotal × n − 1 n − k − 1
- =
1 − 339.47 480.25 × 51 − 1 51 − 2 − 1
- =
1 − 339.47 480.25 × 50 48
- Statistics 101 (Nicole Dalzell)
U7 - L1: Multiple Linear Regression June 15, 2015 25 / 30
Adjusted R2
Application exercise: Adjusted R2 Calculate adjusted R2 for the multiple linear regression model predict- ing % living in poverty from % female householders and % white. Re- member n = 51, 50 states + DC.
ANOVA: Df Sum Sq Mean Sq F value Pr(>F) female house 1 132.57 132.57 18.74 0.0001 white 1 8.21 8.21 1.16 0.2868 Residuals 48 339.47 7.07 Total 50 480.25
(a) 0.26 (b) 0.29 (c) 0.32 (d) 0.71
R2
adj
= 1 − SSError SSTotal × n − 1 n − k − 1
- =
1 − 339.47 480.25 × 51 − 1 51 − 2 − 1
- =
1 − 339.47 480.25 × 50 48
- = 1 − 0.74
Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 25 / 30
Adjusted R2
Application exercise: Adjusted R2 Calculate adjusted R2 for the multiple linear regression model predict- ing % living in poverty from % female householders and % white. Re- member n = 51, 50 states + DC.
ANOVA: Df Sum Sq Mean Sq F value Pr(>F) female house 1 132.57 132.57 18.74 0.0001 white 1 8.21 8.21 1.16 0.2868 Residuals 48 339.47 7.07 Total 50 480.25
(a) 0.26 (b) 0.29 (c) 0.32 (d) 0.71
R2
adj
= 1 − SSError SSTotal × n − 1 n − k − 1
- =
1 − 339.47 480.25 × 51 − 1 51 − 2 − 1
- =
1 − 339.47 480.25 × 50 48
- = 1 − 0.74 = 0.26
Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 25 / 30
Adjusted R2
R2 vs. adjusted R2
R2
Adjusted R2 Model 1 (poverty vs. female house) 0.28 0.26 Model 2 (poverty vs. female house + white) 0.29 0.26
Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 26 / 30
Adjusted R2
R2 vs. adjusted R2
R2
Adjusted R2 Model 1 (poverty vs. female house) 0.28 0.26 Model 2 (poverty vs. female house + white) 0.29 0.26 When any variable is added to the model R2 increases.
Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 26 / 30
Adjusted R2
R2 vs. adjusted R2
R2
Adjusted R2 Model 1 (poverty vs. female house) 0.28 0.26 Model 2 (poverty vs. female house + white) 0.29 0.26 When any variable is added to the model R2 increases. But if the added variable doesn’t really provide any new information, or is completely unrelated, adjusted R2 does not increase.
Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 26 / 30
Adjusted R2
Adjusted R2 - properties
R2
adj = 1 −
SSError SSTotal × n − 1 n − k − 1
- Because k is never negative, R2
adj will always be smaller than R2.
R2
adj applies a penalty for the number of predictors included in the
model. Therefore, we choose models with higher R2
adj over others.
Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 27 / 30
Adjusted R2
Participation question True or false: Adjusted R2 tells us the percentage of variability in the response variable explained by the model. (a) True (b) False
Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 28 / 30
Adjusted R2
Participation question True or false: Adjusted R2 tells us the percentage of variability in the response variable explained by the model. (a) True (b) False
Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 28 / 30
Collinearity and parsimony
1
Announcements Recap
2
SLR: Categorical Predictors
3
Many variables in a model
4
Adjusted R2
5
Collinearity and parsimony
Statistics 101 U7 - L1: Multiple Linear Regression Nicole Dalzell
Collinearity and parsimony
We saw that adding the variable white to the model did not increase adjusted R2, i.e. did not add any valuable information to the model. Why?
poverty
40 50 60 70 80 90 100
- 80
85 90
- 6
8 10 14 18
- 40
60 80 100
−0.20
metro_res
- −0.31
−0.34 white
- ●
- 30
50 70 90
- 80
85 90
−0.75
0.0180.24
hs_grad
- ●
- ●
- 6
8 10 12 14 16 18
0.53
0.30
30 50 70 90
−0.75
−0.61
8 10 12 14 16 18 8 10 14 18
female_house Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 29 / 30
Collinearity and parsimony
Collinearity between explanatory variables (cont.)
Two predictor variables are said to be collinear when they are correlated, and this collinearity (also called multicollinearity) complicates model estimation.
Remember: Predictors are also called explanatory or independent variables, so they should be independent of each other.
Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 30 / 30
Collinearity and parsimony
Collinearity between explanatory variables (cont.)
Two predictor variables are said to be collinear when they are correlated, and this collinearity (also called multicollinearity) complicates model estimation.
Remember: Predictors are also called explanatory or independent variables, so they should be independent of each other.
We don’t like adding predictors that are associated with each
- ther to the model, because often times the addition of such
variable brings nothing to the table. Instead, we prefer the simplest best model, i.e. parsimonious model.
Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 30 / 30
Collinearity and parsimony
Collinearity between explanatory variables (cont.)
Two predictor variables are said to be collinear when they are correlated, and this collinearity (also called multicollinearity) complicates model estimation.
Remember: Predictors are also called explanatory or independent variables, so they should be independent of each other.
We don’t like adding predictors that are associated with each
- ther to the model, because often times the addition of such
variable brings nothing to the table. Instead, we prefer the simplest best model, i.e. parsimonious model. In addition, addition of collinear variables can result in biased estimates of the slope parameters.
Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 30 / 30
Collinearity and parsimony
Collinearity between explanatory variables (cont.)
Two predictor variables are said to be collinear when they are correlated, and this collinearity (also called multicollinearity) complicates model estimation.
Remember: Predictors are also called explanatory or independent variables, so they should be independent of each other.
We don’t like adding predictors that are associated with each
- ther to the model, because often times the addition of such
variable brings nothing to the table. Instead, we prefer the simplest best model, i.e. parsimonious model. In addition, addition of collinear variables can result in biased estimates of the slope parameters. While it’s impossible to avoid collinearity from arising in
- bservational data, experiments are usually designed to control
for correlated predictors.
Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 30 / 30