- Day 5: Model Selection I
Lucas Leemann
Essex Summer School
Introduction to Statistical Learning
- L. Leemann (Essex Summer School)
Day 5 Introduction to SL 1 / 26
Day 5: Model Selection I Lucas Leemann Essex Summer School - - PowerPoint PPT Presentation
Day 5: Model Selection I Lucas Leemann Essex Summer School Introduction to Statistical Learning L. Leemann (Essex Summer School) Day 5 Introduction to SL 1 / 26 1 Motivation 2 Choosing the Optimal Model 3 Subset Selection 4
Day 5 Introduction to SL 1 / 26
Day 5 Introduction to SL 2 / 26
Red: Test error. Blue: Training error. (Hastie et al, 2008: 220)
Day 5 Introduction to SL 3 / 26
Day 5 Introduction to SL 4 / 26
Day 5 Introduction to SL 5 / 26
Day 5 Introduction to SL 6 / 26
2 4 6 8 10 10000 15000 20000 25000 30000
Number of Predictors Cp
2 4 6 8 10 10000 15000 20000 25000 30000
Number of Predictors BIC
2 4 6 8 10 0.86 0.88 0.90 0.92 0.94 0.96
Number of Predictors Adjusted R2
Day 5 Introduction to SL 7 / 26
Day 5 Introduction to SL 8 / 26
Day 5 Introduction to SL 9 / 26
RSS n−d−1. While RSS
RSS n−d−1
Day 5 Introduction to SL 10 / 26
Day 5 Introduction to SL 11 / 26
2 4 6 8 10 100 120 140 160 180 200 220
Number of Predictors Square Root of BIC
2 4 6 8 10 100 120 140 160 180 200 220
Number of Predictors Validation Set Error
2 4 6 8 10 100 120 140 160 180 200 220
Number of Predictors Cross−Validation Error
Day 5 Introduction to SL 12 / 26
Day 5 Introduction to SL 13 / 26
Day 5 Introduction to SL 14 / 26
Day 5 Introduction to SL 15 / 26
> regfit.full <- regsubsets(mpg ˜ ., Auto[,-9]) > summary(regfit.full) Subset selection object Call: regsubsets.formula(mpg ˜ ., Auto[, -9]) 7 Variables (and intercept) Forced in Forced out cylinders FALSE FALSE displacement FALSE FALSE horsepower FALSE FALSE weight FALSE FALSE acceleration FALSE FALSE year FALSE FALSE
FALSE FALSE 1 subsets of each size up to 7 Selection Algorithm: exhaustive cylinders displacement horsepower weight acceleration year origin 1 ( 1 ) " " " " " " "*" " " " " " " 2 ( 1 ) " " " " " " "*" " " "*" " " 3 ( 1 ) " " " " " " "*" " " "*" "*" 4 ( 1 ) " " "*" " " "*" " " "*" "*" 5 ( 1 ) " " "*" "*" "*" " " "*" "*" 6 ( 1 ) "*" "*" "*" "*" " " "*" "*" 7 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
Day 5 Introduction to SL 16 / 26
1 2 3 4 5 6 7 4500 5500 6500 Number of Variables RSS 1 2 3 4 5 6 7 0.70 0.74 0.78 0.82 Number of Variables Adjusted RSq 1 2 3 4 5 6 7 50 100 200 Number of Variables Cp 1 2 3 4 5 6 7
Number of Variables BIC
Day 5 Introduction to SL 17 / 26
Day 5 Introduction to SL 18 / 26
Day 5 Introduction to SL 19 / 26
(Here: best refers to highest R2 or smallest MSE since k constant within each step)
Day 5 Introduction to SL 20 / 26
(James et al. 2013: 209)
Day 5 Introduction to SL 21 / 26
(Here: best refers to highest R2 or smallest MSE since k constant within each step)
Day 5 Introduction to SL 22 / 26
Day 5 Introduction to SL 23 / 26
> regfit.full <- regsubsets(Salary ˜ .,data=Hitters, nvmax=19) > regfit.for <- regsubsets(Salary ˜ .,data=Hitters, nvmax=19, method="forward") > regfit.back <- regsubsets(Salary ˜ .,data=Hitters, nvmax=19, method = "backward") > > coef(regfit.full, 7) (Intercept) Hits Walks CAtBat CHits CHmRun DivisionW PutOuts 79.4509472 1.2833513 3.2274264
1.4957073 1.4420538 -129.9866432 0.2366813 > > coef(regfit.for, 7) (Intercept) AtBat Hits Walks CRBI CWalks DivisionW PutOuts 109.7873062
7.4498772 4.9131401 0.8537622
0.2533404 > > coef(regfit.back, 7) (Intercept) AtBat Hits Walks CRuns CWalks DivisionW PutOuts 105.6487488
6.7574914 6.0558691 1.1293095
0.3028847 >
Day 5 Introduction to SL 24 / 26
(James et al. 2013: 214)
Day 5 Introduction to SL 25 / 26
Day 5 Introduction to SL 26 / 26