Lecture 10: Regularized/penalized regression (contd) Felix Held, - - PowerPoint PPT Presentation
Lecture 10: Regularized/penalized regression (contd) Felix Held, - - PowerPoint PPT Presentation
Lecture 10: Regularized/penalized regression (contd) Felix Held, Mathematical Sciences MSA220/MVE440 Statistical Learning for Big Data 2nd May 2019 A short recap Goals of modelling 1. Predictive strength: How well can we reconstruct the
A short recap
Goals of modelling
- 1. Predictive strength: How well can we reconstruct the
- bserved data? Has been most important so far.
- 2. Model/variable selection: Which variables are part of the
true model? This is about uncovering structure to allow for mechanistic understanding.
1/31
Feature selection
Feature selection can be addressed in multiple ways
▶ Filtering: Remove variables before the actual model for
the data is built
▶ Often crude but fast ▶ Typically only pays attention to one or two features at a
time (e.g. F-Score, MIC) or does not take the outcome variable into consideration (e.g. PCA)
▶ Wrapping: Consider the selected features as an
additional hyper-parameter
▶ computationally very heavy ▶ most approximations are greedy algorithms
▶ Embedding: Include feature selection into parameter
estimation through penalisation of the model coefficients
▶ Naive form is equally computationally heavy as wrapping ▶ Soft-constraints create biased but useful approximations
2/31
Penalised regression
The optimization problem arg min
𝜸
1 2‖𝐳 − 𝐘𝜸‖2
2
subject to ‖𝜸‖𝑟
𝑟 ≤ 𝑢
for 𝑟 > 0 is equivalent to ̂ 𝜸 = arg min
𝜸
1 2‖𝐳 − 𝐘𝜸‖2
2 + 𝜇‖𝜸‖𝑟 𝑟
when 𝑟 ≥ 1.
▶ For 𝑟 = 2 known as ridge regression for 𝑟 = 1 known as
the lasso
▶ Constraints are convex for all 𝑟 ≥ 1 but not differentiable
in 𝜸 = 𝟏 for 𝑟 = 1
3/31
Intuition for the penalties (I)
Assume the OLS solution 𝜸OLS exists and set 𝐬 = 𝐳 − 𝐘𝜸OLS it follows for the residual sum of squares (RSS) that ‖𝐳 − 𝐘𝜸‖2
2 = ‖(𝐘𝜸OLS + 𝐬) − 𝐘𝜸‖2 2
= ‖(𝐘(𝜸 − 𝜸OLS) − 𝐬‖2
2
= (𝜸 − 𝜸OLS)𝑈𝐘𝑈𝐘(𝜸 − 𝜸OLS) − 2𝐬𝑈𝐘(𝜸 − 𝜸OLS) + 𝐬𝑈𝐬 which is an ellipse (at least in 2D) centred on 𝜸OLS.
4/31
Intuition for the penalties (II)
The least squares RSS is minimized for 𝜸OLS. If a constraint is added (‖𝜸‖𝑟
𝑟 ≤ 𝑢) then the RSS is minimized by the closest 𝜸
possible that fulfills the constraint.
β1 β2 βOLS
- βlasso
Lasso
β1 β2 βOLS
- βridge
Ridge
The blue lines are the contour lines for the RSS.
5/31
Intuition for the penalties (III)
Depending on 𝑟 the different constraints lead to different
- solutions. If 𝜸OLS is in
- ne of the coloured
areas or on a line, the constrained solution will be at the corresponding dot. Sparsity only for 𝑟 ≤ 1 Convexity only for 𝑟 ≥ 1
β1 β2
- β1
β2
- β1
β2
- β1
β2
- q: 2
q: Inf q: 0.7 q: 1
6/31
Shrinkage and effective degrees of freedom
When 𝜇 is fixed, the shrinkage of the lasso estimate 𝜸lasso(𝜇) compared to the OLS estimate 𝜸OLS is defined as 𝑡(𝜇) = ‖𝜸lasso(𝜇)‖1 ‖𝜸OLS‖1 Note: 𝑡(𝜇) ∈ [0, 1] with 𝑡(𝜇) → 0 for increasing 𝜇 and 𝑡(𝜇) = 1 if 𝜇 = 0 For ridge regression define 𝐈(𝜇) ∶= 𝐘(𝐘𝑈𝐘 + 𝜇𝐉𝑞)−1𝐘𝑈 and df(𝜇) ∶= tr(𝐈(𝜇)) =
𝑞
∑
𝑘=1
𝑒2
𝑘
𝑒2
𝑘 + 𝜇,
the effective degrees of freedom.
7/31
A regularisation path
Prostate cancer dataset (𝑜 = 67, 𝑞 = 8)
Red dashed lines indicate the 𝜇 selected by cross-validation
−0.25 0.00 0.25 0.50 0.75 2 4 6 8 Effective degrees of freedom Coefficient
Ridge
−0.25 0.00 0.25 0.50 0.75 0.00 0.25 0.50 0.75 1.00 Shrinkage Coefficient
Lasso
−0.25 0.00 0.25 0.50 0.75 −5 5 10 log(λ) Coefficient −0.25 0.00 0.25 0.50 0.75 −5 5 10 log(λ) Coefficient
8/31
Connection to classification
Recall: Regularised Discriminant Analysis (RDA)
Given training samples (𝑗𝑚, 𝐲𝑚), quadratic DA models 𝑞(𝐲|𝑗) = 𝑂(𝐲|𝝂𝑗, 𝚻𝑗) and 𝑞(𝑗) = 𝜌𝑗 Estimates ˆ 𝝂𝑗, ˆ 𝚻𝑗 and ˆ 𝜌𝑗 are straight-forward to find,… …but evaluating the normal density requires inversion of ˆ 𝚻𝑗. If it is (near-)singular, this can lead to numerical instability. Penalisation can help here:
▶ Use ˆ
𝚻𝑗 = ˆ 𝚻QDA
𝑗
+ 𝜇ˆ 𝚻LDA for 𝜇 > 0
▶ Use LDA (i.e. 𝚻𝑗 = 𝚻) and ˆ
𝚻 = ˆ 𝚻LDA + 𝜇𝚬 for 𝜇 > 0 and a diagonal matrix 𝚬
9/31
Recall: Naive Bayes LDA
Naive Bayes LDA means that we assume that ˆ 𝚻 = ˆ 𝚬 for a diagonal matrix ˆ 𝚬. The diagonal elements are estimated as ˆ Δ2
𝑘𝑘 =
1 𝑜 − 𝐿
𝐿
∑
𝑗=1
∑
𝑗𝑚=𝑗
(𝑦𝑚𝑘 − ˆ 𝝂𝑗,𝑘)2 which is the pooled within-class variance. Classification is performed by evaluating the discriminant functions 𝜀𝑗(𝐲) = −1 2(𝐲 − ˆ 𝝂𝑗)𝑈 ˆ 𝚬−1(𝐲 − ˆ 𝝂𝑗) + log(ˆ 𝜌𝑗) and by choosing 𝑑(𝐲) = arg max
𝑗
𝜀𝑗(𝐲) as the predicted class.
10/31
Shrunken centroids (I)
In high-dimensional problems, centroids will
▶ contain noise ▶ be hard to interpret when all variables are active
As in regression, we would like to perform variable selection and reduce noise. Note: The class centroids solve ˆ 𝝂𝑗 = arg min
𝐰
1 2 ∑
𝑗𝑚=𝑗
‖𝐲𝑚 − 𝐰‖2
2
Nearest shrunken centroids performs variable selection and stabilises centroid estimates by solving ˆ 𝝂𝑡
𝑗 = arg min 𝐰
1 2 ∑
𝑗𝑚=𝑗
‖(ˆ 𝚬+𝑡0𝐉𝑞)−1/2(𝐲𝑚−𝐰)‖2
2+𝜇√
(𝑜 − 𝑜𝑗)𝑜𝑗 𝑜 ‖𝐰−ˆ 𝝂𝑈‖1
11/31
Shrunken centroids (II)
Nearest shrunken centroids ˆ 𝝂𝑡
𝑗 = arg min 𝐰
1 2 ∑
𝑗𝑚=𝑗
‖(ˆ 𝚬 + 𝑡0𝐉𝑞)−1/2(𝐲𝑚−𝐰)‖2
2+𝜇√
(𝑜 − 𝑜𝑗)𝑜𝑗 𝑜 ‖𝐰−ˆ 𝝂𝑈‖1
▶ Penalises distance of class centroid to the overall
centroid 𝝂𝑈
▶ ˆ
𝚬 + 𝑡0𝐉𝑞 is the diagonal regularised within-class covariance matrix. Leads to greater weights for variables that are less variable across samples (interpretability)
▶ √(𝑜 − 𝑜𝑗)𝑜𝑗/𝑜 is only there for technical reasons ▶ If the predictors are centred (ˆ
𝝂𝑈 = 0) this is a scaled lasso problem
12/31
Shrunken centroids (III)
The solution for component 𝑘 can be derived using subdifferentials as ˆ 𝝂𝑡
𝑗,𝑘 = ˆ
𝝂𝑈,𝑘 +𝑛𝑗(Δ𝑘𝑘 +𝑡0) ST (𝑢𝑗,𝑘, 𝜇) where 𝑢𝑗,𝑘 = ˆ 𝝂𝑗,𝑘 − ˆ 𝝂𝑈,𝑘 𝑛𝑗(Δ𝑘𝑘 + 𝑡0) and 𝑛𝑗 = √
1 𝑜𝑗 − 1 𝑜.
Note: 𝜇 is a tuning parameter and has to be determined through e.g. cross-validation.
▶ Typically, misclassification rate improves first with
increasing 𝜇 and declines for too high values
▶ The larger 𝜇 the more components will be equal to the
respective component of the overall centroid.
13/31
Application of nearest shrunken centroids (I)
A gene expression data set with 𝑜 = 63 and 𝑞 = 2308. There are four classes (cancer subtypes) with 𝑜BL = 8, 𝑜EWS = 23, 𝑜NB = 12, and 𝑜RMS = 20.
- 0.0
0.2 0.4 0.6 2 4 6 λ Misclassification rate
5-fold cross-validation curve and largest 𝜇 that leads to minimal misclassification rate
14/31
Application of nearest shrunken centroids (II)
RMS NB EWS BL 500 1000 1500 2000 Gene Average Expression
Grey lines show the original centroids and red lines show the shrunken centroids
15/31
General calculation of the lasso estimates
Calculation of the lasso estimate
Last lecture: When 𝐘𝑈𝐘 = 𝐉𝑞 and 𝜸OLS are the OLS estimates then ˆ 𝛾lasso,𝑘(𝜇) = sign(𝛾OLS,𝑘)(|𝛾OLS,𝑘| − 𝜇)+ = ST(𝛾OLS,𝑘, 𝜇) where 𝑦+ = max(𝑦, 0) and the soft-thresholding operator ST. What about the general case? Coordinate Descent: The lasso problem arg min
𝜸
1 2‖𝐳 − 𝐘𝜸‖2
2 + 𝜇‖𝜸‖1
can be written in coordinates (omitting terms not dependent
- n any 𝛾𝑗)
arg min
𝛾1,…,𝛾𝑞
1 2
𝑞
∑
𝑗,𝑘=1
𝐲𝑈
𝑗 𝐲 𝑘𝛾𝑗𝛾 𝑘 − 𝑜
∑
𝑚=1 𝑞
∑
𝑗=1
𝑧𝑚𝑦𝑚𝑗𝛾𝑗 + 𝜇
𝑞
∑
𝑗=1
|𝛾𝑗|
16/31
Subderivative and subdifferential
Let 𝑔 ∶ 𝐽 → ℝ be a convex function in an open interval 𝐽 and 𝑦0 ∈ 𝐽. A 𝑑 ∈ ℝ is called a subderivative of 𝑔 at 𝑦0 if 𝑔(𝑦) − 𝑔(𝑦0) ≥ 𝑑(𝑦 − 𝑦0) It can be shown that for 𝑏 = lim
𝑦→𝑦−
𝑔(𝑦) − 𝑔(𝑦0) 𝑦 − 𝑦0 and 𝑐 = lim
𝑦→𝑦+
𝑔(𝑦) − 𝑔(𝑦0) 𝑦 − 𝑦0 all 𝑑 ∈ [𝑏, 𝑐] are subderivatives. Call 𝜀𝑔(𝑦0) ∶= [𝑏, 𝑐] the subdifferential of 𝑔 at 𝑦0. Example: Let 𝑔(𝑦) = |𝑦|, then 𝜀𝑔(𝑦0) = ⎧ ⎪ ⎨ ⎪ ⎩ {−1} 𝑦0 < 0 [−1, 1] 𝑦0 = 0 {+1} 𝑦0 > 0
17/31
Properties of subdifferentials
- 1. A convex function is differentiable at 𝑦0, if and only if its
subdifferential at 𝑦0 contains only one point. This point is the derivative.
- 2. Moreau-Rockafellar theorem: If 𝑔, are convex with
subdifferentials 𝜀𝑔 and 𝜀, then 𝜀(𝑔 + ) = 𝜀𝑔 + 𝜀 where 𝜀𝑔 + 𝜀 = {𝑤1 + 𝑤2 ∶ 𝑤1 ∈ 𝜀𝑔, 𝑤2 ∈ 𝜀}
- 3. Stationarity condition: A point 𝑦0 is a global minimum of
a convex function, if and only if 0 is contained in the subdifferential at 𝑦0
18/31
Coordinate Descent (I)
Idea: Use subdifferentials to find the global minimum ˆ 𝛾𝑙 for a single coefficient 𝛾𝑙 given the other coefficients 𝜸−𝑙, i.e. ˆ 𝛾𝑙 = arg min
𝛾𝑙
𝐾(𝛾𝑙) = arg min
𝛾𝑙
1 2
𝑞
∑
𝑗,𝑘=1
𝐲𝑈
𝑗 𝐲 𝑘𝛾𝑗𝛾 𝑘− 𝑜
∑
𝑚=1 𝑞
∑
𝑗=1
𝑧𝑚𝑦𝑚𝑗𝛾𝑗+𝜇
𝑞
∑
𝑗=1
|𝛾𝑗| Taking the subdifferential of 𝐾 at 𝛾𝑙 leads to 𝜀𝐾(𝛾𝑙) = −𝐲𝑈
𝑙( =∶𝐬𝑙
⏞ ⎴ ⎴ ⏞ ⎴ ⎴ ⏞ 𝐳 − ∑
𝑗=1 𝑗≠𝑙
𝛾𝑗𝐲𝑗 −𝛾𝑙𝐲𝑙) + 𝜇 ⎧ ⎪ ⎨ ⎪ ⎩ {−1} 𝛾𝑙 < 0 [−1, 1] 𝛾𝑙 = 0 {+1} 𝛾𝑙 > 0 = = ⎧ ⎪ ⎨ ⎪ ⎩ {−𝐲𝑈
𝑙𝐬𝑙 + 𝛾𝑙𝐲𝑈 𝑙𝐲𝑙 − 𝜇}
𝛾𝑙 < 0 [−𝐲𝑈
𝑙𝐬𝑙 − 𝜇, −𝐲𝑈 𝑙𝐬𝑙 + 𝜇]
𝛾𝑙 = 0 {−𝐲𝑈
𝑙𝐬𝑙 + 𝛾𝑙𝐲𝑈 𝑙𝐲𝑙 + 𝜇}
𝛾𝑙 > 0
19/31
Coordinate Descent (II)
Subdifferential of 𝐾 at 𝛾𝑙 𝜀𝐾(𝛾𝑙) = ⎧ ⎪ ⎨ ⎪ ⎩ {−𝐲𝑈
𝑙 𝐬𝑙 + 𝛾𝑙𝐲𝑈 𝑙𝐲𝑙 − 𝜇}
𝛾𝑙 < 0 [−𝐲𝑈
𝑙 𝐬𝑙 − 𝜇, −𝐲𝑈 𝑙𝐬𝑙 + 𝜇]
𝛾𝑙 = 0 {−𝐲𝑈
𝑙 𝐬𝑙 + 𝛾𝑙𝐲𝑈 𝑙𝐲𝑙 + 𝜇}
𝛾𝑙 > 0 Two cases:
- 1. Standard derivative for 𝛾𝑙 ≠ 0, i.e.
𝜖𝐾 𝜖𝛾𝑙 = 0 ⇔ 𝛾𝑙 = ⎧ ⎨ ⎩
𝐲𝑈
𝑙𝐬𝑙+𝜇
𝐲𝑈
𝑙𝐲𝑙
𝐲𝑈
𝑙𝐬𝑙< −𝜇 𝐲𝑈
𝑙𝐬𝑙−𝜇
𝐲𝑈
𝑙𝐲𝑙
𝐲𝑈
𝑙𝐬𝑙> +𝜇 20/31
Coordinate Descent (III)
Subdifferential of 𝐾 at 𝛾𝑙 𝜀𝐾(𝛾𝑙) = ⎧ ⎪ ⎨ ⎪ ⎩ {−𝐲𝑈
𝑙 𝐬𝑙 + 𝛾𝑙𝐲𝑈 𝑙𝐲𝑙 − 𝜇}
𝛾𝑙 < 0 [−𝐲𝑈
𝑙 𝐬𝑙 − 𝜇, −𝐲𝑈 𝑙𝐬𝑙 + 𝜇]
𝛾𝑙 = 0 {−𝐲𝑈
𝑙 𝐬𝑙 + 𝛾𝑙𝐲𝑈 𝑙𝐲𝑙 + 𝜇}
𝛾𝑙 > 0 Two cases:
- 2. Stationarity condition for 𝛾𝑙 = 0, i.e.
0 ∈ 𝜀𝐾(0) ⇔ −𝜇 ≤ 𝐲𝑈
𝑙𝐬𝑙 ≤ 𝜇 21/31
Coordinate Descent (IV)
In total we get the solution ˆ 𝛾𝑙(𝜸−𝑙) = ⎧ ⎪ ⎨ ⎪ ⎩
𝐲𝑈
𝑙𝐬𝑙+𝜇
𝐲𝑈
𝑙𝐲𝑙
𝐲𝑈
𝑙𝐬𝑙 < −𝜇
−𝜇 ≤ 𝐲𝑈
𝑙 𝐬𝑙 ≤ 𝜇 𝐲𝑈
𝑙𝐬𝑙−𝜇
𝐲𝑈
𝑙𝐲𝑙
𝐲𝑈
𝑙𝐬𝑙 > 𝜇
⎫ ⎪ ⎬ ⎪ ⎭ = ST(𝐲𝑈
𝑙𝐬𝑙, 𝜇)
𝐲𝑈
𝑙𝐲𝑙
the unique minimizer when all coefficients but 𝛾𝑙 are fixed. Multiple options for updating order
▶ Cyclic coordinate descent: Update one coordinate at a
time in a fixed order. Once every coordinate has been updated, start over.
▶ Choose coordinate that leads to best decrease in total
target function value Note: Coordinate descent is not guaranteed to converge if the target function’s level curves are not smooth
22/31
Another algorithmic approach
▶ Cyclic coordinate descent is a popular approach (e.g. R
package glmnet) but is hard to parallelise due to its sequential order
▶ Augmented Directions Method of Multipliers (ADMM)
re-formulates the lasso problem arg min
𝜸
1 2‖𝐳 − 𝐘𝜸‖2
2 + 𝜇‖𝜸‖1
as arg min
𝜸,𝜾
1 2‖𝐳 − 𝐘𝜸‖2
2 + 𝜇‖𝜾‖1
such that 𝜾 = 𝜸 and (approximately) minimizes the augmented Lagragian arg min
𝜸,𝜾
1 2‖𝐳 − 𝐘𝜸‖2
2 + 𝜇‖𝜾‖1 + 𝐳𝑈(𝜸 − 𝜾) + 𝜍
2‖𝜸 − 𝜾‖2
2
Iteratively solves for 𝜸 and 𝜾, then updates 𝐳
23/31
Extensions of the lasso
The lasso and groups of highly correlated variables
▶ The lasso does not handle groups of highly correlated
variables well.
▶ Example: Two groups of highly correlated variables, e.g.
𝐘 ∼ 𝑂(𝟏, 𝚻) where 𝚻 = (𝚻1 𝟏 𝟏 𝚻1 ) , 𝚻1 = ⎛ ⎜ ⎜ ⎝ 1.04 1 1 1 1.04 1 1 1 1.04 ⎞ ⎟ ⎟ ⎠ and 𝐳 = 3𝐲1 − 1.5𝐲5 + 𝜻 where 𝜻 ∼ 𝑂(𝟏, 4𝐉𝑜)
▶ The (very theoretical) irrepresentable condition from last
lecture tells us that the lasso will not be able to recover the true model.
▶ What happens in practice?
24/31
The lasso and groups of highly correlated variables in practice
Lasso Elastic net (α = 0.3)
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 −1 1 Shrinkage Coefficient
▶ The lasso typically selects one variable from a group of
highly correlated variables, more or less randomly, instead of distributing the coefficients evenly.
▶ The elastic net is an extension of the lasso, which “finds”
correlated groups of variables
▶ Note that the elastic net is not given explicit knowledge
- f the groups of variables.
25/31
The elastic net (I)
The elastic net solves the problem arg min
𝜸
1 2‖𝐳 − 𝐘𝜸‖2
2 + 𝜇 (1 − 𝛽
2 ‖𝜸‖2
2 + 𝛽‖𝜸‖1)
striking a balance between lasso (variable selection) and ridge regression (grouping of variables)
26/31
The elastic net (II)
The solution can be found through cyclic coordinate descent and the coefficient updates are ˆ 𝛾𝑙(𝜸−𝑙) = ST(𝐲𝑈
𝑙𝐬𝑙, 𝜇𝛽)
𝐲𝑈
𝑙𝐲𝑙 + 𝜇(1 − 𝛽)
▶ Note: 𝛽 is an additional tuning parameter that should be
determined by cross-validation
▶ The lasso and ridge regression are special cases of the
elastic net (𝛽 = 1 and 𝛽 = 0, respectively). The R package glmnet is a popular implementation of the elastic net.
27/31
The lasso and groups of variables
▶ The lasso in it’s original formulation considers each
variable separately
▶ Groups in data can form through e.g.
▶ Correlation ▶ Categorical variables in dummy encoding ▶ Domain-knowledge (e.g. genes in the same signal
pathway, signals that only appear in groups in a compressed sensing problem,…)
▶ Ideally the whole group is either present or not ▶ The elastic net can find groups, but only does so for
highly correlated variables and without external
- influence. Sometimes more control is necessary.
28/31
The group lasso (I)
The group lasso solves the problem arg min
𝜸
1 2‖𝐳 − 𝐘𝜸‖2
2 + 𝜇 𝐿
∑
𝑙=1
‖𝐂𝑙‖2 where 𝐂𝑙 is a vector of coefficients 𝛾𝑗 for the 𝑙-th group. Note that ‖𝛾𝑗‖2 = |𝛾𝑗| for singleton groups.
29/31
The group lasso (II)
The solution can be similarly derived as for the lasso or the elastic net, but on group level. This leads to the coefficient update 𝐂(𝑗+1)
𝑙
= (𝐘𝑈
𝑙𝐘𝑙 +
𝜇 ‖ ‖𝐂(𝑗)
𝑙 ‖
‖2 𝐉)
−1
𝐘𝑙𝐬𝐥 when ‖𝐘𝑙𝐬𝑙‖2 > 0 and 0 otherwise. Note that 𝐘𝑙 contains all predictors belonging to group 𝑙 and 𝐬𝑙 = 𝐳 − ∑
𝑘=1 𝑘≠𝑙
𝐘
𝑘𝐂 𝑘 30/31
Take-home message
▶ Penalisation methods are not only restricted to
regression, also applicable to classification
▶ Sparsity is a very important concept when interpretability
- f models is important
▶ Many extensions to the lasso exist, which make it more