[PPT] - Lecture 10: Regularized/penalized regression (contd) Felix Held, PowerPoint Presentation

SLIDE 1

Lecture 10: Regularized/penalized regression (cont’d)

Felix Held, Mathematical Sciences

MSA220/MVE440 Statistical Learning for Big Data 2nd May 2019

SLIDE 2

A short recap

SLIDE 3

Goals of modelling

1. Predictive strength: How well can we reconstruct the
bserved data? Has been most important so far.
2. Model/variable selection: Which variables are part of the

true model? This is about uncovering structure to allow for mechanistic understanding.

1/31

SLIDE 4

Feature selection

Feature selection can be addressed in multiple ways

▶ Filtering: Remove variables before the actual model for

the data is built

▶ Often crude but fast ▶ Typically only pays attention to one or two features at a

time (e.g. F-Score, MIC) or does not take the outcome variable into consideration (e.g. PCA)

▶ Wrapping: Consider the selected features as an

additional hyper-parameter

▶ computationally very heavy ▶ most approximations are greedy algorithms

▶ Embedding: Include feature selection into parameter

estimation through penalisation of the model coefficients

▶ Naive form is equally computationally heavy as wrapping ▶ Soft-constraints create biased but useful approximations

2/31

SLIDE 5

Penalised regression

The optimization problem arg min

𝜸

1 2‖𝐳 − 𝐘𝜸‖2

2

subject to ‖𝜸‖𝑟

𝑟 ≤ 𝑢

for 𝑟 > 0 is equivalent to ̂ 𝜸 = arg min

𝜸

1 2‖𝐳 − 𝐘𝜸‖2

2 + 𝜇‖𝜸‖𝑟 𝑟

when 𝑟 ≥ 1.

▶ For 𝑟 = 2 known as ridge regression for 𝑟 = 1 known as

the lasso

▶ Constraints are convex for all 𝑟 ≥ 1 but not differentiable

in 𝜸 = 𝟏 for 𝑟 = 1

3/31

SLIDE 6

Intuition for the penalties (I)

Assume the OLS solution 𝜸OLS exists and set 𝐬 = 𝐳 − 𝐘𝜸OLS it follows for the residual sum of squares (RSS) that ‖𝐳 − 𝐘𝜸‖2

2 = ‖(𝐘𝜸OLS + 𝐬) − 𝐘𝜸‖2 2

= ‖(𝐘(𝜸 − 𝜸OLS) − 𝐬‖2

2

= (𝜸 − 𝜸OLS)𝑈𝐘𝑈𝐘(𝜸 − 𝜸OLS) − 2𝐬𝑈𝐘(𝜸 − 𝜸OLS) + 𝐬𝑈𝐬 which is an ellipse (at least in 2D) centred on 𝜸OLS.

4/31

SLIDE 7

Intuition for the penalties (II)

The least squares RSS is minimized for 𝜸OLS. If a constraint is added (‖𝜸‖𝑟

𝑟 ≤ 𝑢) then the RSS is minimized by the closest 𝜸

possible that fulfills the constraint.

β1 β2 βOLS

βlasso

Lasso

β1 β2 βOLS

βridge

Ridge

The blue lines are the contour lines for the RSS.

5/31

SLIDE 8

Intuition for the penalties (III)

Depending on 𝑟 the different constraints lead to different

solutions. If 𝜸OLS is in
ne of the coloured

areas or on a line, the constrained solution will be at the corresponding dot. Sparsity only for 𝑟 ≤ 1 Convexity only for 𝑟 ≥ 1

β1 β2

β1

β2

β1

β2

β1

β2

q: 2

q: Inf q: 0.7 q: 1

6/31

SLIDE 9

Shrinkage and effective degrees of freedom

When 𝜇 is fixed, the shrinkage of the lasso estimate 𝜸lasso(𝜇) compared to the OLS estimate 𝜸OLS is defined as 𝑡(𝜇) = ‖𝜸lasso(𝜇)‖1 ‖𝜸OLS‖1 Note: 𝑡(𝜇) ∈ [0, 1] with 𝑡(𝜇) → 0 for increasing 𝜇 and 𝑡(𝜇) = 1 if 𝜇 = 0 For ridge regression define 𝐈(𝜇) ∶= 𝐘(𝐘𝑈𝐘 + 𝜇𝐉𝑞)−1𝐘𝑈 and df(𝜇) ∶= tr(𝐈(𝜇)) =

𝑞

∑

𝑘=1

𝑒2

𝑘

𝑒2

𝑘 + 𝜇,

the effective degrees of freedom.

7/31

SLIDE 10

A regularisation path

Prostate cancer dataset (𝑜 = 67, 𝑞 = 8)

Red dashed lines indicate the 𝜇 selected by cross-validation

−0.25 0.00 0.25 0.50 0.75 2 4 6 8 Effective degrees of freedom Coefficient

Ridge

−0.25 0.00 0.25 0.50 0.75 0.00 0.25 0.50 0.75 1.00 Shrinkage Coefficient

Lasso

−0.25 0.00 0.25 0.50 0.75 −5 5 10 log(λ) Coefficient −0.25 0.00 0.25 0.50 0.75 −5 5 10 log(λ) Coefficient

8/31

SLIDE 11

Connection to classification

SLIDE 12

Recall: Regularised Discriminant Analysis (RDA)

Given training samples (𝑗𝑚, 𝐲𝑚), quadratic DA models 𝑞(𝐲|𝑗) = 𝑂(𝐲|𝝂𝑗, 𝚻𝑗) and 𝑞(𝑗) = 𝜌𝑗 Estimates ˆ 𝝂𝑗, ˆ 𝚻𝑗 and ˆ 𝜌𝑗 are straight-forward to find,… …but evaluating the normal density requires inversion of ˆ 𝚻𝑗. If it is (near-)singular, this can lead to numerical instability. Penalisation can help here:

▶ Use ˆ

𝚻𝑗 = ˆ 𝚻QDA

𝑗

+ 𝜇ˆ 𝚻LDA for 𝜇 > 0

▶ Use LDA (i.e. 𝚻𝑗 = 𝚻) and ˆ

𝚻 = ˆ 𝚻LDA + 𝜇𝚬 for 𝜇 > 0 and a diagonal matrix 𝚬

9/31

SLIDE 13

Recall: Naive Bayes LDA

Naive Bayes LDA means that we assume that ˆ 𝚻 = ˆ 𝚬 for a diagonal matrix ˆ 𝚬. The diagonal elements are estimated as ˆ Δ2

𝑘𝑘 =

1 𝑜 − 𝐿

𝐿

∑

𝑗=1

∑

𝑗𝑚=𝑗

(𝑦𝑚𝑘 − ˆ 𝝂𝑗,𝑘)2 which is the pooled within-class variance. Classification is performed by evaluating the discriminant functions 𝜀𝑗(𝐲) = −1 2(𝐲 − ˆ 𝝂𝑗)𝑈 ˆ 𝚬−1(𝐲 − ˆ 𝝂𝑗) + log(ˆ 𝜌𝑗) and by choosing 𝑑(𝐲) = arg max

𝑗

𝜀𝑗(𝐲) as the predicted class.

10/31

SLIDE 14

Shrunken centroids (I)

In high-dimensional problems, centroids will

▶ contain noise ▶ be hard to interpret when all variables are active

As in regression, we would like to perform variable selection and reduce noise. Note: The class centroids solve ˆ 𝝂𝑗 = arg min

𝐰

1 2 ∑

𝑗𝑚=𝑗

‖𝐲𝑚 − 𝐰‖2

2

Nearest shrunken centroids performs variable selection and stabilises centroid estimates by solving ˆ 𝝂𝑡

𝑗 = arg min 𝐰

1 2 ∑

𝑗𝑚=𝑗

‖(ˆ 𝚬+𝑡0𝐉𝑞)−1/2(𝐲𝑚−𝐰)‖2

2+𝜇√

(𝑜 − 𝑜𝑗)𝑜𝑗 𝑜 ‖𝐰−ˆ 𝝂𝑈‖1

11/31

SLIDE 15

Shrunken centroids (II)

Nearest shrunken centroids ˆ 𝝂𝑡

𝑗 = arg min 𝐰

1 2 ∑

𝑗𝑚=𝑗

‖(ˆ 𝚬 + 𝑡0𝐉𝑞)−1/2(𝐲𝑚−𝐰)‖2

2+𝜇√

(𝑜 − 𝑜𝑗)𝑜𝑗 𝑜 ‖𝐰−ˆ 𝝂𝑈‖1

▶ Penalises distance of class centroid to the overall

centroid 𝝂𝑈

▶ ˆ

𝚬 + 𝑡0𝐉𝑞 is the diagonal regularised within-class covariance matrix. Leads to greater weights for variables that are less variable across samples (interpretability)

▶ √(𝑜 − 𝑜𝑗)𝑜𝑗/𝑜 is only there for technical reasons ▶ If the predictors are centred (ˆ

𝝂𝑈 = 0) this is a scaled lasso problem

12/31

SLIDE 16

Shrunken centroids (III)

The solution for component 𝑘 can be derived using subdifferentials as ˆ 𝝂𝑡

𝑗,𝑘 = ˆ

𝝂𝑈,𝑘 +𝑛𝑗(Δ𝑘𝑘 +𝑡0) ST (𝑢𝑗,𝑘, 𝜇) where 𝑢𝑗,𝑘 = ˆ 𝝂𝑗,𝑘 − ˆ 𝝂𝑈,𝑘 𝑛𝑗(Δ𝑘𝑘 + 𝑡0) and 𝑛𝑗 = √

1 𝑜𝑗 − 1 𝑜.

Note: 𝜇 is a tuning parameter and has to be determined through e.g. cross-validation.

▶ Typically, misclassification rate improves first with

increasing 𝜇 and declines for too high values

▶ The larger 𝜇 the more components will be equal to the

respective component of the overall centroid.

13/31

SLIDE 17

Application of nearest shrunken centroids (I)

A gene expression data set with 𝑜 = 63 and 𝑞 = 2308. There are four classes (cancer subtypes) with 𝑜BL = 8, 𝑜EWS = 23, 𝑜NB = 12, and 𝑜RMS = 20.

0.0

0.2 0.4 0.6 2 4 6 λ Misclassification rate

5-fold cross-validation curve and largest 𝜇 that leads to minimal misclassification rate

14/31

SLIDE 18

Application of nearest shrunken centroids (II)

RMS NB EWS BL 500 1000 1500 2000 Gene Average Expression

Grey lines show the original centroids and red lines show the shrunken centroids

15/31

SLIDE 19

General calculation of the lasso estimates

SLIDE 20

Calculation of the lasso estimate

Last lecture: When 𝐘𝑈𝐘 = 𝐉𝑞 and 𝜸OLS are the OLS estimates then ˆ 𝛾lasso,𝑘(𝜇) = sign(𝛾OLS,𝑘)(|𝛾OLS,𝑘| − 𝜇)+ = ST(𝛾OLS,𝑘, 𝜇) where 𝑦+ = max(𝑦, 0) and the soft-thresholding operator ST. What about the general case? Coordinate Descent: The lasso problem arg min

𝜸

1 2‖𝐳 − 𝐘𝜸‖2

2 + 𝜇‖𝜸‖1

can be written in coordinates (omitting terms not dependent

n any 𝛾𝑗)

arg min

𝛾1,…,𝛾𝑞

1 2

𝑞

∑

𝑗,𝑘=1

𝐲𝑈

𝑗 𝐲 𝑘𝛾𝑗𝛾 𝑘 − 𝑜

∑

𝑚=1 𝑞

∑

𝑗=1

𝑧𝑚𝑦𝑚𝑗𝛾𝑗 + 𝜇

𝑞

∑

𝑗=1

|𝛾𝑗|

16/31

SLIDE 21

Subderivative and subdifferential

Let 𝑔 ∶ 𝐽 → ℝ be a convex function in an open interval 𝐽 and 𝑦0 ∈ 𝐽. A 𝑑 ∈ ℝ is called a subderivative of 𝑔 at 𝑦0 if 𝑔(𝑦) − 𝑔(𝑦0) ≥ 𝑑(𝑦 − 𝑦0) It can be shown that for 𝑏 = lim

𝑦→𝑦−

𝑔(𝑦) − 𝑔(𝑦0) 𝑦 − 𝑦0 and 𝑐 = lim

𝑦→𝑦+

𝑔(𝑦) − 𝑔(𝑦0) 𝑦 − 𝑦0 all 𝑑 ∈ [𝑏, 𝑐] are subderivatives. Call 𝜀𝑔(𝑦0) ∶= [𝑏, 𝑐] the subdifferential of 𝑔 at 𝑦0. Example: Let 𝑔(𝑦) = |𝑦|, then 𝜀𝑔(𝑦0) = ⎧ ⎪ ⎨ ⎪ ⎩ {−1} 𝑦0 < 0 [−1, 1] 𝑦0 = 0 {+1} 𝑦0 > 0

17/31

SLIDE 22

Properties of subdifferentials

1. A convex function is differentiable at 𝑦0, if and only if its

subdifferential at 𝑦0 contains only one point. This point is the derivative.

2. Moreau-Rockafellar theorem: If 𝑔, 𝑕 are convex with

subdifferentials 𝜀𝑔 and 𝜀𝑕, then 𝜀(𝑔 + 𝑕) = 𝜀𝑔 + 𝜀𝑕 where 𝜀𝑔 + 𝜀𝑕 = {𝑤1 + 𝑤2 ∶ 𝑤1 ∈ 𝜀𝑔, 𝑤2 ∈ 𝜀𝑕}

3. Stationarity condition: A point 𝑦0 is a global minimum of

a convex function, if and only if 0 is contained in the subdifferential at 𝑦0

18/31

SLIDE 23

Coordinate Descent (I)

Idea: Use subdifferentials to find the global minimum ˆ 𝛾𝑙 for a single coefficient 𝛾𝑙 given the other coefficients 𝜸−𝑙, i.e. ˆ 𝛾𝑙 = arg min

𝛾𝑙

𝐾(𝛾𝑙) = arg min

𝛾𝑙

1 2

𝑞

∑

𝑗,𝑘=1

𝐲𝑈

𝑗 𝐲 𝑘𝛾𝑗𝛾 𝑘− 𝑜

∑

𝑚=1 𝑞

∑

𝑗=1

𝑧𝑚𝑦𝑚𝑗𝛾𝑗+𝜇

𝑞

∑

𝑗=1

|𝛾𝑗| Taking the subdifferential of 𝐾 at 𝛾𝑙 leads to 𝜀𝐾(𝛾𝑙) = −𝐲𝑈

𝑙( =∶𝐬𝑙

⏞ ⎴ ⎴ ⏞ ⎴ ⎴ ⏞ 𝐳 − ∑

𝑗=1 𝑗≠𝑙

𝛾𝑗𝐲𝑗 −𝛾𝑙𝐲𝑙) + 𝜇 ⎧ ⎪ ⎨ ⎪ ⎩ {−1} 𝛾𝑙 < 0 [−1, 1] 𝛾𝑙 = 0 {+1} 𝛾𝑙 > 0 = = ⎧ ⎪ ⎨ ⎪ ⎩ {−𝐲𝑈

𝑙𝐬𝑙 + 𝛾𝑙𝐲𝑈 𝑙𝐲𝑙 − 𝜇}

𝛾𝑙 < 0 [−𝐲𝑈

𝑙𝐬𝑙 − 𝜇, −𝐲𝑈 𝑙𝐬𝑙 + 𝜇]

𝛾𝑙 = 0 {−𝐲𝑈

𝑙𝐬𝑙 + 𝛾𝑙𝐲𝑈 𝑙𝐲𝑙 + 𝜇}

𝛾𝑙 > 0

19/31

SLIDE 24

Coordinate Descent (II)

Subdifferential of 𝐾 at 𝛾𝑙 𝜀𝐾(𝛾𝑙) = ⎧ ⎪ ⎨ ⎪ ⎩ {−𝐲𝑈

𝑙 𝐬𝑙 + 𝛾𝑙𝐲𝑈 𝑙𝐲𝑙 − 𝜇}

𝛾𝑙 < 0 [−𝐲𝑈

𝑙 𝐬𝑙 − 𝜇, −𝐲𝑈 𝑙𝐬𝑙 + 𝜇]

𝛾𝑙 = 0 {−𝐲𝑈

𝑙 𝐬𝑙 + 𝛾𝑙𝐲𝑈 𝑙𝐲𝑙 + 𝜇}

𝛾𝑙 > 0 Two cases:

1. Standard derivative for 𝛾𝑙 ≠ 0, i.e.

𝜖𝐾 𝜖𝛾𝑙 = 0 ⇔ 𝛾𝑙 = ⎧ ⎨ ⎩

𝐲𝑈

𝑙𝐬𝑙+𝜇

𝐲𝑈

𝑙𝐲𝑙

𝐲𝑈

𝑙𝐬𝑙< −𝜇 𝐲𝑈

𝑙𝐬𝑙−𝜇

𝐲𝑈

𝑙𝐲𝑙

𝐲𝑈

𝑙𝐬𝑙> +𝜇 20/31

SLIDE 25

Coordinate Descent (III)

Subdifferential of 𝐾 at 𝛾𝑙 𝜀𝐾(𝛾𝑙) = ⎧ ⎪ ⎨ ⎪ ⎩ {−𝐲𝑈

𝑙 𝐬𝑙 + 𝛾𝑙𝐲𝑈 𝑙𝐲𝑙 − 𝜇}

𝛾𝑙 < 0 [−𝐲𝑈

𝑙 𝐬𝑙 − 𝜇, −𝐲𝑈 𝑙𝐬𝑙 + 𝜇]

𝛾𝑙 = 0 {−𝐲𝑈

𝑙 𝐬𝑙 + 𝛾𝑙𝐲𝑈 𝑙𝐲𝑙 + 𝜇}

𝛾𝑙 > 0 Two cases:

2. Stationarity condition for 𝛾𝑙 = 0, i.e.

0 ∈ 𝜀𝐾(0) ⇔ −𝜇 ≤ 𝐲𝑈

𝑙𝐬𝑙 ≤ 𝜇 21/31

SLIDE 26

Coordinate Descent (IV)

In total we get the solution ˆ 𝛾𝑙(𝜸−𝑙) = ⎧ ⎪ ⎨ ⎪ ⎩

𝐲𝑈

𝑙𝐬𝑙+𝜇

𝐲𝑈

𝑙𝐲𝑙

𝐲𝑈

𝑙𝐬𝑙 < −𝜇

−𝜇 ≤ 𝐲𝑈

𝑙 𝐬𝑙 ≤ 𝜇 𝐲𝑈

𝑙𝐬𝑙−𝜇

𝐲𝑈

𝑙𝐲𝑙

𝐲𝑈

𝑙𝐬𝑙 > 𝜇

⎫ ⎪ ⎬ ⎪ ⎭ = ST(𝐲𝑈

𝑙𝐬𝑙, 𝜇)

𝐲𝑈

𝑙𝐲𝑙

the unique minimizer when all coefficients but 𝛾𝑙 are fixed. Multiple options for updating order

▶ Cyclic coordinate descent: Update one coordinate at a

time in a fixed order. Once every coordinate has been updated, start over.

▶ Choose coordinate that leads to best decrease in total

target function value Note: Coordinate descent is not guaranteed to converge if the target function’s level curves are not smooth

22/31

SLIDE 27

Another algorithmic approach

▶ Cyclic coordinate descent is a popular approach (e.g. R

package glmnet) but is hard to parallelise due to its sequential order

▶ Augmented Directions Method of Multipliers (ADMM)

re-formulates the lasso problem arg min

𝜸

1 2‖𝐳 − 𝐘𝜸‖2

2 + 𝜇‖𝜸‖1

as arg min

𝜸,𝜾

1 2‖𝐳 − 𝐘𝜸‖2

2 + 𝜇‖𝜾‖1

such that 𝜾 = 𝜸 and (approximately) minimizes the augmented Lagragian arg min

𝜸,𝜾

1 2‖𝐳 − 𝐘𝜸‖2

2 + 𝜇‖𝜾‖1 + 𝐳𝑈(𝜸 − 𝜾) + 𝜍

2‖𝜸 − 𝜾‖2

2

Iteratively solves for 𝜸 and 𝜾, then updates 𝐳

23/31

SLIDE 28

Extensions of the lasso

SLIDE 29

The lasso and groups of highly correlated variables

▶ The lasso does not handle groups of highly correlated

variables well.

▶ Example: Two groups of highly correlated variables, e.g.

𝐘 ∼ 𝑂(𝟏, 𝚻) where 𝚻 = (𝚻1 𝟏 𝟏 𝚻1 ) , 𝚻1 = ⎛ ⎜ ⎜ ⎝ 1.04 1 1 1 1.04 1 1 1 1.04 ⎞ ⎟ ⎟ ⎠ and 𝐳 = 3𝐲1 − 1.5𝐲5 + 𝜻 where 𝜻 ∼ 𝑂(𝟏, 4𝐉𝑜)

▶ The (very theoretical) irrepresentable condition from last

lecture tells us that the lasso will not be able to recover the true model.

▶ What happens in practice?

24/31

SLIDE 30

The lasso and groups of highly correlated variables in practice

Lasso Elastic net (α = 0.3)

0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 −1 1 Shrinkage Coefficient

▶ The lasso typically selects one variable from a group of

highly correlated variables, more or less randomly, instead of distributing the coefficients evenly.

▶ The elastic net is an extension of the lasso, which “finds”

correlated groups of variables

▶ Note that the elastic net is not given explicit knowledge

f the groups of variables.

25/31

SLIDE 31

The elastic net (I)

The elastic net solves the problem arg min

𝜸

1 2‖𝐳 − 𝐘𝜸‖2

2 + 𝜇 (1 − 𝛽

2 ‖𝜸‖2

2 + 𝛽‖𝜸‖1)

striking a balance between lasso (variable selection) and ridge regression (grouping of variables)

26/31

SLIDE 32

The elastic net (II)

The solution can be found through cyclic coordinate descent and the coefficient updates are ˆ 𝛾𝑙(𝜸−𝑙) = ST(𝐲𝑈

𝑙𝐬𝑙, 𝜇𝛽)

𝐲𝑈

𝑙𝐲𝑙 + 𝜇(1 − 𝛽)

▶ Note: 𝛽 is an additional tuning parameter that should be

determined by cross-validation

▶ The lasso and ridge regression are special cases of the

elastic net (𝛽 = 1 and 𝛽 = 0, respectively). The R package glmnet is a popular implementation of the elastic net.

27/31

SLIDE 33

The lasso and groups of variables

▶ The lasso in it’s original formulation considers each

variable separately

▶ Groups in data can form through e.g.

▶ Correlation ▶ Categorical variables in dummy encoding ▶ Domain-knowledge (e.g. genes in the same signal

pathway, signals that only appear in groups in a compressed sensing problem,…)

▶ Ideally the whole group is either present or not ▶ The elastic net can find groups, but only does so for

highly correlated variables and without external

influence. Sometimes more control is necessary.

28/31

SLIDE 34

The group lasso (I)

The group lasso solves the problem arg min

𝜸

1 2‖𝐳 − 𝐘𝜸‖2

2 + 𝜇 𝐿

∑

𝑙=1

‖𝐂𝑙‖2 where 𝐂𝑙 is a vector of coefficients 𝛾𝑗 for the 𝑙-th group. Note that ‖𝛾𝑗‖2 = |𝛾𝑗| for singleton groups.

29/31

SLIDE 35

The group lasso (II)

The solution can be similarly derived as for the lasso or the elastic net, but on group level. This leads to the coefficient update 𝐂(𝑗+1)

𝑙

= (𝐘𝑈

𝑙𝐘𝑙 +

𝜇 ‖ ‖𝐂(𝑗)

𝑙 ‖

‖2 𝐉)

−1

𝐘𝑙𝐬𝐥 when ‖𝐘𝑙𝐬𝑙‖2 > 0 and 0 otherwise. Note that 𝐘𝑙 contains all predictors belonging to group 𝑙 and 𝐬𝑙 = 𝐳 − ∑

𝑘=1 𝑘≠𝑙

𝐘

𝑘𝐂 𝑘 30/31

SLIDE 36

Take-home message

▶ Penalisation methods are not only restricted to

regression, also applicable to classification

▶ Sparsity is a very important concept when interpretability

f models is important

▶ Many extensions to the lasso exist, which make it more