Learning From Data Lecture 12 Regularization
Constraining the Model Weight Decay Augmented Error
- M. Magdon-Ismail
CSCI 4100/6100
Learning From Data Lecture 12 Regularization Constraining the - - PowerPoint PPT Presentation
Learning From Data Lecture 12 Regularization Constraining the Model Weight Decay Augmented Error M. Magdon-Ismail CSCI 4100/6100 recap: Overfitting Fitting the data more than is warranted Data Target Fit y x M Regularization : 2 /30
Constraining the Model Weight Decay Augmented Error
CSCI 4100/6100
recap: Overfitting
c A M L Creator: Malik Magdon-Ismail
Regularization: 2 /30
Noise − →
recap: Noise is Part of y We Cannot Model
x y f(x) y = f(x)+stoch. noise
x y h∗ y = h∗(x)+det. noise
Human: Good at extracting the simple pattern, ignoring the noise and complications. Computer: Pays equal attention to all pixels. Needs help simplifying → (features
c A M L Creator: Malik Magdon-Ismail
Regularization: 3 /30
What is regularization? − →
A cure for our tendency to fit (get distracted by) the noise, hence improving Eout.
By constraining the model so that we cannot fit the noise. ↑
putting on the brakes
The medication will have side effects – if we cannot fit the noise, maybe we cannot fit f (the signal)?
c A M L Creator: Malik Magdon-Ismail
Regularization: 4 /30
Constraining − →
x y
c A M L Creator: Malik Magdon-Ismail
Regularization: 5 /30
Small weights − →
x y x y
constrain weights to be smaller
c A M L Creator: Malik Magdon-Ismail
Regularization: 6 /30
bias− →
x y ¯ g(x) sin(x) x y ¯ g(x) sin(x)
← side effect (Constant model had bias=0.5 and var=0.25.)
c A M L Creator: Malik Magdon-Ismail
Regularization: 7 /30
var− →
x y ¯ g(x) sin(x) x y ¯ g(x) sin(x)
← side effect
← treatment (Constant model had bias=0.5 and var=0.25.)
c A M L Creator: Malik Magdon-Ismail
Regularization: 8 /30
Regularication in a nutshell − →
If you use a simpler H and get a good fit, then your Eout is better.
c A M L Creator: Malik Magdon-Ismail
Regularization: 9 /30
Polynomials − →
Standard Polynomial Legendre Polynomial z = 1 x x2 . . . xq h(x) = wtz(x) = w0 + w1x + · · · + wqxq z = 1 L1(x) L2(x) . . . Lq(x) h(x) = wtz(x) = w0 + w1L1(x) + · · · + wqLq(x)
we’re using linear regression
allows us to treat the weights ‘independently’
L1
x
L2
1 2(3x2 − 1)
L3
1 2(5x3 − 3x)
L4
1 8(35x4 − 30x2 + 3)
L5
1 8(63x5 · · · )
c A M L Creator: Malik Magdon-Ismail
Regularization: 10 /30
recap: linear regression − →
recap: Linear Regression
N
linear regression fit ր
c A M L Creator: Malik Magdon-Ismail
Regularization: 11 /30
Already saw constraints − →
h(x) = w0 + w1Φ1(x) + w2Φ2(x) + w3Φ3(x) + · · · + w10Φ10(x)
such that: w3 = w4 = · · · = w10 = 0
sets some weights to zero ր
c A M L Creator: Malik Magdon-Ismail
Regularization: 12 /30
Soft constraint − →
q
q ≤ C տ
budget for weights
H2 C → ∞ H10
soft order constraint allows ‘intermediate’ models
c A M L Creator: Malik Magdon-Ismail
Regularization: 13 /30
HC − →
h(x) = w0 + w1Φ1(x) + w2Φ2(x) + w3Φ3(x) + · · · + w10Φ10(x)
such that: w3 = w4 = · · · = w10 = 0
such that:
10
w2
q ≤ C
a ‘soft’ budget constraint
ր
c A M L Creator: Malik Magdon-Ismail
Regularization: 14 /30
Fitting data − →
regularized ր
subject to:
c A M L Creator: Malik Magdon-Ismail
Regularization: 15 /30
Getting wreg − →
min : Ein(w) = 1 N (Zw − y)t(Zw − y)
subject to:
wtw ≤ C Observations:
Optimal w will use full budget and be on the surface wtw = C.
Otherwise can move along the surface and decrease Ein.
∇Ein is parallel to normal (but in opposite direction).
wlin wtw = C w Ein = const. ∇Ein normal
λC, the lagrange multiplier, is positive. The 2 is for mathematical convenience.
ր
c A M L Creator: Malik Magdon-Ismail
Regularization: 16 /30
Unconstrained minimization − →
c A M L Creator: Malik Magdon-Ismail
Regularization: 17 /30
Augmented error − →
A penalty for the ‘complexity’ of h, measured by the size of the weights.
We can pick any budget C. Translation: we are free to pick any multiplier λC What’s the right C? ↔ What’s the right λC?
c A M L Creator: Malik Magdon-Ismail
Regularization: 18 /30
Linear regression − →
Convenient to set λC = λ
N
called ‘weight decay’ as the penalty encourages smaller weights
c A M L Creator: Malik Magdon-Ismail
Regularization: 19 /30
Linear regression solution − →
λ determines the amount of regularization
Recall the unconstrained solution (λ = 0):
c A M L Creator: Malik Magdon-Ismail
Regularization: 20 /30
Dramatic effect − →
x y Data Target Fit
c A M L Creator: Malik Magdon-Ismail
Regularization: 21 /30
Just a little works − →
x y Data Target Fit x y
c A M L Creator: Malik Magdon-Ismail
Regularization: 22 /30
Easy to overdose − →
x y Data Target Fit x y x y x y
Overfitting
Underfitting
c A M L Creator: Malik Magdon-Ismail
Regularization: 23 /30
Overfitting and underfitting − →
Regularization Parameter, λ Expected Eout
underfitting
0.5 1 1.5 2 0.76 0.8 0.84
c A M L Creator: Malik Magdon-Ismail
Regularization: 24 /30
Noise and regularization − →
Regularization Parameter, λ Expected Eout σ2 = 0 σ2 = 0.25 σ2 = 0.5
0.5 1 1.5 2 0.25 0.5 0.75 1
c A M L Creator: Malik Magdon-Ismail
Regularization: 25 /30
Deterministic too − →
Regularization Parameter, λ Expected Eout σ2 = 0 σ2 = 0.25 σ2 = 0.5
0.5 1 1.5 2 0.25 0.5 0.75 1
Regularization Parameter, λ Expected Eout Qf = 15 Qf = 30 Qf = 100
0.5 1 1.5 2 0.2 0.4 0.6
c A M L Creator: Malik Magdon-Ismail
Regularization: 26 /30
Variations on weight decay − →
Regularization Parameter, λ Expected Eout
underfitting
0.5 1 1.5 2 0.76 0.8 0.84
Regularization Parameter, λ Expected Eout
0.5 1 1.5 2 0.76 0.8 0.84
Regularization Parameter, λ Expected Eout weight growth weight decay
Q
q Q
q Q
q
c A M L Creator: Malik Magdon-Ismail
Regularization: 27 /30
Choosing a regularizer − →
constrain in the ‘direction’ of the target function. target function is unknown (going around in circles ).
constrain in the ‘direction’ of smoother (usually simpler) hypotheses hurts your ability to fit the ‘high frequency’ noise smoother and simpler
usually means
− → weight decay not weight growth.
You still have λ to play with — validation.
c A M L Creator: Malik Magdon-Ismail
Regularization: 28 /30
Regularization philosophy − →
Helps to combat what noise remains, especially when N is small. Typical modus operandi: sacrifice a little bias for a huge improvement in var. VC angle: you are using a smaller H without sacrificing too much Ein
c A M L Creator: Malik Magdon-Ismail
Regularization: 29 /30
Eaug versus Ein − →
NΩ(h)
this was wtw
this was O
N ln N
depends on choice of λ
c A M L Creator: Malik Magdon-Ismail
Regularization: 30 /30