[PPT] - Gradient Descent for L2 Penalized Logistic Regr. N 1 X log PowerPoint Presentation

SLIDE 1

input: initial w 2 R input: initial step size s0 2 R+ while not converged: w w strwL(w) st decay(s0, t) t t + 1

1 Mike Hughes - Tufts COMP 135 - Spring 2019

You need to specify:

Max. num iterations T
Step size s
Convergence threshold d

min

w∈RM

1 2λwT w −

N

X

n=1

log BernPMF(tn|σ(wT φ(xn))) | {z }

L(w)

<latexit sha1_base64="WVKZbGz9XgDFns5dP5ZCmxq463Y=">ACgHicbVFdixMxFM2MX2v9qvroy8UizIJbZ6qgCMKygvjgSpXt7kLTDplMpg2bZIYkY7fE/A7/l2/+GMF0tg+6w0h3Puzb05KRrBjU3TX1F87fqNm7d2bvfu3L13/0H/4aNjU7easgmtRa1PC2KY4IpNLeCnTaEVkIdlKcvd/oJ9+YNrxWR3bdsJkC8UrTokNVN7/gSVXuVsB5gqwJHZFO6rnx96wK0qmS40oczhKhwu824UeBGuLwms5kewAtgDbFqZO/Uu8/PQa0XgC07t+6AaTU+/OATsLmC7yGPLyRJukLcLHlynqtdCMvnrmtNiXCfLaBZ/3B+kw7QKugmwLBmgb47z/E5c1bSVTlgpizDRLGztzRFtOBfM93BrWEHpGFmwaoCKSmZnrDPTwLDAlVLUOW1no2L8rHJHGrGURMjdzmsvahvyfNm1t9WbmuGpayxS9aFS1AmwNm9+AkmtGrVgHQKjmYVagSxK8tuHPesGE7PKTr4Lj0TB7ORx9eTXYP9jasYOeoKcoQRl6jfbRzRGE0TR72gQPY/24jhO4hdxdpEaR9uax+ifiN/+AcCfwBM=</latexit>

e s0 2 al step size s0 2 R+

Gradient Descent for L2 Penalized Logistic Regr.

SLIDE 2

Will gradient descent always find same solution?

2 Mike Hughes - Tufts COMP 135 - Spring 2019

SLIDE 3

Will gradient descent always find same solution?

3 Mike Hughes - Tufts COMP 135 - Spring 2019

Yes, if loss looks like this Not if multiple local minima exist

SLIDE 4

Loss for logistic regression is convex!

4 Mike Hughes - Tufts COMP 135 - Spring 2019

SLIDE 5

Intuition: 1D gradient descent

5 Mike Hughes - Tufts COMP 135 - Spring 2019

𝑔(𝒚) 𝒚 𝑔(𝒚) 𝒚 𝒚 𝒚

Choosing good step size matters!

SLIDE 6

Log likelihood vs iterations

6 Mike Hughes - Tufts COMP 135 - Spring 2019

Maximizing likelihood: Higher is better! (could multiply by -1 and minimize instead)

Figure Credit: Emily Fox (UW)

SLIDE 7

If step size is to too small

7 Mike Hughes - Tufts COMP 135 - Spring 2019

Figure Credit: Emily Fox (UW)

SLIDE 8

If step size is lar large

8 Mike Hughes - Tufts COMP 135 - Spring 2019

Figure Credit: Emily Fox (UW)

SLIDE 9

If step size is to too large

9 Mike Hughes - Tufts COMP 135 - Spring 2019

Figure Credit: Emily Fox (UW)

SLIDE 10

If step size is wa way too large

10 Mike Hughes - Tufts COMP 135 - Spring 2019

Figure Credit: Emily Fox (UW)

SLIDE 11

Rule for picking step sizes

Never try just one!
Usually: Want largest step size that doesn’t diverge
Try several values (exponentially spaced) until
Find one clearly too small
Find one clearly too large (unhelpful oscillation / divergence)
Always make trace plots!
Show the loss, norm of gradient, and parameter values versus epoch
Smarter choices for step size:
Decaying methods
Search methods
Second-order methods

11 Mike Hughes - Tufts COMP 135 - Spring 2019

SLIDE 12

Decaying step sizes

12 Mike Hughes - Tufts COMP 135 - Spring 2019

input: initial w 2 R input: initial step size s0 2 R+ while not converged: w w strwL(w) st decay(s0, t) t t + 1 s0e−kt s0 kt

Linear decay Exponential decay

Often helpful, but hard to get right!

SLIDE 13

Searching for good step size

Search for the best scalar s >= 0, such that:

13 Mike Hughes - Tufts COMP 135 - Spring 2019

min

x f(x)

x

∆x = rxf(x) s∗ = arg min

s≥0 f(x + s∆x) ∆x

Possible step lengths

Step Direction: Goal: Exact Line Search: Expensive but gold standard

SLIDE 14

Searching for good step size

14 Mike Hughes - Tufts COMP 135 - Spring 2019

min

x f(x)

x

∆x = rxf(x)

∆x

Possible step lengths

Step Direction: Goal: Backtracking Line Search: More Efficient!

s = 1 while reduced slope linear extrapolation ˆ f(x + s∆x) < f(x + s∆x) : s ← 0.9 · s

SLIDE 15

Backtracking line search

15 Mike Hughes - Tufts COMP 135 - Spring 2019

acceptable step sizes rejected step sizes Linear extrapolation with reduced slope by factor alpha

s = 1 while reduced slope linear extrapolation ˆ f(x + s∆x) < f(x + s∆x) : s ← 0.9 · s

Python : scipy.optimize.line_search

SLIDE 16

More resources on step sizes!

Online Textbook: Convex Optimization

http://web.stanford.edu/~boyd/cvxbook/bv_cvxbook.pdf

16 Mike Hughes - Tufts COMP 135 - Spring 2019

SLIDE 17

2nd order methods for gradient descent Big Idea: 2nd deriv. can help!

17 Mike Hughes - Tufts COMP 135 - Spring 2019

SLIDE 18

Newton’s method: Use second-derivative to rescale step size!

18 Mike Hughes - Tufts COMP 135 - Spring 2019

min

x f(x)

Will step directly to minimum if f is quadratic!

Step Direction: Goal: ∆x ∆x = H(x)−1rxf(x)

In high dimensions, need the Hessian matrix

SLIDE 19

Animation of Newton’s method

19 Mike Hughes - Tufts COMP 135 - Spring 2019

To optimize, we want to find zeros of first derivative!

f’(x)

SLIDE 20

L-BFGS : Limited Memory Broyden–Fletcher–Goldfarb–Shanno (BFGS)

Provide loss and gradient functions
Approximates the Hessian via recent history of gradient steps

L-BFGS: gold standard approximate 2nd order GD

20 Mike Hughes - Tufts COMP 135 - Spring 2019

∆x = H(x)−1rxf(x)

In high dimensions, need the Hessian matrix But this is quadratic in length of x , expensive

∆x = ˆ H(x)−1rxf(x)

Instead, use low-rank approximation

input: initial w 2 R input: initial step size s0 2 R+ while not converged: w w strwL(w) st decay(s0, t) t t + 1

min

1 2λwT w −

X

log BernPMF(tn|σ(wT φ(xn))) | {z }

e s0 2 al step size s0 2 R+

Gradient Descent for L2 Penalized Logistic Regr.

Will gradient descent always find same solution?

Will gradient descent always find same solution?

Loss for logistic regression is convex!

Intuition: 1D gradient descent

Choosing good step size matters!

Log likelihood vs iterations

Maximizing likelihood: Higher is better! (could multiply by -1 and minimize instead)

If step size is to too small

If step size is lar large

If step size is to too large

If step size is wa way too large

Rule for picking step sizes

Decaying step sizes

input: initial w 2 R input: initial step size s0 2 R+ while not converged: w w strwL(w) st decay(s0, t) t t + 1 s0e−kt s0 kt

Often helpful, but hard to get right!

Searching for good step size

Search for the best scalar s >= 0, such that:

min

x f(x)

x

∆x = rxf(x) s∗ = arg min

s≥0 f(x + s∆x) ∆x

Step Direction: Goal: Exact Line Search: Expensive but gold standard

Searching for good step size

min

x f(x)

x

∆x = rxf(x)

∆x

Step Direction: Goal: Backtracking Line Search: More Efficient!

Backtracking line search

Python : scipy.optimize.line_search

More resources on step sizes!

Online Textbook: Convex Optimization

2nd order methods for gradient descent Big Idea: 2nd deriv. can help!

Newton’s method: Use second-derivative to rescale step size!

min

x f(x)

Step Direction: Goal: ∆x ∆x = H(x)−1rxf(x)

Animation of Newton’s method

To optimize, we want to find zeros of first derivative!

L-BFGS : Limited Memory Broyden–Fletcher–Goldfarb–Shanno (BFGS)

L-BFGS: gold standard approximate 2nd order GD

∆x = H(x)−1rxf(x)

∆x = ˆ H(x)−1rxf(x)

Python : scipy.optimize.fmin_l_bfgs_b