Gradient Descent for L2 Penalized Logistic Regr. N 1 X log - - PowerPoint PPT Presentation

gradient descent for l2 penalized logistic regr
SMART_READER_LITE
LIVE PREVIEW

Gradient Descent for L2 Penalized Logistic Regr. N 1 X log - - PowerPoint PPT Presentation

<latexit


slide-1
SLIDE 1

input: initial w 2 R input: initial step size s0 2 R+ while not converged: w w strwL(w) st decay(s0, t) t t + 1

1 Mike Hughes - Tufts COMP 135 - Spring 2019

You need to specify:

  • Max. num iterations T
  • Step size s
  • Convergence threshold d

min

w∈RM

1 2λwT w −

N

X

n=1

log BernPMF(tn|σ(wT φ(xn))) | {z }

L(w)

<latexit sha1_base64="WVKZbGz9XgDFns5dP5ZCmxq463Y=">ACgHicbVFdixMxFM2MX2v9qvroy8UizIJbZ6qgCMKygvjgSpXt7kLTDplMpg2bZIYkY7fE/A7/l2/+GMF0tg+6w0h3Puzb05KRrBjU3TX1F87fqNm7d2bvfu3L13/0H/4aNjU7easgmtRa1PC2KY4IpNLeCnTaEVkIdlKcvd/oJ9+YNrxWR3bdsJkC8UrTokNVN7/gSVXuVsB5gqwJHZFO6rnx96wK0qmS40oczhKhwu824UeBGuLwms5kewAtgDbFqZO/Uu8/PQa0XgC07t+6AaTU+/OATsLmC7yGPLyRJukLcLHlynqtdCMvnrmtNiXCfLaBZ/3B+kw7QKugmwLBmgb47z/E5c1bSVTlgpizDRLGztzRFtOBfM93BrWEHpGFmwaoCKSmZnrDPTwLDAlVLUOW1no2L8rHJHGrGURMjdzmsvahvyfNm1t9WbmuGpayxS9aFS1AmwNm9+AkmtGrVgHQKjmYVagSxK8tuHPesGE7PKTr4Lj0TB7ORx9eTXYP9jasYOeoKcoQRl6jfbRzRGE0TR72gQPY/24jhO4hdxdpEaR9uax+ifiN/+AcCfwBM=</latexit>

e s0 2 al step size s0 2 R+

Gradient Descent for L2 Penalized Logistic Regr.

slide-2
SLIDE 2

Will gradient descent always find same solution?

2 Mike Hughes - Tufts COMP 135 - Spring 2019

slide-3
SLIDE 3

Will gradient descent always find same solution?

3 Mike Hughes - Tufts COMP 135 - Spring 2019

Yes, if loss looks like this Not if multiple local minima exist

slide-4
SLIDE 4

Loss for logistic regression is convex!

4 Mike Hughes - Tufts COMP 135 - Spring 2019

slide-5
SLIDE 5

Intuition: 1D gradient descent

5 Mike Hughes - Tufts COMP 135 - Spring 2019

𝑔(𝒚) 𝒚 𝑔(𝒚) 𝒚 𝒚 𝒚

Choosing good step size matters!

slide-6
SLIDE 6

Log likelihood vs iterations

6 Mike Hughes - Tufts COMP 135 - Spring 2019

Maximizing likelihood: Higher is better! (could multiply by -1 and minimize instead)

Figure Credit: Emily Fox (UW)

slide-7
SLIDE 7

If step size is to too small

7 Mike Hughes - Tufts COMP 135 - Spring 2019

Figure Credit: Emily Fox (UW)

slide-8
SLIDE 8

If step size is lar large

8 Mike Hughes - Tufts COMP 135 - Spring 2019

Figure Credit: Emily Fox (UW)

slide-9
SLIDE 9

If step size is to too large

9 Mike Hughes - Tufts COMP 135 - Spring 2019

Figure Credit: Emily Fox (UW)

slide-10
SLIDE 10

If step size is wa way too large

10 Mike Hughes - Tufts COMP 135 - Spring 2019

Figure Credit: Emily Fox (UW)

slide-11
SLIDE 11

Rule for picking step sizes

  • Never try just one!
  • Usually: Want largest step size that doesn’t diverge
  • Try several values (exponentially spaced) until
  • Find one clearly too small
  • Find one clearly too large (unhelpful oscillation / divergence)
  • Always make trace plots!
  • Show the loss, norm of gradient, and parameter values versus epoch
  • Smarter choices for step size:
  • Decaying methods
  • Search methods
  • Second-order methods

11 Mike Hughes - Tufts COMP 135 - Spring 2019

slide-12
SLIDE 12

Decaying step sizes

12 Mike Hughes - Tufts COMP 135 - Spring 2019

input: initial w 2 R input: initial step size s0 2 R+ while not converged: w w strwL(w) st decay(s0, t) t t + 1 s0e−kt s0 kt

Linear decay Exponential decay

Often helpful, but hard to get right!

slide-13
SLIDE 13

Searching for good step size

Search for the best scalar s >= 0, such that:

13 Mike Hughes - Tufts COMP 135 - Spring 2019

min

x f(x)

x

∆x = rxf(x) s∗ = arg min

s≥0 f(x + s∆x) ∆x

Possible step lengths

Step Direction: Goal: Exact Line Search: Expensive but gold standard

slide-14
SLIDE 14

Searching for good step size

14 Mike Hughes - Tufts COMP 135 - Spring 2019

min

x f(x)

x

∆x = rxf(x)

∆x

Possible step lengths

Step Direction: Goal: Backtracking Line Search: More Efficient!

s = 1 while reduced slope linear extrapolation ˆ f(x + s∆x) < f(x + s∆x) : s ← 0.9 · s

slide-15
SLIDE 15

Backtracking line search

15 Mike Hughes - Tufts COMP 135 - Spring 2019

acceptable step sizes rejected step sizes Linear extrapolation with reduced slope by factor alpha

s = 1 while reduced slope linear extrapolation ˆ f(x + s∆x) < f(x + s∆x) : s ← 0.9 · s

Python : scipy.optimize.line_search

slide-16
SLIDE 16

More resources on step sizes!

Online Textbook: Convex Optimization

http://web.stanford.edu/~boyd/cvxbook/bv_cvxbook.pdf

16 Mike Hughes - Tufts COMP 135 - Spring 2019

slide-17
SLIDE 17

2nd order methods for gradient descent Big Idea: 2nd deriv. can help!

17 Mike Hughes - Tufts COMP 135 - Spring 2019

slide-18
SLIDE 18

Newton’s method: Use second-derivative to rescale step size!

18 Mike Hughes - Tufts COMP 135 - Spring 2019

min

x f(x)

Will step directly to minimum if f is quadratic!

Step Direction: Goal: ∆x ∆x = H(x)−1rxf(x)

In high dimensions, need the Hessian matrix

slide-19
SLIDE 19

Animation of Newton’s method

19 Mike Hughes - Tufts COMP 135 - Spring 2019

To optimize, we want to find zeros of first derivative!

f’(x)

slide-20
SLIDE 20

L-BFGS : Limited Memory Broyden–Fletcher–Goldfarb–Shanno (BFGS)

  • Provide loss and gradient functions
  • Approximates the Hessian via recent history of gradient steps

L-BFGS: gold standard approximate 2nd order GD

20 Mike Hughes - Tufts COMP 135 - Spring 2019

∆x = H(x)−1rxf(x)

In high dimensions, need the Hessian matrix But this is quadratic in length of x , expensive

∆x = ˆ H(x)−1rxf(x)

Instead, use low-rank approximation

Python : scipy.optimize.fmin_l_bfgs_b