Logistic Regression Many slides attributable to: Prof. Mike Hughes - - PowerPoint PPT Presentation

logistic regression
SMART_READER_LITE
LIVE PREVIEW

Logistic Regression Many slides attributable to: Prof. Mike Hughes - - PowerPoint PPT Presentation

Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2020f/ Logistic Regression Many slides attributable to: Prof. Mike Hughes Erik Sudderth (UCI) Finale Doshi-Velez (Harvard) James, Witten, Hastie, Tibshirani


slide-1
SLIDE 1

Logistic Regression

2

Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2020f/

Many slides attributable to: Erik Sudderth (UCI) Finale Doshi-Velez (Harvard) James, Witten, Hastie, Tibshirani (ISL/ESL books)

  • Prof. Mike Hughes
slide-2
SLIDE 2

Objectives Today: Logistic Regression

Logistic Regression

  • View as a probabilistic classifier
  • Justification: minimize “log loss” is equivalent to

maximizing the likelihood of training set

  • Computing log loss in numerically stable way
  • Computing the gradient wrt parameters
  • Training via gradient descent

3

Mike Hughes - Tufts COMP 135 - Fall 2020

slide-3
SLIDE 3

Check-in Q1:

4

Mike Hughes - Tufts COMP 135 - Fall 2020

When training Logistic Regression, we minimize the log loss

  • n the training set.

Can you provide 2 justifications for why this log loss

  • bjective is sensible?

min

w,b N

X

n=1

log loss(yn, ˆ p(xn, w, b))

log loss(y, ˆ p) = −y log ˆ p − (1 − y) log(1 − ˆ p)

slide-4
SLIDE 4

Check-in Q1:

5

Mike Hughes - Tufts COMP 135 - Fall 2020

When training Logistic Regression, we minimize the log loss

  • n the training set.

Can you provide 2 justifications for why this log loss

  • bjective is sensible?

1) Log loss is an upper bound of error rate. Minimizing log loss must reduce our (worst case) error rate. 2) Log loss = Binary Cross Entropy. Interpret as learning best probabilistic “encoding’ of the training data

min

w,b N

X

n=1

log loss(yn, ˆ p(xn, w, b))

log loss(y, ˆ p) = −y log ˆ p − (1 − y) log(1 − ˆ p)

slide-5
SLIDE 5

What will we learn?

6

Mike Hughes - Tufts COMP 135 - Fall 2020

Supervised Learning Unsupervised Learning Reinforcement Learning

Data, Label Pairs Performance measure Task data x label y

{xn, yn}N

n=1

Training Prediction Evaluation

slide-6
SLIDE 6

7

Mike Hughes - Tufts COMP 135 - Fall 2020

y

x2 x1

is a binary variable (red or blue)

Supervised Learning

binary classification

Unsupervised Learning Reinforcement Learning

Task: Binary Classification

slide-7
SLIDE 7

>>> yproba_N2 = model.predict_proba(x_NF) >>> yproba1_N = model.predict_proba(x_NF)[:,1] >>> yproba1_N[:5] [0.143, 0.432, 0.523, 0.003, 0.994]

Probability Prediction

Goal: Predict probability of label given features

  • Input:
  • Output:

8

Mike Hughes - Tufts COMP 135 - Fall 2020

xi , [xi1, xi2, . . . xif . . . xiF ]

Entries can be real-valued, or

  • ther numeric types (e.g. integer,

binary)

“features” “covariates” “attributes” “probability”

ˆ pi , p(Yi = 1|xi)

Value between 0 and 1 e.g. 0.001, 0.513, 0.987

slide-8
SLIDE 8

Review: Logistic Regression

9

Mike Hughes - Tufts COMP 135 - Fall 2020

min

w,b N

X

n=1

log loss(yn, ˆ p(xn, w, b))

Parameters: Prediction: Training:

Find weights and bias that minimize (penalized) log loss

w = [w1, w2, . . . wf . . . wF ] b

weight vector bias scalar

p(yi = 1|xi) , sigmoid @

F

X

f=1

wfxif + b 1 A

ˆ p(xi, w, b) =

sigmoid(z) = 1 1 + e−z

slide-9
SLIDE 9

Logistic Regression: Predicted Probas vs Binary Decisions

10

Mike Hughes - Tufts COMP 135 - Fall 2020

Predicted probability function is monotonically increasing in one direction That direction is perpendicular to the decision boundary Decision boundary is the set of x values where Pr(y=1|x) = 0.5 Decision boundary is a linear function of x

slide-10
SLIDE 10

Check-in (warm up for lab)

Consider logistic regression classifier for 2D features What is the value (approximately)

  • f w_1, w_2, and bias for each plot below?

11

Mike Hughes - Tufts COMP 135 - Fall 2020 Predicted proba of positive class x_1 x_2

slide-11
SLIDE 11

Check-in Answered

Consider logistic regression classifier for 2D features What is the value (approximately)

  • f w_1, w_2, and bias for each plot below?

12

Mike Hughes - Tufts COMP 135 - Fall 2020 Predicted proba of positive class x_1 x_2

slide-12
SLIDE 12

13

Mike Hughes - Tufts COMP 135 - Fall 2020

Optimization Objective Why minimize log loss?

A probabilistic justification

slide-13
SLIDE 13

Likelihood of labels under LR

14

Mike Hughes - Tufts COMP 135 - Fall 2020

We can write the probability for each possible outcome as:

p(Yi = 1|xi) = p(Yi = 0|xi) = 1 −

We can write the probability mass function of random variable Y as:

p(Yi = yi|xi) = ⇥ σ(wT xi + b) ⇤yi ⇥ 1 − σ(wT xi + b) ⇤1−yi

Interpret: p(y | x) is the “likelihood” of label y given input features x Goal: Fit model to make the training data as likely as possible

⇥ σ(wT xi + b) ⇥ σ(wT xi + b)

slide-14
SLIDE 14

Maximizing likelihood

15

Mike Hughes - Tufts COMP 135 - Fall 2020

max

w,b N

Y

n=1

p(Yn = yn|xn, w, b)

Why might this be hard in practice? Hint: Think about datasets with 1000s of examples N

slide-15
SLIDE 15

Maximizing log likelihood

16

Mike Hughes - Tufts COMP 135 - Fall 2020

The logarithm (with any base) is a monotonic transform

a > b implies log (a) > log (b)

Thus, the following are equivalent problems

w∗, b∗ = arg max

w,b N

Y

n=1

p(Yn = yn|xn, w, b) w∗, b∗ = arg max

w,b

log " N Y

n=1

p(Yn = yn|xn, w, b) #

slide-16
SLIDE 16

Log likelihood for LR

17

Mike Hughes - Tufts COMP 135 - Fall 2020

We can write the probability mass function of Y as: Our training objective is to maximize log likelihood

p(Yi = yi|xi) = ⇥ σ(wT xi + b) ⇤yi ⇥ 1 − σ(wT xi + b) ⇤1−yi

Y

n=1

w∗, b∗ = arg max

w,b

log " N Y

n=1

p(Yn = yn|xn, w, b) #

slide-17
SLIDE 17

In order to maximize likelihood, we can minimize negative log likelihood.

Two equivalent optimization problems:

18

Mike Hughes - Tufts COMP 135 - Fall 2020

w∗, b∗ = arg max

w,b N

X

n=1

log p(Yn = yn|xn, w, b) w∗, b∗ = arg min

w,b

N

X

n=1

log p(Yn = yn|xn, w, b)

slide-18
SLIDE 18

Summary of “likelihood” justification for the way we train logistic regression

  • We defined a probabilistic model for y given x
  • We want to maximize probability of training data

under this model (“maximize likelihood”)

  • We can show that another optimization problem

(”maximize log likelihood”) is easier numerically but produces the same optimal values for weights and bias

  • Equivalent to minimizing -1 * log likelihood
  • Turns out, minimizing log loss is precisely the

same thing as minimizing negative log likelihood

19

Mike Hughes - Tufts COMP 135 - Fall 2020

slide-19
SLIDE 19

Computing the loss for Logistic Regression (LR) in a numerically stable way

20

Mike Hughes - Tufts COMP 135 - Fall 2020

slide-20
SLIDE 20

Simplified notation

  • Feature vector with first entry constant
  • Weight vector (first entry is the “bias”)
  • “Score” value s (real number, -inf to +inf)

21

Mike Hughes - Tufts COMP 135 - Fall 2020

w = [w0 w1 w2 . . . wF ]

sn = wT ˜ xn

<latexit sha1_base64="5EvpmrgcSyUmaOJjrly/Mlf040M=">AB/nicbVBNS8NAEN3Ur1q/quLJy2IRPJWkCnoRil48VugXtDFsNtN26WYTdjdqCQX/ihcPinj1d3jz37htc9DWBwOP92aYmefHnClt29Wbml5ZXUtv17Y2Nza3inu7jVlEgKDRrxSLZ9oAzAQ3NId2LIGEPoeWP7ye+K17kIpFoq5HMbgh6QvWY5RoI3nFA+UJfIkf7uq4qxkPIH0ce0Yv2WV7CrxInIyUIaV/zqBhFNQhCacqJUx7Fj7aZEakY5jAvdREFM6JD0oWOoICEoN52eP8bHRglwL5KmhMZT9fdESkKlRqFvOkOiB2rem4j/eZ1E9y7clIk40SDobFEv4VhHeJIFDpgEqvnIEIlM7diOiCSUG0SK5gQnPmXF0mzUnZOy5Xbs1L1Kosjw7RETpBDjpHVXSDaqiBKErRM3pFb9aT9WK9Wx+z1pyVzeyjP7A+fwB9mpUy</latexit>
slide-21
SLIDE 21

Training Logistic Regression

22

Mike Hughes - Tufts COMP 135 - Fall 2020

min

w N

X

n=1

BCE(yn, σ(sn(w)))

<latexit sha1_base64="YRvHWptLtnhP6lMZXa9tHK9YQek=">ACKHicbVDLSgMxFM34rPVdekmWIQKUmZU0I1YFMGVKNgqdOqQSdMamTG5I61DPM5bvwVNyKuPVLTB8LX+dy4XDOvST3hLHgBlz3wxkbn5icms7N5Gfn5hcWC0vLNRMlmrIqjUSkr0JimOCKVYGDYFexZkSGgl2GnaO+f3nHtOGRuoBezBqStBVvcUrASkHhwJdcBWk3w/5tQprYN4kMUrXvZden2Ad2D+nh0XFWwr1AbVqXtyUpYROoUncD2woKRbfsDoD/Em9EimiEs6Dw4jcjmkimgApiTN1zY2ikRAOngmV5PzEsJrRD2qxuqSKSmUY6ODTD61Zp4lakbSvA/X7RkqkMT0Z2klJ4Mb89vrif149gdZeI+UqToApOnyolQgMEe6nhptcMwqiZwmhmtu/YnpDNKFgs83bELzfJ/8lta2yt13eOt8pVg5HceTQKlpDJeShXVRBJ+gMVRFD+gJvaI359F5dt6dj+HomDPaWUE/4Hx+Af1pJY=</latexit>

σ(s) = sigmoid(s)

<latexit sha1_base64="IUncCqbyCeOum5jTOfUuvUum2Q=">ACBnicbZDLSsNAFIYn9VbrLepShMEi1E1JqAboejGZQV7gSaUyWTaDp1cmDkRS+jKja/ixoUibn0Gd76NkzYLbf1h4Oc753Dm/F4suAL+jYKS8srq2vF9dLG5tb2jrm71JRIilr0khEsuMRxQPWRM4CNaJSOBJ1jbG1n9fY9k4pH4R2MY+YGZBDyPqcENOqZh47ig4BU1Am+xA6wB0gzEHF/olnPLFtVayq8aOzclFGuRs/8cvyIJgELgQqiVNe2YnBTIoFTwSYlJ1EsJnREBqyrbUgCptx0esYEH2vi434k9QsBT+nviZQESo0DT3cGBIZqvpbB/2rdBPoXbsrDOAEW0tmifiIwRDjLBPtcMgpirA2hku/YjoklDQyZV0CPb8yYumVavap9Xa7Vm5fpXHUQH6AhVkI3OUR3doAZqIoe0TN6RW/Gk/FivBsfs9aCkc/soz8yPn8AJPWYSA=</latexit>

s

<latexit sha1_base64="8r6dWOvdyfgmPOHEanDSQrmgcis=">AB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48t2FpoQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm/GbZuDtj4YeLw3w8y8IBFcG9f9dgpr6xubW8Xt0s7u3v5B+fCoreNUMWyxWMSqE1CNgktsGW4EdhKFNAoEPgTj25n/8IRK81jem0mCfkSHkoecUWOlpu6XK27VnYOsEi8nFcjR6Je/eoOYpRFKwTVu5ifEzqgxnAqelXqoxoWxMh9i1VNItZ/ND52SM6sMSBgrW9KQufp7IqOR1pMosJ0RNSO97M3E/7xuasJrP+MySQ1KtlgUpoKYmMy+JgOukBkxsYQyxe2thI2oszYbEo2BG/5VXSrlW9i2qteVmp3+RxFOETuEcPLiCOtxBA1rAOEZXuHNeXRenHfnY9FacPKZY/gD5/MH3/OM+w=</latexit>

sn = wT ˜ xn

<latexit sha1_base64="5EvpmrgcSyUmaOJjrly/Mlf040M=">AB/nicbVBNS8NAEN3Ur1q/quLJy2IRPJWkCnoRil48VugXtDFsNtN26WYTdjdqCQX/ihcPinj1d3jz37htc9DWBwOP92aYmefHnClt29Wbml5ZXUtv17Y2Nza3inu7jVlEgKDRrxSLZ9oAzAQ3NId2LIGEPoeWP7ye+K17kIpFoq5HMbgh6QvWY5RoI3nFA+UJfIkf7uq4qxkPIH0ce0Yv2WV7CrxInIyUIaV/zqBhFNQhCacqJUx7Fj7aZEakY5jAvdREFM6JD0oWOoICEoN52eP8bHRglwL5KmhMZT9fdESkKlRqFvOkOiB2rem4j/eZ1E9y7clIk40SDobFEv4VhHeJIFDpgEqvnIEIlM7diOiCSUG0SK5gQnPmXF0mzUnZOy5Xbs1L1Kosjw7RETpBDjpHVXSDaqiBKErRM3pFb9aT9WK9Wx+z1pyVzeyjP7A+fwB9mpUy</latexit>

How can we evaluate this loss as a function of w reliably? How can we evaluate gradients of this loss with respect to w?

w = [w0 w1 w2 . . . wF ]

slide-22
SLIDE 22

Simplifying per-example loss

23

Mike Hughes - Tufts COMP 135 - Fall 2020

scoreBCE(yn, sn) = BCE(yn, σ(sn)) = −yn log2 σ(sn) − (1 − yn) log2(1 − σ(sn)) = −yn log2 1 1 + e−sn − (1 − yn) log2 e−sn 1 + e−sn = ( log2(1 + e−sn) if yn = 1 log2(1 + esn) if yn = 0 = log2 ⇣ 1 + eflip(yn)sn⌘ = log2 ⇣ e0 + eflip(yn)sn⌘

<latexit sha1_base64="/3/02cHn1fHsPdg4G+cuXrR0prc=">AEIXicnVNLb9NAEN7aPIp5NIUjlxERyBFNZAekcqlUtULiWCTSVopTa71ZO6u18a7Ro0s81O48Fe4cACh3hB/hvVDNEmjHhjJ8ni+x+6Md4OUM6kc5/eGYd6fefu5j3r/oOHj7Y624+PZJnhI5IwpPsNMCSciboSDHF6WmaURwHnJ4E54cVfvKJZpIl4oOap3QS40iwkBGsdMnfNnZBh6fohSokSTJ6cPi2tOe+2AHpix682GvBq7onWRju4J7YHmepTl90B4PIn84SJBA7bn1dZC+pvXVyk9NZ6hBkmhVsWLrwEelb0NbUs19g1xH+MZb5VN+fVr6qTgEZMFETPS7ZY42IvqrT3xP275ZWH6uN7YH7qLdivJGodMoqJgurX21r7ZnTkNlQ+vYuIScpfXce9X/KMHLWDRTvRsN6JnzHxY1V/vIPKYXWgBjZwfWOcBE/zL73SdgVMHXE/cNumiNo78zqU3TYg2F4pwLOXYdVI1KXCmGOG0tLxc0hSTcxzRsU4FjqmcFPUJL+G5rkwhTDL9CAV1dVFR4FjKeRxoZozVTK5iVXEdNs5V+GZSMJHmigrSLBTmHFQC1XWBKcsoUXyuE0wypvcKZIb1kVP6UlVDcFdbvp4cDwfuq8Hw/evu/kE7jk30FD1DNnLRLtpH79ARGiFifDG+GT+Mn+ZX87v5y7xsqMZGq3mClsL8xdoPDv+</latexit>

sn = wT ˜ xn

<latexit sha1_base64="5EvpmrgcSyUmaOJjrly/Mlf040M=">AB/nicbVBNS8NAEN3Ur1q/quLJy2IRPJWkCnoRil48VugXtDFsNtN26WYTdjdqCQX/ihcPinj1d3jz37htc9DWBwOP92aYmefHnClt29Wbml5ZXUtv17Y2Nza3inu7jVlEgKDRrxSLZ9oAzAQ3NId2LIGEPoeWP7ye+K17kIpFoq5HMbgh6QvWY5RoI3nFA+UJfIkf7uq4qxkPIH0ce0Yv2WV7CrxInIyUIaV/zqBhFNQhCacqJUx7Fj7aZEakY5jAvdREFM6JD0oWOoICEoN52eP8bHRglwL5KmhMZT9fdESkKlRqFvOkOiB2rem4j/eZ1E9y7clIk40SDobFEv4VhHeJIFDpgEqvnIEIlM7diOiCSUG0SK5gQnPmXF0mzUnZOy5Xbs1L1Kosjw7RETpBDjpHVXSDaqiBKErRM3pFb9aT9WK9Wx+z1pyVzeyjP7A+fwB9mpUy</latexit>
slide-23
SLIDE 23

Why care about numerical issues?

1) Avoid NaNs and other issues 2) Correctly rank bad solutions (avoid saturation) 3) Non-zero gradients everywhere (no flat regions)

24

Mike Hughes - Tufts COMP 135 - Fall 2020

slide-24
SLIDE 24

logsumexp : the problem

25

Mike Hughes - Tufts COMP 135 - Fall 2020

See HW2

  • n website
slide-25
SLIDE 25

logsumexp : the solution

26

Mike Hughes - Tufts COMP 135 - Fall 2020

See HW2

  • n website
slide-26
SLIDE 26

Gradient of log loss wrt w

27

Mike Hughes - Tufts COMP 135 - Fall 2020

Simplifying yields: Per-example loss when y=0 Gradient w.r.t. weight on feature f: Use the chain rule

sn = wT ˜ xn

<latexit sha1_base64="5EvpmrgcSyUmaOJjrly/Mlf040M=">AB/nicbVBNS8NAEN3Ur1q/quLJy2IRPJWkCnoRil48VugXtDFsNtN26WYTdjdqCQX/ihcPinj1d3jz37htc9DWBwOP92aYmefHnClt29Wbml5ZXUtv17Y2Nza3inu7jVlEgKDRrxSLZ9oAzAQ3NId2LIGEPoeWP7ye+K17kIpFoq5HMbgh6QvWY5RoI3nFA+UJfIkf7uq4qxkPIH0ce0Yv2WV7CrxInIyUIaV/zqBhFNQhCacqJUx7Fj7aZEakY5jAvdREFM6JD0oWOoICEoN52eP8bHRglwL5KmhMZT9fdESkKlRqFvOkOiB2rem4j/eZ1E9y7clIk40SDobFEv4VhHeJIFDpgEqvnIEIlM7diOiCSUG0SK5gQnPmXF0mzUnZOy5Xbs1L1Kosjw7RETpBDjpHVXSDaqiBKErRM3pFb9aT9WK9Wx+z1pyVzeyjP7A+fwB9mpUy</latexit>

J(sn(w)) = log(1 + esn)

<latexit sha1_base64="1xdHfV/jk7MYpngidAHW9XMpWgk=">ACXicbVDLSsNAFJ34rPUVdelmsAgpQkmqoBuh6EZcVbAPaGOYTCft0MkzEyUErp146+4caGIW/AnX/jpM1CWw9cOJxzL/fe48eMSmXb38bC4tLymphrbi+sbm1be7sNmWUCEwaOGKRaPtIEkY5aSiqGnHgqDQZ6TlDy8zv3VPhKQRv1WjmLgh6nMaUIyUljwTXlvS49ZDuQzPYZdFfQs68AiSu1TLY1guembJrtgTwHni5KQEctQ986vbi3ASEq4wQ1J2HDtWboqEopiRcbGbSBIjPER90tGUo5BIN518MoaHWunBIBK6uIT9fdEikIpR6GvO0OkBnLWy8T/vE6igjM3pTxOFOF4uihIGFQRzGKBPSoIVmykCcKC6lshHiCBsNLhZSE4sy/Pk2a14hxXqjcnpdpFHkcB7IMDYAEHnIauAJ10AYPIJn8ArejCfjxXg3PqatC0Y+swf+wPj8AT8Qlt8=</latexit>

∂ ∂wf J(sn(w)) = ∂ ∂sn J(sn) · ∂ ∂wf sn(w)

<latexit sha1_base64="VQHsWy34og0qSvoMfi37wuwX4hw=">ACYnicdVHPT8IwGO2mKLCkKMeGokJXMiGJnoxIXoxnjCRHwkjS1c6aOi6pe0kZOGf9ObJi3+IBRajIF/S5OV973tf+rHjEpl2x+GubefOzjMHxWOT06LJat81pVRIjDp4IhFou8jSRjlpKOoYqQfC4JCn5GeP31c9ntvREga8Vc1j8kwRGNOA4qR0pRnzd1AIJy6MRKIrb4QXDmBQv4XJMer83qdXgPdyq1JFPWoYtHkdotXZmuLT2rajfsVcFt4GSgCrJqe9a7O4pwEhKuMENSDhw7VsN06YwZWRTcRJIY4Skak4GHIVEDtNVRAt4pZkRDCKhD1dwxf6eSFEo5Tz0tTJEaiI3e0vyv94gUcHdMKU8ThTheL0oSBhUEVzmDUdUEKzYXAOEBdV3hXiCdDxK/0pBh+BsPnkbdJsN57rRfLmpth6yOPLgHFyCGnDALWiBJ9AGHYDBp5EzikbJ+DILZtmsrKWmkc1UwJ8yL74Bd6i3nw=</latexit>

∂ ∂wf J = (σ(sn(w)) − yn) ˜ xnf

<latexit sha1_base64="zEMTdLb7GHupQX2g+feZ+wYeBXY=">ACPnicbZBPSxBEMV7jH9Xo5vk6KVwEXYPWZU0IsgySXkpOCqsLMPb01u409PUN3jboM8lyWfIzaOXHBLEq0d71iUkmgcNP15V0VUvzpW05Pu3tyb+YXFpeWVxura2/WN5rv3ZzYrjMCeyFRmLmJuUmNPZKk8CI3yNY4Xl8+bmun1+hsTLTpzTJcZDykZaJFJycFTV7YWK4KMOcG5JcVX8IrqOkgq9wCKHChNoQWjlKedtGun3d6cBHmEQaQiNHY+pASFINsbypolInVSNqtvyuPxW8hmAGLTbTcdT8EQ4zUaSoShubT/wcxqU9SpCYdUIC4s5F5d8hH2HmqdoB+X0/Aq2nTOEJDPuaYKp+/dEyVNrJ2nsOlNOY/uyVpv/q/ULSg4GpdR5QajF80dJoYAyqLOEoTQoSE0cGk2xXEmLs8ySVehxC8Pk1nO10g93uzsle6+jTLI5ltsm2WJsFbJ8dsS/smPWYN/YHfvFfnvfvZ/evfw3DrnzWY+sH/kPT4BOl2ufA=</latexit>

Gradient of log loss wrt weights

slide-27
SLIDE 27

Gradient of log loss wrt weights

28

Mike Hughes - Tufts COMP 135 - Fall 2020

sn = wT ˜ xn

<latexit sha1_base64="5EvpmrgcSyUmaOJjrly/Mlf040M=">AB/nicbVBNS8NAEN3Ur1q/quLJy2IRPJWkCnoRil48VugXtDFsNtN26WYTdjdqCQX/ihcPinj1d3jz37htc9DWBwOP92aYmefHnClt29Wbml5ZXUtv17Y2Nza3inu7jVlEgKDRrxSLZ9oAzAQ3NId2LIGEPoeWP7ye+K17kIpFoq5HMbgh6QvWY5RoI3nFA+UJfIkf7uq4qxkPIH0ce0Yv2WV7CrxInIyUIaV/zqBhFNQhCacqJUx7Fj7aZEakY5jAvdREFM6JD0oWOoICEoN52eP8bHRglwL5KmhMZT9fdESkKlRqFvOkOiB2rem4j/eZ1E9y7clIk40SDobFEv4VhHeJIFDpgEqvnIEIlM7diOiCSUG0SK5gQnPmXF0mzUnZOy5Xbs1L1Kosjw7RETpBDjpHVXSDaqiBKErRM3pFb9aT9WK9Wx+z1pyVzeyjP7A+fwB9mpUy</latexit>

∂ ∂wf J = (σ(sn(w)) − yn) xnf

<latexit sha1_base64="J8y+rhOfVN7k/NEOkcDocGyMacg=">ACNXicbZBNSxBEIZ7/M7G6KjHXAqXwO4hy4wKegmIuYh42ED2A3aWoae3Z7exp2forokuw/wpL/kfnswhByXk6l+w94MQV19oeHiriq56o0wKg573y1laXldW94V3m/+WFr293ZbZs014y3WCpT3Y2o4VIo3kKBknczWkSd6Jr5O6p0fXBuRqu84zng/oUMlYsEoWit0L4NYU1YEGdUoqCz/EVyHcQkX8AUCyWOsQWDEMKE1E6radb0On2EcKgi0GI6wDjdhoeIydKtew5sKXoM/hyqZqxm6d8EgZXnCFTJjen5Xob9YrIBk7ysBLnhGWVXdMh7FhVNuOkX06tL+GSdAcSptk8hTN3/JwqaGDNOItuZUByZxdrEfKvWyzE+6RdCZTlyxWYfxbkETGESIQyE5gzl2AJlWthdgY2ojRFt0BUbgr948mtoHzT8w8bBt6Pq6dk8jg3ykeyTGvHJMTkl56RJWoSRW3JPHsij89P57fx/s5al5z5zB5IefpGbACqrI=</latexit>

Nice interpretation: ( PREDICTION - TRUE RESPONSE ) * FEATURE If feature is positive: Predicted probability is larger than truth means gradient is positive To improve our model (go downhill) we should make that weight smaller If feature is negative, we move weights in the opposite direction.

slide-28
SLIDE 28

Gradient descent for Logistic Regression

29

Mike Hughes - Tufts COMP 135 - Fall 2020

slide-29
SLIDE 29

Will gradient descent always find same solution?

30

Mike Hughes - Tufts COMP 135 - Fall 2020

Yes, if loss looks like this Not if multiple local minima exist

slide-30
SLIDE 30

Log loss (aka BCE) is convex as a function of weight parameters

31

Mike Hughes - Tufts COMP 135 - Fall 2020

slide-31
SLIDE 31

32

Mike Hughes - Tufts COMP 135 - Fall 2020

Logistic Regression

  • View as a probabilistic classifier
  • Justification: minimize “log loss” is equivalent to

maximizing the likelihood of training set

  • Computing log loss in numerically stable way
  • Computing the gradient
  • Training via gradient descent

Lab: practice visual intuition for how weights and bias affect predicted probabilities

Objectives Today: Logistic Regression