LEARNING Outline Math Behind Logistic Regression Visualizing - - PowerPoint PPT Presentation

learning outline
SMART_READER_LITE
LIVE PREVIEW

LEARNING Outline Math Behind Logistic Regression Visualizing - - PowerPoint PPT Presentation

Logistic Regression CSCI 447/547 MACHINE LEARNING Outline Math Behind Logistic Regression Visualizing Logistic Regression Loss Function Minimizing Log Likelihood Function Batch/Full Logistic Regression Gradient Descent for


slide-1
SLIDE 1

CSCI 447/547 MACHINE LEARNING

Logistic Regression

slide-2
SLIDE 2

Outline

  • Math Behind Logistic Regression
  • Visualizing Logistic Regression
  • Loss Function

 Minimizing Log Likelihood Function

  • Batch/Full Logistic Regression
  • Gradient Descent for OLS
  • Gradient Descent for Logistic Regression
  • Comparing OLS and Logistic Regression
  • Multi-Class Logistic Regression Using Softmax
slide-3
SLIDE 3

OLS Recap

  • Linear regression

 Predicts continuous and potentially unbounded

labels based on given features

 Y = XB

 B are the coefficients, X is the data matrix

 The Issue:

 Unbounded output means we cannot use it with discrete classification

ˆ

slide-4
SLIDE 4

Logistic Regression

  • Classification

 Predicts discrete labels based on given features

  • The Setup

 y = 0 or 1 with probabilities 1-p and p  Predict a probability instead of a value

 Estimate P(y=1|X)

slide-5
SLIDE 5

Definitions

  • Link Function

 Relates mean of distribution to output of linear model  Convert unbounded to bounded predictions  Convert continuous output to discrete interpretation  Typically the link function is exponentially distributed  Logistic Function  Input is ∞ to - ∞ and output is 0 to 1  f(- ∞, ∞) -> (0, 1)  Typically sigmoid function

 f(0) = 0.5, f(- ∞) = 0, f(∞) = 1  This value is a probability

slide-6
SLIDE 6

Note on Choice of Link Function

  • Properties:

 Bounded [0,1]  Domain (- ∞, ∞)  Differentiable everywhere

 Used in optimization

 Increasing function

 We do not lose the property of coefficient effect (positive and negative) in moving from linear to logistic

slide-7
SLIDE 7

Visualizing Logistic Regression

slide-8
SLIDE 8

Logistic Regression Loss

  • Our Goal
  • The Full Log-Likelihood
  • A Note on Minimization
slide-9
SLIDE 9

Our Goal

  • Find P(y=1 | X)

 Probability to be estimated

  • Logistic Function: Φ 𝑦 =

1 1+𝑓−𝑦

  • OLS Function Y = XB becomes Y = Φ(XB)
  • Φ =

𝜚1 … 𝜚𝑂

  • yi = 𝜚(xi . B)
  • MLE – Maximum Likelihood Estimation
  • L =

𝑄(𝑍 = 𝑧𝑗|𝑦𝑗)

𝑂 𝑗=1

  • Independent, so joint probability is the product of each
  • bservation
slide-10
SLIDE 10

Our Goal

  • L =

𝑄(𝑍 = 𝑧𝑗|𝑦𝑗)

𝑂 𝑗=1

  • =

𝑄 𝑧𝑗 1 − 𝑄 1 − 𝑧𝑗)

𝑂 𝑗=1

  • Log-Likelihood, log(L)
  • log(L) =

[𝑧𝑗 log 𝑞 + 1 − 𝑧𝑗 log(1 − 𝑞)

𝑂 𝑗=1

  • =

[𝑧𝑗 log 𝜚(𝑦𝑗. 𝐶) + 1 − 𝑧𝑗 log(1 − 𝜚 𝑦𝑗. 𝐶 )

𝑂 𝑗=1

  • Since we want to minimize, take the negative

log likelihood

  • -log(L) – will minimize this
slide-11
SLIDE 11

The Full Log-Likelihood

  • Minimize this in logistic regression
  • So far, similar to OLS

 But this does not have an explicit solution so we

need to minimize this numerically

slide-12
SLIDE 12

Logistic Regression Loss

slide-13
SLIDE 13

Gradient Descent

slide-14
SLIDE 14

Batch/Full Gradient Descent

  • The Gradient
  • Algorithm

 1. Chose x randomly  2. Compute gradient of f at x: 𝛼f(x)  3. Step in the direction of the negative of the gradient  x <- x – η(𝛼f(x))  η = step size (too large, can overshoot, too small, takes too long)  Repeat steps 2 and 3 until convergence  Difference in iterations is not decreasing or have reached a certain number of iterations

slide-15
SLIDE 15

Gradient Descent for OLS

  • Numerical Minimization of OLS

 Mean Log-Likelihood  L = -1/N ||y – XB||2

2

  • The Algorithm for OLS
slide-16
SLIDE 16

Gradient Descent for Logistic Regression

  • Numerical Minimization of Logistic Loss
  • L(B) = -∑i[yi log(Φ(xi

.B)) + (1-y)log(1- Φ(xi .B))]

 Two Components  Recall ∅ 𝑦 =

1 1+𝑓−𝑦

 Because this is symmetric about 1/2:

 ∅ 𝑦 + ∅ −𝑦 = 1

slide-17
SLIDE 17

Gradient Descent for Logistic Regression

  • First Term
  • 𝜖

𝜖𝐶𝑘0 [log

(Φ(𝑦 ∙ 𝐶 ))] =

1 𝜚(𝑦𝑗 ∙𝐶) 𝜖 𝜖𝐶𝑘0 𝜚(𝑦𝑗

∙ 𝐶 )

  • =

1+𝑓− 𝑦𝑗.𝐶 1 𝜖 𝜖𝐶𝑘0 1 1+𝑓− 𝑦𝑗.𝐶

  • = 1 + 𝑓− 𝑦𝑗.𝐶

1 1+𝑓− 𝑦𝑗.𝐶

2 𝑓−𝑦𝑗.𝐶(−𝑦𝑗𝑘)

  • =

1 1+𝑓𝑦𝑗.𝐶 𝑦𝑗𝑘0

  • = 𝜚(−𝑦𝑗

∙ 𝐶)𝑦𝑗𝑘0 = (1 − 𝜚(𝑦𝑗 ∙ 𝐶))𝑦𝑗𝑘0

  • 𝜖

𝜖𝐶𝑘0 [log

(Φ(𝑦 ∙ 𝐶 ))] = 𝑦𝑗 𝑈(1 − 𝜚(𝑦𝑗 ∙ 𝐶))𝑦𝑗𝑘0

slide-18
SLIDE 18

Gradient Descent for Logistic Regression

  • Second Term
  • 𝜖

𝜖𝐶𝑘0 [log

(1 − Φ(𝑦 ∙ 𝐶 ))] = 𝜖

𝜖𝐶 log

(𝜚 −𝑦𝑗 ∙ 𝐶 )

  • = −𝑦 𝑗

𝑈(1 − 𝜚 −𝑦𝑗

∙ 𝐶 )

  • = −𝑦 𝑗

𝑈(𝜚 𝑦𝑗

∙ 𝐶 )

slide-19
SLIDE 19

Gradient Descent for Logistic Regression

  • All Terms
  • 𝜖𝑀(𝐶)

𝜖𝑪

= − [𝒛𝒋𝒚𝒋𝑼

𝒋

(1 − 𝜚(𝑦 ∙ 𝐶 ))+ (1 − 𝑧𝑗)(−𝑦𝑗𝑈)𝜚(𝑦𝑗 ∙ 𝐶 )]

  • = − [𝒛𝒋𝒚𝒋𝑼

𝒋

−𝑧𝑗𝑦𝑗𝑈𝜚(𝑦𝑗 ∙ 𝐶 ) + 𝑦𝑗𝑈𝑧𝑗𝜚(𝑦𝑗 ∙ 𝐶 )-𝑦𝑗𝑈𝜚(𝑦𝑗 ∙ 𝐶 )]

  • = − 𝒚𝒋𝑼

𝒋

(𝑧𝑗 − 𝜚(𝑦𝑗 ∙ 𝐶 ))

  • = −XT 𝒛

− 𝚾 𝒀𝑪

  • where 𝚾 =

𝜚1 … 𝜚𝑂

slide-20
SLIDE 20

Gradient Descent for Logistic Regression

  • 𝜖𝑀

𝜖𝐶 = −𝑌𝑈(𝑧

− Φ 𝑌𝐶 )

 where Φ is the individual logistic functions

  • Normalize this by the number of observations

 Divide by N to get the mean loss

slide-21
SLIDE 21

Gradient Descent for Logistic Regression

  • Algorithm:

 Pick learning rate η  Initialize B randomly  Iterate:

 𝐶 ⟵ 𝐶 − 𝜃

𝜖𝑀 𝜖𝐶 = 𝐶

+

𝜃 𝑂 𝑌𝑈[𝑧

− Φ 𝑌𝐶 ]

slide-22
SLIDE 22

Comparing OLS and Logistic Regression

  • The Two Update Steps
  • OLS:

 𝐶

← 𝐶 +

2𝜃 𝑂 𝑌𝑈 𝑧

− 𝑌𝐶

 constant residual error

  • Logistic:

 𝐶

← 𝐶 +

𝜃 𝑂 𝑌𝑈 𝑧

− Φ(𝑌𝐶)

 constant error

slide-23
SLIDE 23

Comparing OLS and Logistic Regression

  • Regularization
  • OLS Loss Function under Regularization:

 𝑀 =

𝑧 − 𝑌𝐶

2 2+𝜀2 𝐶 2 2

  • Logistic:

 𝑀 =

𝑧 − Φ(𝑌𝐶) 2

2+𝜀2 𝐶 2 2

  • L2 Norm – Ridge Regression: Uniform

Regularization

  • L1 Norm – LASSO Regression:

Dimensionality Reduction

slide-24
SLIDE 24

Multi-Class Logistic Regression Using Softmax

  • Softmax

 𝑄 𝑧 = 𝑑

𝑘 𝑦𝑗 = 𝑓𝐶𝑘∙𝑦𝑗 𝑓𝐶𝑘∙𝑦𝑗

𝑛 𝑘=1

, 𝑧 ∈ {1, … , 𝑛}

𝑄 𝑧 = 𝑑

𝑘 𝑦𝑗 𝑛 𝑘=1

= 1 , so it is a probability

slide-25
SLIDE 25

Multi-Class Logistic Regression Using Softmax

  • Comparison with Logistic Regression in the

Case of Two Classes

 2 Classes: 𝑄 𝑧 = 1 𝑦𝑗 =

1 1+𝑓−𝑦𝑗𝐶

 Softmax with 2 classes:

𝑄 𝑧 = 1 𝑦𝑗 =

𝑓𝑦𝑗𝐶1 𝑓𝑦𝑗𝐶1+𝑓𝑦𝑗𝐶2 multiply this by 𝑓−𝑦𝑗𝐶1 𝑓−𝑦𝑗𝐶1

=

1 1+𝑓−𝑦𝑗(𝐶1−𝐶2)

slide-26
SLIDE 26

Softmax Optimization

  • Classification Probabilities:

 Π𝑗,𝑘 =

𝑓𝑦𝑗𝐶𝑘 𝑓𝑦𝑗𝐶𝑙

𝑙

, 𝜌𝑘 = 𝜌1,𝑘 … 𝜌𝑂,𝑘

 Probabilities of observation i, class j

  • The Gradients:

𝜖𝑀 𝜖𝐶𝑘 = −𝑌𝑈 𝑧𝑘

− 𝜌𝑘

 Logistic: −𝑌𝑈(𝑧 − 𝜚 𝑌𝐶 )  Probability vector replaces sigmoid term  yj us a column vector with all entries 0 except the jth

slide-27
SLIDE 27

Softmax Gradient Descent

  • Algorithm largely unchanged
  • 1. Initialize learning rate
  • 2. Randomly choose B1…Bm
  • 3. 𝐶

𝑘

← 𝐶

𝑘

+ 𝜃

𝑂 𝑌𝑈(𝑧𝑘

− 𝜌𝑘 )

slide-28
SLIDE 28

Last Notes

  • Learning Rate η:

 Small values take a long time to converge  Large values, and convergence may not happen  Important to monitor loss function on each

itertation

  • Also need to make sure you normalize so

values don’t get too large

  • Gradient descent algorithms across all are

very similar

slide-29
SLIDE 29

Summary

  • Math Behind Logistic Regression
  • Visualizing Logistic Regression
  • Loss Function

 Minimizing Log Likelihood Function

  • Batch/Full Logistic Regression
  • Gradient Descent for OLS
  • Gradient Descent for Logistic Regression
  • Comparing OLS and Logistic Regression
  • Multi-Class Logistic Regression Using Softmax