[PPT] - LEARNING Outline Math Behind Logistic Regression Visualizing PowerPoint Presentation

SLIDE 1

CSCI 447/547 MACHINE LEARNING

Logistic Regression

SLIDE 2

Outline

Math Behind Logistic Regression
Visualizing Logistic Regression
Loss Function

 Minimizing Log Likelihood Function

Batch/Full Logistic Regression
Gradient Descent for OLS
Gradient Descent for Logistic Regression
Comparing OLS and Logistic Regression
Multi-Class Logistic Regression Using Softmax

SLIDE 3

OLS Recap

Linear regression

 Predicts continuous and potentially unbounded

labels based on given features

 Y = XB

 B are the coefficients, X is the data matrix

 The Issue:

 Unbounded output means we cannot use it with discrete classification

ˆ

SLIDE 4

Logistic Regression

Classification

 Predicts discrete labels based on given features

The Setup

 y = 0 or 1 with probabilities 1-p and p  Predict a probability instead of a value

 Estimate P(y=1|X)

SLIDE 5

Definitions

Link Function

 Relates mean of distribution to output of linear model  Convert unbounded to bounded predictions  Convert continuous output to discrete interpretation  Typically the link function is exponentially distributed  Logistic Function  Input is ∞ to - ∞ and output is 0 to 1  f(- ∞, ∞) -> (0, 1)  Typically sigmoid function

 f(0) = 0.5, f(- ∞) = 0, f(∞) = 1  This value is a probability

SLIDE 6

Note on Choice of Link Function

Properties:

 Bounded [0,1]  Domain (- ∞, ∞)  Differentiable everywhere

 Used in optimization

 Increasing function

 We do not lose the property of coefficient effect (positive and negative) in moving from linear to logistic

SLIDE 7

Visualizing Logistic Regression

SLIDE 8

Logistic Regression Loss

Our Goal
The Full Log-Likelihood
A Note on Minimization

SLIDE 9

Our Goal

Find P(y=1 | X)

 Probability to be estimated

Logistic Function: Φ 𝑦 =

1 1+𝑓−𝑦

OLS Function Y = XB becomes Y = Φ(XB)
Φ =

𝜚1 … 𝜚𝑂

yi = 𝜚(xi . B)
MLE – Maximum Likelihood Estimation
L =

𝑄(𝑍 = 𝑧𝑗|𝑦𝑗)

𝑂 𝑗=1

Independent, so joint probability is the product of each
bservation

SLIDE 10

Our Goal

L =

𝑄(𝑍 = 𝑧𝑗|𝑦𝑗)

𝑂 𝑗=1

=

𝑄 𝑧𝑗 1 − 𝑄 1 − 𝑧𝑗)

𝑂 𝑗=1

Log-Likelihood, log(L)
log(L) =

[𝑧𝑗 log 𝑞 + 1 − 𝑧𝑗 log(1 − 𝑞)

𝑂 𝑗=1

=

[𝑧𝑗 log 𝜚(𝑦𝑗. 𝐶) + 1 − 𝑧𝑗 log(1 − 𝜚 𝑦𝑗. 𝐶 )

𝑂 𝑗=1

Since we want to minimize, take the negative

log likelihood

-log(L) – will minimize this

SLIDE 11

The Full Log-Likelihood

Minimize this in logistic regression
So far, similar to OLS

 But this does not have an explicit solution so we

need to minimize this numerically

SLIDE 12

Logistic Regression Loss

SLIDE 13

Gradient Descent

SLIDE 14

Batch/Full Gradient Descent

The Gradient
Algorithm

 1. Chose x randomly  2. Compute gradient of f at x: 𝛼f(x)  3. Step in the direction of the negative of the gradient  x <- x – η(𝛼f(x))  η = step size (too large, can overshoot, too small, takes too long)  Repeat steps 2 and 3 until convergence  Difference in iterations is not decreasing or have reached a certain number of iterations

SLIDE 15

Gradient Descent for OLS

Numerical Minimization of OLS

 Mean Log-Likelihood  L = -1/N ||y – XB||2

2

The Algorithm for OLS

SLIDE 16

Gradient Descent for Logistic Regression

Numerical Minimization of Logistic Loss
L(B) = -∑i[yi log(Φ(xi

.B)) + (1-y)log(1- Φ(xi .B))]

 Two Components  Recall ∅ 𝑦 =

1 1+𝑓−𝑦

 Because this is symmetric about 1/2:

 ∅ 𝑦 + ∅ −𝑦 = 1

SLIDE 17

Gradient Descent for Logistic Regression

First Term
𝜖

𝜖𝐶𝑘0 [log

(Φ(𝑦 ∙ 𝐶 ))] =

1 𝜚(𝑦𝑗 ∙𝐶) 𝜖 𝜖𝐶𝑘0 𝜚(𝑦𝑗

∙ 𝐶 )

=

1+𝑓− 𝑦𝑗.𝐶 1 𝜖 𝜖𝐶𝑘0 1 1+𝑓− 𝑦𝑗.𝐶

= 1 + 𝑓− 𝑦𝑗.𝐶

1 1+𝑓− 𝑦𝑗.𝐶

2 𝑓−𝑦𝑗.𝐶(−𝑦𝑗𝑘)

=

1 1+𝑓𝑦𝑗.𝐶 𝑦𝑗𝑘0

= 𝜚(−𝑦𝑗

∙ 𝐶)𝑦𝑗𝑘0 = (1 − 𝜚(𝑦𝑗 ∙ 𝐶))𝑦𝑗𝑘0

𝜖

𝜖𝐶𝑘0 [log

(Φ(𝑦 ∙ 𝐶 ))] = 𝑦𝑗 𝑈(1 − 𝜚(𝑦𝑗 ∙ 𝐶))𝑦𝑗𝑘0

SLIDE 18

Gradient Descent for Logistic Regression

Second Term
𝜖

𝜖𝐶𝑘0 [log

(1 − Φ(𝑦 ∙ 𝐶 ))] = 𝜖

𝜖𝐶 log

(𝜚 −𝑦𝑗 ∙ 𝐶 )

= −𝑦 𝑗

𝑈(1 − 𝜚 −𝑦𝑗

∙ 𝐶 )

= −𝑦 𝑗

𝑈(𝜚 𝑦𝑗

∙ 𝐶 )

SLIDE 19

Gradient Descent for Logistic Regression

All Terms
𝜖𝑀(𝐶)

𝜖𝑪

= − [𝒛𝒋𝒚𝒋𝑼

𝒋

(1 − 𝜚(𝑦 ∙ 𝐶 ))+ (1 − 𝑧𝑗)(−𝑦𝑗𝑈)𝜚(𝑦𝑗 ∙ 𝐶 )]

= − [𝒛𝒋𝒚𝒋𝑼

𝒋

−𝑧𝑗𝑦𝑗𝑈𝜚(𝑦𝑗 ∙ 𝐶 ) + 𝑦𝑗𝑈𝑧𝑗𝜚(𝑦𝑗 ∙ 𝐶 )-𝑦𝑗𝑈𝜚(𝑦𝑗 ∙ 𝐶 )]

= − 𝒚𝒋𝑼

𝒋

(𝑧𝑗 − 𝜚(𝑦𝑗 ∙ 𝐶 ))

= −XT 𝒛

− 𝚾 𝒀𝑪

where 𝚾 =

𝜚1 … 𝜚𝑂

SLIDE 20

Gradient Descent for Logistic Regression

𝜖𝑀

𝜖𝐶 = −𝑌𝑈(𝑧

− Φ 𝑌𝐶 )

 where Φ is the individual logistic functions

Normalize this by the number of observations

 Divide by N to get the mean loss

SLIDE 21

Gradient Descent for Logistic Regression

Algorithm:

 Pick learning rate η  Initialize B randomly  Iterate:

 𝐶 ⟵ 𝐶 − 𝜃

𝜖𝑀 𝜖𝐶 = 𝐶

+

𝜃 𝑂 𝑌𝑈[𝑧

− Φ 𝑌𝐶 ]

SLIDE 22

Comparing OLS and Logistic Regression

The Two Update Steps
OLS:

 𝐶

← 𝐶 +

2𝜃 𝑂 𝑌𝑈 𝑧

− 𝑌𝐶

 constant residual error

Logistic:

 𝐶

← 𝐶 +

𝜃 𝑂 𝑌𝑈 𝑧

− Φ(𝑌𝐶)

 constant error

SLIDE 23

Comparing OLS and Logistic Regression

Regularization
OLS Loss Function under Regularization:

 𝑀 =

𝑧 − 𝑌𝐶

2 2+𝜀2 𝐶 2 2

Logistic:

 𝑀 =

𝑧 − Φ(𝑌𝐶) 2

2+𝜀2 𝐶 2 2

L2 Norm – Ridge Regression: Uniform

Regularization

L1 Norm – LASSO Regression:

Dimensionality Reduction

SLIDE 24

Multi-Class Logistic Regression Using Softmax

Softmax

 𝑄 𝑧 = 𝑑

𝑘 𝑦𝑗 = 𝑓𝐶𝑘∙𝑦𝑗 𝑓𝐶𝑘∙𝑦𝑗

𝑛 𝑘=1

, 𝑧 ∈ {1, … , 𝑛}



𝑄 𝑧 = 𝑑

𝑘 𝑦𝑗 𝑛 𝑘=1

= 1 , so it is a probability

SLIDE 25

Multi-Class Logistic Regression Using Softmax

Comparison with Logistic Regression in the

Case of Two Classes

 2 Classes: 𝑄 𝑧 = 1 𝑦𝑗 =

1 1+𝑓−𝑦𝑗𝐶

 Softmax with 2 classes:

𝑄 𝑧 = 1 𝑦𝑗 =

𝑓𝑦𝑗𝐶1 𝑓𝑦𝑗𝐶1+𝑓𝑦𝑗𝐶2 multiply this by 𝑓−𝑦𝑗𝐶1 𝑓−𝑦𝑗𝐶1



=

1 1+𝑓−𝑦𝑗(𝐶1−𝐶2)

SLIDE 26

Softmax Optimization

Classification Probabilities:

 Π𝑗,𝑘 =

𝑓𝑦𝑗𝐶𝑘 𝑓𝑦𝑗𝐶𝑙

𝑙

, 𝜌𝑘 = 𝜌1,𝑘 … 𝜌𝑂,𝑘

 Probabilities of observation i, class j

The Gradients:



𝜖𝑀 𝜖𝐶𝑘 = −𝑌𝑈 𝑧𝑘

− 𝜌𝑘

 Logistic: −𝑌𝑈(𝑧 − 𝜚 𝑌𝐶 )  Probability vector replaces sigmoid term  yj us a column vector with all entries 0 except the jth

SLIDE 27

Softmax Gradient Descent

Algorithm largely unchanged
1. Initialize learning rate
2. Randomly choose B1…Bm
3. 𝐶

𝑘

← 𝐶

𝑘

+ 𝜃

𝑂 𝑌𝑈(𝑧𝑘

− 𝜌𝑘 )

SLIDE 28

Last Notes

Learning Rate η:

 Small values take a long time to converge  Large values, and convergence may not happen  Important to monitor loss function on each

itertation

Also need to make sure you normalize so

values don’t get too large

Gradient descent algorithms across all are

very similar

SLIDE 29

Summary

Math Behind Logistic Regression
Visualizing Logistic Regression
Loss Function

 Minimizing Log Likelihood Function

Batch/Full Logistic Regression
Gradient Descent for OLS
Gradient Descent for Logistic Regression
Comparing OLS and Logistic Regression
Multi-Class Logistic Regression Using Softmax