Logistic Regression, Gradient Descent, and Newton Method Matthieu - - PDF document

▶

Nov 14, 2023 119 likes •160 views

1 (1) no clear methodology. Hence, we must resort to a numerical algorithm to find the solution of (7) Here, this means that (6) (5) (4) Tie log likelihood can therefore be written as (3) In case you are not familiar with this way of

SLIDE 1

ECE 6254 - Spring 2020 - Lecture 8 v1.0 - revised January 30, 2020

Logistic Regression, Gradient Descent, and Newton Method

Matthieu R. Bloch

1 Maximum Likelihood Estimator (MLE) for logistic classification We will start with a standard trick to simplify notation, which consists in defining ˜ x = [1, x⊺]⊺ and

θ = [b w⊺]⊺. Tiis allows us to write the logistic model as η(x) ≜ η1(x) = 1 1 + exp(−θ⊺˜ x). (1) To avoid carrying a tilde repeatedly in our notation, we will now simply write x in place of ˜ x, but keep in mind that we operate under the assumption that the first component of x is set to one. Given our dataset {(xi, yi)}N

i=1 the likelihood is L(θ) ≜ ∏N i=1 Pθ(yi|xi), where we don’t try to

model the distribution of xi as mentioned in Example ??. For K = 2 and Y = {0, 1}, we obtain L(θ) ≜

∏

i=1

η(xi)yi(1 − η(xi))1−yi (2) In case you are not familiar with this way of writing the likelihood, note that η(xi)yi(1 − η(xi))1−yi = { η(xi) = η1(xi) if yi = 1 (1 − η(xi)) = η0(xi) if yi = 0. (3) Tie log likelihood can therefore be written as ℓ(θ) ≜ log L(θ) =

∑

i=1

(yi log η(xi) + (1 − yi) log(1 − η(xi))) (4) =

∑

i=1

( yi log 1 1 + e−θ⊺x + (1 − yi) log e−θ⊺x 1 + e−θ⊺x ) (5) =

∑

i=1

( yiθ⊺xi − log(1 + eθ⊺xi) ) . (6) To find the minimum with respect to (w.r.t.) θ, a necessary condition for optimality is ∇θℓ(θ) = 0. Here, this means that ∇θℓ(θ) =

∑

i=1

( yixi − eθ⊺xi 1 + eθ⊺xi xi ) =

∑

i=1

xi ( yi − 1 1 + e−θ⊺xi ) = 0. (7) Solving this equation means solving a nonlinear system of d + 1 equations, for which there exists no clear methodology. Hence, we must resort to a numerical algorithm to find the solution of argminθ −ℓ(θ). 1

SLIDE 2

ECE 6254 - Spring 2020 - Lecture 8 v1.0 - revised January 30, 2020

You should check for yourself −ℓ(θ) is convex in θ, and there exists algorithms with provable convergence guarantees. We will mention a few specific techniques, such as gradient descent, New- ton’s method, but there are many more that especially useful in high dimension. 2 Conclusion regarding plug-in methods Naive Bayes, Linear Discriminant Analysis (LDA), and logistic classification are all plugin methods that result in linear classifiers, i.e., classifiers for which decision boundaries are hyperplanes in Rd. All have advantages and drawbacks:

Naive Bayes is plugin method based on a seldom valid assumption (independence of features

given the class), but which scales well to high-dimensions and naturally handles mixture of discrete and continuous features;

LDA tends to work well if the assumption regarding the Gaussian distribution of the feature

vectors in a class is valid;

Logistic classification models only the distribution of Py|x, not Py,x, which is valid for a larger

class of distributions and results in fewer parameters to estimate. Plugin methods can be useful in practice, but ultimately they are very limited. Tiere are always distributions for which assumptions are violated and if our assumptions are wrong, the output is totally unpredictable. It can be hard to verify whether our assumptions are right and plugin methods

ften require solving a more difficult problem as an intermediate step, see for instance the detour

made by LDA to obtain a linear model. 3 Gradient descent and Newton’s method Assume that we wish to solve the problem minx∈Rd f(x) where f : Rd → R. Very often one cannot

btain a closed form expression for the solution and one resorts to numerical algorithms to obtain

the solution. We could spend an entire semester studying these algorithms in depth (and why they work), and we will only briefly review here important concepts. Gradient descent Tie idea of gradient descent is to find the minimum of f iteratively by following the opposite directions of the gradient ∇f. Intuitvely, gradient descent consists in “rolling down the hill” to find the minimum. A typical gradient descent algorithm would run as follows.

Start with a guess of the solution x(0).
For j ⩾ 0, update the estimate as xj+1 = xj − η∇f(xj), where η > 0 is called the stepsize.

Without further assumptions, there is no guarantee of convergence. In addition, the choice of step size η really matters: too small and convergence takes forever, too big and the algorithm might never

converge. Very often in machine learning, the function f to optimize is a loss function ℓ(θ) that

takes the form ℓ(θ) ≜

∑

i=1

ℓi(θ), (8) 2

SLIDE 3

ECE 6254 - Spring 2020 - Lecture 8 v1.0 - revised January 30, 2020

where ℓi(θ) is a function of θ and the data point (xi, yi) but not the other data points. When the number of data points N is very large, or when the data points cannot all be accessed at the same time, a typical approach is to not compute the exact gradient ∇ℓ to perform updates but to only evaluate ∇ℓi. In its simplest form, stochastic gradient descent consists of the following rule.

Start with a guess of the solution θ(0).
For j ⩾ 0, update the estimate as θ(j+1) = θ(j) − η∇ℓj(θ(j)), where η > 0 is the stepsize.1

Note that the update only depends on the loss function evaluated at a single point (xj, yj). Newton’s method One drawback of the basic gradient descent sketched earlier is the presence of a parameter η, which has to be set a priori. Tiere exists many methods to choose η adaptively, and Newton’s method is one of many such methods. Specifically, consider a quadratic approximation ˜ f

f the function f at a point x:

˜ f(x′) ≜ f(x) + ∇f(x)(x′ − x) + 1 2(x′ − x)⊺∇2f(x)(x′ − x). (9) Tie matrix ∇2f(x) ∈ Rd×d is called the Hessian of f and is defined as ∇2f(x) =       

∂2f ∂x2

∂2f ∂x1∂x2

· · ·

∂2f ∂x1∂xd ∂2f ∂x2∂x1 ∂2f ∂x2

· · ·

∂2f ∂x2∂xd

. . . . . . ... . . .

∂2f ∂xd∂x1 ∂2f ∂xd∂x2

· · ·

∂2f ∂x2

       (10) Note that the gradient of ˜ f is ∇ ˜ f(x′) = ∇f(x) + ∇2f(x)x′ − ∇2f(x)x. Let us now assume that we decide to update our next point in the gradient descent update by choosing xj+1 as the minimum of the quadratic approximation of f at xj. Because ˜ f is quadratic, we can find the minimum by solving ∇ ˜ f(xj+1) = 0: ∇ ˜ f(xj+1) = 0 ⇔ ∇f(xj) + ∇2f(xj)xj+1 − ∇2f(xj)xj = 0 (11) ⇔ xj+1 = xj − [ ∇2f(xj) ]−1 ∇f(xj). (12) Tiis looks like the gradient descent update equation except that we have chosen the stepsize to be a matrix [ ∇2f(xj) ]−1; this adjusts how much we move as a function of the local curvature. Newton’s methods has much faster convergence than gradient descent, but requires the calcu- lation of the inverse of the Hessian. Tiis is feasible when the dimension d is small but impractical when d is large. In many machine learning problems, researchers therefore focus on gradient descent techniques that attempt to adapt the step size without having to compute a full Hessian.

1Tie data index may be chosen to be different from the iteration step index.