[PPT] - 15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline PowerPoint Presentation

SLIDE 1

15-780: Optimization

J. Zico Kolter

March 14-16, 2015

1

SLIDE 2

Outline

Introduction to optimization Types of optimization problems Unconstrained optimization Constrained optimization Practical optimization

2

SLIDE 3

Outline

Introduction to optimization Types of optimization problems Unconstrained optimization Constrained optimization Practical optimization

3

SLIDE 4

Beyond linear programming

Linear programming

minimize

x

cTx subject to Gx ≤ h Ax = b

General (continuous) optimization

minimize subject to

where is optimization variable, is the objective function, are inequality constraints, and are equality constraints

4

SLIDE 5

Beyond linear programming

Linear programming

minimize

x

cTx subject to Gx ≤ h Ax = b

General (continuous) optimization

minimize

x

f (x) subject to gi(x) ≤ 0, i = 1, . . . , m hi(x) = 0, i = 1, . . . , p

where x ∈ Rn is optimization variable, f : Rn → R is the objective function, gi : Rm → R are inequality constraints, and hi : Rn → R are equality constraints

4

SLIDE 6

Example: image deblurring

Original image Blurred image Reconstruction Figures from (Wang et. al, 2009) Given corrupted m × n image represented as vector y ∈ Rm·n, find

x ∈ Rm·n by solving the optimization problem

minimize

x

∥K∗x−y∥2

2+λ

(n−1 ∑

i=1

|xmi − xm(i+1)| +

m−1

∑

i=1

|xni − xn(i+1)| ) where K∗ denotes 2D convolution with some filter K

5

SLIDE 7

Example: machine learning

Virtually all machine learning algorithms can be expressed as minimizing a loss function over observed data Given inputs x (i) ∈ X , desired outputs y(i) ∈ Y, hypothesis function hθ : X → Y defined by parameters θ ∈ Rn, and loss function ℓ : Y × Y → R+ Machine learning algorithms solve optimization problem

minimize

θ m

∑

i=1

ℓ ( hθ ( x (i)) , y(i))

6

SLIDE 8

Example: robot trajectory planning

𝑒 𝑠

Figure from (Schulman et al., 2013) Robot state xt and control inputs ut

minimize

x1:T ,u1:T−1 T−1

∑

i=1

∥xt − xt+1∥2

2 + ∥ut∥2 2

subject to xt+1 = fdynamics(xt, ut), (robot dynamics) fcollision(xt) ≥ 0.1 (avoid collisions) x1 = xinit, xT = xgoal

7

SLIDE 9

Many other applications

We’ve already seen many applications (i.e., any linear programming setting is also an example of continuous optimization, but there are many other non-linear problems) Applications in control, machine learning, finance, forecasting, signal processing, communications, structural design, any many others The move to optimization-based formalisms has been one of the primary trends in AI in the past 15+ years

8

SLIDE 10

Outline

Introduction to optimization Types of optimization problems Unconstrained optimization Constrained optimization Practical optimization

9

SLIDE 11

Classes of optimization problems

Unconstrained Constrained Convex Nonconvex Smooth Nonsmooth

Many different classifications for (continuous) optimization problems (linear programming, nonlinear programming, quadratic programming, semidefinite programming, second order cone programming, geometric programming, etc) can get overwhelming We focus on three distinctions: unconstrained vs. constrained, convex vs. nonconvex, and (less so) smooth vs. nonsmooth

10

SLIDE 12

Unconstrained vs. constrained optimization

minimize

x

f (x)

vs.

minimize

x

f (x) subject to gi(x) ≤ 0, i = 1, . . . , m hi(x) = 0, i = 1, . . . , p

In unconstrained optimization, every point x ∈ Rn is “feasible”, so singular focus is on finding a low value of f (x) In constrained optimization (where constraints truly need to hold exactly) it may be difficult to find an initial feasible point, and maintain feasibility during optimization Typically leads to different classes of algorithms

11

SLIDE 13

Convex vs. nonconvex optimization

Originally researchers distinguished between linear (easy) and nonlinear (hard) optimization problems But in the 80s and 90s, it became clear that this wasn’t the right line: the real distinction is between convex (easy) and nonconvex (hard) problems The optimization problem

minimize

x

f (x) subject to gi(x) ≤ 0, i = 1, . . . , m hi(x) = 0, i = 1, . . . , p

if f and the gi’s are all convex functions and the hi’s are affine functions

12

SLIDE 14

Convex functions

(x, f(x)) (y, f(y))

A function f : Rn → R is convex if, for any x, y ∈ Rn and

θ ∈ [0, 1], f (θx + (1 − θ)y) ≤ θf (x) + (1 − θ)f (y) f is concave if −f is convex f is affine if it is both convex and concave, must take form f (x) = aTx + b

for a ∈ Rn, b ∈ R

13

SLIDE 15

Why is convex optimization easy?

Convex function Nonconvex function

f1(x) f2(x) Convex function “curve upward everywhere”, and convex constraints define a convex set (for any x, y that is feasible, so is

θx + (1 − θ)y for θ ∈ [0, 1])

Together, these properties imply that any local optima must also be a global optima Thus, for convex problems we can use local methods to find the globally optimal solution (cf. linear programming vs. integer programming)

14

SLIDE 16

Smooth vs. Nonsmooth optimization

Smooth function Nonsmooth function

f1(x) f2(x) In optimization, we care about smoothness in terms of whether functions are (first or second order) continuously differentiable A function f is first order continuously differentiable if it’s derivative f ′ exists and is continuous; the Lipschitz constant of its derivative is a constant L such that for all x, y

|f ′(x) − f ′(y)| ≤ L|x − y|

In the next section, we will use first and second derivative information to optimize functions, so whether or not these exist affect which methods we can apply.

15

SLIDE 17

Outline

Introduction to optimization Types of optimization problems Unconstrained optimization Constrained optimization Practical optimization

16

SLIDE 18

Solving optimization problems

Starting with the unconstrained, smooth, one dimensional case f(x) x To find minimum point x ⋆, we can look at the derivative of the function f ′(x): any location where f ′(x) = 0 will be a “flat” point in the function For convex problems, this is guaranteed to be a minimum (instead of a maximum)

17

SLIDE 19

The gradient

x1 x2 ∇xf(x) For a multivariate function f : Rn, its gradient is a n-dimensional vector containing partial derivatives with respect to each dimension

∇xf (x) =    

∂f (x) ∂x1

. . .

∂f (x) ∂xn

   

For continuously differentiable f and unconstrained optimization,

ptimal point must have ∇xf (x ⋆) = 0

18

SLIDE 20

Properties of the gradient

f(x) x0 f(x0) + ∇xf(x)T(x − x0) x Gradient defines the first order Taylor approximation to the function f around a point x0

f (x) ≈ f (x0) + ∇xf (x0)T(x − x0)

For convex f , first order Taylor approximation is always an underestimate

f (x) ≥ f (x0) + ∇xf (x0)T(x − x0)

19

SLIDE 21

Some common gradients

For f (x) = aTx gradient is given by ∇xf (x) = a

∂f (x) ∂xi = ∂ ∂xi

n

∑

i=1

aixi = ai

For f (x) = 1

2x TQx, gradient is given by ∇xf (x) = 1 2(Q + QT)x

r just ∇xf (x) = Qx if Q is symmetric (Q = QT )

20

SLIDE 22

How do we find ∇xf (x) = 0?

Direct solution: In some cases, it is possible to analytically compute the x ⋆ such that ∇xf (x ⋆) = 0 Example:

f (x) = 2x 2

1 + x 2 2 + x1x2 − 6x1 − 5x2

= ⇒ ∇xf (x) = [ 4x1 + x2 + 6 2x2 + x1 + 5 ] = ⇒ x ⋆ = [ 4 1 1 2 ]−1 [ 6 5 ] = [ 1 2 ]

Iterative methods: more commonly the condition that the gradient equal zero will not have an analytical solution, require iterative methods

21

SLIDE 23

Gradient descent

The gradient doesn’t just give us the optimality condition, it also points in the direction of “steepest ascent” for the function f x1 x2 ∇xf(x) Motivates the gradient descent algorithm, which repeatedly takes steps in the direction of the negative gradient Repeat: x ← x − α∇xf (x) for some step size α > 0

22

SLIDE 24

0.0 0.5 1.0 1.5 2.0 2.5 3.0 x1 0.0 0.5 1.0 1.5 2.0 2.5 3.0 x2 20 40 60 80 100 Iteration 10-15 10-13 10-11 10-9 10-7 10-5 10-3 10-1 101 f - f*

100 iterations of gradient descent on function

f (x) = 2x 2

1 + x 2 2 + x1x2 − 6x1 − 5x2

23

SLIDE 25

How do we choose step size α?

Choice of α plays a big role in convergence of algorithm

0.0 0.5 1.0 1.5 2.0 2.5 3.0 x1 0.0 0.5 1.0 1.5 2.0 2.5 3.0 x2 0.0 0.5 1.0 1.5 2.0 2.5 3.0 x1 0.0 0.5 1.0 1.5 2.0 2.5 3.0 x2

α = 0.05 α = 0.42

24

SLIDE 26

20 40 60 80 100 Iteration 10-15 10-13 10-11 10-9 10-7 10-5 10-3 10-1 101 f - f*

alpha = 0.2 alpha = 0.42 alpha = 0.05

Convergence of gradient descent for different step sizes

25

SLIDE 27

If we know gradient is Lipschitz continuous with constant L, step size α = 1/L is good in theory and practice But what if we don’t know Lipschitz constant, or derivative has unbounded Lipschitz constant? Idea #1 (“exact” line search): want to choose α to minimize

f (x0 + α∇f (x0)) for current iterate x0; this is just another

ptimization problem, but with a single variable α

Idea #2 (backtracking line search): try a few α’s on each iteration until we get one that causes a suitable decrease in the function

26

SLIDE 28

Backtracking line search

function α = Backtracking-Line-Search(x, f , ∆x, α0, γ, β) // x: current iterate // f : function being optimized // ∆x: descent direction (i.e., ∆x = −∇xf (x)) // α0: initial step size // γ ∈ (0, 0.5), β ∈ (0, 1): backtracking parameters

α ← α0

while f (x + α∆x) > f (x) + γα∇xf (x)T∆x

α ← α · β

return α Common choices are γ = 0.001, β = 0.5 Intuition: for small α, by first order Taylor approximation

f (x + α∆x) ≈ f (x) + α∇xf (x)T∆x; backtracking line search

makes α smaller until the inequality holds, but scaled by γ

27

SLIDE 29

0.0 0.5 1.0 1.5 2.0 2.5 3.0 x1 0.0 0.5 1.0 1.5 2.0 2.5 3.0 x2 20 40 60 80 100 Iteration 10-15 10-13 10-11 10-9 10-7 10-5 10-3 10-1 101 f - f*

100 iterations of gradient descent with backtracking line search on function

f (x) = 2x 2

1 + x 2 2 + x1x2 − 6x1 − 5x2

28

SLIDE 30

Trouble with gradient descent

5 10 15 20 25 x1 4 3 2 1 1 2 3 4 x2

Gradient descent with backtracking line search on function

f (x) = 0.01x 2

1 + x 2 2 − 0.5x1 − x2

Gradient is given by (0.02x1 − 0.5, 2x2 − 1) which is very poorly scaled (x2 components are much bigger than x1) Motivates approaches that “automatically” find the right scaling

29

SLIDE 31

Newton’s root finding method

f(x) x f(x0) + f ′(x0)(x − x0) x0 Recall Newton’s method for finding a zero (root) of a univariate function f (x) Repeat: x ← x − f (x)

f ′(x)

To optimize some univariate function g, we could use Newton’s method to find a zero of the derivative, via the updates Repeat: x ← x − g′(x)

g′′(x)

30

SLIDE 32

Hessian Matrix

To apply Newton’s method to optimize a multivariate function, we need a generalization of the second derivative, called the Hessian For a function f : Rn → R, the Hessian is an n × n matrix of all second derivatives

∇2

xf (x) ∈ Rn×n =

      

∂2f (x) ∂x 2

1

∂2f (x) ∂x1∂x2

· · ·

∂2f (x) ∂x1∂xn ∂2f (x) ∂x2∂x1 ∂2f (x) ∂x 2

2

· · ·

∂2f (x) ∂x2∂xn

. . . . . . ... . . .

∂2f (x) ∂xn∂x1 ∂2f (x) ∂xn∂x2

· · ·

∂2f (x) ∂x 2

n

      

31

SLIDE 33

f(x) x0 f(x0) + ∇xf(x)T(x − x0) + 1

2(x − x0)T∇2 xf(x)(x − x0)

x Hessian gives second order Taylor approximation of f at a point x0

f (x) ≈ f (x0) + ∇xf (x0)T(x − x0) + 1 2(x − x0)T∇2

xf (x0)(x − x0)

Because ∂2f (x)

∂xi∂xj = ∂2f (x) ∂xj ∂xi (i.e., it doesn’t matter which order we

differentiate in), the Hessian for any function is always a symmetric matrix For convex function f , the Hessian is positive semidefinite (all it’s eigenvalues are greater than or equal to zero)

32

SLIDE 34

Newton’s Method

(Multivariate) Newton’s method optimizes a function f : Rn → R by the iterations Repeat: x ← x − α(∇2

xf (x))−1∇xf (x)

where α is a step size Can choose α via backtracking line search, with

∆x = −(∇2

xf (x))−1∇xf (x) and with α0 = 1 (we always want to

be able to take a “full” Newton step if possible)

33

SLIDE 35

5 10 15 20 25 x1 4 3 2 1 1 2 3 4 x2

For our previous example, Newton’s method finds the exact solution in a single step (holds true for any convex quadratic function) But for optimizing general convex functions, Newton’s method is usually much faster than gradient descent, the preferred method provided it is feasible to compute and invert the Hessian Newton’s method is invariant to linear reparameterizations: if

g(x) = f (Ax), then optimizing g with Newton’s method gives the

exact same iterates as optimizing f

34

SLIDE 36

Example: Newton’s method

f (x1, x2) = exp(x1 +3x2−0.1)+exp(x1−3x2−0.1)+exp(−x1−0.1)

Convergence of Newton’s method vs. gradient descent

5 10 15 20 25 30 35 40 Iteration 10-15 10-13 10-11 10-9 10-7 10-5 10-3 10-1 101 f - f*

Newton Gradient descent

35

SLIDE 37

Handling nonconvexity

Convex function Nonconvex function

f1(x) f2(x) For nonconvex f , a choice: attempt to find a global optimum (hard, will require grid search in the worst case) or local optimum (relatively easy, can apply the methods above ignoring nonconvexity) The issue with general nonconvex continuous optimization is that (unlike integer programming), there is relatively little structure to exploit, typically need to fall back to exponential-time grid search

36

SLIDE 38

In practice, for the vast majority of practical nonconvex problem (e.g. deep Networks, probabilistic graphical models with missing variables), people are satisfied with finding local optima One subtle issue: because Hessian is not positive definite, Hessian is often poorly conditioned or non-invertible, leads to very poor Newton steps In practice, gradient descent (or more generally, methods based upon just gradient evaluations) are more common for non-convex functions

37

SLIDE 39

Handling nonsmooth functions

Nonsmooth functions may not have gradient or Hessian Example f (x) = |x| does not have a derivative at x = 0 x |x| f(x) Not a minor issue because the function is “only” non-differentiable at

ne point, the problem is that the solutions often lie precisely at

these non-differentiable points (cf. linear programming)

38

SLIDE 40

Subgradients

Although nonsmooth convex functions do not have gradients everywhere, they do have a subgradient at every point A subgradient is something “like” a gradient, in that for a convex function it must lie below the function everywhere f(x) x0 f(x) ≥ f(x0) + ∇xf(x)T(x − x0) x Example: f (x) = |x|, subgradients are given by

∇xf (x) =    −1 x < 0 1 x > 0 g ∈ [0, 1] x = 0

39

SLIDE 41

Can run gradient descent (now called subgradient descent) just as before One subtle point: a constant step size (no matter how small) may never converge exactly (similar issues for backtracking line search) x |x| f(x) Instead, common to use a decreasing sequence of step sizes, e.g.

αt = α0/(n0 + t)

40

SLIDE 42

No “easy” analogue for Newton’s method when we only have subgradients But, a great deal of recent work in how to solve nonsmooth

ptimization problems, especially in cases where the function has a

smooth and a nonsmooth portion Proximal algorithms have received a great deal of attention in this setting in recent years (e.g. Parikh et al., 2014)

41

SLIDE 43

Outline

Introduction to optimization Types of optimization problems Unconstrained optimization Constrained optimization Practical optimization

42

SLIDE 44

Constrained optimization

Recall constrained optimization problem

minimize

x

f (x) subject to gi(x) ≤ 0, i = 1, . . . , m hi(x) = 0, i = 1, . . . , p

and its convex variant

minimize

x

f (x) subject to gi(x) ≤ 0, i = 1, . . . , m Ax = b

with f and all gi convex Seemingly much more challenging, we need to both find a feasible solution and an optimal solution amongst these feasible solutions

43

SLIDE 45

Projected gradient methods

Suppose we can easily solve the projection subproblem

Proj(v) = argmin

x

∥x−v∥2 subject to gi(x) ≤ 0, i = 1, . . . , m hi(x) = 0, i = 1, . . . , p

Then we can use a simple adjustment to gradient descent, called projected gradient descent Repeat: x ← Proj(x − α∇xf (x)) But doesn’t solving the projection seem just as hard as solving the problem to begin with?

44

SLIDE 46

The good news: some constraints have very simple forms of project Example: x ≥ 0:

Proj(v) = argmin

x≥0

∥x − v∥2 = max{x, 0} (applied elementwise)

Example: Ax = b

Proj(v) = argmin

x:Ax=b

∥x − v∥2 = x − AT(AAT)−1(Ax − b)

Important: just because we have a closed form projection for each

f two sets, C1, C2, does not give us a closed form projection onto

their intersection C1 ∩ C2

45

SLIDE 47

Duality and the KKT conditions

For a general constrained problem, we consider the Lagrangian

L(x, λ, ν) = f (x) +

m

∑

i=1

λigi(x) +

p

∑

i=1

νihi(x)

Just like for the linear programming case, we have

max

λ≥0,ν L(x, λ, ν) =

{ f (x) x feasible ∞

therwise

Thus, we can rewrite our optimization problem as

minimize

x

maximize

λ≥0,ν

L(x, λ, ν)

46

SLIDE 48

Because optimal (x ⋆, λ⋆, ν⋆) must have f (x ⋆) = L(x ⋆, λ⋆, ν⋆) and

∇xΛ(x ⋆, λ⋆, ν⋆) = 0, we have the following conditions

1. gi(x ⋆) ≤ 0, ∀i = 1, . . . , m
2. hi(x ⋆) = 0, ∀i = 1, . . . , p
3. λ⋆

i ≥ 0, ∀i = 1, . . . , m

4. m

∑

i=1

λ⋆

i gi(x ⋆) = 0 =

⇒ λ⋆

i gi(x ⋆) = 0, ∀i = 1, . . . , m

5. ∇xf (x ⋆) +

x

∑

i=1

λ⋆

i ∇xgi(x ⋆) + p

∑

i=1

ν⋆

i ∇xhi(x ⋆) = 0

These are the Karush-Kuhn-Tucker (KKT) equations

47

SLIDE 49

Equality constrained optimization

For convex f , consider the optimization problem

minimize

x

f (x) subject to Ax = b

KKT optimality conditions for this problem are

∇xf (x ⋆) + ATν⋆ = 0 Ax ⋆ − b = 0

48

SLIDE 50

Just like for unconstrained case, after a bit of derivation, we can use Newton’s method to find a zero of this system by repeatedly solving the linear system

[ ∇2

xf (x)

AT A ] [ ∆x ∆ν ] = [ ∇xf (x) + ATν Ax − b ]

and then updating

x ← x − α∆x, ν ← ν − α∆ν

One subtle point: because we are both maximizing and minimizing the Lagrangian over x and ν respectively, we need a slight modification of backtracking line search, updating α ← βα until

[

∇xf (x + α∆x) + AT(ν + ∆ν) A(x + ∆x) − b ]

2

≤ (1−γα)

[

∇xf (x) + ATν Ax − b ]

2

49

SLIDE 51

Inequality constrained optimization

Consider the full convex constrained optimization problem, with convex f , gi

minimize

x

f (x) subject to gi(x) ≤ 0, i = 1, . . . , m Ax = b

KKT optimality conditions for this problem are

rx ≡ ∇xf (x ⋆) +

m

∑

i=1

λ⋆

i gi(x ⋆) + ATν⋆ = 0

(rλ)i ≡ λigi(x) = 0, i = 1, . . . , p rν ≡ Ax ⋆ − b = 0

plus the condition that gi(x) ≤ 0

50

SLIDE 52

To apply Newton’s method to find a solution to this system, the constraint that λigi(x) = 0 is difficult, instead use the constraint that λigi(x) = t and bring t → 0 as algorithm progresses Newton update becomes

  ∇2

xf (x) + ∑m i=1 λi∇2 xgi(x)

Dg(x)T AT diag(λ)Dg(x) diag(f (x)) A     ∆x ∆λ ∆ν   =   rx rλ + t1 rν  

where

Dg(x) =    ∇xg1(x)T

. . .

∇xgm(x)T   

This is the primal-dual interior point method, which is the state of the art in exactly solving convex optimization problems

51

SLIDE 53

Handling nonconvexity

Solving, or even finding a feasible solution for, general nonconvex constrained optimization problems is very challenging Consider hi(x) = xi(1 − xi) = 0, equivalent to x ∈ {0, 1}, and we know it is NP-hard to find a feasible solution to an IP Nonetheless, there are a few methods that can work well in practice to (locally) find hopefully feasible solutions

52

SLIDE 54

One popular approach: sequential convex programming, iteratively attempt to solve penalized, unconstrained objective

minimize

x

f (x) + µ

m

∑

i=1

|gi(x)|+ + µ

p

∑

i=1

|hi(x)|

for µ > 0 and |x|+ = max{0, x} Approximate this with the convex problem

minimize

x:||x−x0||≤ϵ

˜ f (x, x0) + µ

m

∑

i=1

|˜ gi(x, x0)|+ + µ

p

∑

i=1

|˜ hi(x, x0)|

where ˜

f (x, x0) = f (x0) + ∇xf (x0)(x − x0) (and similarly for ˜ gi, ˜ hi)

are first order Taylor approximations For large enough µ, if there is a “nearby” feasible solution to x0, we will often find a point that satisfies constraints exactly

53

SLIDE 55

Handling nonsmoothness

A wonderful property of many nonsmooth functions is that they can be represented by smooth constrained functions (so in some sense nonsmoothness is easier for constrained problems than unconstrained) Consider problem

minimize

x

∥Ax − b∥2

2 + µ n

∑

i=1

|xi|

This can be written as

minimize

x

∥Ax − b∥2

2 + µ n

∑

i=1

yi subject to − yi ≤ xi ≤ yi, ∀i = 1, . . . , n

54

SLIDE 56

Outline

Introduction to optimization Types of optimization problems Unconstrained optimization Constrained optimization Practical optimization

55

SLIDE 57

Practically solving optimization problems

The good news: for many classes of optimization problems, people have already done all the “hard work” of developing numerical algorithms A wide range of tools that can take optimization problems in “natural” forms and compute a solution Some well-known libraries: CVX (MATLAB), YALMIP (MATLAB), AMPL (custom language), GAMS (custom language), cvxpy (Python)

56

SLIDE 58

cvxpy

Python library for specifying and solving convex optimization problems Available at http://www.cvxpy.org Under active development, but at a relatively stable point

57

SLIDE 59

Image deblurring

minimize

x

∥K ∗x −y∥2

2 +µ

(n−1 ∑

i=1

|xmi − xm(i+1)| +

m−1

∑

i=1

|xni − xn(i+1)| ) cvxpy code:

import cvxpy as cp Y,K = ... X = cp.Variable(Y.shape[0], Y.shape[1]) f = (cp.sum_squares(Kcp.vec(X) - cp.vec(Y)) + mucp.sum_entries(cp.abs(X[:,:-1] - X[:,1:])) + mu*cp.sum_entries(cp.abs(X[:-1,:] - X[1:,:]))) constraints = [] result = cp.Problem(cp.Minimize(f), constraints).solve()

58