15-780 Graduate Artificial Intelligence: Optimization J. Zico - - PowerPoint PPT Presentation

▶

May 18, 2023 512 likes •834 views

15-780 Graduate Artificial Intelligence: Optimization J. Zico Kolter (this lecture) and Ariel Procaccia Carnegie Mellon University Spring 2017 1 Outline Introduction to optimization Types of optimization problems, convexity Solving

SLIDE 1

15-780 – Graduate Artificial Intelligence: Optimization

J. Zico Kolter (this lecture) and Ariel Procaccia

Carnegie Mellon University Spring 2017

SLIDE 2

Outline

Introduction to optimization Types of optimization problems, convexity Solving optimization problems

SLIDE 3

Logistics

HW0, some unintentional ambiguity about “no late days” criteria To be clear, in all future assignments, the policy is: You have 5 late days, no more than 2 on any assignment If you use up your five late days, you will receive 20% off per day for these two days If you submit any homework more than 2 days late, you will receive zero credit All homework, both programming and written portions, must be written up independently All students who submitted HW0 have been taken off waitlist

SLIDE 4

Outline

Introduction to optimization Types of optimization problems, convexity Solving optimization problems

SLIDE 5

Continuous optimization

The problems we have seen so far (i.e., search) in class involve making decisions over a discrete space of choices An amazing property: One of the most significant trends in AI in the past 15 years has been the integration of optimization methods throughout the field

Discrete search (Convex) optimization Variables Discrete Continuous # Solutions Finite Infinite Solution complexity Exponential Polynomial

SLIDE 6

Optimization definitions

We’ll write optimization problems like this: minimize

푥

𝑔(𝑦) subject to 𝑦 ∈ 𝒟 which should be interpreted to mean: we want to find the value of 𝑦 that achieves the smallest possible value of 𝑔(𝑦), out of all points in 𝒟 Important terms: 𝑦 ∈ ℝ푛 – optimization variable (vector with 𝑜 real-valued entries) 𝑔: ℝ푛 → ℝ – optimization objective 𝒟 ⊆ ℝ푛 – constraint set 𝑦⋆ ≡ argmin

푥∈풞

𝑔(𝑦) – optimal solution 𝑔⋆ ≡ 𝑔 𝑦⋆ ≡ min

푥∈풞 𝑔(𝑦) – optimal objective

SLIDE 7

Example: Weber point

Given a collection of cities (assume

n 2D plane) how can we find the

location that minimizes the sum of distances to all cities? Denote the locations of the cities as 𝑧 1 , … , 𝑧 푚 Write as the optimization problem: minimize

푥

∑ 𝑦 − 𝑧 푚

2 푚 푖=1

?

SLIDE 8

Example: image deblurring

Given corrupted image 𝑍 ∈ ℝ푚×푛, reconstruct image by solving

ptimization problem:

minimize

푋

∑ 𝑍푖푗 − 𝐿 ∗ 𝑌 푖푗

푖,푗

+ 𝜇 ∑ 𝑌푖푗 − 𝑌푖,푗+1

2 + 𝑌푖+1,푗 − 𝑌푖푗 2 1 2

푖,푗

where 𝐿 ∗ denotes convolution with a blurring filter

Figure from (O’Connor and Vandenberghe, 2014)

(a) Original image. (b) Blurry, noisy image. (c) Restored image.

SLIDE 9

Example: robot trajectory planning

Many robotic planning tasks are more complex than shortest path, e.g. have robot dynamics, require “smooth” controls Common to formulate planning problem as an optimization task Robot state 𝑦푡 and control inputs 𝑣푡 minimize

푥1:푇 ,푢1:푇−1

∑ 𝑣푡 2

2 푇 푖=1

subject to 𝑦푡+1 = 𝑔dynamics 𝑦푡, 𝑣푡 𝑦푡 ∈ FreeSpace, ∀𝑢 𝑦1 = 𝑦init, 𝑦푇 = 𝑦goal

𝑒 𝑠

Figure from (Schulman et al., 2014)

SLIDE 10

Example: machine learning

As we will see in much more detail shortly, virtually all (supervised) machine learning algorithms boil down to solving an optimization problem minimize

휃

∑ ℓ ℎ휃 𝑦 푖 , 𝑧 푖

푚 푖=1

Where 𝑦 푖 ∈ 𝒴 are inputs, 𝑧 푖 ∈ 𝒵 are outputs, ℓ is a loss function, ad ℎ휃 is a hypothesis function parameterized by 𝜄, which are the parameters

f the model we are optimizing over

Much more on this soon

SLIDE 11

The benefit of optimization

One of the key benefits of looking at problems in AI as optimization problems: we separate out the definition of the problem from the method for solving it For many classes of problems, there are off-the-shelf solvers that will let you solve even large, complex problems, once you have put them in the right form

SLIDE 12

Outline

Introduction to optimization Types of optimization problems, convexity Solving optimization problems

SLIDE 13

Classes of optimization problems

Many different names for types of optimization problems: linear programming, quadratic programming, nonlinear programming, semidefinite programming, integer programming, geometric programming, mixed linear binary integer programming (the list goes on and on, can all get a bit confusing) We’re instead going to focus on two dimensions: convex vs. nonconvex and constrained vs. unconstrained

Constrained Unconstrained Convex Nonconvex Most machine learning Linear programming Deep learning Integer programming

SLIDE 14

Constrained vs. unconstrained

In unconstrained optimization, every point 𝑦 ∈ ℝ푛 is feasible, so singular focus is on minimizing 𝑔(𝑦) In contrast, for constrained optimization, it may be difficult to even find a point 𝑦 ∈ 𝒟 Often leads to very different methods for optimization (more next lecture)

minimize

푥

𝑔(𝑦) subject to 𝑦 ∈ 𝒟 minimize

푥

𝑔(𝑦)

x1 x2 x⋆ C x1 x2 x⋆

SLIDE 15

Convex vs. nonconvex optimization

Originally, researchers distinguished between linear (easy) and nonlinear (hard) problems But in 80s and 90s, it became clear that this wasn’t the right distinction, key difference is between convex and nonconvex problems Convex problem: minimize

푥

𝑔(𝑦) subject to 𝑦 ∈ 𝒟 Where 𝑔 is a convex function and 𝒟 is a convex set

Convex function Nonconvex function

f1(x) f2(x)

SLIDE 16

Convex sets

A set 𝒟 is convex if, for any 𝑦, 𝑧 ∈ 𝒟 and 0 ≤ 𝜄 ≤ 1 𝜄𝑦 + 1 − 𝜄 𝑧 ∈ 𝒟 Examples: All points 𝒟 = ℝ푛 Intervals 𝒟 = {𝑦 ∈ ℝ푛| 𝑚 ≤ 𝑦 ≤ 𝑣} (elementwise inequality) Linear equalities 𝒟 = 𝑦 ∈ ℝ푛 𝐵𝑦 = 𝑐} (for 𝐵 ∈ ℝ푚×푛, 𝑐 ∈ ℝ푚) Intersection of convex sets 𝒟 = ⋂ 𝒟푖

푚 푖=1

. . Convex set . Nonconvex set

SLIDE 17

Convex functions

A function 𝑔: ℝ푛 → ℝ is convex if, for any 𝑦, 𝑧 ∈ ℝ푛 and 0 ≤ 𝜄 ≤ 1 𝑔 𝜄𝑦 + 1 − 𝜄 𝑧 ≤ 𝜄𝑔 𝑦 + 1 − 𝜄 𝑔 𝑧 Convex functions “curve upwards” (or at least not downwards) If 𝑔 is convex then −𝑔 is concave If 𝑔 is both convex and concave, it is affine, must be of form: 𝑔 𝑦 = ∑ 𝑏푖𝑦푖

푛 푖=1

+ 𝑐

(x, f(x)) (y, f(y))

SLIDE 18

Examples of convex functions

Exponential: 𝑔 𝑦 = exp 𝑏𝑦 , 𝑏 ∈ ℝ Negative logarithm: 𝑔 𝑦 = − log 𝑦, with domain 𝑦 > 0 Squared Euclidean norm: 𝑔 𝑦 = 𝑦 2

2 ≡ 𝑦푇 𝑦 ≡ ∑

𝑦푖

2 푛 푖=1

Euclidean norm: 𝑔 𝑦 = 𝑦 2 Non-negative weighted sum of convex functions 𝑔 𝑦 = ∑ 𝑥푖𝑔푖(𝑦)

푚 푖=1

, 𝑥푖 ≥ 0, 𝑔푖 convex

SLIDE 19

Poll: convex sets and functions

Which of the following functions or sets are convex 1. A union of two convex sets 𝒟 = 𝒟1 ∪ 𝒟2 2. The set 𝑦 ∈ ℝ2 𝑦 ≥ 0, 𝑦1𝑦2 ≥ 1} 3. The function 𝑔: ℝ2 → ℝ, 𝑔 𝑦 = 𝑦1𝑦2 4. The function𝑔: ℝ2 → ℝ, 𝑔 𝑦 = 𝑦1

2 + 𝑦2 2 + 𝑦1𝑦2

SLIDE 20

Convex optimization

The key aspect of convex optimization problems that make them tractable is that all local optima are global optima Definition: a point 𝑦 is globally optimal if 𝑦 is feasible and there is no feasible 𝑧 such that 𝑔 𝑧 < 𝑔 𝑦 Definition: a point x is locally optimal if 𝑦 is feasible and there is some 𝑆 > 0 such that for all feasible 𝑧 with 𝑦 − 𝑧 2 ≤ 𝑆, 𝑔 𝑦 ≤ 𝑔 𝑧 Theorem: for a convex optimization problem all locally optimal points are globally optimal

SLIDE 21

Proof of global optimality

Proof: Given a locally optimal 𝑦 (with optimality radius 𝑆), and suppose there exists some feasible 𝑧 such that 𝑔 𝑧 < 𝑔 𝑦 Now consider the point 𝑨 = 𝜄𝑦 + 1 − 𝜄 𝑧, 𝜄 = 1 − 𝑆 2 𝑦 − 𝑧 2 1) Since 𝑦, 𝑧 ∈ 𝒟 (feasible set), we also have 𝑨 ∈ 𝒟 (by convexity of 𝒟) 2) Furthermore, since 𝑔 is convex: 𝑔 𝑨 = 𝑔 𝜄𝑦 + 1 − 𝜄 𝑧 ≤ 𝜄𝑔 𝑦 + 1 − 𝜄 𝑔 𝑧 < 𝑔 𝑦 and 𝑦 − 𝑨 2 = 𝑦 − 1 −

푅 2 푥−푦 2 𝑦 + 푅 2 푥−푦 2 𝑧 2 = 푅 푥−푦 2 푥−푦 2 2 = 푅 2

Thus, 𝑨 is feasible, within radius 𝑆 of 𝑦, and has lower objective value, a contradiction of supposed local optimality of 𝑦

∎

SLIDE 22

Outline

Introduction to optimization Types of optimization problems, convexity Solving optimization problems

SLIDE 23

The gradient

A key concept in solving optimization problems is the notation of the gradient of a function (multi-variate analogue of derivative) For 𝑔: ℝ푛 → ℝ, gradient is defined as vector of partial derivatives Points in “steepest direction” of increase in function 𝑔

x1 x2 ∇xf(x)

∇푥𝑔 𝑦 ∈ ℝ푛 = 𝜖𝑔 𝑦 𝜖𝑦1 𝜖𝑔 𝑦 𝜖𝑦2 ⋮ 𝜖𝑔 𝑦 𝜖𝑦푛

SLIDE 24

Gradient descent

Gradient motivates a simple algorithm for minimizing 𝑔(𝑦): take small steps in the direction of the negative gradient “Convergence” can be defined in a number of ways

Algorithm: Gradient Descent Given: Function 𝑔, initial point 𝑦0, step size 𝛽 > 0 Initialize: 𝑦 ← 𝑦0 Repeat until convergence: 𝑦 ← 𝑦 − 𝛽𝛼푥𝑔(𝑦)

SLIDE 25

Gradient descent works

Theorem: For differentiable 𝑔 and small enough 𝛽, at any point 𝑦 that is not a (local) minimum 𝑔 𝑦 − 𝛽∇푥𝑔 𝑦 < 𝑔(𝑦) i.e., gradient descent algorithm will decrease the objective Proof: Any differentiable function 𝑔 can be written in terms of its Taylor expansion: 𝑔 𝑦 + 𝑤 = 𝑔 𝑦 + ∇푥𝑔 𝑦 푇 𝑤 + 𝑃( 𝑤 2

f(x + v) x f(x) + ∇xf(x)Tv x + v

SLIDE 26

Gradient descent works (cont)

Choosing 𝑤 = −𝛽∇푥𝑔 𝑦 , we have 𝑔 𝑦 − 𝛽∇푥𝑔 𝑦 = 𝑔 𝑦 − 𝛽∇푥𝑔 𝑦 푇 ∇푥𝑔 𝑦 + 𝑃 𝛽∇푥𝑔 𝑦

2 2

≤ 𝑔 𝑦 − 𝛽 ∇푥𝑔 𝑦

2 2 + 𝐷 𝛽∇푥𝑔 𝑦 2 2

= 𝑔 𝑦 − 𝛽 − 𝛽2𝐷 ∇푥𝑔 𝑦

2 2

< 𝑔 𝑦 (for 𝛽 < 1/𝐷 and ∇푥𝑔 𝑦

2 2 > 0)

We are guaranteed to have ∇푥𝑔 𝑦

2 2> 0 except at optima

Works for both convex and non-convex functions, but with convex functions guaranteed to find global optimum

∎ (Watch out: a bit of subtlety of this line, only holds for small 𝛽∇푥𝑔 𝑦 )

SLIDE 27

Poll: modified gradient descent

Consider an alternative version of gradient descent, where instead of choosing an update 𝑦 − 𝛽∇푥𝑔(𝑦), we chose some other direction 𝑦 + 𝛽𝑤 where 𝑤 has a negative inner product with the gradient ∇푥𝑔 𝑦 푇 𝑤 < 0 Will this update, for suitably chosen 𝛽, still decrease the objective? 1. No, not necessarily (for either convex or nonconvex functions) 2. Only for convex functions 3. Only for nonconvex functions 4. Yes, for both convex and nonconvex functions

SLIDE 28

Gradient descent in practice

Choice of 𝛽 matters a lot in practice: minimize

푥

2𝑦1

2 + 𝑦2 2 + 𝑦1𝑦2 − 6𝑦1 − 5𝑦2

𝛽 = 0.2 𝛽 = 0.42 𝛽 = 0.05

SLIDE 29

Dealing with constraints, non-differentiability

For settings where we can easily project points onto the constraint set 𝒟, can use a simple generalization called projected gradient descent Repeat: 𝑦 ← 𝑄풞 𝑦 − 𝛽∇푥𝑔 𝑦 If 𝑔 is not differentiable, but continuous, it still has what is called a subgradient, can replace gradient with subgradient in all cases (but theory/practice of convergence is quite different)

SLIDE 30

Optimization in practice

We won’t discuss this too much yet, but one of the beautiful properties of

ptimization problems is that there exists a wealth of tools that can solve

them using very simple notation Example: solving Weber point problem using cvxpy (http://cvxpy.org)

import numpy as np import cvxpy as cp n,m = (5,10) y = np.random.randn(n,m) x = cp.Variable(n) f = sum(cp.norm2(x - y[:,i]) for i in range(m)) cp.Problem(cp.Minimize(f), []).solve()