[PPT] - Optimization Unconstrained optimization Constrained optimization PowerPoint Presentation

SLIDE 1

Optimization

SLIDE 2

Constrained optimization

Newton with equality constraints Simplex method Interior-point method Active-set method

Unconstrained optimization One-dimensional Multi-dimensional

Newton’s method Descent methods Gradient descent Conjugate gradient Basic Newton Gauss- Newton Quasi- Newton

SLIDE 3

Unconstrained optimization

Define an objective function over a domain: f: Rn → R
Optimization variables:

minimize f(x1, x2, · · · , xn) xT = {x1, x2, · · · , xn} minimize f(x), for x ∈ Rn

SLIDE 4

Constraints

Equality constraints
Inequality constraints

ai(x) = 0 for x ∈ Rn, where i = 1, · · · , p cj(x) ≥ 0 for x ∈ Rn, where j = 1, · · · , q

SLIDE 5

Constrained optimization

Solution: x* satisfies constraints ai and cj, while minimizing

the objective function f(x)

minimize f(x), for x ∈ Rn subjec to ai(x) = 0, where i = 1, · · · , p cj(x) ≥ 0, where j = 1, · · · , q

SLIDE 6

Formulate an optimization

General optimization problem is very difficult to solve
Certain problem classes can be solve efficiently and reliably
Convex problems can be solved with global solutions

efficiently and reliably

Nonconvex problems do not guarantee global solutions

SLIDE 7

Example: pattern matching

A pattern can be described by a set of points, P = {p1, p2, ...,

pn}

The same object viewed from a different distance or a different

angle corresponds to a different P’

Two patterns P and P’ are similar if

p

i = η

cos θ

− sin θ sin θ cos θ ⇥ pi +

r1

r2 ⇥

SLIDE 8

Example: pattern matching

Let Q = {q1, q2, ..., qn} be the target pattern, find the most

similar pattern among P1, P2, ..., Pn

SLIDE 9

Inverse kinematics

a set of 3D marker positions a pose described by joint angles

SLIDE 10

Optimal motion trajectories

SLIDE 11

Quiz

Arrive at d with velocity = 0 Maximal force allowed: F

d

Minimize time? Minimize energy?

SLIDE 12

Unconstrained optimization
Newton method
Gauss-Newton method
Gradient descent method
Conjugate gradient method

SLIDE 13

Newton method

C(x) = 0

Find the roots of of a nonlinear function

¯ x = x − C(x) C′(x)

Then we can estimate the roots as We can linearize the function as

C(¯ x) = C(x) + C′(x)(¯ x − x) = 0

, where C′(x) = ∂C

∂x

SLIDE 14

Root estimation

C(x)

x x(0)

C(x(1)) = C(x(0)) + C′(x(0))(x(1) − x(0))

x(1)

x(2)

SLIDE 15

Minimization

What is the simplest function that has minima? Find the minima of F(x) Find such that the nonlinear function is a minimum

x∗

F(x∗)

F(x(k) + δ) = F(x(k)) + F ′(x(k))δ + 1 2F ′′(x(k))δ2

Find the roots of F ′(x)

δ = − F ′(x) F ′′(x) ∂F(x(k) + δ) ∂δ = 0

SLIDE 16

Conditions

What are the conditions for minima to exist?
Necessary conditions: a local minimum

exists at x*

Sufficient conditions: an isolated minimum

exists x*

F ′(x∗) = 0 F ′′(x∗) ≥ 0 F ′′(x∗) > 0 F ′(x∗) = 0

SLIDE 17

Minimization

x

F(x) F ′(x) x∗

F ′′(x∗) > 0

SLIDE 18

Multidimensional optimization

Search methods only need function evaluations
First-order gradient-based methods depend on

the information of gradient g

Second-order gradient-based methods depend on

both gradient and Hessian H

SLIDE 19

Multiple variables

F(x(k) + p) = F(x(k)) + gT (x(k))p + 1 2pT H(x(k))p g(x) = ∇xF =   

∂F ∂x1

. . .

∂F ∂xn

  

gradient vector

H(x) = ∇2

xxF =

      

∂2F ∂x2

1

· · ·

∂2F ∂x1∂xn ∂2F ∂x2∂x1

· · ·

∂2F ∂x2∂xn

. . . . . .

∂2F ∂xn∂x1

· · ·

∂2F ∂x2

n

      

Hessian matrix

SLIDE 20

Multiple variables

0 = g(x(k)) + H(x(k))p p = −H(x(k))−1g(x(k)) x(k+1) = x(k) + p

SLIDE 21

Multiple variables

Necessary conditions: Sufficient conditions:

g(x∗) = 0 pT H∗p ≥ 0 pT H∗p > 0 g(x∗) = 0

H is positive semi-definite H is positive definite

SLIDE 22

Gauss-Newton method

What if the objective function is in the form of a

vector of functions?

The real-valued function can be formed as

f = [f1(x) f2(x) · · · fm(x)]T F =

m

p=1

fp(x)2 = f T f

SLIDE 23

Jacobian

Each fp(x) depends on xi for i = 1,2,...,m, a gradient

matrix can be formed

The Jacobian need not to be a square matrix

SLIDE 24

Gradient and Hessian

Gradient of objective function
Hessian of objective function

∂F ∂xi =

m

p=1

2fp(x)∂fp ∂xi gF = 2JT f ∂2F ∂xi∂xj = 2

m

p=1

∂fp ∂xi ∂fp ∂xj + 2

m

p=1

fp(x) ∂2fp ∂xi∂xj HF ≈ 2JT J

SLIDE 25

Gauss-Newton algorithm

In kth iteration, compute fp(xk) and Jk to obtain

new gk and Hk

Compute pk = -(2JTJ)-1(2JTf) = -(JTJ)-1(JTf)
Find αk that minimizes F(xk + αkpk)
Set xk+1 = xk + αkpk

SLIDE 26

First-order gradient methods
Greatest gradient descent
Conjugate gradient

SLIDE 27

Solving large linear system

Ax = b A a known, square, symmetric, and

positive semi-definite matrix

b a known vector x an unknown vector

If A is dense, solve with factorization and back substitution If A is sparse, solve with iterative methods (descent methods)

SLIDE 28

Quadratic form

The critical point of F is also the solution to Ax = b

F ′(x) = 0 = Ax − b

The gradient of F(x) is

F(x) = 1 2xT Ax − bT x + c F ′(x) = 1 2AT x + 1 2Ax − b

If A is symmetric, F ′(x) = Ax − b If A is not symmetric, what is the linear system solved by finding the critical points of F ?

SLIDE 29

Greatest gradient descent

Take a step along the direction in which F descents most quickly

−F ′(x(k)) = b − Ax(k)

Start at an arbitrary point x(0) and slide down to the bottom of the paraboloid Take a series of steps x(1), x(2), ... until we are satisfied that we are close enough to the solution x*

SLIDE 30

Greatest gradient descent

error: e(k) = x(k) − x∗ Think residual as the direction of the greatest descent Important definitions: residual: r(k) = b − Ax(k) = −F ′(x(k))

= −Ae(k)

SLIDE 31

Line search

r(0)

x(0)

x(1) = x(0) + αr(0)

But how big of a step should we take? A line search is a procedure that chooses to minimize F along a line

α

SLIDE 32

Line search

2.5 0

2.5 5

5
2.5

2.5 50 100 150

2.5 0

2.5 5

1 1 2 1

0.2 0.4 0.6 20 40 60 80 100 120 140

SLIDE 33

Optimal step size

d dαF(x(1)) = F ′(x(1))T d dαx(1) = F ′(x(1))T r(0) = 0

r(0) F ′(x(1))

rT

(0)r(1) = 0

SLIDE 34

Optimal step size

Exercise: derive alpha from Hint: replace the terms involving (k+1) with those involving (k) by x(k+1) = x(k) + αr(k)

rT

(k)r(k+1) = 0

α = rT

k rk

rT

k Ark

Ans:

SLIDE 35

Recurrence of residual

The algorithm requires two matrix-vector multiplications per iteration

r(k) = b − Ax(k)

1.

α = rT

k rk

rT

k Ark

2.

x(k+1) = x(k) + αr(k)

3. One multiplication can be eliminated by replacing step 1 with

r(k+1) = r(k) − αAr(k)

SLIDE 36

Quiz

In our IK problem, we use greatest gradient

descent method to find an optimal pose, but we can’t compute alpha using the formula described in the previous slides, why?

SLIDE 37

Line search

Exact line search: Choose t to minimize f along the ray
Backtracking line search depends on two constants: α and β

given a descent direction ∆x for f at x ∈ dom f, α ∈ (0,0.5), β ∈ (0,1) t := 1 while f(x + t∆x) > f(x) + αt∇f(x)T ∆x, t := βt

{x + t∆x | t ≥ 0} t = argmins≥0f(x + s∆x)

SLIDE 38

Poor convergence

Wouldn’t it be nice if we can avoid to traverse the same direction? What is the problem with greatest descent?

SLIDE 39

Conjugate directions

Pick a set of directions:

d(0), d(1), · · · , d(n−1)

Take exactly one step along each direction Solution is found within n steps Two problems:

1. How do we determine these directions?
2. How do we determine the step size along each direction?

SLIDE 40

A-orthogonality

If we take the optimal step size along each direction

d dαF(x(k+1))

= F ′(x(k+1))T d

dαx(k+1)

= −rT

(k+1)d(k)

= dT

(k)Ae(k+1)

=

Two different vectors v and u are A-orthogonal or conjugate, if vTAu = 0

SLIDE 41

A-orthogonality

vectors are A-orthogonal vectors are orthogonal

SLIDE 42

Optimal size step

must be A-orthogonal to

e(k+1) d(k)

Using this condition, can you derive ?

α(k)

SLIDE 43

Algorithm

α(k) = − dT

(k)r(k)

dT

(k)Ad(k)

Take

x(k+1) = x(k) + α(k)d(k) d(k)

1. 2. 3. Suppose we can come up with a set of A-orthogonal directions , this algorithm will converge in n steps

{d(k)}

SLIDE 44

We need to prove that can be found in n steps if we take step size along at each step

Why does it work?

x∗

e(0) =

n−1

i=0

δid(i) dT

(j)Ae(0) = n−1

i=0

δidT

(j)Ad(i)

dT

(j)Ae(0) = δjdT (j)Ad(j)

α(k)

d(k)

= dT

(j)Ae(j)

dT

(j)Ad(j)

= −α(j)

d’s are linearly independent if d’s are A-orthogonal

δj = dT

(j)Ae(0)

dT

(j)Ad(j)

= dT

(j)A(e(0) − j−1 k=0 δkd(k))

dT

(j)Ad(j)

SLIDE 45

Quiz

Given that d’s are A-orthogonal, prove that d’s

are linearly independent.

SLIDE 46

Search directions

We know how to determine the optimal step size

along each direction (second problem solved)

We still need to figure out what search directions

are

What do we know about d(0), d(1), ..., d(n-1)?
They are A-orthogonal to each other: d(i)TAd(j)

= 0

d(i) is A-orthogonal to e(i+1)

SLIDE 47

Gram-Schmidt Conjugation

Use the same trick to get rid of the summation

βkj = −uT

k Ad(j)

dT

j Ad(j)

What are the drawbacks of Gram-Schmidt conjugation? Suppose we have a set of linearly independent vectors u’s, the search directions can be represented as

d(0) = u0

and

d(k) = uk +

k−1

i=0

βkid(i) k > j

dT

(k)Ad(j) = uT (k)Ad(j) + βkjdT (j)Ad(j)

SLIDE 48

Conjugate gradients

If we pick a set of u’s intelligently, we might be

able to save both time and space

It turns out that residuals (r’s) is an excellent

choice for u’s

residuals are orthogonal to each other
residual is orthogonal to the previous search

directions

SLIDE 49

Conjugate gradients

d(k) = r(k) +

k−1

i=0

βkid(i) βkj = − rT

(k)Ad(j)

dT

(j)Ad(j)

dT

(k)Ad(j) = rT (k)Ad(j) + k−1

i=0

βkidT

(i)Ad(j)

j < k

0 = rT

(k)Ad(j) + βkjdT (j)Ad(j)

(by A-orthogonality of d vectors)

Each requires O(n3) operations! However...

d(k)

SLIDE 50

Conjugate gradients

is A-orthogonal to all the previous search directions except for d(k−1)

r(k)

βkj = − rT

(k)Ad(j)

dT

(j)Ad(j)

= 0 j < k − 1

if

βkj = − rT

(k)r(k)

rT

(k−1)r(k−1)

if

j = k − 1

proof:

rT

(k)Ad(j) = 0

j < k − 1

when

SLIDE 51

Proof: Orthogonality

Proof is orthogonal to all the previous search directions

r(k)

d(0), d(1), · · · , d(k−1) e(k) =

n−1

j=k

δjd(j) dT

(i)Ae(k) = n−1

j=k

δjdT

(i)Ad(j) = 0

dT

(i)r(k) = 0

if i < k From here, we can proof

dT

(k)r(k) = rT (k)r(k)

if i < k

rT

(i)r(j) = 0, i ̸= j

identity 1 identity 2

SLIDE 52

Proof: A-orthogonality

{

therwise

Proof is A-orthogonal to all the previous search directions except for d(k−1)

r(k)

r(j+1) = −Ae(j+1) = −A(e(j) + α(j)d(j)) = r(j) − α(j)Ad(j) rT

(k)r(j+1) = rT (k)r(j) − α(j)rT (k)Ad(j)

rT

(k)Ad(j) =

− rT

(k)r(k)

α(k−1) rT

(k)r(k)

α(k) j = k − 1 j = k

use identity 1

SLIDE 53

Conjugate gradients

Simplify βk

use identity 2

= − rT

(k)r(k)

dT

(k−1)r(k−1)

= − rT

(k)r(k)

rT

(k−1)r(k−1)

βk = − rT

(k)Ad(k−1)

dT

(k−1)Ad(k−1)

=

rT

(k)r(k)

α(k−1)

dT

(k−1)Ad(k−1)

use orthogonality

= rT

(k)r(k)

dT

(k−1)(r(k−1) − r(k))

SLIDE 54

Conjugate gradients

Put it all together

d(0) = r(0) = b − Ax(0) α(k) = rT

(k)r(k)

dT

(k)Ad(k)

x(k+1) = x(k) + α(k)d(k) r(k+1) = r(k) − α(k)Ad(k) d(k+1) = r(k+1) + β(k+1)d(k)

β(k+1) = − rT

(k+1)r(k+1)

rT

(k)r(k)

SLIDE 55

References

J. Shewchuk, An introduction to conjugate

gradient method without agonizing pain

A. Antoniou and W.S. Lu, Practical optimization
R. Fletcher, Practical methods of optimization
J. Betts, Practical methods for optimal control

using nonlinear programming