Optimization Unconstrained optimization Constrained optimization - - PowerPoint PPT Presentation

optimization unconstrained optimization constrained
SMART_READER_LITE
LIVE PREVIEW

Optimization Unconstrained optimization Constrained optimization - - PowerPoint PPT Presentation

Optimization Unconstrained optimization Constrained optimization Newton with equality One-dimensional constraints Multi-dimensional Active-set method Newtons method Descent methods Basic Newton Gradient Simplex method Gauss-


slide-1
SLIDE 1

Optimization

slide-2
SLIDE 2

Constrained optimization

Newton with equality constraints Simplex method Interior-point method Active-set method

Unconstrained optimization One-dimensional Multi-dimensional

Newton’s method Descent methods Gradient descent Conjugate gradient Basic Newton Gauss- Newton Quasi- Newton

slide-3
SLIDE 3

Unconstrained optimization

  • Define an objective function over a domain: f: Rn → R
  • Optimization variables:

minimize f(x1, x2, · · · , xn) xT = {x1, x2, · · · , xn} minimize f(x), for x ∈ Rn

slide-4
SLIDE 4

Constraints

  • Equality constraints
  • Inequality constraints

ai(x) = 0 for x ∈ Rn, where i = 1, · · · , p cj(x) ≥ 0 for x ∈ Rn, where j = 1, · · · , q

slide-5
SLIDE 5

Constrained optimization

  • Solution: x* satisfies constraints ai and cj, while minimizing

the objective function f(x)

minimize f(x), for x ∈ Rn subjec to ai(x) = 0, where i = 1, · · · , p cj(x) ≥ 0, where j = 1, · · · , q

slide-6
SLIDE 6

Formulate an optimization

  • General optimization problem is very difficult to solve
  • Certain problem classes can be solve efficiently and reliably
  • Convex problems can be solved with global solutions

efficiently and reliably

  • Nonconvex problems do not guarantee global solutions
slide-7
SLIDE 7

Example: pattern matching

  • A pattern can be described by a set of points, P = {p1, p2, ...,

pn}

  • The same object viewed from a different distance or a different

angle corresponds to a different P’

  • Two patterns P and P’ are similar if

p

i = η

  • cos θ

− sin θ sin θ cos θ ⇥ pi +

  • r1

r2 ⇥

slide-8
SLIDE 8

Example: pattern matching

  • Let Q = {q1, q2, ..., qn} be the target pattern, find the most

similar pattern among P1, P2, ..., Pn

slide-9
SLIDE 9

Inverse kinematics

a set of 3D marker positions a pose described by joint angles

slide-10
SLIDE 10

Optimal motion trajectories

slide-11
SLIDE 11

Quiz

Arrive at d with velocity = 0 Maximal force allowed: F

d

Minimize time? Minimize energy?

slide-12
SLIDE 12
  • Unconstrained optimization
  • Newton method
  • Gauss-Newton method
  • Gradient descent method
  • Conjugate gradient method
slide-13
SLIDE 13

Newton method

C(x) = 0

Find the roots of of a nonlinear function

¯ x = x − C(x) C′(x)

Then we can estimate the roots as We can linearize the function as

C(¯ x) = C(x) + C′(x)(¯ x − x) = 0

, where C′(x) = ∂C

∂x

slide-14
SLIDE 14

Root estimation

C(x)

x x(0)

C(x(1)) = C(x(0)) + C′(x(0))(x(1) − x(0))

x(1)

x(2)

slide-15
SLIDE 15

Minimization

What is the simplest function that has minima? Find the minima of F(x) Find such that the nonlinear function is a minimum

x∗

F(x∗)

F(x(k) + δ) = F(x(k)) + F ′(x(k))δ + 1 2F ′′(x(k))δ2

Find the roots of F ′(x)

δ = − F ′(x) F ′′(x) ∂F(x(k) + δ) ∂δ = 0

slide-16
SLIDE 16

Conditions

  • What are the conditions for minima to exist?
  • Necessary conditions: a local minimum

exists at x*

  • Sufficient conditions: an isolated minimum

exists x*

F ′(x∗) = 0 F ′′(x∗) ≥ 0 F ′′(x∗) > 0 F ′(x∗) = 0

slide-17
SLIDE 17

Minimization

x

F(x) F ′(x) x∗

F ′′(x∗) > 0

slide-18
SLIDE 18

Multidimensional optimization

  • Search methods only need function evaluations
  • First-order gradient-based methods depend on

the information of gradient g

  • Second-order gradient-based methods depend on

both gradient and Hessian H

slide-19
SLIDE 19

Multiple variables

F(x(k) + p) = F(x(k)) + gT (x(k))p + 1 2pT H(x(k))p g(x) = ∇xF =   

∂F ∂x1

. . .

∂F ∂xn

  

gradient vector

H(x) = ∇2

xxF =

      

∂2F ∂x2

1

· · ·

∂2F ∂x1∂xn ∂2F ∂x2∂x1

· · ·

∂2F ∂x2∂xn

. . . . . .

∂2F ∂xn∂x1

· · ·

∂2F ∂x2

n

      

Hessian matrix

slide-20
SLIDE 20

Multiple variables

0 = g(x(k)) + H(x(k))p p = −H(x(k))−1g(x(k)) x(k+1) = x(k) + p

slide-21
SLIDE 21

Multiple variables

Necessary conditions: Sufficient conditions:

g(x∗) = 0 pT H∗p ≥ 0 pT H∗p > 0 g(x∗) = 0

H is positive semi-definite H is positive definite

slide-22
SLIDE 22

Gauss-Newton method

  • What if the objective function is in the form of a

vector of functions?

  • The real-valued function can be formed as

f = [f1(x) f2(x) · · · fm(x)]T F =

m

  • p=1

fp(x)2 = f T f

slide-23
SLIDE 23

Jacobian

  • Each fp(x) depends on xi for i = 1,2,...,m, a gradient

matrix can be formed

  • The Jacobian need not to be a square matrix
slide-24
SLIDE 24

Gradient and Hessian

  • Gradient of objective function
  • Hessian of objective function

∂F ∂xi =

m

  • p=1

2fp(x)∂fp ∂xi gF = 2JT f ∂2F ∂xi∂xj = 2

m

  • p=1

∂fp ∂xi ∂fp ∂xj + 2

m

  • p=1

fp(x) ∂2fp ∂xi∂xj HF ≈ 2JT J

slide-25
SLIDE 25

Gauss-Newton algorithm

  • In kth iteration, compute fp(xk) and Jk to obtain

new gk and Hk

  • Compute pk = -(2JTJ)-1(2JTf) = -(JTJ)-1(JTf)
  • Find αk that minimizes F(xk + αkpk)
  • Set xk+1 = xk + αkpk
slide-26
SLIDE 26
  • First-order gradient methods
  • Greatest gradient descent
  • Conjugate gradient
slide-27
SLIDE 27

Solving large linear system

Ax = b A a known, square, symmetric, and

positive semi-definite matrix

b a known vector x an unknown vector

If A is dense, solve with factorization and back substitution If A is sparse, solve with iterative methods (descent methods)

slide-28
SLIDE 28

Quadratic form

The critical point of F is also the solution to Ax = b

F ′(x) = 0 = Ax − b

The gradient of F(x) is

F(x) = 1 2xT Ax − bT x + c F ′(x) = 1 2AT x + 1 2Ax − b

If A is symmetric, F ′(x) = Ax − b If A is not symmetric, what is the linear system solved by finding the critical points of F ?

slide-29
SLIDE 29

Greatest gradient descent

Take a step along the direction in which F descents most quickly

−F ′(x(k)) = b − Ax(k)

Start at an arbitrary point x(0) and slide down to the bottom of the paraboloid Take a series of steps x(1), x(2), ... until we are satisfied that we are close enough to the solution x*

slide-30
SLIDE 30

Greatest gradient descent

error: e(k) = x(k) − x∗ Think residual as the direction of the greatest descent Important definitions: residual: r(k) = b − Ax(k) = −F ′(x(k))

= −Ae(k)

slide-31
SLIDE 31

Line search

r(0)

x(0)

x(1) = x(0) + αr(0)

But how big of a step should we take? A line search is a procedure that chooses to minimize F along a line

α

slide-32
SLIDE 32

Line search

  • 2.5 0

2.5 5

  • 5
  • 2.5

2.5 50 100 150

  • 2.5 0

2.5 5

1 1 2 1

0.2 0.4 0.6 20 40 60 80 100 120 140

slide-33
SLIDE 33

Optimal step size

d dαF(x(1)) = F ′(x(1))T d dαx(1) = F ′(x(1))T r(0) = 0

r(0) F ′(x(1))

rT

(0)r(1) = 0

slide-34
SLIDE 34

Optimal step size

Exercise: derive alpha from Hint: replace the terms involving (k+1) with those involving (k) by x(k+1) = x(k) + αr(k)

rT

(k)r(k+1) = 0

α = rT

k rk

rT

k Ark

Ans:

slide-35
SLIDE 35

Recurrence of residual

The algorithm requires two matrix-vector multiplications per iteration

r(k) = b − Ax(k)

1.

α = rT

k rk

rT

k Ark

2.

x(k+1) = x(k) + αr(k)

3. One multiplication can be eliminated by replacing step 1 with

r(k+1) = r(k) − αAr(k)

slide-36
SLIDE 36

Quiz

  • In our IK problem, we use greatest gradient

descent method to find an optimal pose, but we can’t compute alpha using the formula described in the previous slides, why?

slide-37
SLIDE 37

Line search

  • Exact line search: Choose t to minimize f along the ray
  • Backtracking line search depends on two constants: α and β

given a descent direction ∆x for f at x ∈ dom f, α ∈ (0,0.5), β ∈ (0,1) t := 1 while f(x + t∆x) > f(x) + αt∇f(x)T ∆x, t := βt

{x + t∆x | t ≥ 0} t = argmins≥0f(x + s∆x)

slide-38
SLIDE 38

Poor convergence

Wouldn’t it be nice if we can avoid to traverse the same direction? What is the problem with greatest descent?

slide-39
SLIDE 39

Conjugate directions

Pick a set of directions:

d(0), d(1), · · · , d(n−1)

Take exactly one step along each direction Solution is found within n steps Two problems:

  • 1. How do we determine these directions?
  • 2. How do we determine the step size along each direction?
slide-40
SLIDE 40

A-orthogonality

If we take the optimal step size along each direction

d dαF(x(k+1))

= F ′(x(k+1))T d

dαx(k+1)

= −rT

(k+1)d(k)

= dT

(k)Ae(k+1)

=

Two different vectors v and u are A-orthogonal or conjugate, if vTAu = 0

slide-41
SLIDE 41

A-orthogonality

vectors are A-orthogonal vectors are orthogonal

slide-42
SLIDE 42

Optimal size step

must be A-orthogonal to

e(k+1) d(k)

Using this condition, can you derive ?

α(k)

slide-43
SLIDE 43

Algorithm

α(k) = − dT

(k)r(k)

dT

(k)Ad(k)

Take

x(k+1) = x(k) + α(k)d(k) d(k)

1. 2. 3. Suppose we can come up with a set of A-orthogonal directions , this algorithm will converge in n steps

{d(k)}

slide-44
SLIDE 44

We need to prove that can be found in n steps if we take step size along at each step

Why does it work?

x∗

e(0) =

n−1

  • i=0

δid(i) dT

(j)Ae(0) = n−1

  • i=0

δidT

(j)Ad(i)

dT

(j)Ae(0) = δjdT (j)Ad(j)

α(k)

d(k)

= dT

(j)Ae(j)

dT

(j)Ad(j)

= −α(j)

d’s are linearly independent if d’s are A-orthogonal

δj = dT

(j)Ae(0)

dT

(j)Ad(j)

= dT

(j)A(e(0) − j−1 k=0 δkd(k))

dT

(j)Ad(j)

slide-45
SLIDE 45

Quiz

  • Given that d’s are A-orthogonal, prove that d’s

are linearly independent.

slide-46
SLIDE 46

Search directions

  • We know how to determine the optimal step size

along each direction (second problem solved)

  • We still need to figure out what search directions

are

  • What do we know about d(0), d(1), ..., d(n-1)?
  • They are A-orthogonal to each other: d(i)TAd(j)

= 0

  • d(i) is A-orthogonal to e(i+1)
slide-47
SLIDE 47

Gram-Schmidt Conjugation

Use the same trick to get rid of the summation

βkj = −uT

k Ad(j)

dT

j Ad(j)

What are the drawbacks of Gram-Schmidt conjugation? Suppose we have a set of linearly independent vectors u’s, the search directions can be represented as

d(0) = u0

and

d(k) = uk +

k−1

  • i=0

βkid(i) k > j

dT

(k)Ad(j) = uT (k)Ad(j) + βkjdT (j)Ad(j)

slide-48
SLIDE 48

Conjugate gradients

  • If we pick a set of u’s intelligently, we might be

able to save both time and space

  • It turns out that residuals (r’s) is an excellent

choice for u’s

  • residuals are orthogonal to each other
  • residual is orthogonal to the previous search

directions

slide-49
SLIDE 49

Conjugate gradients

d(k) = r(k) +

k−1

  • i=0

βkid(i) βkj = − rT

(k)Ad(j)

dT

(j)Ad(j)

dT

(k)Ad(j) = rT (k)Ad(j) + k−1

  • i=0

βkidT

(i)Ad(j)

j < k

0 = rT

(k)Ad(j) + βkjdT (j)Ad(j)

(by A-orthogonality of d vectors)

Each requires O(n3) operations! However...

d(k)

slide-50
SLIDE 50

Conjugate gradients

is A-orthogonal to all the previous search directions except for d(k−1)

r(k)

βkj = − rT

(k)Ad(j)

dT

(j)Ad(j)

= 0 j < k − 1

if

βkj = − rT

(k)r(k)

rT

(k−1)r(k−1)

if

j = k − 1

proof:

rT

(k)Ad(j) = 0

j < k − 1

when

slide-51
SLIDE 51

Proof: Orthogonality

Proof is orthogonal to all the previous search directions

r(k)

d(0), d(1), · · · , d(k−1) e(k) =

n−1

  • j=k

δjd(j) dT

(i)Ae(k) = n−1

  • j=k

δjdT

(i)Ad(j) = 0

dT

(i)r(k) = 0

if i < k From here, we can proof

dT

(k)r(k) = rT (k)r(k)

if i < k

rT

(i)r(j) = 0, i ̸= j

identity 1 identity 2

slide-52
SLIDE 52

Proof: A-orthogonality

{

  • therwise

Proof is A-orthogonal to all the previous search directions except for d(k−1)

r(k)

r(j+1) = −Ae(j+1) = −A(e(j) + α(j)d(j)) = r(j) − α(j)Ad(j) rT

(k)r(j+1) = rT (k)r(j) − α(j)rT (k)Ad(j)

rT

(k)Ad(j) =

− rT

(k)r(k)

α(k−1) rT

(k)r(k)

α(k) j = k − 1 j = k

use identity 1

slide-53
SLIDE 53

Conjugate gradients

Simplify βk

use identity 2

= − rT

(k)r(k)

dT

(k−1)r(k−1)

= − rT

(k)r(k)

rT

(k−1)r(k−1)

βk = − rT

(k)Ad(k−1)

dT

(k−1)Ad(k−1)

=

rT

(k)r(k)

α(k−1)

dT

(k−1)Ad(k−1)

use orthogonality

= rT

(k)r(k)

dT

(k−1)(r(k−1) − r(k))

slide-54
SLIDE 54

Conjugate gradients

Put it all together

d(0) = r(0) = b − Ax(0) α(k) = rT

(k)r(k)

dT

(k)Ad(k)

x(k+1) = x(k) + α(k)d(k) r(k+1) = r(k) − α(k)Ad(k) d(k+1) = r(k+1) + β(k+1)d(k)

β(k+1) = − rT

(k+1)r(k+1)

rT

(k)r(k)

slide-55
SLIDE 55

References

  • J. Shewchuk, An introduction to conjugate

gradient method without agonizing pain

  • A. Antoniou and W.S. Lu, Practical optimization
  • R. Fletcher, Practical methods of optimization
  • J. Betts, Practical methods for optimal control

using nonlinear programming